How Large Language Models Perform on the United States Medical Licensing Examination: A Systematic Review

Revisão Acesso aberto

How Large Language Models Perform on the United States Medical Licensing Examination: A Systematic Review

2023; Cold Spring Harbor Laboratory; Linguagem: Inglês

10.1101/2023.09.03.23294842

Autores

Dana Brin, Vera Sorin, Eli Konen, Girish N. Nadkarni, Benjamin S. Glicksberg, Eyal Klang,

Tópico(s)

Healthcare cost, quality, practices

Resumo

ABSTRACT Objective The United States Medical Licensing Examination (USMLE) assesses physicians’ competency and passing is a requirement to practice medicine in the U.S. With the emergence of large language models (LLMs) like ChatGPT and GPT-4, understanding their performance on these exams illuminates their potential in medical education and healthcare. Materials and Methods A literature search following the 2020 PRISMA guidelines was conducted, focusing on studies using official USMLE questions and publicly available LLMs. Results Three relevant studies were found, with GPT-4 showcasing the highest accuracy rates of 80-90% on the USMLE. Open-ended prompts typically outperformed multiple-choice ones, with 5-shot prompting slightly edging out zero-shot. Conclusion LLMs, especially GPT-4, display proficiency in tackling USMLE-standard questions. While the USMLE is a structured evaluation tool, it may not fully capture the expansive capabilities and limitations of LLMs in medical scenarios. As AI integrates further into healthcare, ongoing assessments against trusted benchmarks are essential.

Ver no editor

Altmetric

PlumX

Entrar

Lembrar minha senha

Receber meu e-mail de confirmação

How Large Language Models Perform on the United States Medical Licensing Examination: A Systematic Review