How Large Language Models Perform on the United States Medical Licensing Examination: A Systematic Review
2023; Cold Spring Harbor Laboratory; Linguagem: Inglês
10.1101/2023.09.03.23294842
AutoresDana Brin, Vera Sorin, Eli Konen, Girish N. Nadkarni, Benjamin S. Glicksberg, Eyal Klang,
Tópico(s)Healthcare cost, quality, practices
ResumoABSTRACT Objective The United States Medical Licensing Examination (USMLE) assesses physicians’ competency and passing is a requirement to practice medicine in the U.S. With the emergence of large language models (LLMs) like ChatGPT and GPT-4, understanding their performance on these exams illuminates their potential in medical education and healthcare. Materials and Methods A literature search following the 2020 PRISMA guidelines was conducted, focusing on studies using official USMLE questions and publicly available LLMs. Results Three relevant studies were found, with GPT-4 showcasing the highest accuracy rates of 80-90% on the USMLE. Open-ended prompts typically outperformed multiple-choice ones, with 5-shot prompting slightly edging out zero-shot. Conclusion LLMs, especially GPT-4, display proficiency in tackling USMLE-standard questions. While the USMLE is a structured evaluation tool, it may not fully capture the expansive capabilities and limitations of LLMs in medical scenarios. As AI integrates further into healthcare, ongoing assessments against trusted benchmarks are essential.
Referência(s)