Artigo Acesso aberto Revisado por pares

Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Question Bank

2023; Lippincott Williams & Wilkins; Volume: 93; Issue: 5 Linguagem: Inglês

10.1227/neu.0000000000002551

ISSN

1524-4040

Autores

Rohaid Ali, Oliver Y. Tang, Ian D. Connolly, Jared S. Fridley, John H. Shin, Patricia L. Zadnik Sullivan, Deus Cielo, Adetokunbo A. Oyelese, Curtis E. Doberstein, Albert E. Telfeian, Ziya L. Gokaslan, Wael F. Asaad,

Tópico(s)

Medical Imaging and Analysis

Resumo

BACKGROUND AND OBJECTIVES: General large language models (LLMs), such as ChatGPT (GPT-3.5), have demonstrated the capability to pass multiple-choice medical board examinations. However, comparative accuracy of different LLMs and LLM performance on assessments of predominantly higher-order management questions is poorly understood. We aimed to assess the performance of 3 LLMs (GPT-3.5, GPT-4, and Google Bard) on a question bank designed specifically for neurosurgery oral boards examination preparation. METHODS: The 149-question Self-Assessment Neurosurgery Examination Indications Examination was used to query LLM accuracy. Questions were inputted in a single best answer, multiple-choice format. χ 2 , Fisher exact, and univariable logistic regression tests assessed differences in performance by question characteristics. RESULTS: On a question bank with predominantly higher-order questions (85.2%), ChatGPT (GPT-3.5) and GPT-4 answered 62.4% (95% CI: 54.1%-70.1%) and 82.6% (95% CI: 75.2%-88.1%) of questions correctly, respectively. By contrast, Bard scored 44.2% (66/149, 95% CI: 36.2%-52.6%). GPT-3.5 and GPT-4 demonstrated significantly higher scores than Bard (both P < .01), and GPT-4 outperformed GPT-3.5 ( P = .023). Among 6 subspecialties, GPT-4 had significantly higher accuracy in the Spine category relative to GPT-3.5 and in 4 categories relative to Bard (all P < .01). Incorporation of higher-order problem solving was associated with lower question accuracy for GPT-3.5 (odds ratio [OR] = 0.80, P = .042) and Bard (OR = 0.76, P = .014), but not GPT-4 (OR = 0.86, P = .085). GPT-4's performance on imaging-related questions surpassed GPT-3.5's (68.6% vs 47.1%, P = .044) and was comparable with Bard's (68.6% vs 66.7%, P = 1.000). However, GPT-4 demonstrated significantly lower rates of “hallucination” on imaging-related questions than both GPT-3.5 (2.3% vs 57.1%, P < .001) and Bard (2.3% vs 27.3%, P = .002). Lack of question text description for questions predicted significantly higher odds of hallucination for GPT-3.5 (OR = 1.45, P = .012) and Bard (OR = 2.09, P < .001). CONCLUSION: On a question bank of predominantly higher-order management case scenarios for neurosurgery oral boards preparation, GPT-4 achieved a score of 82.6%, outperforming ChatGPT and Google Bard.

Referência(s)