ORIGINAL RESEARCH article
Front. Digit. Health
Sec. Health Informatics
This article is part of the Research TopicThe Digitalization of Neurology - Volume IIView all 3 articles
Evaluation of Multiple Generative Large Language Models on Neurology Board-Style Questions
Provisionally accepted- 1University of Texas Medical Branch at Galveston, Galveston, United States
- 2The University of Texas Medical Branch at Galveston, Galveston, United States
- 3University of Illinois Chicago College of Medicine, Chicago, United States
Select one of your emails
You have multiple emails registered with Frontiers:
Notify me on publication
Please enter your email address:
If you already have an account, please login
You don't have a Frontiers account ? You can register here
Objective To compare the performance of eight large language models (LLMs) with neurology residents on board-style multiple-choice questions across seven subspecialties and two cognitive levels. Methods In a cross-sectional benchmarking study, we evaluated Bard, Claude, Gemini v1, Gemini 2.5, ChatGPT-3.5, ChatGPT-4, ChatGPT-4o, and ChatGPT-5 using 107 text-only items spanning movement disorders, vascular neurology, neuroanatomy, neuroimmunology, epilepsy, neuromuscular disease, and neuro-infectious disease. Items were labeled as lower-or higher-order per Bloom's taxonomy by two neurologists. Models answered each item in a fresh session and reported confidence and Bloom classification. Residents completed the same set under exam-like conditions. Outcomes included overall and domain accuracies, guessing-adjusted accuracy, confidence–accuracy calibration (Spearman ρ), agreement with expert Bloom labels (Cohen κ), and inter-generation scaling (linear regression of topic-level accuracies). Group differences used Fisher exact or χ² tests with Bonferroni correction. Results Residents scored 64.9%. ChatGPT-5 achieved 84.1% and ChatGPT-4o 81.3%, followed by Gemini 2.5 at 77.6% and ChatGPT-4 at 68.2%; Claude (56.1%), Bard (54.2%), ChatGPT-3.5 (53.3%), and Gemini v1 (39.3%) underperformed residents. On higher-order items, ChatGPT-5 (86%) and ChatGPT-4o (82.5%) maintained superiority; Gemini 2.5 matched 82.5%. Guessing-adjusted accuracy preserved rank order (ChatGPT-5 78.8%, ChatGPT-4o 75.1%, Gemini 2.5 70.1%). Confidence–accuracy calibration was weak across models. Inter-generation scaling was strong within the ChatGPT lineage (ChatGPT-4 to 4o R²=0.765, p=0.010; 4o to 5 R²=0.908, p<0.001) but absent for Gemini v1 to 2.5 (R²=0.002, p=0.918), suggesting discontinuous improvements. Conclusions LLMs—particularly ChatGPT-5 and ChatGPT-4o—exceeded resident performance on text-based neurology board-style questions across subspecialties and cognitive levels. Gemini 2.5 showed substantial gains over v1 but with domain-uneven scaling. Given weak confidence calibration, LLMs should be integrated as supervised educational adjuncts with ongoing validation, version governance, and transparent metadata to support safe use in neurology education.
Keywords: artificial intelligence, Large language models, Neurology education, board examinations, Model Performance Analysis
Received: 02 Nov 2025; Accepted: 01 Dec 2025.
Copyright: © 2025 Rodríguez-Fernández, Almomani, Valaparla, Weatherhead, Fang, Dabi, Li, McCaffrey and Hier. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence: Jorge Mario Rodríguez-Fernández
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.
