Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Digit. Health

Sec. Health Informatics

This article is part of the Research TopicThe Digitalization of Neurology - Volume IIView all 3 articles

Evaluation of Multiple Generative Large Language Models on Neurology Board-Style Questions

Provisionally accepted
  • 1University of Texas Medical Branch at Galveston, Galveston, United States
  • 2The University of Texas Medical Branch at Galveston, Galveston, United States
  • 3University of Illinois Chicago College of Medicine, Chicago, United States

The final, formatted version of the article will be published soon.

Objective To compare the performance of eight large language models (LLMs) with neurology residents on board-style multiple-choice questions across seven subspecialties and two cognitive levels. Methods In a cross-sectional benchmarking study, we evaluated Bard, Claude, Gemini v1, Gemini 2.5, ChatGPT-3.5, ChatGPT-4, ChatGPT-4o, and ChatGPT-5 using 107 text-only items spanning movement disorders, vascular neurology, neuroanatomy, neuroimmunology, epilepsy, neuromuscular disease, and neuro-infectious disease. Items were labeled as lower-or higher-order per Bloom's taxonomy by two neurologists. Models answered each item in a fresh session and reported confidence and Bloom classification. Residents completed the same set under exam-like conditions. Outcomes included overall and domain accuracies, guessing-adjusted accuracy, confidence–accuracy calibration (Spearman ρ), agreement with expert Bloom labels (Cohen κ), and inter-generation scaling (linear regression of topic-level accuracies). Group differences used Fisher exact or χ² tests with Bonferroni correction. Results Residents scored 64.9%. ChatGPT-5 achieved 84.1% and ChatGPT-4o 81.3%, followed by Gemini 2.5 at 77.6% and ChatGPT-4 at 68.2%; Claude (56.1%), Bard (54.2%), ChatGPT-3.5 (53.3%), and Gemini v1 (39.3%) underperformed residents. On higher-order items, ChatGPT-5 (86%) and ChatGPT-4o (82.5%) maintained superiority; Gemini 2.5 matched 82.5%. Guessing-adjusted accuracy preserved rank order (ChatGPT-5 78.8%, ChatGPT-4o 75.1%, Gemini 2.5 70.1%). Confidence–accuracy calibration was weak across models. Inter-generation scaling was strong within the ChatGPT lineage (ChatGPT-4 to 4o R²=0.765, p=0.010; 4o to 5 R²=0.908, p<0.001) but absent for Gemini v1 to 2.5 (R²=0.002, p=0.918), suggesting discontinuous improvements. Conclusions LLMs—particularly ChatGPT-5 and ChatGPT-4o—exceeded resident performance on text-based neurology board-style questions across subspecialties and cognitive levels. Gemini 2.5 showed substantial gains over v1 but with domain-uneven scaling. Given weak confidence calibration, LLMs should be integrated as supervised educational adjuncts with ongoing validation, version governance, and transparent metadata to support safe use in neurology education.

Keywords: artificial intelligence, Large language models, Neurology education, board examinations, Model Performance Analysis

Received: 02 Nov 2025; Accepted: 01 Dec 2025.

Copyright: © 2025 Rodríguez-Fernández, Almomani, Valaparla, Weatherhead, Fang, Dabi, Li, McCaffrey and Hier. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Jorge Mario Rodríguez-Fernández

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.