ORIGINAL RESEARCH article
Front. Med.
Sec. Healthcare Professions Education
This article is part of the Research TopicArtificial Intelligence for Technology Enhanced LearningView all 11 articles
Diagnostic Performance of Large Language Models on the NEJM Image Challenge: A Comparative Study with Human Evaluators and the Impact of Prompt Engineering
Provisionally accepted- 1Peking Union Medical College Hospital Department of Radiation Oncology, Beijing, China
- 2Tsinghua University School of Medicine, Beijing, China
Select one of your emails
You have multiple emails registered with Frontiers:
Notify me on publication
Please enter your email address:
If you already have an account, please login
You don't have a Frontiers account ? You can register here
Multimodal large language models (LLMs) that can interpret clinical text and images are emerging as potential decision-support tools, yet their accuracy on standardized cases and how it compares with human performance across different difficulty levels remains largely unclear. This study aimed to rigorously evaluate the performance of four leading LLMs on the 200-item New England Journal of Medicine (NEJM) Image Challenge. We assessed OpenAI o4-mini-high, Claude 4 Opus, Gemini 2.5 Pro, and Qwen 3, and benchmarked the top model against three medical students (Years 5-7) and an internal-medicine attending physician under identical test conditions. Additionally, we characterized the dominant error types for OpenAI o4-mini-high and tested prompt engineering strategies for potential correction. Our results suggest that OpenAI o4-mini-high achieved the highest overall accuracy of 94%. Its performance remained consistently high across easy, moderate, and difficult cases. The human accuracies in this cohort ranged from 38.5% for three medical students to 70.5% for an attending physician—all significantly lower than OpenAI o4-mini-high. An analysis of OpenAI o4-mini-high’s 12 errors revealed that most (83.3%) were outputs reflecting lapses in diagnostic logic rather than input processing. Notably, simple prompting techniques like chain-of-thought and few-shot learning corrected over half of these initial errors. In conclusion, within the context of this standardized challenge, a leading multimodal LLM delivered high diagnostic accuracy that surpassed the scores of both peer models and the recruited human participants. However, these results should be interpreted as evidence of pattern recognition capabilities rather than human-like clinical understanding. While further validation on real-world data is warranted, these findings support the potential utility of LLMs in educational and standardized settings, highlighting that most residual errors are due to logic gaps that can be partly mitigated by refined user prompting, emphasizing the importance of human-AI interaction for maximizing reliability.
Keywords: Artificial intelligence in medicine, Clinical decision support, Medical Education, multimodal large language models, NEJM Image Challenge
Received: 20 Sep 2025; Accepted: 08 Dec 2025.
Copyright: © 2025 Zhou, Wang, Wang and Hu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence: Ke Hu
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.
