BRIEF RESEARCH REPORT article
Front. Artif. Intell.
Sec. Medicine and Public Health
Volume 8 - 2025 | doi: 10.3389/frai.2025.1614874
This article is part of the Research TopicDigital Medicine and Artificial IntelligenceView all 8 articles
A Multi-Model Longitudinal Assessment of ChatGPT Performance on Medical Residency Examinations
Provisionally accepted- 1University of the State of Rio Grande do Norte, Mossoro, Rio Grande do Norte, Brazil
- 2University of São Paulo, São Paulo, Brazil
Select one of your emails
You have multiple emails registered with Frontiers:
Notify me on publication
Please enter your email address:
If you already have an account, please login
You don't have a Frontiers account ? You can register here
Introduction: ChatGPT, a generative artificial intelligence, has potential applications in numerous fields, including medical education. This potential can be assessed through its performance on medical exams. Medical residency exams, critical for entering medical specialties, serve as a valuable benchmark.This study aimed to assess the accuracy of ChatGPT-4 and GPT-4o in responding to 1,041 medical residency questions from Brazil, examining overall accuracy and performance across different medical areas, based on evaluations conducted in 2023 and 2024. The questions were classified into higher and lower cognitive levels according to Bloom's taxonomy. Additionally, questions answered incorrectly by both models were tested using the recent GPT models that use chain-of-thought reasoning (e.g., o1-preview, o3, o4-mini-high) with evaluations carried out in both 2024 and 2025.Results: GPT-4 achieved 81.27% accuracy (95% CI: 78.89% -83.64%), while GPT-4o reached 85.88% (95% CI: 83.76% -88.00%), significantly outperforming GPT-4 (p < 0.05). Both models showed reduced accuracy on higher-order thinking questions. On questions that both models failed, GPT o1-preview achieved 53.26% accuracy (95% CI: 42.87% -63.65%), GPT o3 47.83% (37.42% -58.23%) and o4 mini-high 35.87% (25.88%-45.86%), with all three models performing better on higher-order questions.Artificial intelligence could be a beneficial tool in medical education, enhancing residency exam preparation, helping to understand complex topics, and improving teaching strategies. However, careful use of artificial intelligence is essential due to ethical concerns and potential limitations in both educational and clinical practice.
Keywords: Generative artificial intelligence, Medical Residency Examinations, Medical Education, artificial intelligence, Chain-of-thought reasoning, Large Language Model
Received: 20 Apr 2025; Accepted: 08 Aug 2025.
Copyright: © 2025 Souto, Fernandes, Silva, Ribeiro and Fernandes. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence:
Maria Souto, University of the State of Rio Grande do Norte, Mossoro, 59610-210, Rio Grande do Norte, Brazil
Alexandre Chaves Fernandes, University of São Paulo, São Paulo, Brazil
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.