Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Med.

Sec. Pulmonary Medicine

Performance of Large Language Models on Sleep Medicine Certification Examination: A Comprehensive Multi-Model Analysis

Provisionally accepted
Abdurrahman  KoçAbdurrahman Koç1*Abdullah Enes  AtaşAbdullah Enes Ataş2Şebnem  YosunkayaŞebnem Yosunkaya2Hülya  VatansevHülya Vatansev2
  • 1Meram State Hospital, Konya, Türkiye
  • 2Necmettin Erbakan Universitesi Tip Fakultesi, Meram, Türkiye

The final, formatted version of the article will be published soon.

Purpose: To evaluate and compare the performance of nine contemporary LLM configurations on sleep medicine certification examination-aligned questions, analyzing version differences, pricing tiers, and subdomain competencies. Methods: Cross-sectional comparative analysis of 197 multiple-choice questions structured according to American Academy of Sleep Medicine (AASM) standards. Nine LLM configurations were evaluated: ChatGPT (GPT-3.5 free, GPT-4o paid), Gemini (2.5 Flash free, 2.5 Pro paid), Claude (3.7 Sonnet previous, Opus 4 paid), Deepseek V3 (free), xAI Grok3 (free), and Llama 3 (free). Each question was posed three times in independent sessions. Final accuracy was determined using strict 3/3 concordance criterion (correct only when all three iterations yielded identical correct answers). Supplementary analyses using majority voting (2/3) yielded consistent rankings. Performance metrics included overall accuracy, 95% confidence intervals (CI), and subdomain analyses across seven categories. Statistical analyses employed Pearson's chi-square and McNemar's tests. Results: Model performance demonstrated significant heterogeneity (χ2=101.95, df=8, p<0.001), ranging from 68.5% to 95.9%. Gemini 2.5 Pro achieved the highest accuracy (95.9%, 95% CI: 93.2-98.7), followed by Claude Opus 4 (93.9%, 95% CI: 90.6-97.2) and ChatGPT GPT-4o (93.4%, 95% CI: 89.9-96.9). Premium versions consistently outperformed free alternatives with differences of 5.1–8.6 points (all p<0.05). Subdomain analysis revealed highest consistency in Secondary Sleep Disorders (92.0% mean accuracy) and greatest variability in Diagnostic Methods (85.9%). Sensitivity analysis comparing three scoring criteria (single-try, majority voting, strict concordance) revealed minimal impact on rankings (Spearman's ρ=0.879-1.000, p<0.01). Majority voting and strict concordance yielded identical accuracy in seven models due to high response consistency. Eight models exceeded the 80% reference benchmark under all criteria. Conclusions: Contemporary LLMs demonstrate substantially improved performance compared to previous evaluations, with premium models exceeding the 80% benchmark. However, results reflect performance on a certification-aligned question bank rather than the official board examination. The significant performance advantage of paid versions raises important considerations regarding equitable access to AI-enhanced medical education and clinical decision support tools.

Keywords: artificial intelligence, Certification examination, Large language models, Medical Education, Sleep medicine

Received: 04 Dec 2025; Accepted: 16 Feb 2026.

Copyright: © 2026 Koç, Ataş, Yosunkaya and Vatansev. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Abdurrahman Koç

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.