Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Artif. Intell.

Sec. Medicine and Public Health

ChatGPT and Reference Intervals: A Comparative Analysis of Repeatability in GPT-3.5 Turbo, GPT-4, and GPT-4o

Provisionally accepted
Annika  MeyerAnnika Meyer1,2*Edgar  SchömigEdgar Schömig3Thomas  StreichertThomas Streichert2
  • 1University Hospital Cologne, Department of Anesthesiology and Operative Intensive Care, Faculty of Medicine and University Hospital, Cologne, Germany
  • 2University of Cologne, Institute of Clinical Chemistry, Faculty of Medicine and University Hospital, University Hospital Cologne, Kerpener Str. 62, 50937 Cologne, Germany, Cologne, Germany
  • 3Medical Director, Faculty of Medicine and University Hospital, University Hospital Cologne, Cologne, Germany

The final, formatted version of the article will be published soon.

Background: Large language models such as ChatGPT hold promise as rapid “curbside consultation” tools in laboratory medicine. However, their ability to generate consistent and clinically reliable reference intervals - particularly in the absence of contextual clinical information - remains uncertain. Method: This cross-sectional study evaluated whether three versions of ChatGPT (GPT-3.5-Turbo, GPT-4, GPT-4o) maintain repeatable reference-interval outputs when the prompt intentionally omits the interval, using reference interval variability as a stress-test for model consistency. Standardized prompts were submitted through 726,000 chatbot requests. A total of 246,842 reference intervals across 47 laboratory parameters were then analyzed for consistency using the coefficient of variation (CV) and regression models. Results: On average, the chatbots exhibited a CV of 26.50% (IQR: 7.35%-129.01%) for the lower limit and 15.82% (IQR: 4.50%-45.30%) for the upper limit upon repetition. GPT-4 and GPT-4o demonstrated significantly lower CVs compared to GPT-3.5-Turbo. Reference intervals for poorly standardized parameters were particularly inconsistent across lower (β: 0.6; 95% CI: 0.35 to 0.86; p < 0.001) and upper limit (β: 0.5; 95% CI: 0.28 to 0.71; p < 0.001), while unit expressions also showed variability. Conclusion: While the newer ChatGPT versions tested demonstrate improved repeatability, diagnostically unacceptable variability persists, particularly for poorly standardized analytes. Mitigating this requires thoughtful prompt design (e.g., mandatory inclusion of reference intervals), global harmonization of laboratory standards, further model refinement, and robust regulatory oversight. Until then, AI chatbots should be restricted to professional use and trained to refuse laboratory interpretation when reference intervals are not provided by the user.

Keywords: Chatbot, ChatGPT, Reference interval, repeatability, Consistency, Largelanguage model

Received: 14 Aug 2025; Accepted: 12 Nov 2025.

Copyright: © 2025 Meyer, Schömig and Streichert. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Annika Meyer, annika.meyer@uk-koeln.de

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.