ORIGINAL RESEARCH article
Front. Artif. Intell.
Sec. Medicine and Public Health
ChatGPT and Reference Intervals: A Comparative Analysis of Repeatability in GPT-3.5 Turbo, GPT-4, and GPT-4o
Provisionally accepted- 1University Hospital Cologne, Department of Anesthesiology and Operative Intensive Care, Faculty of Medicine and University Hospital, Cologne, Germany
- 2University of Cologne, Institute of Clinical Chemistry, Faculty of Medicine and University Hospital, University Hospital Cologne, Kerpener Str. 62, 50937 Cologne, Germany, Cologne, Germany
- 3Medical Director, Faculty of Medicine and University Hospital, University Hospital Cologne, Cologne, Germany
Select one of your emails
You have multiple emails registered with Frontiers:
Notify me on publication
Please enter your email address:
If you already have an account, please login
You don't have a Frontiers account ? You can register here
Background: Large language models such as ChatGPT hold promise as rapid “curbside consultation” tools in laboratory medicine. However, their ability to generate consistent and clinically reliable reference intervals - particularly in the absence of contextual clinical information - remains uncertain. Method: This cross-sectional study evaluated whether three versions of ChatGPT (GPT-3.5-Turbo, GPT-4, GPT-4o) maintain repeatable reference-interval outputs when the prompt intentionally omits the interval, using reference interval variability as a stress-test for model consistency. Standardized prompts were submitted through 726,000 chatbot requests. A total of 246,842 reference intervals across 47 laboratory parameters were then analyzed for consistency using the coefficient of variation (CV) and regression models. Results: On average, the chatbots exhibited a CV of 26.50% (IQR: 7.35%-129.01%) for the lower limit and 15.82% (IQR: 4.50%-45.30%) for the upper limit upon repetition. GPT-4 and GPT-4o demonstrated significantly lower CVs compared to GPT-3.5-Turbo. Reference intervals for poorly standardized parameters were particularly inconsistent across lower (β: 0.6; 95% CI: 0.35 to 0.86; p < 0.001) and upper limit (β: 0.5; 95% CI: 0.28 to 0.71; p < 0.001), while unit expressions also showed variability. Conclusion: While the newer ChatGPT versions tested demonstrate improved repeatability, diagnostically unacceptable variability persists, particularly for poorly standardized analytes. Mitigating this requires thoughtful prompt design (e.g., mandatory inclusion of reference intervals), global harmonization of laboratory standards, further model refinement, and robust regulatory oversight. Until then, AI chatbots should be restricted to professional use and trained to refuse laboratory interpretation when reference intervals are not provided by the user.
Keywords: Chatbot, ChatGPT, Reference interval, repeatability, Consistency, Largelanguage model
Received: 14 Aug 2025; Accepted: 12 Nov 2025.
Copyright: © 2025 Meyer, Schömig and Streichert. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence: Annika Meyer, annika.meyer@uk-koeln.de
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.
