Evaluating the Accuracy of AI-Model Generated Medical Information by ChatGPT and Gemini in Alignment with International Clinical Guidelines: A Comparison with Surviving Sepsis Campaign

Kutbi, Dina; Abou-Bakr, Ehab; Mousa, Hassan

doi:10.3389/fdgth.2026.1672941

ORIGINAL RESEARCH article

Front. Digit. Health

Sec. Health Informatics

Evaluating the Accuracy of AI-Model Generated Medical Information by ChatGPT and Gemini in Alignment with International Clinical Guidelines: A Comparison with Surviving Sepsis Campaign

Dina Kutbi ¹

Ehab Abou-Bakr ²

Hassan Mousa ²

1. King Fahad Armed Forces Hospital, Jeddah, Saudi Arabia
2. Jeddah International College, Jeddah, Saudi Arabia

Article metrics

View details

277

Views

The final, formatted version of the article will be published soon.

Abstract

Background: The assessment of artificial intelligence (AI) chatbots like ChatGPT and Google Gemini in providing medical information compared with international guidelines is a burgeoning area of research. These AI models are increasingly being considered for their potential to support clinical decision-making and patient education. However, their accuracy and reliability in delivering medical information that aligns with established guidelines remain under scrutiny. This study aims to assess the accuracy of medical information generated by ChatGPT and Gemini regarding their alignment with international guidelines for sepsis management. Methods: ChatGPT-4o and Gemini 1.5 accessed in December 2024, were asked 18 questions (Supplementary Data S1, S2, and S3) according to the Surviving Sepsis Campaign International guideline, and the responses were evaluated by seven independent intensive care physicians. The responses generated were scored as follows: 3=correct, complete, and accurate, 2=correct but incomplete or inaccurate, and 1=incorrect. This scoring system was chosen to provide a clear and straightforward assessment of the accuracy and completeness of the responses. The Fleiss' kappa test was used to assess the agreement between the evaluators, and the Mann-Whitney U test was used to test for the significance between the correct responses generated by ChatGPT and Gemini. Results: The results showed that ChatGPT provided 5 (28%) perfect responses, 12 (67%) nearly perfect responses, and 1 (5%) low-quality response, with a substantial agreement among the evaluators (Fleiss' Kappa = 0.656). Gemini, on the other hand, provided 3 (17%) perfect responses, 14 (78%) nearly perfect responses, and 1 (5%) low-quality response, with a moderate agreement among the evaluators (Fleiss' Kappa = 0.582). The Mann-Whitney U test revealed no statistically significant difference between the two platforms (p-value = 0.4843). Conclusion: ChatGPT and Gemini both demonstrated potential for generating medical information. Despite their current limitations, both showed promise as complementary tools in patient education and clinical decision-making. The medical information generated from ChatGPT and Gemini still needs continuous evaluation regarding its accuracy, reliability, and alignment with international guidelines in different medical domains, particularly in the sepsis field.

Summary

Keywords

artificial intelligence, Chatbots, ChatGPT, Gemini, information, Large LanguageModels, Sepsis

Received

25 July 2025

Accepted

20 January 2026

© 2026 Kutbi, Abou-Bakr and Mousa. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Dina Kutbi

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

ORIGINAL RESEARCH article

Evaluating the Accuracy of AI-Model Generated Medical Information by ChatGPT and Gemini in Alignment with International Clinical Guidelines: A Comparison with Surviving Sepsis Campaign

Abstract

Summary

Outline

Share article

Article metrics