ORIGINAL RESEARCH article
Front. Public Health
Sec. Digital Public Health
This article is part of the Research TopicAdvancing Healthcare AI: Evaluating Accuracy and Future DirectionsView all 21 articles
Evaluation of Accuracy, Quality, and Readability of Information on Hypothyroidism Provided by Different Artificial Intelligence Chatbot Models
Provisionally accepted- 1Liaoning University of Traditional Chinese Medicine, Shenyang, China
- 2Department of Thyroid and Breast Surgery, People’s Hospital of China Medical University (Liaoning Provincial People’s Hospital),, Shenyang, China
- 3Department of Cardiology, People's Hospital of Liaoning Province, Shenyang, China
- 4Department of General Medicine, People's Hospital of Liaoning Province, Shenyang, China
Select one of your emails
You have multiple emails registered with Frontiers:
Notify me on publication
Please enter your email address:
If you already have an account, please login
You don't have a Frontiers account ? You can register here
Objective: This study assessed the accuracy, quality, and readability of responses from three leading AI chatbots—ChatGPT-3.5, DeepSeek-V3, and Google Gemini-2.5—on the diagnosis, treatment, and long-term risks of adult hypothyroidism, comparing their outputs with current clinical guidelines. Methods: Two thyroid specialists developed 27 questions based on the Guideline for the Diagnosis and Management of Hypothyroidism in Adults (2017 edition), covering three categories: diagnosis, treatment, and long-term health risks. Responses from each AI model were independently evaluated by two reviewers. Accuracy was rated using a six-point Likert scale, quality using the DISCERN tool and the five-point Likert scale, and readability was assessed by the Flesch Reading Ease (FRE), Flesch-Kincaid Grade Level (FKGL), Gunning Fog Index (GFI),and Simple Measure of Gobbledygook(SMOG). Results: All three AI models demonstrated excellent performance in accuracy (mean score > 4.5) and quality (high-quality rate > 94%). According to the DISCERN tool, no significant difference was observed in the overall information quality among the models. However, Gemini-2.5 generated This is a provisional file, not the final typeset article responses of significantly lower quality for treatment-related questions than for diagnostic inquiries. The content generated by all models was relatively difficult to comprehend (low FRE scores and high FKGL/GFI scores), generally requiring a college-level or higher education for adequate understanding. Conclusion: All three AI chatbots were capable of producing highly accurate and high-quality medical information regarding hypothyroidism, with their responses showing strong consistency with clinical guidelines. This underscores the substantial potential of AI in supporting medical information delivery. However, the consistently high reading difficulty of their outputs may limit their practical utility in patient education. Future research should focus on improving the readability and patient-friendliness of AI outputs—through prompt engineering and multi-round dialogue optimization— while maintaining professional accuracy, to enable broader application of AI in health education.
Keywords: Artificial intelligence chatbot1, hypothyroidism2, readability3, Clinical guideline4, Patient education5
Received: 03 Sep 2025; Accepted: 25 Nov 2025.
Copyright: © 2025 Ruan, Shao, Sun, Ju and Cui. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence:
Xingai Ju
Jianchun Cui
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.