Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Digit. Health

Sec. Health Informatics

Volume 7 - 2025 | doi: 10.3389/fdgth.2025.1670510

This article is part of the Research TopicAI in Healthcare: Transforming Clinical Risk Prediction, Medical Large Language Models, and BeyondView all 8 articles

Comparative Performance Evaluation of Large Language Models in Answering Esophageal Cancer-Related Questions: A Multi-Model Assessment Study

Provisionally accepted
Zijie  HeZijie He1Lilan  ZhaoLilan Zhao1Genglin  LiGenglin Li1Jintao  WangJintao Wang1Songyu  CaiSongyu Cai1Pengjie  TuPengjie Tu1Jingbo  ChenJingbo Chen2Jianman  WuJianman Wu2Juan  ZhangJuan Zhang1Ruiqi  ChenRuiqi Chen1Yangyun  HuangYangyun Huang1Xiaojie  PanXiaojie Pan1Wenshu  ChenWenshu Chen1*
  • 1Department of Thoracic Surgery, Fujian Provincial Hospital, Fuzhou, Fujian, China
  • 2Fujian Provincial Hospital, Fuzhou, China

The final, formatted version of the article will be published soon.

Background: Esophageal cancer has high incidence and mortality rates, leading to increased public demand for accurate information. However, the reliability of online medical information is often questionable. This study systematically compares the accuracy, completeness, and comprehensibility of mainstream large language models (LLMs) in answering esophageal cancer-related questions. Methods: 65 questions covering fundamental knowledge, preoperative preparation, surgical treatment, and postoperative management were selected. Each model—ChatGPT 5, Claude Sonnet 4.0, DeepSeek-R1, Gemini 2.5 Pro, and Grok-4—was queried independently using standardized prompts. Five senior clinical experts, including three thoracic surgeons, one radiologist, and one medical oncologist, evaluated responses using a five-point Likert scale. A retesting mechanism was applied for low-scoring responses, and the intraclass correlation coefficient (ICC) assessed rating consistency. Statistical analyses included the Friedman test, Wilcoxon signed-rank test, and Bonferroni correction. Results: All models performed well, with average scores exceeding 4.0. However, significant differences emerged: Gemini excelled in accuracy, while ChatGPT led in completeness, particularly in surgical and postoperative contexts. Minor differences appeared in fundamental knowledge but notable disparities were found in complex areas. Retesting showed improvements in overall quality, yet some responses showed decreased completeness and relevance. Conclusion: Large language models demonstrate considerable potential in esophageal cancer question and answer, with significant differences in completeness. ChatGPT is more comprehensive in complex scenarios, while Gemini excels in accuracy. This study offers guidance for selecting AI tools in clinical settings, advocating for a tiered application strategy tailored to specific scenarios and highlighting the importance of user education to clarify the limitations and applicability of LLMs.

Keywords: Large language models, artificial intelligence, esophageal cancer, Medicalquestion-answering, Medical Education

Received: 21 Jul 2025; Accepted: 16 Sep 2025.

Copyright: © 2025 He, Zhao, Li, Wang, Cai, Tu, Chen, Wu, Zhang, Chen, Huang, Pan and Chen. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Wenshu Chen, doctorcws@163.com

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.