AUTHOR=He Zijie , Zhao Lilan , Li Genglin , Wang Jintao , Cai Songyu , Tu Pengjie , Chen Jingbo , Wu Jianman , Zhang Juan , Chen Ruiqi , Huang Yangyun , Pan Xiaojie , Chen Wenshu TITLE=Comparative performance evaluation of large language models in answering esophageal cancer-related questions: a multi-model assessment study JOURNAL=Frontiers in Digital Health VOLUME=Volume 7 - 2025 YEAR=2025 URL=https://www.frontiersin.org/journals/digital-health/articles/10.3389/fdgth.2025.1670510 DOI=10.3389/fdgth.2025.1670510 ISSN=2673-253X ABSTRACT=BackgroundEsophageal cancer has high incidence and mortality rates, leading to increased public demand for accurate information. However, the reliability of online medical information is often questionable. This study systematically compared the accuracy, completeness, and comprehensibility of mainstream large language models (LLMs) in answering esophageal cancer-related questions.MethodsIn total, 65 questions covering fundamental knowledge, preoperative preparation, surgical treatment, and postoperative management were selected. Each model, namely, ChatGPT 5, Claude Sonnet 4.0, DeepSeek-R1, Gemini 2.5 Pro, and Grok-4, was queried independently using standardized prompts. Five senior clinical experts, including three thoracic surgeons, one radiologist, and one medical oncologist, evaluated the responses using a five-point Likert scale. A retesting mechanism was applied for the low-scoring responses, and intraclass correlation coefficients were used to assess the rating consistency. The statistical analyses were conducted using the Friedman test, the Wilcoxon signed-rank test, and the Bonferroni correction.ResultsAll the models performed well, with average scores exceeding 4.0. However, the following significant differences emerged: Gemini excelled in accuracy, while ChatGPT led in completeness, particularly in surgical and postoperative contexts. Minor differences appeared in fundamental knowledge, but notable disparities were found in complex areas. Retesting showed improvements in overall quality, yet some responses showed decreased completeness and relevance.ConclusionLarge language models have considerable potential in answering questions about esophageal cancer, with significant differences in completeness. ChatGPT is more comprehensive in complex scenarios, while Gemini excels in accuracy. This study offers guidance for selecting artificial intelligence tools in clinical settings, advocating for a tiered application strategy tailored to specific scenarios and highlighting the importance of user education to understand the limitations and applicability of LLMs.