ORIGINAL RESEARCH article
Front. Oncol.
Sec. Surgical Oncology
Volume 15 - 2025 | doi: 10.3389/fonc.2025.1590230
This article is part of the Research TopicArtificial Intelligence in Clinical Oncology: Enhancements in Tumor ManagementView all 10 articles
ChatGPT-4o, Gemini Advanced and DeepSeek R1 in Preoperative Decision-Making for Thyroid Surgery: A Comparative Assessment with Human Surgeons
Provisionally accepted- The Third Affiliated Hospital of Sun Yat-Sen University, Guangzhou, China
Select one of your emails
You have multiple emails registered with Frontiers:
Notify me on publication
Please enter your email address:
If you already have an account, please login
You don't have a Frontiers account ? You can register here
The integration of large language models (LLMs) into surgical decision-making is an emerging field with potential clinical value. This study assessed the preoperative decision-making consistency of ChatGPT-4o, Gemini Advanced, and DeepSeek R1 in comparison with expert consensus, using clinical data from 123 patients undergoing thyroid surgery. Overall concordance rates were 47.97% for ChatGPT-4o, 24.39% for Gemini Advanced, and 56.10% for DeepSeek R1. In thyroidectomy extent decisions, all three models showed moderate consistency with the surgical team, with agreement rates of 61.79% (κ=0.484) for ChatGPT-4o, 67.48% (κ=0.548) for Gemini, and 67.48% (κ=0.535) for DeepSeek R1 (all p < 0.001). However, significant divergence was observed in lymph node dissection planning: ChatGPT-4o achieved a high concordance rate of 69.11% (κ =0.616), DeepSeek R1 showed the highest at 79.67% (κ=0.741), while Gemini's performance was relatively poor at 34.96% (κ=0.188). Though our findings demonstrate that ChatGPT-4o and DeepSeek R1 exhibit substantial agreement with experienced surgeons in preoperative planning, overall performance still leaves room for improvement. Nevertheless, model-specific variability — particularly in oncologic decision-making—highlights the need for refinement and robust clinical validation before widespread clinical adoption.
Keywords: Thyroid Surgery, Large language models, Preoperative decision-making, Clinical concordance, artificial intelligence
Received: 09 Mar 2025; Accepted: 30 Sep 2025.
Copyright: © 2025 Zou, Zhang, Jiang, Wang, Yan, Wu, Qi, Li, Cai, Xuan and Hu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence: Kunpeng Hu, hkpdhy918@126.com
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.