Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Oncol.

Sec. Surgical Oncology

Volume 15 - 2025 | doi: 10.3389/fonc.2025.1590230

This article is part of the Research TopicArtificial Intelligence in Clinical Oncology: Enhancements in Tumor ManagementView all 10 articles

ChatGPT-4o, Gemini Advanced and DeepSeek R1 in Preoperative Decision-Making for Thyroid Surgery: A Comparative Assessment with Human Surgeons

Provisionally accepted
Long  ZouLong ZouPeng  ZhangPeng ZhangYu-Qi  JiangYu-Qi JiangXiaowen  WangXiaowen WangXi-Jing  YanXi-Jing YanJie-Zhong  WuJie-Zhong WuJia  QiJia QiWenchao  LiWenchao LiQing-Qing  CaiQing-Qing CaiZhi-Rong  XuanZhi-Rong XuanKunpeng  HuKunpeng Hu*
  • The Third Affiliated Hospital of Sun Yat-Sen University, Guangzhou, China

The final, formatted version of the article will be published soon.

The integration of large language models (LLMs) into surgical decision-making is an emerging field with potential clinical value. This study assessed the preoperative decision-making consistency of ChatGPT-4o, Gemini Advanced, and DeepSeek R1 in comparison with expert consensus, using clinical data from 123 patients undergoing thyroid surgery. Overall concordance rates were 47.97% for ChatGPT-4o, 24.39% for Gemini Advanced, and 56.10% for DeepSeek R1. In thyroidectomy extent decisions, all three models showed moderate consistency with the surgical team, with agreement rates of 61.79% (κ=0.484) for ChatGPT-4o, 67.48% (κ=0.548) for Gemini, and 67.48% (κ=0.535) for DeepSeek R1 (all p < 0.001). However, significant divergence was observed in lymph node dissection planning: ChatGPT-4o achieved a high concordance rate of 69.11% (κ =0.616), DeepSeek R1 showed the highest at 79.67% (κ=0.741), while Gemini's performance was relatively poor at 34.96% (κ=0.188). Though our findings demonstrate that ChatGPT-4o and DeepSeek R1 exhibit substantial agreement with experienced surgeons in preoperative planning, overall performance still leaves room for improvement. Nevertheless, model-specific variability — particularly in oncologic decision-making—highlights the need for refinement and robust clinical validation before widespread clinical adoption.

Keywords: Thyroid Surgery, Large language models, Preoperative decision-making, Clinical concordance, artificial intelligence

Received: 09 Mar 2025; Accepted: 30 Sep 2025.

Copyright: © 2025 Zou, Zhang, Jiang, Wang, Yan, Wu, Qi, Li, Cai, Xuan and Hu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Kunpeng Hu, hkpdhy918@126.com

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.