Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Endocrinol.

Sec. Thyroid Endocrinology

Volume 16 - 2025 | doi: 10.3389/fendo.2025.1667809

LLM Evaluation for Thyroid Nodule Assessment: Comparing ACR-TIRADS, C-TIRADS, and Clinician-AI Trust Gap

Provisionally accepted
Xi  DaiXi Dai1Yu  XiYu Xi2Yong  HuYong Hu2Qingyan  DingQingyan Ding2Yu  ZhangYu Zhang2Hui  LiuHui Liu2Piaofei  ChenPiaofei Chen2Xi  WangXi Wang2Wenjun  WangWenjun Wang2Chaoxue  ZhangChaoxue Zhang3*
  • 1The First Affiliated Hospital of Anhui Medical University, Hefei, China
  • 2Huangshan City People's Hospital, Huangshan, China
  • 3Department of Ultrasound, The First Affiliated Hospital of Anhui Medical University, Hefei, China

The final, formatted version of the article will be published soon.

Objective: To evaluate the diagnostic performance and clinical utility of advanced large language models (LLMs) -GPT-4o, GPT-o3-mini, and DeepSeek-R1- in stratifying thyroid nodule malignancy risk and generating guideline-aligned management recommendations based on structured narrative ultrasound descriptions. Methods: This diagnostic modeling study evaluated three LLMs—GPT-4o, GPT-o3-mini, and DeepSeek-R1—using standardized narrative ultrasound descriptors. These descriptors were annotated by consensus among three senior board-certified sonologists and processed independently in a stateless manner to ensure unbiased outputs. LLM outputs were assessed under both ACR-TIRADS and C-TIRADS frameworks. Two experienced clinicians (a thyroid surgeon and an endocrinologist) independently rated the outputs across five clinical dimensions using 5-point Likert scales. Primary outcomes included the area under the receiver operating characteristic curve (AUC) for malignancy prediction, and clinician ratings of guideline adherence, patient safety, operational feasibility, clinical applicability, and overall performance. Results: GPT-4o achieved the highest predictive AUC (0.898) under C-TIRADS, approaching expert-level accuracy. DeepSeek-R1, particularly with C-TIRADS, received the highest clinician ratings (mean Likert: surgeon 4.65, endocrinologist 4.63), reflecting greater trust in its practical recommendations. Clinicians consistently favored the C-TIRADS framework across all models. GPT-4o and GPT-o3-mini received lower ratings in trustworthiness and recommendation quality, especially from the endocrinologist. Conclusion: While GPT-4o demonstrated superior diagnostic accuracy, clinicians most trusted DeepSeek-R1 combined with the C-TIRADS framework for generating practical, guideline-consistent recommendations. The findings highlight the critical need for alignment between AI-generated outputs and clinician expectations, and the importance of incorporating region-specific clinical guidelines (like C-TIRADS) for the effective real-world implementation of LLMs in thyroid nodule management decision support.

Keywords: Large Language Models (LLMs), thyroid nodules, risk stratification, ACR-TIRADS, C-TIRADS, clinical decision-making

Received: 17 Jul 2025; Accepted: 15 Sep 2025.

Copyright: © 2025 Dai, Xi, Hu, Ding, Zhang, Liu, Chen, Wang, Wang and Zhang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Chaoxue Zhang, zcxay@163.com

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.