Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Artif. Intell.

Sec. Medicine and Public Health

This article is part of the Research TopicExplainable Artificial Intelligence for Trustworthy and Human‑Centric Healthcare: Methods, Evaluation, and Clinical ImpactView all 4 articles

Evaluating Chain-of-Thought Reasoning in Large Language Models for Thyroid Ultrasound Interpretation: A Dual-Information Approach

Provisionally accepted
Yu-Tong  ZhangYu-Tong Zhang1Si-Yi  WuSi-Yi Wu1Dong  ZhangDong Zhang2Zheng-Yi  YangZheng-Yi Yang1Sheng-Wei  ZhaoSheng-Wei Zhao2Hongcheng  HanHongcheng Han2Xin  YuanXin Yuan1Lirong  WangLirong Wang1Jiang  JueJiang Jue1Shaoyi  DuShaoyi Du2Qi  ZhouQi Zhou1Juan  WangJuan Wang1*
  • 1Department of Ultrasound, The Second Affiliated Hospital of Xi'an Jiaotong University, Xi'an, China
  • 2Xi'an Jiaotong University, Xi'an, China

The final, formatted version of the article will be published soon.

Objective: To assess whether reasoning-capable large language models (LLMs) can accurately interpret both qualitative and quantitatively encoded ultrasound features of thyroid nodules within the ACR-TIRADS framework and improve diagnostic reliability. Methods: This retrospective study analyzed thyroid nodules with both radiologist-labeled qualitative ultrasound features and quantitatively encoded descriptors generated through standardized numerical modeling. Both formats were converted into structured prompts and input separately into four CoT-enabled LLMs (ChatGPT-O3, Grok-3, DeepSeek-R1, Gemini-2.5 Pro), each performing three reasoning rounds per task. Diagnostic performance was evaluated by accuracy and reproducibility, and two types of inconsistencies—cross-threshold and cross-modal conflicts—were quantified. Reasoning authenticity and conciseness were independently assessed by radiologists of varying experience. Sankey diagrams were used to summarize ACR-TIRADS category transitions. Results: ChatGPT-O3, Gemini-2.5 Pro, and Grok-3 showed strong ACR-TIRADS accuracy (91%, 96%, 96%), outperforming DeepSeek-R1 (79%). Grok-3 was highest in score-based accuracy (96%); DeepSeek-R1 lowest (52%). Reproducibility for categorization was Grok-3 93%, Gemini-2.5 Pro 90%, ChatGPT-O3 88%, vs. DeepSeek-R1 67%. For scoring reproducibility, Grok-3 (93%), ChatGPT-O3 (90%), and Gemini-2.5 Pro (79%) exceeded DeepSeek-R1 (18%). Physicians rated Grok-3 and Gemini-2.5 Pro highest in reasoning authenticity, while ChatGPT-O3 was most concise (mean 144 words). For quantitative tasks, Gemini-2.5 Pro (78%) and DeepSeek-R1 (74%) were most accurate; Grok-3 lowest (64%). Reproducibility was highest for Gemini-2.5 Pro (84%) and DeepSeek-R1 (86%). Across models, the proportion of nodules exhibiting cross-threshold discrepancies ranged from 3% to 17%, with Grok-3 lowest and DeepSeek-R1 highest. Cross-modal conflicts were more frequent, ranging from 27% to 36% across the four LLMs. Conclusion: Grok-3 excelled in qualitative tasks, while Gemini-2.5 Pro and DeepSeek-R1 showed strengths in quantitative analysis. CoT-enabled LLMs offered interpretable reasoning with promise for clinical decision support.

Keywords: ACR-TIRADS, Chain-of-thought reasoning, Dual-Modal Ultrasound Characteristics, Large language models, thyroid nodules

Received: 04 Jan 2026; Accepted: 16 Feb 2026.

Copyright: © 2026 Zhang, Wu, Zhang, Yang, Zhao, Han, Yuan, Wang, Jue, Du, Zhou and Wang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Juan Wang

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.