ORIGINAL RESEARCH article
Front. Artif. Intell.
Sec. Medicine and Public Health
This article is part of the Research TopicExplainable Artificial Intelligence for Trustworthy and Human‑Centric Healthcare: Methods, Evaluation, and Clinical ImpactView all 4 articles
Evaluating Chain-of-Thought Reasoning in Large Language Models for Thyroid Ultrasound Interpretation: A Dual-Information Approach
Provisionally accepted- 1Department of Ultrasound, The Second Affiliated Hospital of Xi'an Jiaotong University, Xi'an, China
- 2Xi'an Jiaotong University, Xi'an, China
Select one of your emails
You have multiple emails registered with Frontiers:
Notify me on publication
Please enter your email address:
If you already have an account, please login
You don't have a Frontiers account ? You can register here
Objective: To assess whether reasoning-capable large language models (LLMs) can accurately interpret both qualitative and quantitatively encoded ultrasound features of thyroid nodules within the ACR-TIRADS framework and improve diagnostic reliability. Methods: This retrospective study analyzed thyroid nodules with both radiologist-labeled qualitative ultrasound features and quantitatively encoded descriptors generated through standardized numerical modeling. Both formats were converted into structured prompts and input separately into four CoT-enabled LLMs (ChatGPT-O3, Grok-3, DeepSeek-R1, Gemini-2.5 Pro), each performing three reasoning rounds per task. Diagnostic performance was evaluated by accuracy and reproducibility, and two types of inconsistencies—cross-threshold and cross-modal conflicts—were quantified. Reasoning authenticity and conciseness were independently assessed by radiologists of varying experience. Sankey diagrams were used to summarize ACR-TIRADS category transitions. Results: ChatGPT-O3, Gemini-2.5 Pro, and Grok-3 showed strong ACR-TIRADS accuracy (91%, 96%, 96%), outperforming DeepSeek-R1 (79%). Grok-3 was highest in score-based accuracy (96%); DeepSeek-R1 lowest (52%). Reproducibility for categorization was Grok-3 93%, Gemini-2.5 Pro 90%, ChatGPT-O3 88%, vs. DeepSeek-R1 67%. For scoring reproducibility, Grok-3 (93%), ChatGPT-O3 (90%), and Gemini-2.5 Pro (79%) exceeded DeepSeek-R1 (18%). Physicians rated Grok-3 and Gemini-2.5 Pro highest in reasoning authenticity, while ChatGPT-O3 was most concise (mean 144 words). For quantitative tasks, Gemini-2.5 Pro (78%) and DeepSeek-R1 (74%) were most accurate; Grok-3 lowest (64%). Reproducibility was highest for Gemini-2.5 Pro (84%) and DeepSeek-R1 (86%). Across models, the proportion of nodules exhibiting cross-threshold discrepancies ranged from 3% to 17%, with Grok-3 lowest and DeepSeek-R1 highest. Cross-modal conflicts were more frequent, ranging from 27% to 36% across the four LLMs. Conclusion: Grok-3 excelled in qualitative tasks, while Gemini-2.5 Pro and DeepSeek-R1 showed strengths in quantitative analysis. CoT-enabled LLMs offered interpretable reasoning with promise for clinical decision support.
Keywords: ACR-TIRADS, Chain-of-thought reasoning, Dual-Modal Ultrasound Characteristics, Large language models, thyroid nodules
Received: 04 Jan 2026; Accepted: 16 Feb 2026.
Copyright: © 2026 Zhang, Wu, Zhang, Yang, Zhao, Han, Yuan, Wang, Jue, Du, Zhou and Wang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence: Juan Wang
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.
