A Multi-Modal Prompt-tuning Method of Ultrasound Diagnosis for Thyroid Nodule

Xiao, Xiao; Zhou, Ying; Zhu, Yi; Li, Yun; Qi, Tingyue; WANG, WEI

doi:10.3389/fmed.2025.1686374

ORIGINAL RESEARCH article

Front. Med.

Sec. Precision Medicine

This article is part of the Research TopicAdvancements and Challenges in AI-Driven Healthcare InnovationView all 4 articles

A Multi-Modal Prompt-tuning Method of Ultrasound Diagnosis for Thyroid Nodule

Provisionally accepted

Xiao Xiao¹

Ying Zhou²

Yi Zhu²

Yun Li²

Tingyue Qi¹

WEI WANG^1*

¹Affiliated Hospital of Yangzhou University, Yangzhou, China
²Yangzhou University Department of Electronic Information Engineering, Yangzhou, China

The final, formatted version of the article will be published soon.

ABSTRACT Background and Objective: Accurate diagnosis of thyroid nodules using ultrasound images heavily depends on the clinical expertise of radiologists. This reliance poses significant challenges in underdeveloped countries and regions where access to specialized medical resources is limited. Recently, Multi-modal Large Language Models (M-LLMs) have demonstrated promising potential in handling heterogeneous data, such as images and text, making them attractive candidates for automating labor-intensive diagnostic tasks. However, M-LLMs often struggle in ultrasound diagnosis of thyroid nodules for two main reasons: (1) without domain-specific fine-tuning, they are prone to generating hallucinated content, especially in classification tasks that demand expert-level decision-making; and (2) the cost and effort required for ultrasound multi-modal datasets of thyroid nodules are prohibitively high, which are essential for fine-tuning M-LLMs. Methods: We propose a novel multi-modal prompt-tuning method based on ultrasound images and textual description, which can assist radiologists in improving their diagnoses of the etiology of thyroid nodules. Our approach leverages an image encoder and a prompt-tuning framework to learn effective representations from both modalities without the need for expensive full model fine-tuning. The fused multi-modal features are then used to improve the diagnosis of thyroid nodules. These obtained features are re-input into the multi-layer perceptron (MLP) model to fuse multi-modal relationships for complementing image features and assist in the diagnosis of thyroid nodules. Results: Extensive experiments on publicly available and private enrolled datasets demonstrate that our method achieved state-of-the-art performance. Our method significantly outperformed traditional single-modality methods, with accuracy improvements of up to 40.62 over ResNet and 28.51% over AlexNet on the publicly available dataset. In contrast to other multi-modal models, our method achieved superior performance of up to 23.12% and 25.21% on accuracy and F1 score. Xiao et al. A Multi-Modal Prompt-tuning Method of Ultrasound Diagnosis for Thyroid Nodule Conclusions: Our method even surpasses all participating radiologists in accuracy, highlighting its strong potential to assist in expert-level diagnostic decision-making and provide scalable support for resource-limited clinical environments. Practically, it facilitates faster and more consistent thyroid nodule screening, thereby enhancing diagnostic efficiency.

Keywords: Multi-Modal, prompt-tuning, Ultrasound diagnosis, Thyroid Nodule, Medical artificial intelligence

Received: 15 Aug 2025; Accepted: 14 Oct 2025.

Copyright: © 2025 Xiao, Zhou, Zhu, Li, Qi and WANG. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: WEI WANG, waywang@126.com

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.