ORIGINAL RESEARCH article

Front. Artif. Intell.

Sec. Medicine and Public Health

Volume 8 - 2025 | doi: 10.3389/frai.2025.1618426

This article is part of the Research TopicDigital Medicine and Artificial IntelligenceView all 3 articles

Thyroid Nodule Segmentation in Ultrasound Images Using Transformer Models with Masked Autoencoder Pre-Training

Provisionally accepted
  • 1Singapore Health Services Pte Ltd, Singapore, Singapore
  • 2University of Southern Queensland, Toowoomba, Queensland, Australia
  • 3Singapore General Hospital, Singapore, Singapore

The final, formatted version of the article will be published soon.

Thyroid nodule segmentation in ultrasound (US) images is a valuable yet challenging task, playing a critical role in diagnosing thyroid cancer. The difficulty arises from factors such as the absence of prior knowledge about the thyroid region, low contrast between anatomical structures, and speckle noise, all of which obscure boundary detection and introduce variability in nodule appearance across different images.To address these challenges, we propose a transformer-based model for thyroid nodule segmentation. Unlike traditional convolutional neural networks (CNNs), transformers capture global context from the first layer, enabling more comprehensive image representation, which is crucial for identifying subtle nodule boundaries. In this study, We first pre‑train a Masked Autoencoder (MAE) to reconstruct masked patches, then fine‑tune on thyroid US data, and further explore a cross‑attention mechanism to enhance information flow between encoder and decoder.Our experiments on the public AIMI, TN3K, and DDTI datasets show that MAE pre‑training accelerates convergence. However, overall improvements are modest: the model achieves Dice Similarity Coefficient (DSC) scores of 0.63, 0.64, and 0.65 on AIMI, TN3K, and DDTI, respectively, highlighting limitations under small‑sample conditions. Furthermore, adding cross‑attention did not yield consistent gains, suggesting that data volume and diversity may be more critical than additional architectural complexity.

Keywords: Thyroid nodule segmentation, ultrasound imaging, transformer-based network, Masked AutoEncoder, Self-supervised learning

Received: 26 Apr 2025; Accepted: 07 Jul 2025.

Copyright: © 2025 Xiang, Acharya, Le, Tan and Chng. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Yi Xiang, Singapore Health Services Pte Ltd, Singapore, 168753, Singapore

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.