Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Oncol., 12 January 2026

Sec. Cancer Imaging and Image-directed Interventions

Volume 15 - 2025 | https://doi.org/10.3389/fonc.2025.1676360

Deep learning for multitask prediction on thyroid nodule frozen sections

Chunyang Wang&#x;Chunyang Wang1†Juan Hu&#x;Juan Hu2†Xiang Li&#x;Xiang Li3†Yufeng CaiYufeng Cai1Shixiang WangShixiang Wang1Xusheng WuXusheng Wu4Haixia LiuHaixia Liu1Zhongliang Hu,*Zhongliang Hu5,6*Dehua Hu*Dehua Hu1*
  • 1School of Life Sciences, Central South University, Changsha, China
  • 2Eight-Year Medical Program, Xiangya School of Medicine, Central South University, Changsha, China
  • 3The First People’s Hospital of Xiangtan City, Xiangtan, Hunan, China
  • 4Shenzhen Health Development Research and Data Management Center, Shenzhen, China
  • 5Department of Pathology, Xiangya Hospital, Central South University, Changsha, China
  • 6Department of Pathology, Xiangya School of Basic Medical Sciences, Central South University, Changsha, China

Background: Preoperative ambiguous thyroid nodules often depend on intraoperative frozen sections for surgical planning, but misdiagnosis can occur due to low-quality frozen sections, limited diagnostic time, and a shortage of pathologists. Deep learning models and conventional radiomics have shown potential in improving diagnostic accuracy in thyroid nodules, yet their integration remains under-explored. This study aimed to develop deep-learning-based models to assist in the intraoperative pathological diagnosis of thyroid nodules by classifying benign/malignant cases, predicting BRAFV600E gene mutation, and identifying lymph node metastasis.

Methods: A total of 436 Whole-Slide Images (WSIs) of thyroid frozen sections were analyzed using deep learning techniques. The analysis included image preprocessing, feature extraction, and classifier training. Patch-to-WSI feature aggregation was done via Patch Likelihood Histogram (PLH) and Bag of Words (BoW) methods.

Results: On the test set, the InceptionV3 model performed best in benign/malignant classification with an AUC of 0.998 and accuracy of 0.988, where weakly supervised strategies surpassed supervised ones. For BRAFV600E gene mutation prediction, the ResNet50 model achieved a patch-level AUC of 0.831 and a WSI-level accuracy of 94.4% under the extended strategy. A ViT-based model for lymph node metastasis prediction obtained an AUC of 0.671 and accuracy of 76%.

Conclusions: The study indicates that deep learning models can effectively classify benign/malignant thyroid frozen sections, predict BRAFV600E gene mutations, and predict lymph node metastasis status. It also emphasizes the effectiveness of weakly supervised strategies in thyroid lesion frozen sections, which could lessen reliance on pathologists’ annotations.

1 Introduction

The overall prevalence of thyroid nodules in the general population is 24.8% (1), of which thyroid cancer is the most common malignant lesion of thyroid nodules (2), often presenting symptoms such as dysphagia, dyspnea, and local pain, and is prone to distant metastasis in later stages, severely affecting patients’ quality of life. The incidence rate of thyroid cancer exhibits a complex global trend, with increases observed in some regions and declines in others (3, 4). The incidence rate of thyroid cancer is generally higher in females than in males (5), and significant differences exist across countries and regions (5, 6).

With the continuous development of scanning, digital, and computer technologies, the emergence and application of high-quality digital pathological sections images, especially Whole-Slide Image (WSI) data, have greatly promoted the development of the pathology field, and related research on digital pathological images has garnered widespread attention (7). The proposal and development of artificial intelligence (AI) have made it possible for AI to assist in pathological diagnosis. Deep learning (DL)-related methods have been widely applied in the analysis of pathological images, such as classifying, segmenting, and detecting basic elements (e.g., cells, glands), and identifying morphology-related prognostic features (8), providing assistance for objective diagnosis and prognosis prediction in digital pathological images. However, most AI pathological studies are based on paraffin sections, and there is a significant gap in research on frozen sections. Among specific tumor types, the breast cancer (9, 10) field is one of the earliest to develop AI pathology, while other more studied tumors include lung cancer (11, 12), prostate cancer (13, 14), gastrointestinal tumors (15, 16), and neurological tumors (17, 18), with relatively few studies on thyroid cancer.

The most common application of AI-assisted diagnosis is in determining the benign/malignant tumors, which is a fundamental requirement for AI-assisted pathology. AI-assisted diagnosis related to thyroid cancer also includes tasks such as predicting BRAFV600E gene mutation and predicting lymph node metastasis. The BRAF gene, formally known as V-Raf murine sarcoma viral oncogene homolog B, promotes cell proliferation, differentiation, and apoptosis and was first described in 1983 (19). The V600E mutation is the most common type of BRAF mutation and is the most frequently occurring mutation in thyroid cancer patients. Studies have demonstrated that the BRAFV600E mutation plays a diagnostic role in improving the risk assessment of malignancy and treatment options for patients with “indeterminate” cytological thyroid nodules (20). Lymph node metastasis (LNM) is a common clinical manifestation of thyroid cancer and is also an important indicator for evaluating patient prognosis and treatment plans. Preoperative detection of LNM in thyroid cancer primarily relies on neck ultrasound and neck computed tomography (CT) scans. However, the accuracy of ultrasound and neck CT in detecting cervical LNM is limited, with 60-70% of central LNM being missed by neck ultrasound or CT (21).

Therefore, we analyzed thyroid frozen sections WSI using deep learning methods to explore intraoperative pathological auxiliary diagnosis. Our goal was to preliminarily determine tumor benignity/malignancy, BRAFV600E gene mutation status, and LNM to assist intraoperative pathological diagnosis effectively.

2 Materials and methods

2.1 Dataset

We used WSI images of thyroid frozen sections from Xiangya hospital affiliated with Central South University. Intraoperative thyroid lesion tissues were collected from patients and processed into frozen sections through a series of procedures including freezing, sectioning, and staining. After preparation, the pathological sections were scanned using a Pannoramic Scan scanner from 3D Histech (Hungary) at 40× magnification, corresponding to a pixel size of 0.25 μm × 0.25 μm. The resulting digital pathology images were stored in mrxs format.

In this study, sample labeling was based on postoperative paraffin histopathological diagnosis, while the goal of model training was to serve intraoperative diagnosis. Slides diagnosed as normal thyroid tissue, nodular hyperplasia, nodular goiter, subacute thyroiditis, Hashimoto’s thyroiditis, or thyroid adenoma were labeled as benign, while those diagnosed as papillary thyroid carcinoma, medullary carcinoma, poorly differentiated thyroid carcinoma, etc., were labeled as malignant. We present the clinical diagnosis and benign/malignant classification for all samples (Supplementary Table S1). Follicular neoplasms were not included in the study because they cannot be definitively diagnosed by intraoperative frozen section. WSI images were annotated by pathologists with extensive experience using QuPath software to mark Regions of Interest (ROI), and reviewed by another pathologist.

The dataset included 335 malignant samples and 101 benign samples with ROI regions. Some WSI images also contained BRAFV600E gene mutation detection results (301 mutated samples, 29 non-mutated samples) and information on lymph node metastasis (185 metastatic samples, 147 non-metastatic samples). Despite the limited sample size, the deep learning in this study is performed at the patch level. A single WSI image can be segmented into thousands of patches, which satisfies the requirements for patch-level deep learning. Therefore, these WSI images and annotation information were used to construct models for the preliminary classification of thyroid sections benignity/malignancy, prediction of BRAFV600E gene mutation in thyroid cancer, and prediction of lymph node metastasis.

2.2 Image preprocessing

Single WSI image can have a resolution of around hundreds of billions of pixels, making it difficult to directly input complete WSI images into deep learning models for training. Therefore, it is necessary to crop WSI images into patches for deep learning model training. Using a sliding window method at a 20× magnification, the original images were sequentially cropped into 512×512 resolution patch images, which contained ratio information of the ROI regions. To filter tissue patches, we set color thresholds to remove blank patches and used a trained ResNet50 model to distinguish between tissue and non-tissue patches. Meanwhile, to reduce color differences in WSI images caused by instruments and operators, the Vahadane method was selected for color standardization.

In the classification of thyroid benignity/malignancy, patch labels were calculated using both supervised and weakly supervised strategies. Under the supervised strategy, if a patch had over 80% of the ROI annotation area, it was labeled as malignant (label 1); patches not exceeding 80% of the ROI annotation area were discarded, and the remaining patches not including the ROI annotation area were labeled as benign (label 0). Under the weakly supervised strategy, patch labels were consistent with the WSI labels without using ROI annotation information.

In BRAFV600E gene mutation classification, convolutional neural network (CNN) models were trained using both conventional and extended strategies. Under the conventional strategy, only patches from the delineated ROI areas were used as input. In the extended strategy, all patches cropped from the entire WSI were input into the deep learning model for training, and the features learned by the best-performing deep learning model were extracted for WSI-level BRAFV600E gene mutation prediction.

In the classification of lymph node metastasis, all patches extracted from the entire WSI were used as input. Each patch was labeled as metastatic or non-metastatic based on its source WSI and preserved its affiliation with the WSI.

2.3 Patch training

In the classification of thyroid benignity/malignancy and BRAFV600E gene mutation, three CNN models—ResNet50, InceptionV3, and VGG16—were selected as patch-level classifiers. The number of fully connected layer nodes before the Softmax activation layer of these three CNNs was modified to 2 according to the requirements of the patch classification tasks. Additionally, transfer learning based on ImageNet dataset pre-trained weights was used to initialize model parameters. The experimental environment for this study consisted of an AMD Ryzen 7 3700X 8-Core Processor (3.59 GHz), 32.0 GB RAM, and an NVIDIA GeForce RTX 2080 Ti GPU with 11.0 GB VRAM. For the same classification tasks, the same parameter settings were applied. The optimizer used was Stochastic Gradient Descent, the loss function was Cross Entropy, the learning rate of the three deep CNNs was set to 0.001, the batch size was set to 32, and all models were trained for 20 epochs.

In the classification of lymph node metastasis, two methods were used. First, InceptionV3 was selected as the patch-level classifier and transfer learning based on the ImageNet dataset was applied. Second, the Vision Transformer (ViT) network was used as a feature extractor to better represent global dependencies in pathological images. ViT extracted 768-dimensional feature representations for each patch, and all patches of the WSI were used as input to predict the category of all WSIs in a weakly supervised strategy.

2.4 Feature fusion and machine learning

The Patch Likelihood Histogram (PLH) pipeline and Bag of Words (BoW) pipeline were used to aggregate features from multiple patches to a single WSI. In PLH, the occurrence probability histogram of patches was used to represent the WSI, effectively capturing the distribution of likelihoods through discretization of likelihoods and serving as a representative of the WSI. The BoW pipeline was based on vocabulary technology, where each patch was mapped to a Term Frequency–Inverse Document Frequency (TF-IDF) floating-point variable, and the TF-IDF feature vector was calculated to represent the WSI. Subsequently, the features obtained from the two methods were fused, successfully merging the initially patch prediction results to generate WSI-level features.

The number of features output from the CNN model was excessive, so features with high contribution weights to the final WSI-level prediction were selected. First, features were normalized, and the Pearson correlation coefficient was used to statistically analyze the correlation between features, removing features with a correlation coefficient ≥0.9. Then, Least Absolute Shrinkage and Selection Operator (LASSO) regression was used for feature dimensionality reduction, selecting important features with non-zero coefficients under the optimal parameters.

The features screened in the previous step were used as input features for three models: Logistic Regression (LR), Support Vector Machine (SVM), and Random Forest (RF). Parameter search was used to obtain the best parameters, and the machine learning models were used to predict the test set WSI images.

3 Results

3.1 Study framework

In constructing predictive models for tumor benignity/malignancy, BRAFV600E gene mutation, and lymph node metastasis, we employed a patch-level to WSI-level research pipeline (Figure 1A). First, entire WSI images were divided into patches of 512×512 pixels, and each patch was processed through CNN models to generate predictions. Subsequently, patch-level predictions were aggregated into features using the BoW and PLH pipelines. Highly correlated features were selected through LASSO regression. Finally, multiple machine learning models were trained using the selected features as input to perform WSI-level predictions. However, this framework underperformed in predicting lymph node metastasis due to insufficient capture of multiple global features. To address this limitation, we adopted a ViT network as the feature extractor, inputting all patches of the WSI and predicting the class of the entire WSI in a weakly supervised manner (Figure 1B).

Figure 1
Diagram illustrating two processes for WSI analysis. Part A shows a patch-level prediction using a CNN model, transitioning to WSI-level prediction via an ML model. Feature selection involves BoW and PLH pipelines, leading to patch aggregation. Part B depicts a ViT Transformer handling input patches through encoders and transformer layers to produce 768 features, followed by MLP head output.

Figure 1. Schematic of the study framework. (A) Prediction framework based on CNN-machine learning models. (B) Prediction framework based on ViT models. CNN, Convolutional Neural Network; Bow, Bag of Words; PLH, Patch Likelihood Histogram; ML, Machine learning; ViT, Vision Transformer; LNM, Lymph Node Metastasis.

3.2 Benign/malignant prediction of thyroid WSI

The performance metrics of the three patch training models under supervised and weakly supervised strategies were compared (Table 1), and ROC curves for different models under the two supervision modes were plotted (Figure 2). Overall, the three patch training models under the supervised strategy achieved good results, with AUC values on the test set reaching around 0.9. In contrast, under the same training parameters, the three deep learning models under the weakly supervised strategy had AUC values of around 0.7 on the test set, with other metric values also significantly lower than those under the supervised strategy. This may be due to the fact that in the weakly supervised strategy, all patches correspond to the WSI label, leading to a large amount of label noise that affects model performance. Considering AUC values, accuracy, and F1 scores, the InceptionV3 model under the supervised strategy had the highest AUC value, accuracy, and F1 score among the three CNN models, performing the best; under the weakly supervised strategy, the InceptionV3 model had higher AUC and accuracy values than the other two CNN models, with an F1 score slightly lower than the ResNet50 model, making it the best performer. Therefore, the article used the features extracted from the trained supervised and weakly supervised InceptionV3 models.

Table 1
www.frontiersin.org

Table 1. Performance of different models in benign/malignant classification.

Figure 2
Six ROC curve plots comparing model performance on training and validation datasets. Models shown are ResNet50 (A, B), Inception V3 (C, D), and VGG16 (E, F). Each plot displays sensitivity vs 1-specificity, with solid pink lines for training and dotted blue for validation. AUC values range: ResNet50 (0.961 train, 0.892 val in A; 0.849 train, 0.694 val in B), Inception V3 (0.956 train, 0.911 val in C; 0.864 train, 0.753 val in D), VGG16 (0.935 train, 0.910 val in E; 0.865 train, 0.674 val in F).

Figure 2. ROC Curves of Different Models in Benign/Malignant Classification. (A, B) ROC curves for the ResNet50 model under supervised (A) and weakly supervised (B) strategies. (C, D) ROC curves for the InceptionV3 model under supervised (C) and weakly supervised (D) strategies. (E, F) ROC curves for the VGG16 model under supervised (E) and weakly supervised (F) strategies.

We performed feature fusion using the PLH and BoW pipelines on features extracted from the supervised and weakly supervised InceptionV3 models to obtain features for WSI pathological images. After applying t-distributed Stochastic Neighbor Embedding (t-SNE) for dimensionality reduction, we found distinct differences between benign/malignant samples, with the weakly supervised strategy showing greater distinction (Figures 3A, B). Subsequently, we used a LASSO regression model for feature selection, determining the optimal λ value that minimized the Mean Squared Error (MSE). Features with non-zero coefficients at this λ value were selected as inputs for machine learning (Figures 3C–F). The supervised strategy retained 15 features with non-zero coefficients, while the weakly supervised strategy retained 13.

Figure 3
Panel A and B show t-SNE plots with two clusters, colored red and blue, labeled as 0 and 1. Panel C and E present line charts showing coefficients against Lambda values, highlighting specific points with vertical lines. Panels D and F depict line charts with error bars showing MSE against Lambda, with vertical lines marking key values.

Figure 3. Feature fusion from patch to WSI and LASSO feature selection. (A) Sample distribution after t-SNE dimensionality reduction under the supervised strategy. “0” denotes benign samples, and “1” denotes malignant samples. (B) Sample distribution after t-SNE dimensionality reduction under the weakly supervised strategy. “0” denotes benign samples, and “1” denotes malignant samples. (C, D) LASSO regression process for samples under the supervised strategy. “Coefficients” represent the magnitude of the feature coefficients, and “MSE” refers to Mean Squared Error. (E, F) LASSO regression process for samples under the weakly supervised strategy. “Coefficients” represent the magnitude of the feature coefficients, and “MSE” refers to Mean Squared Error.

We then constructed three models—LR, SVM, and RF—for predicting benign/malignant WSIs, evaluating their performance using accuracy and AUC values (Figures 4A, B). The RF model under the supervised strategy achieved the highest AUC of 0.985, while the LR model under the weakly supervised strategy performed best with an AUC of 0.998. Both strategies showed significant improvements from patch-level to WSI-level AUC, indicating that feature aggregation via BoW and PLH pipelines enhanced the WSI-level classification model. Notably, although the weakly supervised strategy had a lower patch-level AUC, it outperformed the supervised strategy at the WSI-level, demonstrating a more pronounced improvement in performance.

Figure 4
Graph panel A features ROC curves comparing Logistic Regression (AUC 0.977), SVM (AUC 0.971), and Random Forest (AUC 0.985). Panel B shows improved results for the same models: Logistic Regression (AUC 0.998), SVM (AUC 0.989), and Random Forest (AUC 0.957). Panel C displays two rows of histopathology images. The first column in each row shows standard images, while the second column applies an activation map overlay.

Figure 4. WSI-level prediction results and regional heatmaps. (A) ROC curve for WSI-level supervised strategy. (B) ROC curve for WSI-level weakly supervised strategy. (C) Examples of Heatmaps for Four Regions of Interest. On the left are cropped patches with a size of 512×512 pixels, and on the right are the corresponding heatmaps, where red regions indicate areas attended by the model. LR, Logistic Regression; SVM, Support Vector Machine; AUC, Area Under the Curve.

To enhance model interpretability, we applied the Gradient-Weighted Class Activation Mapping (Grad-CAM) algorithm to visualize the activation of the last convolutional layer in the InceptionV3 network. The red regions on the heatmap indicate the areas the model focuses on within the image (Figure 4C). These regions primarily correspond to enlarged cell nuclei and areas of nuclear division, which are key features of thyroid cancer. Additionally, we aggregated patch-level prediction probabilities to generate category localization maps for the WSI, providing a clear visualization of regions with higher malignancy in thyroid frozen sections (Figure 5). Under both strategies, malignant regions were effectively highlighted, and benign samples were accurately predicted. These results demonstrate that classification models using only WSI-level labels can achieve strong performance, identify malignant tumor regions, and validate the feasibility and effectiveness of the weakly supervised strategy.

Figure 5
Two panels, A and B, each with four images comparing histological sections. Panel A shows two original tissue histology images and their corresponding processed images with color maps indicating intensity levels. Panel B presents a similar set of images with different structures. Bright yellow and green areas highlight different intensity levels.

Figure 5. Prediction probability maps of patch classification models. (A) Two malignant samples (Left) with prediction probability maps under supervised (Middle) and weakly supervised (Right) strategies. (B) Two benign samples (Left) with prediction probability maps under supervised (Middle) and weakly supervised (Right) strategies.

3.3 Prediction of BRAFV600E gene mutation in thyroid cancer

The BRAFV600E mutation status for all samples was determined using real-time PCR-based mutation detection. Among the 436 collected thyroid frozen digital pathology images, 330 contained BRAFV600E gene mutation information, including 301 mutated cases (label set to “1”) and 29 non-mutated cases (label set to “0”). Due to the imbalance between these two classes, 31 mutated WSIs were randomly selected from the mutated group to combine with the 29 non-mutated WSIs, forming a relatively balanced dataset. The 60 WSIs were then split into training and test sets at a 7:3 ratio, comprising 42 WSIs for training and 18 for testing.

The performance metrics of VGG16, InceptionV3, and ResNet50 models under conventional and extended strategies were compared (Table 2), and ROC curves were plotted (Figure 6). Overall, the performance of the three models under the extended strategy was superior to that under the conventional strategy. Under the conventional strategy, the InceptionV3 model performed the best (AUC = 0.794), while under the extended strategy, the ResNet50 model had the best performance (AUC = 0.831). Therefore, the InceptionV3 model was used as the patch-level classifier under the conventional strategy, while the extended strategy selected the highest-performing ResNet50 model as the classifier for feature extraction to predict WSI classification.

Table 2
www.frontiersin.org

Table 2. Performance of different models in BRAFV600E gene mutation prediction.

Figure 6
Six ROC curves compare the performance of three models: resnet50 (A, B), inception v3 (C, D), and vgg16 (E, F). Solid lines represent training AUCs with values ranging from 0.984 to 1.000, while dotted lines represent validation AUCs ranging from 0.754 to 0.831. Sensitivity is plotted against 1 - Specificity.

Figure 6. ROC curves of different models in BRAFV600E gene mutation prediction. (A, B) ROC curves for the ResNet50 model under conventional (A) and extended (B) strategies. (C-D) ROC curves for the InceptionV3 model under conventional (C) and extended (D) strategies. (E, F) ROC curves for the VGG16 model under conventional (E) and extended (F) strategies.

Subsequently, we combined features extracted from the conventional InceptionV3 and extended ResNet50 models using both histogram and TF-IDF features. The fused features underwent LASSO regression to eliminate low-correlation or redundant features (Figures 7A–D). After LASSO selection, the conventional strategy retained 12 features with non-zero coefficients, while the extended strategy retained 6 features. In the test set of 18 WSIs, the conventional strategy misclassified two mutation-negative WSIs as mutated, whereas the extended strategy had only one false-positive prediction of BRAFV600E mutation (Figures 7E, F). This indicates higher prediction accuracy under the extended strategy.

Figure 7
Graphical representation featuring six panels: A and C show coefficient trajectories over varying lambda values; B and D depict MSE trends with confidence intervals. Panels E and F present confusion matrices with correct and incorrect classifications highlighted in varying intensities of blue, labeled with numbers indicating counts.

Figure 7. LASSO feature selection and WSI-level prediction results. (A, B) LASSO regression process for samples under conventional Strategy. “Coefficients” represent the magnitude of the feature coefficients, and “MSE” refers to Mean Squared Error. (C, D) LASSO regression process for samples under extended strategy. “Coefficients” represent the magnitude of the feature coefficients, and “MSE” refers to Mean Squared Error. (E) WSI-level prediction results under conventional strategy. On the left, “1” and “0” represent the true BRAFV600E gene mutation and non-mutation, respectively. (F) WSI-level prediction results under extended strategy. On the left, “1” and “0” represent the true BRAFV600E gene mutation and non-mutation, respectively.

3.4 Prediction of lymph node metastasis in thyroid cancer

The InceptionV3 model was used as the patch classifier, and the ROC curve was plotted (Figure 8A). The model’s performance was not ideal, with an AUC value of only 0.561 on the test set. Therefore, various machine learning models were used for WSI-level prediction (Figure 8B). Although model performance improved slightly, the best-performing SVM among numerous machine learning algorithms only achieved an AUC value of 0.618.

Figure 8
Panel A shows a ROC curve for the Inception v3 model with a train AUC of 0.883 and validation AUC of 0.561. Panel B compares AUCs of various models, with SVM reaching the highest AUC of 0.618. Panel C depicts another ROC curve with a train AUC of 0.846 and test AUC of 0.671. Panel D displays a confusion matrix, showing values of 28 true negatives, 19 false positives, 5 false negatives, and 48 true positives.

Figure 8. Prediction results for lymph node metastasis in thyroid cancer. (A) ROC curve at the patch-level. (B) ROC curve at the WSI-level. (C) ROC curve at the WSI-level based on the ViT model. (D) Confusion matrix for WSI prediction results. On the left, “1” and “0” represent the true presence and absence of lymph node metastasis, respectively.

To achieve better performance in predicting lymph node metastasis in thyroid cancer, the ViT network was used as a feature extractor to capture global features, and all patches of the WSI were input to obtain local features. Finally, the ROC curve of the ViT-based prediction model was plotted (Figure 8C), with an AUC value of 0.671 on the test set, higher than the AUC value of the CNN model at the WSI-level. Among 100 samples, 24 were incorrectly predicted, with an accuracy rate of 76%, also higher than the CNN model (Figure 8D).

3.5 Training and validation with TCGA data

We applied the same pipeline to train and validate on thyroid cancer WSI images from the TCGA (The Cancer Genome Atlas) database, comprising 97 normal samples and 200 tumor samples, with a training-to-validation set ratio of 1:1. The deep learning model employed was InceptionV3, and the machine learning models selected were LR, SVM, and RF. The results demonstrated that the RF model achieved an accuracy of 93.3% and an AUC of 0.988, indicating that deep learning methods can effectively classify benign and malignant thyroid WSI samples (Supplementary Figure S1).

4 Discussion

Thyroid lesions are among the most common specimens requiring intraoperative consultation in current clinical practice. Accurately assessing, classifying, and evaluating the risk of malignancy is the most critical issue in intraoperative diagnosis. However, the sensitivity of frozen sections diagnosis for thyroid nodules is only about 75% (22). Faced with this challenge, AI-assisted diagnosis may be a potential solution. With the development of network architecture and algorithms and the accumulation of medical data, AI has been applied in the diagnosis of thyroid cancer. Since ultrasound examination is the preferred diagnostic tool, AI-assisted diagnosis based on ultrasound images is the most common [For example, the ThyGPT model is a multimodal GPT model for thyroid cancer-assisted diagnosis based on ultrasound images (23)], followed by studies based on cytological pathological images. Studies on AI-assisted intraoperative frozen sections diagnosis are relatively few (2426), and these studies have all annotated the cancer regions in frozen sections.

We explored whether deep learning can diagnose the benignity/malignancy of thyroid lesions from intraoperative frozen sections with rough cancer region annotation or without cancer region annotation, using only WSI-level labels. At the patch prediction stage, three image block classifiers—VGG16, InceptionV3, and ResNet50—were trained, and the performance of these three CNN networks was comparable. These results validated the effectiveness of selecting the InceptionV3 model for thyroid image block classification through transfer learning. Specifically, for patch-level prediction, models using a supervised learning strategy significantly outperformed those using a weakly supervised strategy, likely because normal regions within cancer samples are prone to misclassification as benign regions under weak supervision.

Currently, studies on benign-malignant classification of WSIs often integrate patch-level into WSI-level predictions based on rules (24), probability heat maps (25), color moments, and voting or probability averaging methods.We adopted another aggregation method, extracting histogram features and TF-IDF feature vectors from patch-level deep learning models and fusing them. We also used three machine learning models—LR, SVM, and RF—all of which achieved good performance, with little difference in results between the two strategies. The results from patch-level AUC to WSI-level AUC in both strategies showed a significant improvement, and the improvement in model performance from patch-level AUC to WSI-level under the weakly supervised strategy was even more pronounced. Comparative experiments showed that in this section’s experimental process, the weakly supervised strategy achieved a level similar to the supervised strategy in the WSI-level classifier. This may be because when malignant tumor regions occupy a very low proportion in malignant samples, WSI-level features under a supervised strategy learn more of the benign region features within malignant tumors, while weakly supervised strategies reduce the possibility of misclassifying malignant samples as benign in such situations. This indicates that the performance of the whole-slide classifier under the weakly supervised strategy is not heavily dependent on the image patch-level classifier, and the success of the weakly supervised strategy proved that good prediction results can be achieved without reliance on pathologists’ annotations.

Studies have shown that BRAFV600E gene mutation detection has high diagnostic value in thyroid nodules with cytological uncertainty and has been proven to be related to the prognosis of thyroid cancer (27). However, recent studies have expressed concerns about the inconsistency between immunohistochemistry detection and the gold standard DNA detection in BRAFV600E gene mutation detection, as the former relies on tissue staining schemes (28, 29). Pathologists traditionally identify morphological features associated with BRAFV600E gene mutation in thyroid cancer through histopathology and cytology (30, 31). However, studies have found that the ability to translate these morphological findings into clinically reliable, effective, and reproducible BRAFV600E gene mutation predictions is limited. Previous studies have also shown this, with the highest observed inter-observer agreement being 0.79, accuracy of 83%, specificity of 71%, and positive predictive value of 78% (30).

Studies have shown that deep learning can extract visual features related to molecular changes from histological images, thereby potentially predicting molecular changes from routine pathological sections (32). Therefore, we used conventional and extended strategies, respectively inputting patches from the ROI and all patches for training. At the patch prediction stage, all models under the extended strategy outperformed those under the conventional strategy, with the ResNet50 model achieving the best AUC of 0.831. The comparison between conventional and extended strategies indicates that information related to BRAFV600E gene mutation in thyroid cancer is not limited to the tumor region of the WSI; pathological features in and around the tumor are also associated with BRAFV600E expression.

Currently, deep learning models based on thyroid pathological images are mainly focused on identifying cancer metastasis areas in lymph node sections. There are few studies predicting LNM based on primary tumor pathological images, and the performance of these models still needs improvement. Therefore, we continued to explore the prediction of lymph node metastasis status from intraoperative frozen sections images of primary thyroid cancer. Initially, we still used the deep learning-machine learning workflow for model training, but the highest AUC value in WSIs was only 0.618. This performance is comparable to that of Wessels (33) and Brinker (34) but falls short of the results achieved by Liu Y et al. (AUC = 0.80 on an internal test set). This may be related to the small sample size from a single institution in this study and the fact that primary tumor pathological images, which were used for lymph node metastasis prediction, contain limited information about lymph node metastasis. Future studies should aim to collect large-scale slide data from multiple centers and incorporate cancer metastasis regions in lymph node slides as well as other clinical information related to lymph node metastasis. Additionally, Liu Y et al. used patches cropped at different magnifications, incorporating images from various magnification levels into the training model. High magnification images better reflect cellular morphology and internal structures, while low magnification images show the overall shape of cells, the distribution pattern of tumor cells, and their surrounding environment. It is evident that histological features at different magnification levels are crucial for LNM prediction.

We used ViT as a feature extractor for model training, aggregating features based on the Transformer framework. Unlike CNN models, the ViT feature extractor can better capture global information. Ultimately, the ViT-based prediction model achieved an AUC value of 0.671 and an accuracy rate of 76% on the test set. Although the performance of this model did not match the excellent performance of the other classification tasks, it demonstrated that deep learning techniques can model the association between lymph node metastasis and related primary tumor histological features. It also indicates that the Transformer framework has certain advantages in LNM prediction tasks, providing insights for future research.

Furthermore, the study has certain limitations. First, existing research on benign-malignant classification using the EfficientNetV2-b0 model enables more nuanced classification of thyroid cancer (35), while studies on lymph node metastasis prediction utilizing ThyNet-LNM have demonstrated robust and excellent predictive performance in multicenter validation (36). Both studies have distinct characteristics that complement this research, collectively forming comprehensive insights into AI-based diagnosis of thyroid cancer frozen sections. This study introduces a weakly supervised strategy for benign-malignant classification and compares the performance of CNN and ViT models for lymph node metastasis prediction. Second, ultrasound is the preferred diagnostic method for thyroid cancer, and frozen section analysis of thyroid tissue has limited clinical application, being used only in specific scenarios (e.g., cases where fine-needle aspiration is diagnosed as “suspicious for malignancy”). Third, the acquisition of WSI images relies on whole-slide digital scanning and computational processing, which may increase time costs and cause potential delays in intraoperative diagnosis. Additionally, the equipment investment is relatively high, which may pose certain challenges for implementation in low-volume medical centers.

5 Conclusions

In conclusion, we demonstrate the potential of AI-assisted diagnosis in addressing the challenges of thyroid lesion assessment, particularly in intraoperative frozen sections diagnosis. The use of deep learning models, such as VGG16, InceptionV3, and ResNet50, has shown promise in classifying thyroid lesions with comparable performance. We also highlight the effectiveness of the weakly supervised strategy in achieving results similar to the supervised strategy, reducing reliance on pathologists’ annotations. Furthermore, the exploration of predicting BRAFV600E gene mutation and lymph node metastasis status using deep learning techniques has provided valuable insights, indicating that these models can extract relevant features from histological images. Although the performance of the LNM prediction model was not as high as other classification tasks, it still showed the potential of deep learning in modeling the association between histological features and clinical outcomes. Future research should continue to explore and refine these models, potentially incorporating multi-magnification level images and leveraging the advantages of frameworks like Transformer to improve prediction accuracy and reliability. Additionally, increasing the data volume and diversity could further enhance the robustness and generalizability of these AI models in thyroid cancer diagnosis and prognosis.

Data availability statement

The original contributions presented in the study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding authors.

Author contributions

CW: Writing – original draft. JH: Writing – review & editing. XL: Writing – original draft. YC: Writing – review & editing. SW: Writing – review & editing. XW: Writing – review & editing. HL: Writing – review & editing. ZH: Writing – review & editing, Writing – original draft. DH: Writing – original draft, Writing – review & editing.

Funding

The author(s) declare that financial support was received for the research and/or publication of this article. This work was supported by the Shenzhen Health Development Research and Data Management Center Project (Grant No. ShenJianYanShuGuan sz20240324), the Central South University Startup Funding and Hunan Provincial Natural Science Foundation of China (Grant No. 2025JJ40079).

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declare that no Generative AI was used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fonc.2025.1676360/full#supplementary-material

Supplementary Figure 1 | Prediction results for the TCGA dataset. (A–C) A, B, and C are the confusion matrices for WSI prediction results of the RF, SVM, and LR machine learning models, respectively. The left side shows the sample truth labels. (D) ROC curve at the WSI-level based on the machine learning models. RF: Random Forest. SVM: Support Vector Machine. LR: Logistic Regression.

Supplementary Table 1 | Final clinical diagnosis and benign/malignant label assignment of samples.

References

1. Grani G, Sponziello M, Filetti S, and Durante C. Thyroid nodules: diagnosis and management. Nat Rev Endocrinol. (2024) 20:715–28. doi: 10.1038/s41574-024-01025-4

PubMed Abstract | Crossref Full Text | Google Scholar

2. Welker MJ and Orlov D. Thyroid nodules. Am Fam Physician. (2003) 67:559–66.

PubMed Abstract | Google Scholar

3. Santelli E, Ascoli V, D'Ippoliti D, Michelozzi P, and Cozzi I. Decreasing trend in thyroid cancer incidence: a study from central Italy (2007-2019). Endocrine. (2024) 86:510–4. doi: 10.1007/s12020-024-03995-x

PubMed Abstract | Crossref Full Text | Google Scholar

4. Ghalandari M, Sheikhzade S, Zardosht K, Sadeghi G, and Soodejani MT. Spatial and temporal analysis of thyroid cancer incidence in Guilan Province, Northern Iran, 2009-2018. Cancer Epidemiol. (2024) 90. doi: 10.1016/j.canep.2024.102579

PubMed Abstract | Crossref Full Text | Google Scholar

5. Kim D, Li G, Moon PK, Ma YF, Sim S, Park SY, et al. Thyroid cancer incidence among korean individuals: A comparison of South Korea and the United States. Laryngoscope. (2024) 134:4156–60. doi: 10.1002/lary.31490

PubMed Abstract | Crossref Full Text | Google Scholar

6. West J, Wiemann BZ, Esce AR, Olson GT, and Boyd NH. Thyroid cancer incidence and tumor size in new Mexico american Indians, hispanics, and non-hispanic whites, 1992 to 2019. Ann Of Otology Rhinology And Laryngology. (2024) 133:705–12. doi: 10.1177/00034894241256697

PubMed Abstract | Crossref Full Text | Google Scholar

7. Lu MY, Williamson D, Chen TY, Chen RJ, and Barbieri M. Data-efficient and weakly supervised computational pathology on whole-slide images. Nat BioMed Eng. (2021) 5:555–70. doi: 10.1038/s41551-020-00682-w

PubMed Abstract | Crossref Full Text | Google Scholar

8. Echle A, Rindtorff NT, Brinker TJ, Luedde T, and Pearson AT. Deep learning in cancer pathology: a new generation of clinical biomarkers. Br J Cancer. (2021) 124:686–96. doi: 10.1038/s41416-020-01122-x

PubMed Abstract | Crossref Full Text | Google Scholar

9. Bejnordi BE, Veta M, van Diest PJ, van Ginneken B, Karssemeijer N, Litjens G, et al. Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. JAMA. (2017) 318:2199–210. doi: 10.1001/jama.2017.14585

PubMed Abstract | Crossref Full Text | Google Scholar

10. Mi WM, Li JJ, Guo YC, Ren XY, Liang ZY, Zhang T, et al. Deep learning-based multi-class classification of breast digital pathology images. Cancer Manag Res. (2021) 13:4605–17. doi: 10.2147/CMAR.S312608

PubMed Abstract | Crossref Full Text | Google Scholar

11. Sakamoto T, Furukawa T, Lami K, Pham H, Uegami W, Kuroda K, et al. A narrative review of digital pathology and artificial intelligence: focusing on lung cancer. Transl Lung Cancer Res. (2020) 9:2255–76. doi: 10.21037/tlcr-20-591

PubMed Abstract | Crossref Full Text | Google Scholar

12. Chen K, Wang M, and Song Z. Multi-task learning-based histologic subtype classification of non-small cell lung cancer. Radiol Med. (2023) 128:537–43. doi: 10.1007/s11547-023-01621-w

PubMed Abstract | Crossref Full Text | Google Scholar

13. Raciti P, Sue J, Retamero JA, Ceballos R, Godrich R, Kunz JD, et al. Clinical validation of artificial intelligence-augmented pathology diagnosis demonstrates significant gains in diagnostic accuracy in prostate cancer detection. Arch Pathol Lab Med. (2023) 147:1178–85. doi: 10.5858/arpa.2022-0066-OA

PubMed Abstract | Crossref Full Text | Google Scholar

14. Busby D, Grauer R, Pandav K, Khosla A, Jain P, Menon M, et al. Applications of artificial intelligence in prostate cancer histopathology. Urol Oncol. (2024) 42:37–47. doi: 10.1016/j.urolonc.2022.12.002

PubMed Abstract | Crossref Full Text | Google Scholar

15. Sharma H, Zerbe N, Klempert I, Hellwich O, and Hufnagl P. Deep convolutional neural networks for automatic classification of gastric carcinoma using whole slide images in digital histopathology. Comput Med Imaging Graph. (2017) 61:2–13. doi: 10.1016/j.compmedimag.2017.06.001

PubMed Abstract | Crossref Full Text | Google Scholar

16. Iizuka O, Kanavati F, Kato K, Rambeau M, Arihiro K, and Tsuneki M. Deep learning models for histopathological classification of gastric and colonic epithelial tumours. Sci Rep. (2020) 10:1504. doi: 10.1038/s41598-020-58467-9

PubMed Abstract | Crossref Full Text | Google Scholar

17. Ma YX, Shi F, Sun TY, Chen H, Cheng HX, Liu XJ, et al. Histopathological auxiliary system for brain tumour (HAS-Bt) based on weakly supervised learning using a WHO CNS5-style pipeline. J Neurooncol. (2023) 163:71–82. doi: 10.1007/s11060-023-04306-6

PubMed Abstract | Crossref Full Text | Google Scholar

18. Alzoubi I, Bao GQ, Zheng YQ, Wang XY, and Graeber MB. Artificial intelligence techniques for neuropathological diagnostics and research. Neuropathology. (2023) 43:277–96. doi: 10.1111/neup.12880

PubMed Abstract | Crossref Full Text | Google Scholar

19. Rapp UR, Goldsborough MD, Mark GE, Bonner TI, Groffen J, Reynolds FHJ, et al. Structure and biological activity of v-raf, a unique oncogene transduced by a retrovirus. Proc Natl Acad Sci U S A. (1983) 80:4218–22. doi: 10.1073/pnas.80.14.4218

PubMed Abstract | Crossref Full Text | Google Scholar

20. Bellevicine C, Migliatico I, Sgariglia R, Nacchio M, Vigliar E, Pisapia P, et al. Evaluation of BRAF, RAS, RET/PTC, and PAX8/PPARg alterations in different Bethesda diagnostic categories: A multicentric prospective study on the validity of the 7-gene panel test in 1172 thyroid FNAs deriving from different hospitals in South Italy. Cancer Cytopathol. (2020) 128:107–18. doi: 10.1002/cncy.22217

PubMed Abstract | Crossref Full Text | Google Scholar

21. Xing ZC, Qiu YX, Yang QR, Yu Y, Liu JY, Fei Y, et al. Thyroid cancer neck lymph nodes metastasis: Meta-analysis of US and CT diagnosis. Eur J Radiol. (2020) 129:109103. doi: 10.1016/j.ejrad.2020.109103

PubMed Abstract | Crossref Full Text | Google Scholar

22. Guevara N, Lassalle S, Benaim G, Sadoul JL, Santini J, and Hofman P. Role of frozen section analysis in nodular thyroid pathology. Eur Ann Otorhinolaryngol Head Neck Dis. (2015) 132:67–70. doi: 10.1016/j.anorl.2014.02.006

PubMed Abstract | Crossref Full Text | Google Scholar

23. Yao JC, Wang YP, Lei ZK, Wang K, Feng N, Dong FJ, et al. Multimodal GPT model for assisting thyroid nodule diagnosis and management. NPJ Digital Med. (2025) 8. doi: 10.1038/s41746-025-01652-9

PubMed Abstract | Crossref Full Text | Google Scholar

24. Li Y, Chen PJ, Li ZY, Su H, Yang L, and Zhong DR. Rule-based automatic diagnosis of thyroid nodules from intraoperative frozen sections using deep learning. Artif Intell Med. (2020) 108:101918. doi: 10.1016/j.artmed.2020.101918

PubMed Abstract | Crossref Full Text | Google Scholar

25. Zhu XY, Chen CC, Guo Q, Ma JH, Sun FL, and Lu HZ. Deep learning-based recognition of different thyroid cancer categories using whole frozen-slide images. Front Bioeng Biotechnol. (2022) 10:857377. doi: 10.3389/fbioe.2022.857377

PubMed Abstract | Crossref Full Text | Google Scholar

26. Chen PJ, Shi XS, Liang Y, Li Y, Yang L, and Gader PD. Interactive thyroid whole slide image diagnostic system using deep representation. Comput Methods Programs Biomed. (2020) 195:105630. doi: 10.1016/j.cmpb.2020.105630

PubMed Abstract | Crossref Full Text | Google Scholar

27. Li XJ, Mao XD, Chen GF, Wang QF, Chu XQ, Hu X, et al. High BRAFV600E mutation frequency in Chinese patients with papillary thyroid carcinoma increases diagnostic efficacy in cytologically indeterminate thyroid nodules. Med (Baltimore). (2019) 98:e16343. doi: 10.1097/MD.0000000000016343

PubMed Abstract | Crossref Full Text | Google Scholar

28. Szymonek M, Kowalik A, Kopczynski J, Gasior-Perczak D, Palyga I, Walczyk A, et al. Immunohistochemistry cannot replace DNA analysis for evaluation of BRAF V600E mutations in papillary thyroid carcinoma. Oncotarget. (2017) 8:74897–909. doi: 10.18632/oncotarget.20451

PubMed Abstract | Crossref Full Text | Google Scholar

29. Singarayer R, Mete O, Perrier L, Thabane L, Asa SL, Van Uum S, et al. A systematic review and meta-analysis of the diagnostic performance of BRAF V600E immunohistochemistry in thyroid histopathology. Endocr Pathol. (2019) 30:201–18. doi: 10.1007/s12022-019-09585-2

PubMed Abstract | Crossref Full Text | Google Scholar

30. Virk RK, Theoharis C, Prasad A, Chhieng D, and Prasad ML. Morphology predicts BRAF (V600E) mutation in papillary thyroid carcinoma: an interobserver reproducibility study. Virchows Arch. (2014) 464:435–42. doi: 10.1007/s00428-014-1552-3

PubMed Abstract | Crossref Full Text | Google Scholar

31. Rossi ED, Bizzarro T, Martini M, Capodimonti S, Fadda G, Larocca LM, et al. Morphological parameters able to predict BRAFV600E - mutated Malignancies on thyroid fine-needle aspiration cytology: our institutional experience. Cancer Cytopathology. (2014) 122:883–91. doi: 10.1002/cncy.21475

PubMed Abstract | Crossref Full Text | Google Scholar

32. Kather JN and Calderaro J. Development of AI-based pathology biomarkers in gastrointestinal and liver cancer. Nat Rev Gastroenterol Hepatol. (2020) 17:591–2. doi: 10.1038/s41575-020-0343-3

PubMed Abstract | Crossref Full Text | Google Scholar

33. Wessels F, Schmitt M, Krieghoff-Henning E, Jutzi T, Worst TS, Waldbillig F, et al. Deep learning approach to predict lymph node metastasis directly from primary tumour histology in prostate cancer. BJU Int. (2021) 128:352–60. doi: 10.1111/bju.15386

PubMed Abstract | Crossref Full Text | Google Scholar

34. Brinker TJ, Kiehl L, Schmitt M, Jutzi TB, Krieghoff-Henning EI, Krahl D, et al. Deep learning approach to predict sentinel lymph node status directly from routine histology of primary melanoma tumours. Eur J Cancer. (2021) 154:227–34. doi: 10.1016/j.ejca.2021.05.026

PubMed Abstract | Crossref Full Text | Google Scholar

35. Ma Y, Zhang XM, Yi ZL, Ding LY, Cai BJ, Jiang ZN, et al. A study of machine learning models for rapid intraoperative diagnosis of thyroid nodules for clinical practice in China. Cancer Med. (2024) 13. doi: 10.1002/cam4.6854

PubMed Abstract | Crossref Full Text | Google Scholar

36. Liu YH, Lai FH, Lin B, Gu YQ, Chen LL, Chen G, et al. Deep learning to predict cervical lymph node metastasis from intraoperative frozen section of tumour in papillary thyroid carcinoma: a multicentre diagnostic study. Eclinicalmedicine. (2023) 60. doi: 10.1016/j.eclinm.2023.102007

PubMed Abstract | Crossref Full Text | Google Scholar

Keywords: frozen sections, pathological image, thyroid cancer, artificial intelligence, deep learning

Citation: Wang C, Hu J, Li X, Cai Y, Wang S, Wu X, Liu H, Hu Z and Hu D (2026) Deep learning for multitask prediction on thyroid nodule frozen sections. Front. Oncol. 15:1676360. doi: 10.3389/fonc.2025.1676360

Received: 30 July 2025; Accepted: 24 November 2025; Revised: 18 November 2025;
Published: 12 January 2026.

Edited by:

Vincenzo L’Imperio, University of Milano-Bicocca, Italy

Reviewed by:

Antonio Maria Alviano, IRCCS San Gerardo dei Tintori Foundation, Italy
Na Feng, Zhejiang Cancer Hospital, China

Copyright © 2026 Wang, Hu, Li, Cai, Wang, Wu, Liu, Hu and Hu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Dehua Hu, aHVkZWh1YUBjc3UuZWR1LmNu; Zhongliang Hu, aHV6aG9uZ2xpYW5nQGNzdS5lZHUuY24=

These authors have contributed equally to this work

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.