Ultrasound-based deep learning using the VGGNet model for the differentiation of benign and malignant thyroid nodules: A meta-analysis

Objective The aim of this study was to evaluate the accuracy of deep learning using the convolutional neural network VGGNet model in distinguishing benign and malignant thyroid nodules based on ultrasound images. Methods Relevant studies were selected from PubMed, Embase, Cochrane Library, China National Knowledge Infrastructure (CNKI), and Wanfang databases, which used the deep learning-related convolutional neural network VGGNet model to classify benign and malignant thyroid nodules based on ultrasound images. Cytology and pathology were used as gold standards. Furthermore, reported eligibility and risk bias were assessed using the QUADAS-2 tool, and the diagnostic accuracy of deep learning VGGNet was analyzed with pooled sensitivity, pooled specificity, diagnostic odds ratio, and the area under the curve. Results A total of 11 studies were included in this meta-analysis. The overall estimates of sensitivity and specificity were 0.87 [95% CI (0.83, 0.91)] and 0.85 [95% CI (0.79, 0.90)], respectively. The diagnostic odds ratio was 38.79 [95% CI (22.49, 66.91)]. The area under the curve was 0.93 [95% CI (0.90, 0.95)]. No obvious publication bias was found. Conclusion Deep learning using the convolutional neural network VGGNet model based on ultrasound images performed good diagnostic efficacy in distinguishing benign and malignant thyroid nodules. Systematic Review Registration https://www.crd.york.ac.nk/prospero, identifier CRD42022336701.


Introduction
Thyroid nodules are the most common diseases of the endocrine system, with an ultrasound population detection rate of about 65%, of which approximately 10% is thyroid cancer (1). Thyroid cancer, despite the low incidence, is one of the fastest growing of all cancer types, having increased approximately 2.4 times in the last 30 years (2). It has become a public health concern in most parts of the world. Therefore, early detection and early accurate diagnosis of benign and malignant thyroid nodules are crucial to develop treatment plans and predict prognosis for patients with thyroid nodules, yet this is a great challenge for radiologists and physicians.
Ultrasound is currently the first-line examination of choice for the clinical diagnosis of thyroid nodules, and it is not only the main method for cancer risk stratification of thyroid nodules, but also usually used for the guidance of fine-needle aspiration biopsy. However, the differential diagnosis of thyroid nodules by 2D ultrasound has certain limitations. The quality of ultrasound images is susceptible to many factors, such as the cooperation of patients, the performance of ultrasound machines, and the operating techniques of radiologists (3). In addition, ultrasound diagnostic results are affected by the experience level of radiologists, and the recognition of ultrasound image characteristics of nodules differs among radiologists with different working experience, which is subjective (4). Therefore, there is an urgent need to explore a diagnostic tool that is noninvasive, accurate, and objective in the differential diagnosis of the benign and malignant thyroid nodules preoperatively.
In 2013, deep learning of artificial intelligence (AI) was ranked as one of the top 10 breakthrough technologies by MIT Technology Review, ranking no. 1. From then on, deep learning entered an era of rapid development and played a pivotal role in the medical field, especially in medical image recognition. Some studies used the deep learning convolutional neural network to extract ultrasound features to identify and diagnose benign and malignant thyroid nodules, and some of the studies with diagnostic performance could be comparable to or better than the advanced physicians, which could reduce unnecessary punctures and overtreatment, and help grassroots and inexperienced physicians improve diagnostic efficiency and accuracy (5)(6)(7). In addition, Lee et al. (8) explored the use of deep learning convolutional neural networks in predicting the presence of lymph node metastasis in thyroid cancer on ultrasound, and their results indicated good predictive diagnostic accuracy (accuracy of 83.0%). Accordingly, ultrasound-based AI provides a new direction and method for radiologists to accurately and non-invasively identify and diagnose benign and malignant thyroid nodules and predict lymph node metastasis in the neck before surgery.
Previous published AI studies on thyroid disease can be broadly classified into two categories: traditional machine learning (ML) and deep learning (DL). Traditional ML uses manual extraction of image features, but ultrasound images are highly variable and feature extraction is dependent on physician experience; therefore, the accuracy of diagnosing benign and malignant thyroid nodules varies between empirical practitioners. Deep learning is a development of machine learning using automated extraction of image features, which is independent of physician experience (9). Among them, convolutional neural network (CNN) is a well-known deep learning structure in the field of medical image analysis and is a fully trainable deep learning algorithm consisting of an input layer, a hidden layer, and an output layer (10, 11). The hidden layer usually contains a convolutional layer, a pooling layer, and a fully connected layer. Compared with traditional machine learning methods, CNN performs better in target detection and image classification, and can better extract semantic features (12). Nowadays, CNN is considered one of the most advanced methods among many representative algorithms of deep learning, and VGGNet is a widely used model in CNN algorithms (10,11). This model is the first network structure to reach "deep" in a real sense, as it takes a different research direction from previous CNN models, namely, deepening the network, and proves that the deep network with small filters is superior to the shallow network with large filters (13). Therefore, the deep learning VGGNet model alone was selected as the research subject to avoid selection bias and ensure the stability and reliability of the results.
At present, a number of studies have demonstrated that using the deep learning VGGNet model can differentiate benign and malignant thyroid nodules on ultrasound to assist physicians in making diagnostic results, but the sensitivity of different studies varies. The sensitivity was 93% in the study results of Zhu et al. (5), but only 77% in the study results of Zhou et al. (14). The sensitivity of ultrasound-based deep learning VGGNet in the diagnosis of thyroid nodules was quite different, and no meta-analysis of ultrasound-based deep learning VGGNet models for the determination of the nature of thyroid nodules has been found. Therefore, this meta-analysis aims to evaluate the efficacy of the ultrasound-based deep learning VGGNet model in distinguishing and diagnosing the nature of thyroid nodules to help radiologists make more accurate diagnoses.

Search strategy
This meta-analysis was a study summarizing previously published literature on the differential diagnosis of thyroid nodules with an ultrasound-based deep learning convolutional neural network VGGNet model, thus requiring no ethical confirmation or patient consent. The literature was independently searched in PubMed, Embase, Cochrane Libraries, China National Knowledge Infrastructure (CNKI), and Wanfang databases up to September 2021, updated as of June 2022. The main following keywords were searched: "Deep learning" or "DL" or "Neural network" and "ultrasonography" OR "ultrasound" OR "ultrasonic" or "diagnostic imaging" and "thyroid" or "thyroid gland" or "thyroid nodules". Moreover, references of retrieved topic-related systematic reviews were also manually searched, and other relevant studies were read and identified to make the search more comprehensive.

Study selection
Inclusion criteria were as follows (1): studies that used the deep learning VGGNet model for the differential diagnosis of benign and malignant thyroid nodules (2); at least one ultrasound imaging modality (3); literature that can provide true positives (TP), false positives (FP), false negatives (FN), and true negatives (TN) (4); test set data or validation set data would be chosen; if both were present at the same time, we chose to use the test set; if there were both external and internal test sets, we also conducted a meta-analysis on the external test sets; if there were more than one external test set results in a paper at the same time, we would remove the highest and lowest diagnostic performance results and select the intermediate results; and (5) the gold standard was fine-needle aspiration (FNA), pathology, or both.
Excluded studies include (1) studies that did not match the gold standard (2); convolutional neural network models unrelated to the deep learning VGGNet model (3); studies that did not provide the necessary 2×2 contingency data (4); literature with only abstracts, reviews, conference report, papers not published in journals, full text that were not accessible online, and so on; and (5) duplicate studies.

Quality assessment and data extraction
The Quality Assessment of Diagnostic Accuracy Studies (QUADAS) tool is a recognized tool for quality assessment of diagnostic accuracy tests, because of its specific problem definition and clinical actionability that is widely used in diagnostic meta-analyses (15, 16). The QUADAS tool was revised in 2011 and was called QUADAS-2, consisting of four main parts: case selection, index test, reference standard, and flow and timing, and all components are evaluated in terms of risk of bias (17). The 11 studies included were independently evaluated by two reviewers using the QUADAS-2 tool, and resolved by discussion between internal members if a disagreement was encountered during the assessment. QUADAS-2 results were output using RevMan 5.3, the dedicated software for the Cochrane Collaboration Network.
In this study, two authors independently read the titles and abstracts to screen eligible papers, and then read the full text to determine the included papers. The information obtained from each study was extracted independently, including first author, year of publication, country, gold standard, training set size, test set size, fourfold table data (TP, FP, FN, and TN), sensitivity, specificity, VGGNet type, and testing objects. If fourfold table data were not available in the literature, they were excluded.

Statistical analysis
The entire data from the included studies were selected using Excel 2019, and sensitivity, specificity, and diagnostic odds ratio (DOR) [95% confidence interval (CI)] were summarized using STATA software version 16.0. The area under the receiver operating characteristic (ROC) curve and 95% CI were also calculated, and the value of the diagnostic test was assessed by the area under the curve (AUC) value, where AUC < 0.70 means low diagnostic accuracy, 0.70 < AUC < 0.90 indicates moderate diagnostic accuracy, and AUC ≥ 0.90 indicates high diagnostic accuracy. Statistical inconsistency between studies was assessed using the I 2 and Cochrane Q tests; if I 2 < 50%, it will choose a fixed-effects model to assess sensitivity and specificity, and if I 2 > 50%, a random-effects model would be used. Meta-regression analysis was used and reasons were given when statistical heterogeneity was large. p < 0.05 was considered statistically significant.

Literature searches
Through a comprehensive search, 2,495 records were obtained for our study as of September 2021, updated as of June 2022, including 544 papers from PubMed, 1,837 papers from Embase, 40 articles from Cochrane Libraries, 31 papers from CNKI, 37 papers from Wanfang database, and 6 papers from other sources. After preliminarily eliminating duplicate literatures, two researchers independently read the titles and abstracts of the remaining literatures, excluding literature reviews, cases, news, and other research types. The full text of the literature obtained will be further read through and eventually include 11 studies eligible for the meta-analysis. The detailed selection procedure is shown in Figure 1.

Study characteristics
We registered this meta-analysis on the PROSPERO website; the registration number is CRD42022336701. Following the PRISMA-Diagnostic studies selection process, we eventually included 11 papers; all studies are included in Table 1. The following are some basic characteristics of the included literature. All studies were published within the last 5 years. Eight papers used the deep learning VGG-16 model (14,(18)(19)(20)(21)(22)(23)(24). Four papers clearly indicated retrospective study (5,6,14,19). Two papers did not give an explicit number of training sets (19,20). Three papers compared the deep learning CNN algorithm with radiologists, and the results were comparable to or better than those of the advanced radiologists (5,7,23). Study flowchart. Eleven studies were included in this meta-analysis.

Methodology quality assessment
The results of evaluating the papers' quality assessed by the QUADAS-2 are shown in Figure 2. Most of the studies themselves were of high quality, but a few studies had potential risk of bias in flow and timing. In general, the included studies were considered as eligible.
Accuracy of the ultrasound-based deep learning VGGNet model in the differential diagnosis of benign and malignant thyroid nodules

Publication bias
Deek's funnel plot drawn by STATA16.0 showed no significant asymmetry, with a p-value of 0.84 (p > 0.05) (Figure 6), which indicated that there was no possibility of significant publication bias.

Heterogeneity detection
Given the heterogeneity of the studies included in the pooled statistics, this research used regression analysis to analyze several clinically relevant survey variables. The result showed that year of study publication (≤2020 or >2020), number or scale of the region of interest (ROI) (single or multiple), and type of deep learning VGGNet (VGG-16 or VGG-19) were all associated with heterogeneity and were statistically significant for sensitivity (p < 0.05). Results of the meta-regression analysis are shown in Table 2

Sensitivity analysis
To explore whether the studies affected the stability of PSEN and PSPE, this study used a method of eliminating the literature one by one, and the results of sensitivity and specificity analysis are shown in Table 3. The results demonstrated that with every single paper excluded, neither PSEN and PSPE nor Higgins I² had significant changes.  The forest plot of sensitivity and specificity for diagnosing thyroid nodules. Horizontal lines illustrate 95% confidence intervals of the individual studies. The diagnostic odds ratios (DOR) for diagnostic thyroid nodules. Horizontal lines illustrate 95% confidence intervals of the individual studies.

Fagan plot analysis
The analysis of the Fagan plots showed that the ultrasoundbased deep learning VGGNet model could provide some help for radiologists on the differential diagnosis of the nature of thyroid nodules (Figure 7). When the prior probability was 50%, the posterior probability of the deep learning VGGNet model correctly discriminating malignant nodules as "positive" was 86%, and the posterior probability dropped to 13% when it was "negative". When the prior probabilities were 25% and 75%, the post-test probabilities for positive were 67% and 86%, and the post-test probabilities for negative were 5% and 31%.

Discussion
This meta-analysis evaluated the efficacy of the ultrasoundbased deep learning VGGNet model in the differential diagnosis of benign and malignant thyroid nodules. The results showed that the deep learning VGGNet model achieved satisfactory results in discriminating benign and malignant thyroid nodules on ultrasound images; the pooled sensitivity and specificity were 0. Traditional machine learning usually involves feature extraction and classification of ROI. Although the popularity of machine learning has gradually increased in recent years, ROI can only be manually selected and analyzed with machine learning using single-area information such as image texture, geometric shape, and statistical distribution (9). Ding et al. (25,26) extracted statistical and textural features from thyroid elastograms, and then trained support vector machine (SVM) FIGURE 6 The publication bias of the included studies. No significant publication bias was found in the present meta-analysis. Each circle represented eligible research. ESS, effective sample size. The receiver operating characteristic curve (ROC). SENS, sensitivity; SPEC, specificity; SROC, summary receiver operating characteristic curve; AUC, area under the SROC curve.
to detect malignancy of thyroid nodules with a maximum classification accuracy of 95.2%. However, the classification accuracy was affected by a hard threshold.
Compared with ML, deep learning can automatically extract the multi-level features of the ROI, and learn features from the nodule itself and the difference between the textures of different tissues, which greatly improves the image classification and detection performance (27). Buda et al. (28) used CNN for feature extraction and nodule classification of thyroid nodules, and also compared the diagnosis results with those of nine radiologists; the average sensitivity and average specificity of deep learning for diagnosis were higher than those of the nine radiologists, indicating that deep learning has a good clinical diagnostic value. Vasile et al. (29) used the fusion method of CNN-VGG for thyroid disease feature extraction and image classification, with an overall accuracy of 97.35%, showing that the integrative method is an excellent and stable classifier.
Previously, some meta-analyses were published about cardiovascular disease (30), gastrointestinal disease (31), and colorectal polyposis disease (32), and their combined AUCs were equal to or greater than 0.9, showing the excellent performance of CNN in disease diagnosis. Obviously, meta-analyses of thyroid nodules in ultrasound-based artificial intelligence have been conducted. Zhao et al. (33) included only five studies in meta-analysis. Xu et al. (34) mainly evaluated the overall computer-aided systems (CAD) efficacy of VGGNet, S-Detect, AlexNet, Inception, and so on in meta-analysis. In addition, the number of studies that included various single models was small, and none of them yielded the diagnostic efficacy of single-class models. Through further retrieval and reading of papers, no meta-analysis using ultrasound-based deep learning VGGNet model to differentially diagnose benign and malignant thyroid nodules has been found so far. Therefore, the authors conducted such a study.  All the included studies reported good quality, indicating that most of the included studies were of high quality and did not show significant publication bias. However, a few numbers of included studies did not inform about the continuity and randomization of case selection and the incompleteness of the implementation of the gold standard, resulting in a small number of studies with slightly poorer quality reports, which may lead to implementation bias and measurement bias, resulting in high heterogeneity. Therefore, this study chose meta-regression to explain this high degree of heterogeneity. From the results, we can see that the year of study publication, number or scale of ROI, and type of deep learning VGGNet model may be important reasons for this heterogeneity. The reasons for heterogeneity are analyzed separately in detail below.
Firstly, there were 6 sets of data from five papers published in 2020 and before (5,14,18,20,21) and 10 sets of data from six papers published after 2020 (6,7,19,(22)(23)(24); sensitivity and specificity were statistically significant (p < 0.05). The papers published after 2020 had a lower sensitivity than those published in 2020 and before (0.86 vs. 0.89). The reason may be that some papers published after 2020 included malignant images of thyroid nodules of different pathological types (6,19,23), which reduced the sensitivity of papers published after 2020. In addition, the total number of benign nodules included in papers after 2020 was less than that in 2020 and before, which reduced the specificity.
Secondly, it is easy for the ROI depicted on a single scale to ignore the rich details of ultrasound images of thyroid nodules (35). Therefore, different numbers or scales of ROIs were an important factor affecting study heterogeneity. Among the included studies, the number or scale of different studies in dividing the ROI was varied, 12 sets of data from eight papers delineated one ROI (5-7, 19-21, 23, 24), and 4 sets of data from three papers delineated two or more ROIs at different scales (14,18,22); sensitivity was statistically significant (p < 0.05). Among them, Zhou et al. (14) delineated three target regions of thyroid nodules based on average size, which were located roughly inside, around, and outside the thyroid nodule, and all three ROIs contained the nodule, which showed an AUC comparison of classification accuracy between one ROI and three ROIs (0.82 vs. 0.87) indicating that the classification accuracy using three ROIs was more accurate. Therefore, it is reasonable to believe that the number or scale of ROIs had an impact on the identification results of thyroid nodules.
Finally, the all included studies used the deep learning VGGNet model. The 10 sets of data from eight papers used the deep learning VGG-16 models (14,(18)(19)(20)(21)(22)(23)(24), and 6 sets of data from four papers used the deep learning VGG-19 models (5-7, 24); the paper of Kim  Fagan plot analysis for VGGNet model in detecting thyroid nodules: (A) Pre-test probability at 25%. (B) Pre-test probability at 50%. (C) Pre-test probability at 75%. The Fagan plot is composed of the left vertical axis representing the pre-test probability, the middle vertical axis representing the likelihood ratio, and the right vertical axis representing the post-test probability.
Moreover, the performance of the DL model is closely connected with the number of training data, and the DL model performs better when the data of the training sample are sufficiently large (36). Based on an analysis of 11 included studies, 2 sets of data from two papers did not give an explicit number of training sets, and 14 sets of data from nine papers did give the number of training sets, but the amount of pre-training varied across studies and the amount of learning varied; thus, it is difficult to know the overfitting results of the model. In addition, some researchers have explored the use of autonomously VGGNet fine-tuned models. Currently, there is no mature deep learning CNN model that can directly differentially diagnose the nature of thyroid nodules on ultrasound, which may inevitably lead to the generation of high heterogeneity.
In addition, the Fagan plot explored the clinical utility of ultrasound-based deep learning VGGNet models. The results showed that the ultrasound-based deep learning VGGNet model had the potential to differentiate benign and malignant thyroid nodules. When a patient was considered to have a 50% chance of developing thyroid cancer after initial clinical assessment, the likelihood of developing thyroid cancer increases from 50% to 86% if the deep learning VGGNet model results appear positive. Therefore, this high probability was highly accurate. In contrast, if the deep learning VGGNet was negative, then patients had a 13% chance of thyroid cancer, which could help our radiologists to exclude thyroid cancer. In real-world clinical practice, a biopsy of masses with a predicted 25% probability of malignancy will be performed regardless of the outcome of deep learning VGGNet. Therefore, the Fagan plot showed that the deep learning VGGNet model can aid in radiologist diagnosis.
This diagnostic meta-analysis has several limitations. Firstly, studies from Europe and America were excluded because they did not meet the inclusion criteria of using the deep learning VGGNet model to differentiate benign from malignant thyroid nodules, which might cause geographic bias. Secondly, this study only included papers published in English and Chinese, which might cause an unavoidable language bias. Thirdly, this meta-analysis only included 11 papers, and the small sample size of the test set in some studies may affect the accuracy of the results of the metaanalysis. To further assess the differential diagnostic efficacy of deep learning VGGNet models, large-scale, prospective, multicenter studies in different regions are necessary.

Conclusion
This meta-analysis suggests that the ultrasound-based deep learning VGGNet model is a suitable and effective method for radiologists to differentiate and diagnose benign and malignant thyroid nodules. However, given the limitations of the sample size and the varying quality of the studies themselves, additional prospective or multicenter studies are expected to follow for further evaluation to make up for the deficiency.

Data availability statement
The original contributions presented in the study are included in the article/supplementary material. Further inquiries can be directed to the corresponding authors.