A Comparative Analysis of Six Machine Learning Models Based on Ultrasound to Distinguish the Possibility of Central Cervical Lymph Node Metastasis in Patients With Papillary Thyroid Carcinoma

Current approaches to predict central cervical lymph node metastasis (CLNM) in patients with papillary thyroid carcinoma (PTC) have failed to identify patients who would benefit from preventive treatment. Machine learning has offered the opportunity to improve accuracy by comparing the different algorithms. We assessed which machine learning algorithm can best improve CLNM prediction. This retrospective study used routine ultrasound data of 1,364 PTC patients. Six machine learning algorithms were compared to predict the possibility of CLNM. Predictive accuracy was assessed by sensitivity, specificity, positive predictive value, negative predictive value, and the area under the curve (AUC). The patients were randomly split into the training (70%), validation (15%), and test (15%) data sets. Random forest (RF) led to the best diagnostic model in the test cohort (AUC 0.731 ± 0.036, 95% confidence interval: 0.664–0.791). The diagnostic performance of the RF algorithm was most dependent on the following five top-rank features: extrathyroidal extension (27.597), age (17.275), T stage (15.058), shape (13.474), and multifocality (12.929). In conclusion, this study demonstrated promise for integrating machine learning methods into clinical decision-making processes, though these would need to be tested prospectively.


INTRODUCTION
According to the American Cancer Society 2020 American Cancer Data Statistics, thyroid cancer incidence has accounted as the fifth leading cause of cancer in women (1), similar to China (2). Cervical lymph node metastasis is considered as a risk factor for recurrence (3,4), of which central cervical lymph node metastasis (CLNM) is the most common (5). According to the American Thyroid Association (ATA) management guidelines for adult patients with thyroid cancer (6), whether there is CLNM or not directly affects the formulation of preoperative surgical procedures.
Prophylactic central lymph node dissection (CLND) for cN0 papillary thyroid carcinoma (PTC) patients will undoubtedly cause excessive medical treatment. Therefore, it is of great significance to distinguish CLNM with non-invasive methods before surgery for the treatment and prognosis in PTC patients.
Ultrasound remains the most critical imaging modality in the evaluation of thyroid cancer according to the ATA Statement on Preoperative Imaging for Thyroid Cancer Surgery (7) due to its convenience, non-invasive, and non-radiation. However, it is challenging to detect CLNM due to the interference of the gas in the trachea and esophagus, and its diagnostic sensitivity is only about 20-40% (8)(9)(10)(11)(12)(13).
In recent years, artificial intelligence (AI) in medicine has grown significantly as a state-of-the-art data analysis tool (14). In particular, radiology lends itself to AI because of its large digital data sets (15). Machine learning, a significant subset of AI, provides a great supporting role to improve diagnostic and prognostic accuracy (16). Recently, some studies have focused on machine learning to evaluate thyroid nodules (16)(17)(18) and lymph node metastasis in patients with thyroid cancer (19), and the highest area under the curve (AUC) can reach 0.953. The machine learning classifiers used in these studies include neural networks, decision trees, random forest (RF), and deep learning. However, few studies focus on comparing the diagnostic performance of various machine learning classifiers in evaluating CLNM in PTC patients.
In the current study, based on published literature (20)(21)(22)(23), we hypothesized that the machine learning models based on ultrasound could achieve higher performance in predicting CLNM in PTC patients. The study's purpose was, first, to develop machine learning-based models using six different classifiers to distinguish CLNM from non-CLNM based on preoperative ultrasound images. Second, to validate and test the diagnostic performance of the six models. Third, to compare the performance of six classifiers.

Patients Population
The research ethics committee of Binzhou Medical University Hospital approved this retrospective study (No. LW-024), and the requirement for written informed consent was waived since the retrospective nature. All the included data were anonymized. All patients' medical records and ultrasound images were stored in the picture archiving and communication systems (PACS) of Binzhou Medical University Hospital. The clinicians could access the data by PACS. Considering patients' privacy, you can contact the corresponding author to obtain all patients' original data if necessary.
The clinical records of 1,679 patients who visited Binzhou Medical University Hospital between January 2017 and June 2020 were retrospectively analyzed. The patients were treated for thyroid nodules and classified as Bethesda Categories V (suspicious for malignancy, risk of malignancy 50-75%) and VI (malignancy, risk of malignancy 97-99%) confirmed by ultrasound-guided fine-needle aspiration biopsy (US-FNAB).
All patients underwent total thyroidectomy or thyroid lobectomy. CLND was performed for patients with preoperative ultrasound, suggesting the possible presence of CLNM; for cN0 patients, who were without evidence from preoperative imaging examination, CLND was conducted to follow the patients' wishes after communication with the patients. Patients who did not undergo CLND were excluded from this study. Meanwhile, according to the ATA guidelines (24), for the lateral cervical lymph nodes, whether to perform lateral cervical lymph node dissection was based on preoperative imaging data. The inclusion criteria were as follows: patients with PTC confirmed by postoperative pathology and complete postoperative pathological results. We excluded patients with medullary thyroid carcinoma (MTC) and follicular thyroid carcinoma (FTC), less than 18 years old, previous thyroid operation or other neck surgery, and a history of radiation therapy. After a strict inclusion and exclusion process ( Figure 1), a total of 1,364 consecutive patients were included, which were randomly split into the training (70%), validation (15%), and test (15%) data sets by IBM SPSS Modeler software (version 18.0). Demographics, including sex, age, final surgical pathology diagnosis, were thoroughly reviewed from the medical records.

Image Acquisition and Analysis
All data were scanned using six different Doppler ultrasonic diagnostic apparatuses. The detailed ultrasound examination protocol was provided in the Appendix. Specific ultrasound diagnostic criteria of malignant thyroid lesions were according to the white paper of the American College of Radiology (ACR) Thyroid Imaging, Reporting, and Data System (TI-RADS) committee (25). Specific evaluation parameters included: location, background, T stage (diameter), margin, shape, composition, echogenicity, calcification, extrathyroidal extension (ETE), and multifocality. The ultrasound images were re-evaluated by four radiologists with 11, 12, 13, and 15 years of experience in thyroid cancer ultrasound diagnosis, blinded to clinical information and pathological diagnosis. A week later, the four radiologists took a second measurement and performed intra-observer consistency analysis. When the four radiologists disagreed with some image feature, it would be resolved through consultation with the fifth radiologist with more than 20 years of ultrasound diagnosis experience. Meanwhile, the inter-observer consistency analysis was performed. Consistency analysis was done by Cohen's Kappa.

Models Construction and Evaluation
By using the SPSS Modeler (version18.0, IBM, Armonk, New York), six classifiers, including decision tree (C5.0), logistic regression analysis (LRA), support vector machine (SVM), Bayesian network (BN), artificial neural network (ANN), and RF were used to establish the models. A brief description of these machine learning classifiers was shown in Table A.1. The parameters of classifiers played an essential role in the classification performance, and we set the parameters as follows to enhance the performance of each classifier: Boosting was used in the C5.0 algorithm to improve model accuracy; binomial logistic regression with backward selection was used in the LRA model; the radial basis function kernel was used as the kernel of the SVM, with the parameter C = 16, g = 0.06 (1/ number of features); BN adopted Markov chain structure and maximum likelihood parameter; a single-layered perceptron neural network for the ANN model consisted of one input layer, one or more hidden layers, and one output layer; 100 random tree number in RF model with the max feature included all the features input.
SPSS Modeler software randomly selected 70% of the data set as the training group and trained the model based on CLNM according to postoperative pathological results. 15% of the data set was applied as the validation group, and the remaining 15% was used as the test group to verify the trained model.
The data of the validation and test cohorts was input into the six machine learning models by SPSS Modeler to evaluate the diagnostic efficiency. Predictive performance was assessed using the receiver operating characteristic (ROC) curve, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV).
In addition, we further compared the concordance between the CLNM status as assessed by the best classifier in the six machine learning algorithms and the radiological CLNM status set at the time of the patient's original treatment.
A power calculation was performed to ensure that both the validation and test data sets were sufficient to evaluate the AUC estimated from the training group.

Statistical Analysis
All the statistical analysis was performed with SPSS (version 25.0, IBM, Armonk, New York), SPSS Modeler (version 18.0), and Medcalc Statistical Software (version 18.2.1). SPSS Modeler, GraphPad prism 8.3.0, and Medcalc Statistical Software were used to draw graphs. A P-value <0.05 was considered to be statistically significant. Use c 2 test to compare the differences in count data. Medcalc Statistical Software was used to calculate the six models' AUCs and evaluate the predictions. The DeLong method was used to compare the AUCs of the six machine learning classifiers. Cohen's kappa value was used to analyze the concordance between the best classifier and the radiologist's assessment of CLNM. PASS 15 (Power Analysis and Sample Size Software, 2017, NCSS, LLC. Kaysville, Utah, USA, ncss.com/ software/pass) "Tests for One ROC Curve" function was used to perform power calculation. Dummy variables were grouped according to the 8th American Joint Committee on Cancer (AJCC) staging systems (26) and ACR TI-RADS (25), details as follows: location (left lobe, right lobe, or isthmus), background (homogeneous or heterogeneous), diameter (T1a = "≤1 cm", T1b = "1-2 cm", T2 = "2-4 cm", ≥T3 = ">4 cm"), margin (smooth, ill-defined, or lobulated/irregular), shape (wider-than-tall or taller-than-wide), composition (cystic, spongiform, mixed, or solid), echogenicity (anechoic, hyper/isoechoic, hypoechoic or very hypoechoic), calcification (none/large comet-tail, macrocalcification, rim calcification, or microcalcification), ETE, and multifocality. No statistically significant difference was observed among the three cohorts (P >0.05). Cohen's Kappa values of intra-and inter-observe consistency analyses were all >0.8, indicating a solid consistency. Baseline epidemiologic and ultrasonic characteristics for the three cohorts were shown in Table 1.
Cohen's kappa value was 0.847, indicating strong concordance between the RF classifier and the radiologist's assessment of CLNM.

The Power Calculation
In the training set, the AUC of RF classifier was 0.832 (95% CI: 0.807-0.855), so we set the AUC0 = 0.83. In the validation and test sets, the AUCs were 0.754 (95% CI: 0.690-0.810) and 0.731 (95% CI: 0.664-0.791); so we set the AUC1 = 0.73-0.75, a = 0.05, False positive rate limited: 0.01-0.20. The result showed that when the target power was 0.80, the sample sizes of validation and test sets were 202 and 129, respectively. Therefore, both the validation and test sets were sufficient to evaluate the AUC estimated from the training set.

The Relative Importance of Each Feature Within the RF Algorithm
The diagnostic performance of the RF algorithm was most dependent on the following five top-rank features, according to their mean decrease in Gini: ETE (27.597), age (17.275), T stage (15.058), shape (13.474), and multifocality (12.929) ( Figure 4). Together, all these features were the most critical factors in the RF algorithm's diagnostic performance based on ultrasound features ( Figure 5).

DISCUSSION
We developed six machine learning models in the current study to differentiate CLNM and non-CLNM in PTC patients based on preoperative ultrasound. There were three significant findings. First, the six machine learning models could distinguish CLNM from non-CLNM based on preoperative ultrasound images to some extent. Second, after comparing the six machine learning models, RF had the best prediction performance. Third, the five most important factors affecting RF's diagnostic performance were ETE, age, T stage, shape, and multifocality.
Presently, there were many studies (27-32) on the differential diagnosis of benign and malignant thyroid nodules using machine learning, but there were only a few studies applying machine learning models to predict lymph node metastasis in PTC patients. Some studies (19,33) exploited the deep learning model to diagnose cervical lymph node metastasis in thyroid cancer patients with computed tomography (CT), and the highest AUC was up to 90.4%. However, there is still controversy about whether or not to routinely perform CT examinations for patients with thyroid cancer internationally due to the possible impact on subsequent radioactive iodine treatment (34). Lee et al. (35) used a deep learning-based computer-aided diagnosis system for localization and diagnosis of metastatic lymph nodes in thyroid cancer patients on ultrasound, and the accuracy was up to 83.0%. But they did not compare multiple machine learning models' performance in distinguishing metastatic lymph nodes in patients with thyroid cancer.
By training six popular machine learning classifiers based on preoperative ultrasound images to identify which one would best differentiate between CLNM and non-CLNM, we found that the RF algorithm performed best. The RF classifier was more reliable for determining CLNM by comparing it with single ultrasound features. RF was a well-known machine learning algorithm for classification tasks and had an inherent resistance to overfitting, which was an ensemble learning method. It chose random data points from the data set to build multiple decision trees and improved the final prediction performance. We applied stratified 10-fold cross-validation in the current study, which randomly divided all the data into ten parts and then held out 10% of the testing data, repeated ten times. ETE, age, T stage, shape, and multifocality were the five most important factors affecting the CLNM diagnostic performance of RF. In our study, ETE and multifocality were also associated with CLNM, indicating that the tumor was much more aggressive, which was consistent with previous studies (36,37). The age of 55 was considered a watershed age in the TNM staging system of the 8th AJCC (26). Li et al. (11) reported that life expectancy was reduced in patients with thyroid cancer ≥45 years (cut-off value determined by the 7th AJCC). CLNM was closely related to the patient's prognosis. In the current study, age ≥55 years was a significant independent risk predictor of CLNM, which was consistent with the literature (38). The diameter was recognized as an independent risk factor for CLNM in patients with PTC (10,11,36), and our study got the same result. It may be attributed to the more extensive the tumors, the more aggressive and proliferative. Taller-than-wide was another specific for distinguishing CLNM from non-CLNM in patients with PTC, which conveyed that malignant nodules grew across regular tissue planes, while benign nodules grew parallel to normal tissue planes (36,39).
There were five limitations in this study. First, because it was a retrospective study, it might result in a potential selection bias. Thus, a multi-center much larger sample size prospective clinical research was required in the future. Second, the images' quality had some variability because four radiologists re-evaluated the ultrasound images; however, all radiologists had rich experience in ultrasound diagnosis of thyroid cancer and followed TI-RADS   (25) for image evaluation. Third, only patients diagnosed with PTC for the first time were enrolled; however, the sonographic features of thyroid bed recurrences might be significantly different, which is one of our future research directions. Fourth, the ultrasound reporting was not standardized prospectively, i.e., ideally, each examination should follow the same procedure and record the same dataset. Fifth, external validation was needed to conduct to verify the accuracy and reliability of the machine learning models in the future.
In conclusion, our study revealed that the machine learningassisted ultrasound examination yielded a satisfactory performance in diagnosing CLNM in patients with PTC. Among these six machine learning models, RF had the best prediction performance. To our knowledge, this was the first  study involving multiple machine learning classifiers specific for CLNM based on ultrasound. We expected that a machine learning model with better performance could help distinguish metastatic lymph nodes on ultrasound and provide a simple method for clinical, surgical decision-making in the future.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by the Binzhou Medical University Hospital. Written informed consent for participation was not required for this study in accordance with the national legislation and the institutional requirements.

AUTHOR CONTRIBUTIONS
YZ, YS, JL, and FS contributed to the conception and design of the study and wrote the draft of the manuscript. GC and ML organized the database. YZ performed the statistical analysis. All authors contributed to the article and approved the submitted version.

ACKNOWLEDGMENTS
Thanks to Professor Shuang Xia of Tianjin First Central Hospital for her valuable comments and suggestions on the conception, revision, and submission of this article.