Establishment of an Ultrasound Malignancy Risk Stratification Model for Thyroid Nodules Larger Than 4 cm

Background The incidence and mortality of thyroid cancer, including thyroid nodules > 4 cm, have been increasing in recent years. The current evaluation methods are based mostly on studies of patients with thyroid nodules < 4 cm. The aim of the current study was to establish a risk stratification model to predict risk of malignancy in thyroid nodules > 4 cm. Methods A total of 279 thyroid nodules > 4 cm in 267 patients were retrospectively analyzed. Nodules were randomly assigned to a training dataset (n = 140) and a validation dataset (n = 139). Multivariable logistic regression analysis was applied to establish a nomogram. The risk stratification of thyroid nodules > 4 cm was established according to the nomogram. The diagnostic performance of the model was evaluated and compared with the American College Radiology Thyroid Imaging Reporting and Data System (ACR TI-RADS), Kwak TI-RADS and 2015 ATA guidelines using the area under the receiver operating characteristic curve (AUC). Results The analysis included 279 nodules (267 patients, 50.6 ± 13.2 years): 229 were benign and 50 were malignant. Multivariate regression revealed microcalcification, solid mass, ill-defined border and hypoechogenicity as independent risk factors. Based on the four factors, a risk stratified clinical model was developed for evaluating nodules > 4 cm, which includes three categories: high risk (risk value = 0.8-0.9, with more than 3 factors), intermediate risk (risk value = 0.3-0.7, with 2 factors or microcalcification) and low risk (risk value = 0.1-0.2, with 1 factor except microcalcification). In the validation dataset, the malignancy rate of thyroid nodules > 4 cm that were classified as high risk was 88.9%; as intermediate risk, 35.7%; and as low risk, 6.9%. The new model showed greater AUC than ACR TI-RADS (0.897 vs. 0.855, p = 0.040), but similar sensitivity (61.9% vs. 57.1%, p = 0.480) and specificity (91.5% vs. 93.2%, p = 0.680). Conclusion Microcalcification, solid mass, ill-defined border and hypoechogenicity on ultrasound may be signs of malignancy in thyroid nodules > 4 cm. A risk stratification model for nodules > 4 cm may show better diagnostic performance than ACR TI-RADS, which may lead to better preoperative decision-making.


INTRODUCTION
Thyroid nodules occur in up to 68% of people in the general population worldwide, and 5-15% of nodules are malignant (1,2). Research shows that the incidence and mortality of thyroid cancer, including thyroid nodules > 4 cm, has been on the rise in recent years and warrants further research (3). Both the 2017 Thyroid Cancer Staging Manual of the American Joint Committee on Cancer (AJCC) and the 2015 Management Guidelines of the American Thyroid Association (ATA) (Referred to as ATA) for adult patients with thyroid nodules and differentiated thyroid cancer list thyroid nodules > 4 cm as an important factor for surgical decision-making, as integrated into the Tumor, Node, Metastasis (TNM) staging system (4,5). Recent guidelines have suggested ultrasound risk stratification patterns to assess the malignant risk of thyroid nodules, including the American College Radiology Thyroid Imaging Reporting and Data System (ACR TI-RADS) (4,(6)(7)(8)(9). However, these methods are based mostly on research of thyroid nodules < 4 cm. For example, fine-needle aspiration biopsy (FNAB) is considered to be the gold standard for preoperative diagnosis of thyroid cancer. However, FNAB shows lower sensitivity and higher rates of false negative results in the case of thyroid nodules > 4 cm (10)(11)(12). Decisions related to surgery and other treatments may be affected if pre-operative assessment of thyroid nodules is inaccurate or incomplete. Thus, distinguishing malignant from benign nodules pre-operatively would assist in diagnosis and decision-making.
The aim of the current study was to identify factors that predict malignancy in thyroid nodules > 4 cm and construct an applicable risk stratification model.

MATERIAL AND METHODS
Study protocols are shown in Figure 1.

Patients
Consecutive patients with at least one thyroid nodule > 4 cm who underwent thyroidectomy at the Peking Union Medical College Hospital (Beijing, China) between 2010 and 2017 were reviewed retrospectively. The inclusion criteria were as follows :(1) the size of the nodule was > 4 cm in its longest diameter, as determined by ultrasonography; and (2) the nodule had not previously been treated surgically. The exclusion criteria were: (1) pathology results from surgical tissue were unavailable for the patient, or (2) ultrasound images were poor or incomplete. A total of 279 thyroid nodules in 267 patients were included for the study (Figure 1). The patients comprised 185 females aged 60.0 ± 13.2 yr and 82 men aged 49.9 ± 13.3 years (Figure 1).
The study was approved by the Institutional Review Board of Peking Union Medical College Hospital. All patients provided informed consent for their clinical data to be published anonymously for research purposes.

Ultrasound Examination
Relevant clinical and ultrasound data of all cases were extracted from the central hospital database. Ultrasound examinations were performed with Phillips IU 22, GE Logiq 9 or GE Logiq 7 devices equipped with a linear array probe of 8-15 MHz A convex array probe of 5-12 MHz was used for larger thyroid nodules. Ultrasound images were retrospectively reviewed by two radiologists who had more than 5 years' experience analyzing thyroid ultrasound, and who were blinded to patients' clinical and pathological results. The age and sex of the patients were recorded, as were ultrasound features of each nodule, including size, composition, echogenicity, margin, shape, border, calcification and halo. The two radiologists resolved any inconsistencies in their reviews through discussion.
All thyroid nodules were also evaluated using ACR TI-RADS, Kwak TI-RADS and ATA (4,6,9). According to the ACR TI-RADS, points were given for all ultrasound features in a nodule. Features suggesting malignancy were awarded additional points. The total points determined the nodule's ACR TI-RADS level, which ranged from TR1 (benign) to TR5 (high probability of malignancy) ( Table 1).

Statistical Analysis
Data analysis was performed using SPSS 19.0 (IBM, Chicago, IL, USA) and p < 0.05 as the definition of statistical significance. Nodules were randomly assigned to a training dataset or validation dataset (13). Continuous data were reported as means ± SD, and inter-group differences were assessed for significance using Student's t-test. Differences in categorical data were assessed using the c 2 -test or Fisher's exact test as appropriate. Categorical variables were classified based on clinical and ultrasound findings. The continuous variable age was transformed into a categorical variable (≥55 or < 55 years) based on a previous report (5).
The variables that were identified as statistically significant prognostic factors were assessed in multivariate logistic regression analysis. A nomogram was constructed based on the results of multivariate analysis and validated using the validation dataset, using the rms package in R 3.6.0. The diagnostic performance of the nomogram was evaluated using the concordance index (C-index) and area under the receiver operating characteristic curve (AUC). Bootstrapping validation (1,000 bootstrap resamples) was used to calculate a relative corrected C-index (14). A calibration curve (1,000 bootstrap resamples) was generated to verify the calibration of the prediction nomogram.
A model for risk stratification of thyroid nodules > 4 cm was established according to the nomogram. Nodules were classified as high, intermediate, or low risk. The cut-off values for the three-level risk stratification were determined according to AUC, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). Accuracy was calculated according to the cut-off value.
Similarly, the diagnostic performance of the ACR TI-RADS was evaluated in terms of AUC, sensitivity, specificity, PPV, NPV and accuracy. The results were compared between this reference standard and proposed model.

Model Construction and Validation
In the training dataset, 29 nodules (19.3%) were malignant ( Table 2). We performed univariate logistic regression analysis using age, sex, composition, echogenicity, border, margin, shape, calcification, and halo. All variables except age and shape were identified as statistically significant risk factors ( Table 2). These risk factors were then included in the multivariable analysis.
279 (100)   A nomogram that integrated all four significant independent factors was constructed (Figure 2). The model showed a C-index of 0.833 (95% CI 0.752-0.915) for predicting malignancy in the training dataset ( Figure 3A) and 0.897 (95% CI 0.835-0.9591) for predicting malignancy in the validation dataset ( Figure 3C). Calibration curves for the probability of malignancy showed a good correlation between the nomogram-predicted and observed values ( Figures 3B, D).
The risk value of each factor was calculated using the nomogram. The risk value was 0.34 for microcalcification (100 points), 0.21 for solid and ill-defined border (70 points), and 0.14 for hypoechogenicity (50 points). Using the model, all nodules were assigned to one of three risk categories ( Table 4): nodules with more than 3 factors were classified as high risk (0.8-0.9); nodules with 2 factors or microcalcification, as intermediate risk (0.3-0.7); and nodules with 1 factor except microcalcification, as low risk (0.1-0.2).
In the training dataset, the malignancy rate of thyroid nodules > 4 cm that were classified as high risk was 90.0%; intermediate risk, 42.1%; and low risk, 10.8%. The corresponding malignancy rates in the validation dataset were 88.9%, 35.7%, and 6.9% ( Table 4). The risk stratification of the model was compared with the ACR TI-RADS (p < 0.001, Table 5).

Diagnostic Efficiency of the Model
Receiver operating characteristic curves demonstrated that the best cut-off value of the model was intermediate risk. In the training dataset, the model had a sensitivity of 58.8%, specificity of 84.6%, NPP of 93.7%, PPV of 34.5%, accuracy of 81.4% and AUC of 0.833 (95% CI 0.752-0.915) ( Table 6). The AUC of the nomogram was higher than that of ACR TI-RADS (0.823, 95% CI 0.750-0.882, p = 0.011). However, the model was similar to ACR TI-RADS in sensitivity (58.8% vs. 58.8%, p = 0.181) and

DISCUSSION
In the present study, we established a model to predict the risk of malignancy for thyroid nodules > 4 cm. This model incorporated four factors relatively easy to determine from conventional ultrasound imaging of nodules: composition, echogenicity, border, and calcification. We observed that the model achieved satisfactory diagnostic performance in both the training and validation datasets. Furthermore, the proposed model predicted malignancy better than ACR TI-RADS in both datasets, although it showed similar specificity and sensitivity as the reference standard. The thyroid nodules > 4cm have its unique ultrasonic risk stratification. In our study, multivariate regression revealed the following independent risk factors for thyroid cancer: microcalcification (OR 8.37, 95% CI 1.641-42.724), solid (OR 1.49, 95% CI 0.391-2.566), ill-defined border (OR 4.40, 95% CI 1.074-18.031) and hypoechogenicity (OR 2.94, 95% CI 1.031-8.389). Our findings were consistent with the three suspicious malignant signs (microcalcification, solid and hypoechogenicity) in ACR TI-RADS and Kwak TI-RADS, which confirmed the effectiveness of the existing guidelines. These factors were incorporated together to develop a nomogram. This nomogram could be a useful and convenient tool in clinical practice to evaluate the malignancy risk of thyroid nodules > 4cm. The model showed a C-index of 0.833 (95% CI 0.752-0.915) for predicting malignancy in the training dataset and 0.897 (95% CI 0.835-0.9591) for predicting malignancy in the validation dataset. Calibration curve plotting demonstrated its significant predictive and discriminatory capacity in the validation cohort. Using the model, all nodules were assigned to one of three risk categories): high risk (0.8-0.9), intermediate risk (0.3-0.7) and low risk (0.1-0.2). In the training dataset, the malignancy rate of thyroid nodules > 4 cm that were classified as high risk was 90.0%; intermediate risk, 42.1%; and low risk, 10.8%. The corresponding malignancy rates in the validation dataset were 88.9%, 35.7%, and 6.9%.
The risk stratification of the model > 4cm was different with the ACR TI-RADS (p < 0.001), which indicating that the thyroid nodules > 4cm have unique characteristics of ultrasonic risk stratification.
The proposed model may be convenient to implement in the clinic. Of various factors linked to nodule malignancy, including solid, hypoechogenicity, microcalcification, taller-than-wide shape and irregular/lobulated margin (4,6,9), our model identified only four independent risk factors: solid, hypoechogenicity, microcalcification, and ill-defined border. This suggests that fewer ultrasound features can still provide reliable predictions of malignancy. Taller-than-wide shape is considered an insensitive but highly specific indicator of malignancy (9,(15)(16)(17)(18), especially in sub-centimeter thyroid nodules (15,16), and it is assigned more points (3 points) in ACR TI-RADS to reflect an association with malignancy (6). However, none of the thyroid nodules in either the training or validation datasets had taller-than-wide shape, which may indicate that shape has no diagnostic value for thyroid   nodules > 4 cm. This may reflect that ultrasonography is less accurate at assessing shape, the larger the thyroid nodule is. Conversely, our study identified ill-defined border as a predictor of malignancy, similar to a previous study (19), but this factor is assigned 0 point in ACR TI-RADS. Our model and associated nomogram may be clinically easier to use than ACR TI-RADS and more accurate for thyroid nodules > 4 cm. The model > 4cm provides a better diagnostic efficiency than the ACR TI-RADS, kwak TI-RADS and ATA.
According to Kim's meta-analysis (20), the overall diagnostic performance of the three risk stratification systems (the ACR TI-RADS, kwak TI-RADS and ATA) of the representative society guidelines were comparable. In this study, there was no significant statistical difference in the AUC value among the ACR TI-RADS, Kwak TI-RADS and ATA in two datasets (p > 0.05). The AUC value were also no significant statistical difference (p > 0.05) among the model > 4cm, Kwak TI-RADS and ATA, similar to Shen's study (21).
The ATA guidelines cannot cover all nodules. For example, in this study, there were two hyperechoic thyroid nodules with microcalcification which were not belong to any risk stratification of ATA guidelines. The ACR convened committees developed a set of standard terms (lexicon) for ultrasound reporting and proposed a TI-RADS based on the lexicon. All nodules can be scored in ACR TI-RADS. And ACR-TIRADS showed the lowest rate of unnecessary FNAB and highest rate of malignancy in FNAB (22)(23)(24). So, the model > 4cm was mainly compared to ACR TI-RADS. The AUC of the model > 4cm was higher than of the ACR TI-RADS, whether in training dataset or in validation dataset, and the difference has statistically significant (p < 0.05). And the model was similar to ACR TI-RADS in sensitivity (61.9% vs. 57.1%, p = 0.480) and specificity (91.5% vs. 93.2%, p = 0.680), which were higher than Ha's study (25). This predictive model can use fewer indicators to diagnose the risk of malignant thyroid nodules larger than 4cm, and its diagnostic efficiency was consistent with that of ACR TI-RADS.
One benign thyroid nodule predicted by our model to be malignant and classified as TR5 in ACR TI-RADS illustrates the shortfalls of both systems. Ultrasonography showed that the nodule was solid, hypoechoic, and microcalcified, and that it had irregular margins and an ill-defined border. The nodule received 290 points, and malignant risk was > 0.95 according to the model. Pathology analysis of surgical samples indicated Riedel's thyroiditis, a rare inflammatory process involving thyroid and surrounding cervical tissues that is associated with systemic fibrosis (26). The nodules associated with this condition show nonspecific ultrasound features and so are often misdiagnosed (27,28).
This study had limitations. First, it was a retrospective study, so confounding factors could not be controlled. Second, all patients underwent thyroidectomy, which may have led to selection bias. The clinical utility of this proposed model as a preoperative decision-making tool should be explored in prospective studies.

CONCLUSION
Thyroid nodules > 4 cm merit a unique ultrasonic risk stratification, and the model proposed here may outperform ACR TI-RADS. The model should be tested in large prospective studies for its ability to guide preoperative decisions.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/supplementary material. Further inquiries can be directed to the corresponding author.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by the Institutional Review Board of Peking Union Medical College Hospital. Written informed consent for participation was not required for this study in accordance with the national legislation and the institutional requirements.

AUTHOR CONTRIBUTIONS
BZ conceived and designed the study. All the other authors collected the data. XY and SZ performed the analysis. XX prepared all the figures and tables. XX, YW, and LG were major contributors in writing the manuscript. BZ edited the manuscript. All authors contributed to the article and approved the submitted version.