Nomogram Combining Radiomics With the American College of Radiology Thyroid Imaging Reporting and Data System Can Improve Predictive Performance for Malignant Thyroid Nodules

Purpose To develop and validate a nomogram combining radiomics of B-mode ultrasound (BMUS) images and the American College of Radiology (ACR) Thyroid Imaging Reporting and Data System (TI-RADS) for predicting malignant thyroid nodules and improving the performance of the guideline. Method A total of 451 thyroid nodules referred for surgery and proven pathologically at an academic referral center from January 2019 to September 2020 were retrospectively collected and randomly assigned to training and validation cohorts (7:3 ratio). A nomogram was developed through combining the BMUS radiomics score (Rad-Score) with ACR TI-RADS score (ACR-Score) in the training cohort; the performance of the nomogram was assessed with respect to discrimination, calibration, and clinical application in the validation and entire cohorts. Results The ACR-Rad nomogram showed good calibration and yielded an AUC of 0.877 (95% CI 0.836–0.919) in the training cohort and 0.864 (95% CI 0.799–0.931) in the validation cohort, which were significantly better than the ACR-Score model (p < 0.001 and 0.031, respectively). The significantly improved AUC, net reclassification index (NRI), and integrated discriminatory improvement (IDI) of the nomogram were found for both senior and junior radiologists (all p < 0.001). Decision curve analysis indicated that the nomogram was clinically useful. When cutoff values for 50% predicted malignancy risk (ACR-Rad_50%) were applied, the nomogram showed increased specificity, accuracy and positive predictive value (PPV), and decreased unnecessary fine-needle aspiration (FNA) rates in comparison to ACR TI-RADS. Conclusion The ACR-Rad nomogram has favorable value in predicting malignant thyroid nodules and improving performance of the ACR TI-RADS for senior and junior radiologists.


INTRODUCTION
With the increasing number of imaging-detected thyroid nodules, overdiagnosis and overtreatment are major clinical challenges in the management of these nodules; therefore, an accurate and practical risk stratification tool is necessary (1). Because B-mode ultrasound (BMUS) is the most accurate imaging modality to assess thyroid nodules, there are a number of risk classification systems based on BMUS images formulated by authoritative associations (2)(3)(4)(5). Previous studies have compared different guidelines to find a management guideline that is most beneficial to patients and demonstrated that the 2017 American College of Radiology (ACR) Thyroid Imaging Reporting and Data System (TIRADS) showed accurate diagnostic performance and meaningful reduction in the number of thyroid nodules recommended for biopsy (6)(7)(8). However, the relatively low specificity, positive predictive value (PPV), and interobserver variability of the ACR guidelines are impediments to achieving the desired clinical results (7,9).
Radiomics has the ability to high-throughput mine quantitative image features and discover information reflecting the underlying pathophysiology that cannot be assessed by visual interpretation (10,11). In recent years, radiomics has been applied to the thyroid, showing that it helps predict malignancy in thyroid nodules and preoperative cervical lymph node staging in papillary thyroid carcinoma (12)(13)(14). However, radiomics features are usually analyzed from a single-section image of the target nodule; therefore, radiomics alone might lose some important BMUS information, which makes it impossible to significantly improve the performance of risk stratification systems for all radiologists with different proficiency levels (15).
A nomogram is a graphical tool for a concise and intuitive display of the predicted value of individual outcome events based on multivariate regression analysis. We supposed that a nomogram could adequately combine the visual interpretation and radiomics of BMUS images to achieve better predictive performance. To the best of our knowledge, no published study has investigated the predictive performance of a nomogram combining radiomics with ACR TI-RADS scores for predicting malignant thyroid nodules.
Therefore, the purpose of our study was to develop and validate a nomogram that combines the radiomics score (Rad-Score) and ACR-TIRADS score (ACR-Score) for predicting malignant thyroid nodules and improving performance of the ACR TI-RADS.

MATERIALS AND METHODS
Ethical approval and informed consent were waived because the retrospective study with de-identified data was used, and no protected health information was needed. The study was conducted following guidelines by the Declaration of Helsinki.

Patients
Between January 2019 and September 2020, patients with thyroid nodules (≥10 mm in maximum diameter) in the Head and Neck Otolaryngology Department of our institution were consecutively included. The nodules were enrolled using the following inclusion and exclusion criteria.
The inclusion criteria were as follows: 1) the target nodule had undergone surgical resection; 2) postoperative pathological results were obtained; and 3) BMUS was performed within 2 weeks before the resection. Exclusion criteria are as follows: 1) the pathological result of the nodule was ambiguous, 2) interventional procedures such as fine-needle aspiration (FNA) and radiofrequency ablation were performed before BMUS, and 3) the BMUS image of the target nodule was unclear.
A total of 451 patients (median age, 45 years, range, 20 to 81 years; 93 men and 358 women) were enrolled. If there were multiple nodules in one patient, the nodule with the largest diameter was selected as the target nodule. All nodules were randomly split into a training cohort (n = 315, median age, 45 years, range 20 to 81 years; 68 men and 247 women) and a validation cohort (n = 136, median age, 43 years, range 21 to 70 years; 25 men and 111 women) in a 7:3 ratio.

Clinical and BMUS Information
Clinicopathological data, including age, sex, and nodule pathology, were obtained from medical records. BMUS images were acquired with a Philips iU Elite and Philips EPIQ7 (ultrasound system, Philips Medical System, Bothell, WA, USA) using a 5-12-MHz linear transducer by two radiologists (PX and ZW) with more than 8 years of experience. Images of each target nodule were obtained in transverse and longitudinal planes, and video clips were obtained in at least one plane.

Analysis of the ACR TI-RADS
Two radiologists (AZ and XH, with more than 10 years and 3 years of experience, respectively) who were unaware of the pathological results reviewed the BMUS images of all nodules. The five feature categories in the ACR TI-RADS lexicon (composition, echogenicity, shape, margin, and echogenic foci) were evaluated, and the ACR-Score of each nodule was calculated (referred to as ACR-Score 1 for AZ, ACR-Score 2 for XH) (16). The Supplement presents the detailed process of calculating the ACR-Score (Supplementary Tables E1, 2). The diameter and location (subcapsular or intrathyroidal) were negotiated to a consensus by the two radiologists.

Analysis of the Radiomics Features
The region of interest (ROI) was delineated manually on the BMUS DICOM image of the target nodule with the largest diameter in sagittal view using open-source software (3D Slicer, version 4.10.2; https://www.slicer.org) (Supplementary Figure E1) (17,18). The reproducibility of the intra-and interobserver agreement for the radiomics features was measured using the first 130 nodules that a radiologist (XH) redelineated twice within 2 weeks. The intraclass correlation coefficient (ICC) was used to evaluate the intra-and interobserver agreement. ICC > 0.75 represented satisfactory agreement. XH delineated the remaining nodules if strong agreement (ICC > 0.90) was achieved. To ensure repeatability of the results, resampling and z-score normalization were performed as preprocessing steps (Supplementary E1). Opensource software (Pyradiomics; http://pyradiomics.readthedocs. io/en/latest/index.html) (19) was used to extract a total of 837 texture, intensity, and wavelet features (Supplementary Table  E3). Then, dimensionality reduction and radiomics feature selection were performed successively by ICC, univariate, least absolute shrinkage and selection operator (LASSO) and linear dependence analyses (20,21). The methodology used to extract the radiomics features is further described in Supplementary E2 and Figure 1A. The radiomics score (Rad-Score) was generated using a linear combination of the selected features.

Statistical Analysis
Statistical analysis was conducted with IBM SPSS 22.0 software (IBM, New York, USA) and R software (Version 4.0.1, https:// www.r-project.org/). The packages of R4.0.1 used are provided in Supplementary Table E4. The Shapiro-Wilk test was used to evaluate the normality of the distribution. Continuous data conforming to a normal distribution are expressed as the mean ± standard deviation (SD) and were compared using Student's t-test; nonconforming data are expressed as the median [interquartile range (IQR)] and were compared using the Mann-Whitney U test. Categorical data are expressed as numbers (%) and were compared using a chi-square test or Fisher's exact test, as appropriate. A p < 0.05 represented a statistically significant difference.

Development of the ACR-Rad Nomogram
The ACR-Rad nomogram was developed based on the Rad-Score and the average of ACR-Score 1 and ACR-Score 2. For comparison, the ACR-Score model was built through a univariate logistic equation.

Performance of the ACR-Rad Nomogram
Calibration was evaluated using the Akaike information criterion (AIC), the Bayesian information criterion (BIC), calibration curve, and Hosmer-Lemeshow test (22). Discrimination performance was evaluated using the area under the receiver operator characteristic (ROC) curve (AUC). The Delong test was used to compare AUCs and the LR test to compare the effect across the nested logistic regression models.

Clinical Utility of the ACR-Rad Nomogram
The interobserver agreement for ACR-Score, Rad-Score, and the predicted malignancy risk by the ACR-Rad nomogram was evaluated. The improvement in the predictive accuracy of the nomogram by radiologists with different levels of experience was evaluated by the AUC, index integrated discrimination improvement (IDI), and net reclassification improvement (NRI). Decision curve analysis (DCA) was conducted to determine the clinical usefulness of the nomogram by quantifying the net benefits at different threshold probabilities in the entire cohort.
For clinical management, we compared the performance for biopsy recommended of the cutoff value which was determined in the training cohort with the maximum Youden index (referred to as ACR-Rad_max), and different cutoff values which were determined in the training cohort for prespecified predicted risks of malignancy (20%/30%/40%/50%) (referred to as ACR-Rad_20%/30%/40%/50%, respectively) with ACR TI-RADS in the entire cohort.

Patient Characteristics
The study flowchart is shown in Figure 1B. Clinical and pathological characteristics in the training and validation cohorts are summarized in Tables 1, 2. There was no significant difference between the training and validation cohorts for clinicopathological and BMUS characteristics (all p > 0.05). The proportions of malignant nodules in the two groups were 62.9% (198/315) and 70.6% (96/136) (p = 0.114). Malignant nodules had significantly lower age, diameter, and nodular goiter and significantly higher ACR-Score 1 and ACR-Score 2 than benign nodules in the training and validation cohorts (all p < 0.05).

Selecting Radiomics Features and Building the Rad-Score
The rates of intra-and interobserver agreement for the radiomics features reached 94.7% (794/837; mean ICC = 0.920) and 94.0% (787/837; mean ICC = 0.901), respectively (Supplementary Figure E2). Seventy-two features were excluded due to unsatisfactory agreement (ICC < 0.75); 192 features were excluded due to insignificant differences based on univariate analysis. Among 14 features selected by LASSO, 4 features were considered to have strong collinearity for the variance inflation factor [VIF] which was more than 10 (Supplementary Figure  E3). The remaining 10 features were included in the Rad-Score formula (Supplementary Figure E4). The Rad-Score of malignant nodules was significantly higher than that of benign nodules in the training [1.265 (0.738-1.900) vs.  Table 2). The Rad-Score yielded an AUC of 0.801 (95% CI 0.750-0.851) in the training cohort and 0.820 (95% CI 0.742-0.898) in the validation cohort ( Figure 2).

Development and Performance of the ACR-Rad Nomogram
The ACR-Rad nomogram incorporated two predictors: the average ACR-Score [odds ratio (OR) 1.644, 95% CI 1.423-1.928] and Rad-Score (OR 2.269, 95% CI 1.709-3.133) (both p < 0.001) ( Figure 3A). The ACR-Score model was built using the following a univariate logistic regression equation: The LR test between the ACR-Rad nomogram and ACR-Score model was c 2 = 4.184 (p < 0.001). The AIC, BIC, calibration curve, and Hosmer-Lemeshow test statistic (p = 0.640) showed good calibration of the ACR-Rad nomogram in the training cohort ( Table 3 and Figure 3B). An AUC of 0.877 (95% CI 0.836-0.919) also showed good discrimination, which was significantly higher than that of the ACR-Score model (0.833, 95% CI 0.785-0.880) in the training cohort (p < 0.001). The favorable calibration of the nomogram was confirmed in the validation cohort, whose Hosmer-Lemeshow test yielded a p value of 0.736 ( Figure 3C). The AUC (0.864, 95% CI 0.799-0.931) was significantly higher than that of the ACR-Score model (0.802, 95% CI 0.719-0.886) in the validation cohort ( Figures 4A, B).

Clinical Utility of the ACR-Rad Nomogram
The ICC of ACR-Score (0.677) was considerably lower than that of the Rad-Score and predicted malignancy risk (0.901 and 0.844, respectively). For senior and junior radiologists, the utilization of the ACR-Rad nomogram significantly improved the predictive value for predicting malignant thyroid nodules in terms of AUC, NRI, and IDI compared to the ACR-Score model in entire cohort (all p <0.001) ( Table 4 and Figure 4C). Moreover, favorable calibration of the nomogram was confirmed in both radiologists (Hosmer-Lemeshow test p = 0.715 and 0.415, respectively) (Supplementary Figure E5). The DCA demonstrated that the nomogram had a higher overall net benefit than the ACR-Score model, and was more beneficial than either the treat-all or the treat-none strategy ( Figure 5).
When applied with the ACR-Rad_max, specificity, accuracy, and PPV significantly increased with unnecessary FNA rates significantly decreasing but at the expense of significantly decreased sensitivity in comparison to ACR TI-RADS for both senior and junior radiologists. With ACR-Rad_20%/30%/40%, the specificity improved insignificantly for the senior radiologist. With ACR-Rad_50%, the significantly increased specificity, accuracy, and PPV and decreased unnecessary FNA rate were observed for the junior radiologist, and the significantly increased specificity was presented for the senior radiologist as well (all p < 0.05), with no difference in sensitivity and negative predictive value (NPV) (p > 0.05) ( Table 5).

DISCUSSION
In this study, we proved that BMUS radiomics and the ACR-Rad nomogram based on it and ACR TI-RADS can accurately predict Qualitative data were expressed as mean ± standard deviation or number and percentages (%); quantitative data were expressed as median (25%-75% quantiles). ACR, American College of Radiology; TI-RADS, Thyroid Imaging Reporting and Data System. a Nodules could have more than one type of echogenic foci. b B-model ultrasound findings based on the senior interpretation. c ACR-Score 1 was referred for the senior radiologist, ACR-Score 2 for the junior radiologist.

Huang et al. ACR-Rad Nomogram Predicted Thyroid Cancer
Frontiers in Oncology | www.frontiersin.org October 2021 | Volume 11 | Article 737847 malignancy in thyroid nodules, and the nomogram showed significantly better discrimination and calibration performance than the guideline alone. Excellent repeatability and clinical application of the nomogram were demonstrated in the entire cohort. With performing with 50% risk cutoff, the nomogram increased the specificity, accuracy, and PPV and decreased the unnecessary FNA rates of ACR TI-RADS for radiologists of different proficiency levels. Predicting malignant thyroid nodules and reduction in the number of meaningless biopsies are original intentions of many guidelines that the ACR TI-RADS can meet. The ACR guideline assigns points to the five feature categories of BMUS. The sum of the points is used to determine the probability of malignancy and provides recommended management procedures (16). However, the clinical application of the ACR guideline is strongly subjective (6,13,14). Hoang et al. (6) found that when the judgment of composition was wrong, malignant nodules would be misclassified. Although the ACR risk stratification system is fault-tolerant, in our study, the interobserver agreement was unsatisfactory (ICC = 0.677) and lower than that of the Rad- Qualitative data were expressed as mean ± standard deviation or number and percentages (%); quantitative data were expressed as median (25%-75% quantiles). ACR, American College of Radiology; TI-RADS, Thyroid Imaging Reporting and Data System. a Nodules could have more than one type of echogenic foci. b B-model ultrasound findings based on the senior interpretation. c ACR-Score 1 was referred for the senior radiologist, ACR-Score 2 for the junior radiologist.  Score and risk prediction value of the nomogram (ICC = 0.901 and 0.844, respectively). The reason may be due to weaker judgment of the junior radiologist in scoring spongiform, very hypoechoic and ill defined. Furthermore, the specificity of the ACR TI-RADS is weak (38.85%-57.96% in our study). Previous studies have reported that combining clinical characteristics (such as age, thyrotropin, or sex) with ultrasound features (such as ACR TI-RADS lexicon, hypoechoic halo, or blood flow) slightly increased the accuracy of these models in discriminating malignant nodules from benign nodules than risk stratification systems (23,24). However, the abovementioned clinical characteristics in the study of Liang et al. (13) were not significantly different. In our study, there was no significant difference in the gender. Moreover, other subjective ultrasound features might make little contribution to solve current challenges.
With the recent development of radiomics, its application in predicting the malignancy of thyroid nodules has received attention. Previous studies reported that radiomics showed good performance in predicting thyroid cancer, which was even higher than the risk classification guidelines with interpretations from non-experts (13,25,26). In our study, both the ACR-Score and Rad-Score were independent predictive factors of malignant nodules, and the Rad-Score had favorable diagnostic performance in magnificent nodules. However, radiomics alone cannot improve the performance of the ACR TI-RADS for senior radiologists who are experienced to evaluate comprehensively ultrasound features correlated with properties of the nodules (15). Park et al. (14) demonstrated that when combined with a 5% predicted malignancy risk cutoff of radiomics with the ACR or American Thyroid Association (ATA) guidelines, the performance significantly increased and unnecessary FNA rates reduced; in consequence, combining radiomics with ultrasoundbased risk stratification systems is a potential approach to predict magnificent thyroid nodules. Luo et al. (27) constructed a The average of ACR-Score 1 and ACR-Score 2 was applied.  nomogram including the Rad-Score and feature categories of the ACR TI-RADS and determined that a combination model was better than radiomics and the ACR TI-RADS alone for discriminating benign and malignant thyroid nodules. In our study, the ACR-Rad nomogram could be a more convenient tool to combine the ACR-Score and Rad-Score and a better predictive model for thyroid cancer. For senior and junior radiologists, the nomogram had significantly improved predictive performance in comparison with the ACR TI-RADS.  The radiomics model has not been sufficiently evaluated by prior studies, which were limited to comparisons of discrimination performance or only had senior radiologists assigned to score and delineate nodules (28,29). Our study evaluated the repeatability, discrimination, and clinical utilization of the ACR-Rad nomogram applied by senior and junior radiologists, proving that there was strong consistency in processing nodule texture information and it significantly increased the predictive performance among radiologists of different proficiency levels, which can compensate for the relatively low repeatability and accuracy of ACR TI-RADS. Moreover, the appropriate cutoff of the ACR-Rad nomogram can significantly reduce unnecessary FNA rates, increase the specificity and PPV, and maintain the high sensitivity of the guideline especially for junior radiologists.
Our study had several limitations. First, this study was a single-center retrospective study; thus, selection bias may be inevitable. The proportion of benign nodules in this study was much lower than that in other studies (13)(14)(15), because we chose nodules with postoperative pathology instead of FNA or followup. Second, BMUS images were only acquired with Philips ultrasound instruments. We should investigate the influence from images of different ultrasound instruments. Third, on account of the overlap between the shape features and the ACR TI-RADS, we did not analyze them. The predictive performance of shape features should be explored further.
In conclusion, the ACR-Rad nomogram, combined with ACR TI-RADS and BMUS radiomics, has the potential to be a convenient and accurate tool to predict malignancy and improve performance for radiologists at different proficiency levels in thyroid nodules.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding author.

ETHICS STATEMENT
Ethical review and approval was not required for the study on human participants in accordance with the local legislation and institutional requirements. The requirement for informed consent was waived. Written informed consent was obtained from the individual(s) for the publication of any potentially identifiable images or data included in this article.

AUTHOR CONTRIBUTIONS
XH performed the formal analysis, contributed to the methodology, administered the project, and wrote the original draft of the manuscript. ZW curated the data and collected the resources. AZ conceived the study and wrote and reviewed the manuscript. XM curated the data. QQ curated the data. CZ curated the data. SC curated the data. PX supervised the study and wrote, reviewed, and edited the manuscript. All authors contributed to the article and approved the submitted version.

FUNDING
This study was supported by "the key research and development projects of Jiangxi Province" (no. 20181BBG70031).