A comprehensive comparative assessment of eight risk stratification systems for thyroid nodules in the elderly population

Objective This study aims to investigate the diagnostic value of eight risk stratification systems (RSSs) for thyroid nodules in the elderly and explore the reasons in comparison with a younger group. Methods Cases of thyroid nodules that underwent ultrasound examination with thyroidectomy or fine-needle aspiration (FNA) at our hospital between August 2013 and March 2023 were collected. The patients were categorized into two groups: an elderly group (aged ≥60) and a younger group (aged <60). Eight RSSs were applied to evaluate these nodules respectively. Results The malignant rate in the elderly group was significantly lower than that in the younger group (28.2% vs. 49.6%, P=0.000). There were statistically significant differences in nodule diameter, multiplicity, composition, echogenicity, orientation, margin, and echogenic foci between the elderly and younger groups (P<0.05). Among the eight RSSs evaluated in elderly adults, the artificial intelligence-based Thyroid Imaging Reporting and Data System (AI TIRADS) demonstrated the highest overall diagnostic efficacy, but with relatively high unnecessary FNA rate (UFR) and missed cancer rate (MCR) of 55.0% and 51.3%, respectively. By modifying the size thresholds, the new AI TI-RADS achieved the lowest UFR and MCR while maintaining nearly the lowest FNA rate (FNAR) among all the RSSs (P=0.172, 0.162, compared to the ACR and original AI, respectively, but P<0.05 compared to the other six RSSs). Conclusion Among the eight RSS systems, AI demonstrated higher diagnostic efficacy in the elderly population. However, the size thresholds for FNA needed to be adjusted.


Introduction
With the widespread application of imaging techniques, the prevalence of thyroid nodules in adults reaches approximately 19%-68%, and it tends to increase with age (1, 2).To aid clinicians in determining suitable management strategies for the growing number of thyroid nodules, various versions of ultrasound (US)-based risk stratification systems (RSSs) have been developed in recent years.The commonly utilized RSSs can be broadly classified into two groups: the "point-based" system and the "pattern-based" system.The point-based system comprises the Thyroid Imaging Reporting and Data System (TIRADS) established by Kwak et al. (Kwak) (3), American College of Radiology (ACR) (4), Benjamin et al. with an artificial intelligence algorithm (AI) (5), and the Chinese (C-TIRADS) (6).The patternbased system comprises the American Thyroid Association (ATA) guideline (2 ), the American Association of Clinical Endocrinologists, American College of Endocrinology, and Associazione Medici Endocrinology (AACE/ACE/AME) guideline (7), European Thyroid Association (EU) TIRADS (8) and Korean Society of Thyroid Radiology (K-TIRADS) (9).All these systems have exhibited excellent diagnostic performance (10)(11)(12)(13)(14).However, studies showed that age is a confounding factor that cannot be overlooked (15,16).
Age is associated with an increased incidence of thyroid nodules, a lower malignancy rate, and a higher proportion of invasive nodules (17).This implies that RSSs designed for the general population may not necessarily be applicable to older patients.To the best of our knowledge, there is currently no comparative study of these eight systems specifically focusing on thyroid nodules in the elderly population.This study aims to analyze thyroid nodules in elderly patients using the eight RSS systems, investigate the optimal diagnostic system, and explore whether the established biopsy thresholds are applicable to older individuals.

Patients
The Scientific Research and Clinical Trials Ethics Committee of the First Affiliated Hospital of Zhengzhou University of China granted approval for this retrospective study and waived the requirement for written informed consent for data usages.The study was conducted from August 2013 to March 2023 on a cohort of 5473 thyroid nodules in 3685 patients who received thyroid US exams and thyroid surgery or fine-needle aspiration (FNA) at our hospital.A total of 3914 thyroid nodules in 2638 patients were included in this study after meeting the exclusion criteria.Then, the nodules were divided into two groups according to the ages: elderly group (≥60 years old, 794 nodules in 504 patients) and younger group (<60 years old, 3120 nodules in 2134 patients) (Figure 1).The definition of 60 years as the age standard was based on our country's regulations, medical situation, and previous literature (18).The exclusion criteria were as follows: (I) Age < 18 years.(II) Incomplete ultrasound images.(III) Inconclusive pathological results.If surgery had been performed, then the postoperative pathology resulted prevail.If no surgery was done, the results of the FNA was applied.Cytology was classified according to the Bethesda System (19).Bethesda V and VI were considered malignant, Bethesda II were considered benign.Bethesda classes I, III or IV were excluded as uncertain outcomes.In the elderly group, 456 nodules were confirmed by postoperative pathology, consisting of 271 benign and 185 malignant cases.Additionally, 338 nodules were confirmed through FNA, with 299 benign and 39 malignant nodules.In the younger group, there were 2011 nodules with pathological confirmation, comprising 713 benign and 1298 malignant cases.Among these, 1109 nodules were confirmed through FNA, including 861 benign and 248 malignant cases.

Thyroid ultrasound examination and image interpretation
One of two US specialists with 33 or 11 years of expertise in thyroid US did each examination with Aplio 300 or 500 (Toshiba Corporation, Tokyo, Japan) equipped with a 5-12 MHz linear array transducer.Two superficial sonographers (with 8 and 12 years of expertise analyzing thyroid US images), blinded to the biopsy results and the final pathological diagnoses were hired to assess the ultrasonic features of the nodules and classify them according to the ATA guidelines, ACR, AI, Kwak, EU, AACE/ACE/AME, C and K-TIRADS.US features included the size (the maximal diameter on US), composition (solid or almost solid, mixed cystic and solid, cystic, spongiform), echogenicity (hyperechoic, isoechoic, hypoechoic, markedly hypoechoic, anechoic), orientation (taller-than-wide, wider-than-tall), margin (smooth, ill-defined, irregular or lobulated, extrathyroidal extension) and echogenic foci (punctuate echogenic foci, peripheral calcifications, macrocalcification, comet-tail artifacts).It is worth noting that comet-tail artifacts were recorded only in the absence of microcalcification.Other types of calcifications could be selected simultaneously.When the two doctors had differing opinions, a third expert with 33 years of thyroid imaging experience participated in a joint discussion to reach a final decision.Before assessing the ultrasonic features, an interactive case-based training session was conducted using 30 representative thyroid nodules not included in this study.Then, FNA were determined based on the size thresholds of each guideline.It was worth noting that the unclassified nodules in the ATA guidelines were grouped with intermediate-suspicion categories, as in previous reports (20)(21)(22)(23).

Statistical analysis
The FNA rate (FNAR) was determined by calculating the percentage of nodules recommended for FNA out of the total nodules.The unnecessary FNA rate (UFR) was computed by determining the ratio of benign nodules among the nodules that were advised to undergo FNA.The missed cancer rate (MCR) was derived by calculating the proportion of malignant nodules that were not recommended for FNA out of all malignant nodules.
Statistical analysis was conducted using SPSS 26.0 (IBM Corp., Armonk, NY, USA) and MedCalc 18.2.1 (MedCalc Software Ltd, Ostend, Belgium) software.Continuous data were presented as mean ± standard deviation (SD) and compared using the independent two-sample t-test.Categorical data were compared using the Chi-square test or Fisher's exact test.Receiver Operating Characteristic (ROC) curves were constructed, and the Area Under ROC (AUC) was compared using the DeLong method or Z-test.Sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), accuracy, FNAR, UFR and MCR with 95% confidence intervals (CI) were evaluated for the RSSs and compared using the McNemar or Chi-square test.A two-sided Pvalue of <0.05 was considered statistically significant.

Basic characteristics
The malignant rate in the elderly group was significantly lower than that in the younger group (P =0.000).There was no statistically significant difference in gender between the two groups (P =0.119).
However, the elderly group had a higher prevalence of multiple nodules and larger nodules, with a higher proportion of nodules measuring ≥20mm and a lower proportion of nodules measuring <10mm compared to the younger group (P<0.05)(Table 1).

Comparison of ultrasound features between elderly and younger groups
There were statistically significant differences in composition, echogenicity, orientation, margin and echogenic foci between the elderly group and the younger group (Table 2).The proportion of solid nodules was lower in the elderly group, while the mixed cystic and solid was higher than in the younger group.The proportions of hyperechoic, isoechoic and cannot classify were higher in the elderly group compared to the younger group, while the proportion of hypoechoic and markedly hypoechoic were lower than in the younger group.The taller-than-wide was less common in the elderly group than in the younger group.Smooth, ill-defined, and cannot classify margins were more common in the elderly group compared to the younger group, while irregular and extrathyroidal extension were less common in the elderly group than in the younger group.Peripheral calcifications, macrocalcification and non-calcified nodules were more common in the elderly group   than in the younger group, while punctate echogenic foci were less common in the elderly group than in the younger group.

Diagnostic efficacy of suspicious ultrasound features
From Table 3, it was observed that the elderly group demonstrated lower sensitivity regarding hypoechoic nodules, extrathyroidal extension, and punctuate echogenic foci in comparison to the younger group (P=0.042,0.028, 0.000, respectively).However, they showed higher sensitivity in terms of ill-defined margins compared to the younger group (P=0.035).For specificity, the elderly group demonstrated higher specificity in terms of hypoechoic compared to the younger group (P=0.036).However, they exhibited lower specificity in terms of ill-defined margins compared to the younger group (P=0.003).All ultrasound features in the Table 3, except for markedly hypoechoic nodules, showed lower PPV in the elderly group compared to the younger group (P<0.05).However, all ultrasound features in the Table 3 exhibited higher NPV in the elderly group compared to the younger group (P<0.05).

Comparison of diagnostic efficacy among different RSSs for elderly patients
The ROC showed that the cutoff value for C-TIRADS was 4B, for Kwak was 4C, for AACE/ACE/AME was 3, and for the remaining systems was 5.The highest area under the ROC curve was observed for AI, followed by Kwak, with AACE/ACE/AME exhibiting the lowest value.In terms of sensitivity, C-TIRADS demonstrated the highest level, followed by EU and AACE/ACE/AME, whereas K-TIRADS showed the lowest sensitivity.K-TIRADS displayed the highest specificity, followed by AI, while C-TIRADS exhibited the lowest specificity.The highest PPV was associated with AI, followed by K-TIRADS, while C-TIRADS presented the lowest PPV.The maximum NPV was achieved by C-TIRADS, followed by AACE/ ACE/AME and EU, whereas K-TIRADS and ACR had the lowest NPV.AI showed the best accuracy, K-TIRADS came in second, and C-TIRADS exhibited the lowest accuracy (Table 4).

Comparison of diagnostic efficacy for elderly patients based on size thresholds
After considering the size thresholds for FNA from various guidelines, we found that the FNAR for the eight different RSSs ranged from 30.5% to 59.7%, with AI having the lowest rate and K-TIRADS the highest.The UFR ranged from 55.0% to 74.5%, with AI having the lowest rate and K-TIRADS the highest.The MCR ranged from 22.8% to 51.8%, with C-TIRADS having the lowest rate and ACR the highest (Table 5).

The modified version of AI-TIRADS with adjusted size thresholds
From the Table 5, we observed that after incorporating the size thresholds for FNA, all eight RSSs performed poorly.Among them, AI had the lowest FNAR and UFR, while C-TIRADS had the lowest MCR.Taken together, AI showed the best diagnostic efficacy among older adults.However, the size thresholds for FNA needed to be adjusted.Considering the specific characteristics of nodules in elderly patients, we found that by modifying the size thresholds of category 3 from ≥25 to no-FNA for all, the category 4 from ≥15 to ≥25, and the category 5 from ≥10 to ≥5, the UFR of the new modified AI TI-RADS decreased from 55.0% to 34.3%.The MCR also significantly decreased from 51.3% to 21.4%, and the FNAR

Discussion
Among the various versions of US-based RSSs, AI-TIRADS demonstrated the best overall performance, with the largest AUC, highest PPV and accuracy, nearly the highest specificity and relatively high sensitivity and NPV, which suggested that AI- TIRADS was more suitable for elderly individuals.AI-TIRADS was a simplified version of ACR based on artificial intelligence algorithms, sharing the same risk stratification and the same thresholds for FNA with ACR, thus maintaining its excellent diagnostic efficacy and reducing UFR.Furthermore, it excluded the scoring of several ultrasound indicators, making the evaluation process simpler and thereby improving user-friendliness.Previous studies have confirmed that AI had similar or even higher diagnostic value compared to ACR (24,25), which was consistent with our study findings.
Combining the size thresholds for FNA, we found that the FNAR for various guidelines ranged from 30.5% to 59.7%, the UFR ranged from 55.0% to 74.5%, and the MCR ranged from 22.8% to 51.8%.Despite AI demonstrating the best performance, its UFR and MCR were as high as 55.0% and 51.3%, respectively.This indicated that the current size thresholds in existing guidelines were not suitable for the elderly, including the ACR/AI, which had been reported to have the lowest UFR in previous literatures (12,14).One possible reason was that thyroid nodules in the elderly population were generally larger and had a higher prevalence of benign nodules.The results of this study also corroborated this point.A study reported that the malignancy rate of thyroid nodules in individuals aged 20-49 was 17.1-22.9%,but it decreased to only 12.6% in those aged 70 and above (17).Hence, the size thresholds that were suitable for the general population may have been relatively low for elderly individuals, resulting in a higher rate of UFR.For elderly patients, careful consideration should have been given to surgical indications because surgery for this age group not only implied treatment but also posed a significant risk due to potential morbidity associated with surgical interventions, particularly for those frail elderly individuals (26).Advancing patient age should be a factor to consider when dealing with thyroid nodules (27).As AI demonstrated the best overall performance in the diagnostic value for elderly individuals, however, with high NPV and MCR, we adjusted the size thresholds of FNA for AI.In this study, the malignancy rate of category 3 nodules in the elderly group was only 7.6% (9/119), and among them, only 33.3% (3/9) had a size of ≥25mm.These nodules could be adequately monitored through follow-up (28).Therefore, we recommend follow-up instead of FNA for category 3 nodules.For category 4, the malignancy rate was 19.6% (31/158), and we recommended adjusting the size threshold from 15mm to 25mm.With these changes above, we reduced the number of FNA nodules by 60.7% (from 122 to 48), while also avoiding unnecessary FNA for 68.3% of benign nodules (from 101 to 32).There was only a slight increase of 5 missed diagnoses (from 19 to 24).
However, for category 5 nodules, given the high malignancy rate and the higher likelihood of aggressive cancer in elderly individuals, which accounts for almost all thyroid-related deaths (26,29), we have lowered the size threshold for grade 5 nodules from 10mm to 5mm, thus avoiding 82.8% of cancers being missed (from 87 to 15).With all the adjustments implemented, the modified AI-TIRADS showed a significant decrease in the UFR and MCR (UFR: before vs. after adjustments: 55.0% vs. 34.3%;MCR: before vs. after adjustments: 51.3% vs. 21.4%;both P=0.000).Although the FNAR increased slightly, there was no statistically significant difference compared to the ACR and original AI (P=0.172,0.162, respectively), and it remained lower than the other six RSSs.
In this study, the ROC analysis yielded a diagnostic threshold of 4A for C-TIRADS, which differed from the previously used 4B threshold in the general population (30).This disparity in threshold selection contributed to the higher sensitivity and lower specificity observed in this research.The possible reason was that C-TIRADS only utilized a few key suspicious US signs, including solid, markedly hypoechoic, ill-defined/irregular margin or extrathyroidal extension, vertical orientation, and microcalcifications.These features were generally less sensitive in the elderly population, especially the presence of microcalcifications.As a result, C-TIRADS tended to yield lower scores in the elderly population, leading to a lower diagnostic cutoff than in previous studies.Additionally, C-TIRADS did not account for the highly sensitive feature of hypoechoic nodules, which partly explained the superior diagnostic performance observed in Kwak (3), a similar classification approach with C-TIRADS.Moreover, C-TIRADS assigned 1 point for ill-defined margin.While in the elderly population, ill-defined margin exhibited low specificity and PPV (66.7% and 18.8%, respectively), which also contributed to the divergence between C-TIRADS and Kwak's.
It is worth mentioning that in this study, the unclassified nodules in the ATA guidelines were grouped with intermediate-suspicion categories, which was similar to the classification method of K-TIRADS (9).However, the AUC of K-TIRADS was found to be superior to ATA in this study.Upon analyzing the data, the difference was observed in mixed cystic and solid nodules with suspicious US features.ATA categorized mixed cystic and solid nodules with suspicious US features into the high suspicious category (TR-5).In contrast, K-TIRADS classified them, along with isoechoic nodules with suspicious US features, into TR-4, with only solid hypoechoic nodules with malignant features classified into TR-5.This emphasized the predictive ability of solid hypoechoic nodules for malignancy.The data from this study also confirmed this point.In the elderly group, the PPV of solid nodules was 43.1%, whereas mixed cystic and solid nodules were only 5.8%.Hypoechoic nodules, although less correlated with malignancy in the elderly group compared to the younger group, still reached 44.1%, while hyperechoic or isoechoic nodules were only 8.4%.This indicates that in clinical practice, paramount significance should be given to the predictive ability of solid hypoechoic nodules for malignancy.
This study had several limitations.Firstly, it was conducted at a single center, which may have limited the generalizability of the findings.Multi-center studies would have been necessary to validate and strengthen the results in the future.Secondly, due to the limited number of patients aged 80 and above, the study did not compare different age groups within the elderly population.Thirdly, as this study was retrospective in nature, there might have been some limitations in image interpretation.Conducting further prospective studies would be essential to establish more definitive conclusions.

Conclusion
All eight RSSs showed acceptable diagnostic efficacy in elderly patients, albeit lower compared to younger patients.Among these RSSs, AI demonstrated the highest overall diagnostic efficacy.By adjusting the size thresholds, the AI TIRADS achieved the lowest UFR, MCR, and nearly the lowest FNAR, thus offering enhanced guidance for clinical practice.

FIGURE 1
FIGURE 1Flowchart of study subject inclusion.US, ultrasound; FNA, fine needle aspiration; n, number.

TABLE 1
Basic characteristics of thyroid nodules according to age group.
*P<0.05 between elderly group and younger group.n, number; SD, standard deviation.

TABLE 2
Ultrasound features of thyroid nodules according to age group.

TABLE 2 Continued
*P<0.05 between elderly group and younger group.# The comet-tail artifacts were recorded only in the absence of microcalcifications.Other types of calcifications could be selected simultaneously.n,number.

TABLE 3
Diagnostic performance of partial suspicious ultrasound features.
only increased from 30.5% to 33.8%.The modified AI demonstrated the lowest UFR (P<0.05 compared to all eight RSSs) and MCR (P=0.733compared to C-TIRADS and P<0.05 compared to the other seven RSSs).Additionally, it achieved nearly the lowest FNAR (P=0.172,0.162, compared to the ACR and original AI, respectively, but P<0.05 compared to the other six RSSs).

TABLE 4
Diagnostic performance of eight RSSs for elder patients.

TABLE 5
Comparison of therapeutic performance for elderly patients based on size thresholds.