Comparison of Diagnostic Performance of Five Different Ultrasound TI-RADS Classification Guidelines for Thyroid Nodules

Objectives We aimed to evaluate and compare the diagnostic performance of five ultrasound thyroid imaging reporting and data system (TI-RADS) classification guidelines for thyroid nodules through a review and meta-analysis. Methods We searched for relevant studies before February 2020 in PubMed. Then we pooled the sensitivity, specificity, likelihood ratios, diagnostic odds ratios, and area under the summary receiver operating characteristic curves. And the diagnostic odds ratios were used to compare the performance. Results We totally included 19 studies with 4,696 lesions in this research. The pooled sensitivity of American College of Radiology (ACR) guidelines, American Thyroid Association (ATA) guidelines, TI-RADS proposed by Kwak (Kwak TI-RADS), Korean Thyroid Association/Korean Society of Thyroid Radiology (KTA/KSThR) guidelines for malignancy risk and European Thyroid Association (ETA) guidelines is between 0.84 and 0.94. The pooled specificity is 0.68, 0.44, 0.62, 0.47, and 0.61, respectively. And the RDOR is 1.57 (ACR vs ATA), 1.37 (ACR vs ETA), 1.80 (ACR vs Kawk), 1.74 (ARC vs KTA). Conclusions The results suggest that five classification guidelines are all effective methods for differential diagnosis of benign and malignant thyroid nodules and ACR guideline is a better choice.


INTRODUCTION
Thyroid nodules are easily found in the general population, especially in women (1), and about 10% of patients with thyroid nodules are at risk of malignancy, and the percentage keeps going up (2,3). Malignant nodules and benign nodules are treated in completely different ways. It's still a big challenge for clinicians to rule out malignancy of the thyroid nodules. At present, ultrasound is a primary, cheap, noninvasive, fast, and valuable tool to identify the thyroid nodules. For suspected thyroid nodules, a surgery or fine-needle aspiration cytology (FNAC) is recommended (4). Benign and malignant nodules have some similar ultrasound features from modulation to size. The ultrasound diagnosis varies with the experience of radiologists, and operators, image acquisition and interpretation are subjective which can easily lead to misdiagnosis or overtreatment (5).
To conduct an objective detection, the thyroid imaging reporting and data system (TI-RADS) was proposed, which is used to classify thyroid nodules and recommend further treatment (6). Nowadays, there are five common classification systems used in clinic. Among the guidelines, the American College of Radiology (ACR) guidelines, the Korean Thyroid Association/Korean Society of Thyroid Radiology (KTA/ KSThR) guidelines, and the European Thyroid Association (ETA) guidelines are recommended by the radiological association, and the American Thyroid Association (ATA) guidelines are in clinical guidelines (1,(7)(8)(9).
Although these five guidelines prove to be effective in managing thyroid nodules, there are no guidelines based on a lot of reliable data to prove which is the best (10). And many clinical trials in progress are used to compare their effectiveness, but these results are biased. The primary purpose of this research is to compare the diagnostic effectivity of the five guidelines for thyroid nodules to address the lack of consistency and avoid wasting of medical resources.

Literature Search Strategy
We followed the guidelines for the systematic review and metaanalysis of diagnostic studies. Then we too retrieved PubMed for related studies with English language only before February 2020, using the terms as follows: "sensitivity", "specificity", "TI-RADS (or thyroid imaging reporting and data system)", "ACR (or The American Thyroid Association)", "ATA (or American Thyroid Association)", "Kwak (or TI-RADS proposed by Kwak)", "ETA (or EU TI-RADS)", "KTA (or Korean Thyroid Association/ Korean Society of Thyroid Radiology)". Two reviewers (RN Yang and YN Zhao) independently reviewed the articles in accordance with the inclusion and exclusion criteria. Disagreements were adjusted by consensus (XL Ma).

Inclusion and Exclusion Criteria
Studies with following inclusion criteria were included: (a) There is enough general information in the article. (b) One or more guidelines are used to evaluate the ultrasound features of thyroid nodules. (c) The study has definite diagnostic criteria. (d) There is sufficient data in the article, whether it is data that can be found directly in the article (sensitivity, specificity, and PPV) or data that can be calculated based on the article [positives (TP), true negatives (TN), false positives (FP), and false negatives (FN)] to fill the diagnostic 2 × 2 table (FN, FP, TP, and TN). And the exclusion criterion is that data in the article is not enough or the grading system is not designed to evaluate ultrasound features. Finally, a total of 19 articles are included.

Data Extraction
Two reviewers (RN Yang and YN Zhao) picked up some main characters from the studies as following: author, year, country, number of patients, number of nodules, mean age, involved guideline, gold standard, malignant lesions, and benign lesions. And we obtained the four numbers of TP, TN, FP, and FN for each guideline in different studies by two ways: (1) We got the data from the article directly. (2) Based on the data (sensitivity, specificity, PPV, and NPV) obtained from the articles, we finished the diagnostic 2 × 2 table. CAL software was use here (11).

Statistical Analysis
On the bases of TP, TN, FP, and FN, we computed the pooled sensitivity, specificity, positive and negative likelihood ratios (PLR and NLR), and diagnostic odds ratio (AUC), with 95% confidence intervals (CI), using the Meta-Disc version 1.4 statistical software (12).
Additionally, using the Meta-Disc version 1.4 statistical software (12), we examined the relationship between sensitivity and specificity by constructing the summary receiver operator characteristic (SROC) curves (13).
At last, we made a head-to-head comparison using R 3.5.1 to calculate the relative diagnostic odds ratio (RDOR) with 95% CI. According to the RDOR, we compared the diagnostic performance among the five guidelines. At comparison, classified into A and B, two guidelines were involved. In A vs B, when the value is greater than 1, A has higher performance. If the value is smaller than 1, B has greater performance. When the value is greater, the performance is better. For all studies, the inconsistency index (I 2 ) and c 2 test were used to assess heterogeneity, and it was considered high heterogeneity if the I 2 value was higher than 50% (14). A random-effect model was chosen in this research (15).

Quality of Studies and Publication Bias
We used Quality Assessment of Diagnostic Accuracy Studies version 2 (QUADAS-2) performed in Review Manager 5.2 to assess the quality of the studies included in this analysis. The method mainly evaluated the articles from four domains: (a) patient selection, (b) index test, (c) reference standard, and (d) flow and timing (16). Each domain is rated as three risks (low, high, and unclear). Publication bias was evaluated by the funnel plot asymmetry test using Stata version 11.0 software.
The studies were published from 2015 to 2020.The number of patients is from 92 to 4,585, and the number of nodules in the included articles varies from 100 to 4,696, which means some patients have more than one nodule. All data were calculated based on the number of nodules. All thyroid nodules were diagnosed of malignancy through postoperative pathological results or the pathology results of FNAC. We totally included 19 articles, 12 of which involved ACR TI-RADS. 10 articles involved ACR guidelines, and Kwak TI-RADS was mentioned in six articles. The data of the KTA guideline and EU TI-RADS were obtained from four articles respectively.
The above characteristics were shown in Table 1.

Diagnostic Accuracy
After pooling all the data of the 19 studies together, we got the final data.  Table 2. As for RDOR, we found a high result when ACR was compared with other guidelines. The specific results are listed in Table 3.

Quality Assessment
The results of the quality assessment are outlined in Figure 2. In conclusion, the quality of the studies was satisfactory.

Assessment of Publication Bias
There is no clear publication bias for DOR of the five guidelines.

DISCUSSION
TI-RADS classification guidelines classify the thyroid nodules according to the imaging characteristics under ultrasound, including the size, number, calcification, boundary, echoic pattern, aspect ratio, and internal structure. The guidelines are aimed to help determine which thyroid nodules require FNAC to reduce overdiagnosis or missed-diagnosis. The reduction of unnecessary FNAC can prevent the waste of economy and the physical pain of patients. It can also guide further treatment and estimate the risk of recurrence. However, the recommended size thresholds for FNAC are different in different guidelines. At present, there are many studies about the diagnostic efficacy of the five guidelines, but the results vary. These differences between studies may be due in part to differences among observers and study populations, especially in retrospective studies. In this research, we included 19 studies to analyze the diagnostic efficacy of the five diagnostic criteria.
Our meta-analysis systematically estimated the diagnostic efficacy of five different ultrasound classification guidelines in detecting malignancy risk. The pooled sensitivity of the ACR TI-RADS, ATA guidelines, Kwak TI-RADS, KTA guidelines, and ETA is between 0.84 and 0.94. The pooled specificity is 0.68, 0.44, 0.62, 0.47, and 0.61, respectively. The AUC which can represent the diagnostic performance of the ACR TI-RADS, ATA guidelines, Kwak TI-RADS KTA guideline, and ETA is 0.8553, 0.8976, 0.9101, 0.9022, and 0.8810. In theory, AUC above 0.8 is diagnostic (36). The results of our research suggested that all the five guidelines have property. Besides, ACR guidelines showed the best diagnostic performance in the head to head comparison.
Our results were similar with a previous meta-analysis published in 2019 (37). But that article just included 12 studies with 18,750 thyroid nodules, and the data it included was used to      (10,38). Every guideline has been divided into several categories to evaluate the thyroid nodules. As the risk stratification categories rise, the risk of the malignancy is increased, but the five guidelines have differences in the  classification. For example, a category five or four thyroid nodule in ETA may be classified as ACR T4/3 or KTA T4/3, and a nodule of KTA T3 (low suspicion) and ETA category 3 (low risk) may be classified as ACR T2, which means not suspicious. Different classification criteria like the above may lead to different specificities, and as the results in our research, the ACR guidelines surely had the highest specificity. It also means less recommendation for FNAC, but the rate of misdiagnosis increases. We need more studies to discuss. As for the performance of recommendation for FNAC, the five guidelines have different size thresholds, and the thresholds also change with categories in different guidelines. For example, for ACR TI-RADS, the threshold of categories three, four, and five is 2.5, 1.5, and 1cm, respectively (39). Some studies (17) have shown ACR TI-RADS have the most effective criteria which can avoid the unnecessary biopsies effectively. Our results also confirmed this, with the highest RDOR for ACR TI-RADS. Nodule size is an important standard for guidelines and further treatment. The too large thyroid nodules with low malignancy risk will suggest surgery or FNAC.
In addition, there are several limitations in this research. Firstly, the final diagnosis was determined by cytology or pathology. It may be influenced by the operators or observers, with possible bias. Especially in retrospective studies, we are not sure whether subjective factors affect the diagnosis. This influence can't be avoided. The second limitation is caused by the patient selection of included studies. Some studies have included more patients with malignant nodules which could influence the sensitivity and specificity. Thirdly, we didn't have enough data for KTA guidelines and ETA to analyze. Lastly, all analyses are based on the ultrasound; the intra-observer and inter-observer variability still exists.
In conclusion, our research indicates that the five classification guidelines are all effective methods for differential diagnosis of benign and malignant thyroid nodules. They can be used before further diagnosis or treatment as an effective recommendation. In head to head comparison, the result suggests ACR guideline is a better choice in the benign and malignant diagnosis with high diagnostic accuracy. However, we still need more studies to prove our findings.

AUTHOR CONTRIBUTIONS
All authors directly participated in the planning, execution, or analysis of the study and wrote the manuscript. RY conducted the literature review, planned and performed all statistical analyses. HZ and XZ provided input and direction for the analytic strategy and editing of the manuscript. YZ reviewed the included articles and provided editing of the manuscript. XM provided technical quality control to ensure accuracy of reported results. All authors contributed to the article and approved the submitted version.