Thyroseq v3, Afirma GSC, and microRNA Panels Versus Previous Molecular Tests in the Preoperative Diagnosis of Indeterminate Thyroid Nodules: A Systematic Review and Meta-Analysis

Background Molecular tests are being used increasingly as an auxiliary diagnostic tool so as to avoid a diagnostic surgery approach for cytologically indeterminate thyroid nodules (ITNs). Previous test versions, Thyroseq v2 and Afirma Gene Expression Classifier (GEC), have proven shortcomings in malignancy detection performance. Objective This study aimed to evaluate the diagnostic performance of the established Thyroseq v3, Afirma Gene Sequencing Classifier (GSC), and microRNA-based assays versus prior iterations in ITNs, in light of “rule-in” and “rule-out” concepts. It further analyzed the impact of noninvasive follicular thyroid neoplasm with papillary-like nuclear features (NIFTP) reclassification and Bethesda cytological subtypes on the performance of molecular tests. Methods Pubmed, Scopus, and Web of Science were the databases used for the present research, a process that lasted until September 2020. A random-effects bivariate model was used to estimate the summary sensitivity, specificity, positive (PLR) and negative likelihood ratios (NLR), and area under the curve (AUC) for each panel. The conducted sensitivity analyses addressed different Bethesda categories and NIFTP thresholds. Results A total of 40 eligible studies were included with 7,831 ITNs from 7,565 patients. Thyroseq v3 showed the best overall performance (AUC 0.95; 95% confidence interval: 0.93–0.97), followed by Afirma GSC (AUC 0.90; 0.87–0.92) and Thyroseq v2 (AUC 0.88; 0.85–0.90). In terms of “rule-out” abilities Thyroseq v3 (NLR 0.02; 95%CI: 0.0–2.69) surpassed Afirma GEC (NLR 0.18; 95%CI: 0.10–0.33). Thyroseq v2 (PLR 3.5; 95%CI: 2.2–5.5) and Thyroseq v3 (PLR 2.8; 95%CI: 1.2–6.3) achieved superior “rule-in” properties compared to Afirma GSC (PLR 1.9; 95%CI: 1.3–2.8). Evidence for Thyroseq v3 seems to have higher quality, notwithstanding the paucity of studies. Both Afirma GEC and Thyroseq v2 performance have been affected by NIFTP reclassification. ThyGenNEXT/ThyraMIR and RosettaGX show prominent preliminary results. Conclusion The newly emerged tests, Thyroseq v3 and Afirma GSC, designed for a “rule-in” purpose, have been proved to outperform in abilities to rule out malignancy, thus surpassing previous tests no longer available, Thyroseq 2 and Afirma GEC. However, Thyroseq v2 still ranks as the best rule-in molecular test. Systematic Review Registration http://www.crd.york.ac.uk/PROSPERO, identifier CRD42020212531.

Background: Molecular tests are being used increasingly as an auxiliary diagnostic tool so as to avoid a diagnostic surgery approach for cytologically indeterminate thyroid nodules (ITNs). Previous test versions, Thyroseq v2 and Afirma Gene Expression Classifier (GEC), have proven shortcomings in malignancy detection performance.
Objective: This study aimed to evaluate the diagnostic performance of the established Thyroseq v3, Afirma Gene Sequencing Classifier (GSC), and microRNA-based assays versus prior iterations in ITNs, in light of "rule-in" and "rule-out" concepts. It further analyzed the impact of noninvasive follicular thyroid neoplasm with papillary-like nuclear features (NIFTP) reclassification and Bethesda cytological subtypes on the performance of molecular tests.
Methods: Pubmed, Scopus, and Web of Science were the databases used for the present research, a process that lasted until September 2020. A random-effects bivariate model was used to estimate the summary sensitivity, specificity, positive (PLR) and negative likelihood ratios (NLR), and area under the curve (AUC) for each panel. The conducted sensitivity analyses addressed different Bethesda categories and NIFTP thresholds.

INTRODUCTION
Thyroid cancer (TC) accounts for 2% of all cancers and it is the most frequent endocrine malignancy. In the last decades, its incidence has increased due to improved screening and ultrasound (US) surveillance of thyroid nodules (TNs) (1,2). Distinguishing benign from malignant disease is typically achieved by fine-needle aspiration (FNA) biopsy and cytologic evaluation of TNs based on US appearance and nodule size.
The Bethesda System for Reporting Thyroid Cytopathology (TBSRTC) argued in favor of an appreciable framework to standardize the reporting of FNA cytology results (3) and, therefore, it has become an effective tool for identifying the malignancy risks, types of neoplasms and guided clinical management. This approach reliably establishes a benign or malignant nodule diagnosis in 70 to 80% of all cases (4). However, for the remaining 20 to 30% of nodules, the FNA diagnosis falls in an interpretive gray zone, consisting of one of three indeterminate cytology categories (3,5), i.e., follicular lesion of undetermined significance/atypia of uncertain significance (FLUS/AUS, Bethesda category III), follicular neoplasm/suspicious for follicular neoplasm (FN/SFN, Bethesda category IV), and suspicious for malignancy (SM, Bethesda category V), with a predicted probability of cancer of 10-30, 25-40, and 50-75%, respectively (3).
Historically, indeterminate thyroid nodules (ITNs) commonly underwent repeat FNA or diagnostic surgery, typically lobectomy. Approximately three-quarters of these were benign on surgical pathology, indicating unnecessary surgical removal (6). Advances in the genetics of thyroid tumorigenesis have led to the development of a series of molecular tests to complement cytology and improve the riskbased stratification of ITNs (7).
Afirma Gene Expression Classifier (GEC) from Veracyte Inc. is a microarray-based test with a proprietary algorithm that analyses the mRNA expression of a panel of 167 genes (8). Previous works report a quite high sensitivity (SE) but low specificity (SP) for Afirma GEC, making it a good "rule-out" test (9).
The ThyroSeq panel is a next-generation sequencing (NGS)based assay that underwent several iterations over the years (10)(11)(12). ThyroSeq v2, replaced in 2011 the so-called seven-gene panel (BRAF, RAS, RET/PTC, and PAX8/PPAR) and queried 56 genes for point mutations, fusions, and abnormal gene expression. Its initial validation study claimed the potential for use as an all-around test of malignancy in ITNs given the reported positive predictive value (PPV) of 83% (13).
Recent refinements led to the development of novel analytic panels, such as Thyroseq v3 and Afirma Gene Sequencing Classifier (GSC), which became available for clinical use in 2017. Thyroseq v3 assays for a panel of 112 gene point mutations, insertions, deletions, copy number alterations, fusions, and gene expression alterations associated with TC (14,15). The next-generation molecular tool, Afirma GSC, was released to improve the GEC's SP and incorporated additional components for BRAFV600E mutation, RET/PTC fusion, parathyroid tissue, and medullary thyroid cancer (MTC) (16). Data from an academic center suggest an improved SP and PPV while maintaining high SE and NPV and reducing the surgery rate for GSC (17). In May 2018, Veracyte Inc. launched the Afirma Xpression Atlas (XA) which uses RNA sequencing to detect gene variants and fusions, being conceived for Afirma GSC suspicions and Bethesda V-VI lesions (18). Subsequent augmentation of the panel meant to include 905 variants and 255 fusions from 593 genes has broadened its initial use from surgical decision-making in ITNs to targeted therapies for metastatic TC (19,20).
A multiplatform approach (MPT, Interpace Diagnostics) combines a mutation panel (ThyGenX) and a microRNA (miRNA) classifier test (ThyraMIR) that has been shown to provide both high NPV and PPV (21,22). In the current MPT, designated MPTX, an analytically validated expanded NGS test (ThyGeNEXT), is combined with ThyraMIR. This multiplatform test demonstrated a high PPV of 75% and NPV of 97%, comparable with other marketed tests (14,16,23). RosettaGX Reveal (Rosetta Genomics) is a thyroid miRNAs classifier for the stratification of ITNs by evaluating the expression of 24 up and down-regulated miRNAs species, using the routinely stained cytology smears as testing substrate (24).
Currently, the AUS/FLUS category represents "the grey zone" of thyroid cytology, comprising a heterogeneous set of cases of uncertain interpretation. This feature can explain in part the more variable AUS/FLUS risk of malignancy compared to other indeterminate categories. Moreover, little is known about the impact of the molecular diagnosis on AUS/FLUS subcategorization. Recent studies have shown that the BRAFV600E mutation is more frequently associated with cytologic atypia than other qualifiers, whereas the molecular landscape of other AUS/FLUS subcategories is still evolving (25). The development of a hybrid AUS/FLUS subclassification system integrating the atypia qualifiers and molecular alteration could improve malignancy risk stratification and could also contribute to customizing the management of AUS/FLUS patients by selecting those more suitable for surgery or clinical follow-up (26). Thus, it was proposed that BRAF, RAS, RET/PTC alterations could be analyzed firstly if cytological atypia predominates. Conversely, if the predominant cytological features are non-typical microfollicular structures, then RAS and PAX8/PPARg alterations could be searched first (27).
Recently, a new histological category of Noninvasive Follicular Thyroid Neoplasms with Papillary-like Nuclear Features (NIFTP) was introduced to distinguish the noninvasive encapsulated follicular variant of papillary thyroid cancer (EFVPTC) from other aggressive forms of papillary thyroid carcinomas (PTC). In this original study, no adverse outcomes were found in 109 NIFTP patients, thus NIFTP was considered a lesion with an excellent prognosis appreciated currently as a low-risk thyroid neoplasm (28). Although two subsequent studies have reported a risk of lymph node and lung metastases in about 5 and 1% of the NIFTP cases, respectively (29,30), these findings were not confirmed in the majority of cohorts after a long follow-up (31)(32)(33)(34)(35)(36). Newly proposed additional diagnostic criteria for NIFTP reflect a joint effort by experts to further refine the NIFTP such that the histomorphology would correlate with an indolent outcome of this entity (37). Reliable criteria that could conduct to a diagnosis of NIFTP for cytological specimens is expected, to avoid overtreatment and additional follow-up. Also, given that some molecular tests were developed and validated before this reclassification, their performance measures have been shown to deteriorate significantly when the NIFTP designation is incorporated in the classification of ITNs (38)(39)(40).
A few previous meta-analyses have been done on this topic; most of them only analyzed single molecular testing, and none of them evaluated qualitatively the newest emerging panels, Thyroseq v3 and Afirma GSC (9,(41)(42)(43). Therefore, the present study aimed to measure the accuracy of recently developed Thyroseq v3, Afirma GSC, Interpace Multiplatform tests, and RosettaGX for diagnosis of ITNs, compare them with the initial versions and highlight each diagnostic potential in light of "rule-in" and "rule-out" concepts. The secondary aim was to perform an up-dated analysis of Thyroseq 2 and Afirma GEC and assess the impact of NIFTP reclassification, TBSRTC cytological subtypes, and industry sponsorship on the performance of these molecular tests.

Protocol and Registration
The protocol of the current systematic review and meta-analysis can be accessed on the Prospero website https://www.crd.york.ac.uk/ prospero/ with the following registration number: CRD42020212531.

Search Strategy
The research followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines (44). We used the PICO (population, index, comparator, outcomes) system to describe the essential items for framing this review and its objective and methodology. Papers published before September 05, 2020, were searched on PUBMED, Web of Science, and Scopus databases combining the concepts "molecular panels" with "thyroid nodules" and "indeterminate cytology". After that, we used the following search strategy on Medline: [Thyroseq OR (Afirma AND ("gene expression classifier" OR "Genomic Sequencing Classifier" OR GEC OR GSC)] OR Rosettagx OR Thyramir OR ThygenX OR (Multiplatform AND test*) OR MPTX OR ThyGeNEXT) AND [(thyroid AND (Nodule* OR tumor*)] OR indetermin* OR undetermin* OR "fine needle aspiration" OR FNAC* OR [(Bethesda OR categor*) adj6 (III OR IV OR V OR 3 OR 4 OR 5)] OR AUS/FLUS OR FN/SFN OR "suspicion of follicular neoplas*"). The search strategy in other databases was similar, following the same principles and steps. At the same time, the reference lists of review papers and original reports were handsearched for further relevant studies. No language, publication date, or status restrictions were used.

Inclusion Criteria
To be included in the meta-analysis, studies had to meet the following criteria: • longitudinal studies in which individuals with nodular thyroid disease (solitary or multinodular) found by palpation or on the US, in whom FNA biopsy was performed and the categories III, IV, or V, were identified according to TBSRTC; • studies evaluating at least one of the following molecular panels: Thyroseq, Afirma GEC or GSC, RosettaGX Reveal, ThyraMIR/ThyGenX, ThyraMIR/ThyGeNEXT (Interpace, MPTX), or miRInform; • studies that used a histopathological examination of the thyroid surgery as the reference standard;

Exclusion Criteria
• studies that used standard references other than histopathological examination, such as clinical or US surveillance; • duplicates, reviews, comments, editorials, conference abstracts, and unpublished articles; • studies that enrolled patients with benign or malignant cytology of the TNs and participants with non-diagnostic results of the molecular tests.

Data Extraction
Two reviewers (SCA, LV), working independently, read the included articles' titles and abstracts and judged their eligibility. A third investigator (SH) adjudicated any discrepancies. After excluding papers that did not meet our inclusion criteria, we read the full texts, and relevant data were extracted and tabulated in a Microsoft Excel sheet framework.
The following items were eligible to collect and record for each manuscript: • publication information (first author, publication year, country of origin); • patients' characteristics (participants' and TNs number, mean age, gender ratio); • index test information (the molecular panel); • reference standard information (histological subtypes after surgical treatment, number of NIFTP cases and their index test results); • study flow and timing (number of FNA biopsies performed to confirm indeterminate cytology, percentage of resected nodules among the entire cohort, group with the positive and negative index test result, number of nodules with nondiagnostic test result); • statistical analysis (TPs, FPs, TNs, and FNs).
When the appropriate size effect was not available, original data had been extracted from the article to calculate them, or we contacted the authors to offer the missing data.

Assessment of Methodological Quality
Two reviewers (SCA, LV) assessed the studies' quality using the Quality Assessment of Diagnostic Accuracy Studies 2 (QUADAS-2) (45). The domains included in the risk of bias and applicability evaluation were participant recruitment, index test, reference standard, flow, and timing. We customized the signaling questions for each of the four QUADAS-2 domains (Supplementary Table 1). According to the signaling questions, the risk of bias and applicability were evaluated as low, high, or unclear (Supplementary Table 2). For each signaling question, reviewers were required to answer "yes," "no," or "unclear." Divergent answers among reviewers were resolved through discussions. No study was excluded as a result of findings from the risk of bias assessments. However, due to the limited number of studies labeled with a low risk of bias, we could not synthesize separately the results for this subgroup.

Statistical Methods
For each panel, the TP, FP, TN, and FN were used for computing SE, SP, PLR, NLR, and DOR. SE and SP with their corresponding 95% confidence intervals (CIs) were used to pooled data using the bivariate random-effects model. The analyses were done using MIDAS from STATA software (version 16.0), which uses joint modeling SE and SP. The pooled PLR was derived to describe the ratio of a positive outcome in cancer cases, while the pooled NLR the ratio of a positive outcome in those without cancer. DOR, the odds of PLR to NLR, ranging from zero to infinity, were derived to estimate the diagnostic accuracy. Also, PPV as the proportion of individuals with positive test results who are correctly identified as having malignant disease and NPV as the proportion of patients with negative test results who truly have benign nodules were calculated. When we have computed the PPV and NPV estimates we quantified the prevalence in a given population by specifying a prior distribution, f (p), on p, following the recommendations described by Li et al. (46). Specifically, we have estimated the prevalence in each study and used the lowest/highest prevalence rates as interval limits in pddam command from midas (i.e., midas tp fp fn tn, pddam (lbp ubp). Finally, we determined the benign call rate (BCR) as the percentage of molecular tests that result in a benign or negative test result.
For providing inferences regarding diagnostic quality, we plotted a Summary Receiver Operating Characteristic (SROC) curve for each panel. The area under the curve (AUC) was used to estimate the panel's diagnostic accuracy. Furthermore, we had conducted a series of sensitivity analyses looking at the pooled SE and SP when NIFTP was excluded from the malignant histologies, at different Bethesda categories, at studies in which the authors were paid as employees of a pharmaceutical company.
We assessed heterogeneity across studies through the I² statistic, and we used a bagplot to examine the spread of the observed data and identify outliers. We examined each panel's clinical utility using Fagan plots with pre-specified probabilities of 25, 50, and 75% respectively. Evidence of publication bias was assessed through Deeks's funnel plot.

Ethical Approval
This article does not contain examinations performed on human participants. Thus, ethical approval was not necessary.

Literature Search
Our literature search in PUBMED, Web of Science, and Scopus databases until September 05, 2020, identified 485 potentially relevant publications. An additional seven studies were found, besides by hand-searched of the review papers and original reports. After removing duplicates, we identified 207 abstracts. We excluded a total of 139 records as they represented irrelevant studies to the current analysis, papers with clinical and US follow-up only as of the reference standard, evaluation of different preparation smears, studies evaluating lymph nodes or residual FNA rinse samples, analytical validation studies, review articles, case reports, comments, letters or reply. The remaining 68 pieces were deemed relevant by title and abstract alone. Based on the readings of the full-text articles, we excluded 28 articles for reasons. Figure 1 illustrates the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) flow-chart of the study selection process.

Participant and Study Characteristics
We included in the review a total of 40 articles from the USA with 50 assessments of association between seven molecular panels and postsurgical histological evaluation (8, 11, 14-17, 21, 23, 40, 47-77). Table 1 summarizes the characteristics of the included studies. All 40 articles are published in English. The publication year ranged from 2012 to 2019, while the populations were enrolled between September 2009 to June 2019. All but one study were conducted in the USA, with the originates in Singapore (73). A minority of the studies had a prospective design (n = 10) (8,14,15,21,50,55,57,64,77), of which one research performed a parallel randomized study (61), and another two studies enrolled patients both retro-and prospectively (11,75).

Excluded Studies
Based on the readings of the full-text articles, we excluded 28 articles for the following reasons: only enrolled nodules with benign test results (n = 4) (78-81) or suspicious test results (n = 1) (82), evaluated nodules with benign or malignant cytology (n = 2) (83, 84), did not perform surgery and consequently did not provide reference standard in nodules with benign test results (n = 7) (85-91), an overlap of the participants with other studies (n = 8) (92-99), used freshly collected FNA samples as the reference standard (n = 1) (100), unavailable statistical analysis (n = 4) (13,22,101,102), and unavailable full-text article (n = 1). Finally, 40 articles met initial eligibility criteria and were systematically reviewed and abstracted. We included all of them in the quantitative analysis. ). The number of GSC negative results increased extensively to 72% comparing with GEC. The surgery rate among the nodules with valid GEC results was lower than GEC (53.3%). We have noticed a significant gap between the percentage of resected nodules with suspicious (79.7%) and negative test results (36.3%). Following surgery and histological evaluation, 125 of 310 (40.3%) nodules were found malignant and two TNs were labeled as NIFTP. GEC's SE and SP across s t u d i e s r a n g e d f r o m 9 0 . 6 % t o 1 0 0 a n d 2 8 . 6 % t o 68.3%, respectively.

Thyroseq Next Generation Sequencing (NGS)
Nine studies involving 1,549 Bethesda III, IV, and V TNs of 1,498 participants evaluated Thyroseq v2 (11,15,58,59,61,62,71,73,75). The recruitment period ranged from June 2012 until June 2017. The reported quality failure proportion of the Thyroseq v2 was exceptionally low (13, 0.8%). We have found negative test results in three quarters (74%) of the investigated nodules. The percentage of surgical resections among the nodules with valid Thyroseq results was 53%, with a significant gap between resections of those with high-risk test results (91%) and negative test results (39%). Following surgery and histological evaluation, 238 of 808 (29.4%) nodules were found malignant. Three studies reported the number of NIFTP lesions (14,59,65), of which 13 had a positive test result and for six the result was not reported. The SE and SP of Thyroseq v2 ranged from 70 to 100% and 44 to 93%, respectively, across studies.
Additional four studies, including 603 TNs from 549 patients, to evaluate the Thyroseq v3 (14,55,59,65). The reported number of non-diagnostic results of the Thyroseq v3 was 33 (5.5%), ranging from 1.  (23). In a similar two-step approach, MPTX is reported as negative when no mutations (ThyGeNEXT) are detected and the miRNA (ThyraMIR) test is negative; as positive when a strong driver mutation is detected or when the miRNA test is positive; and as moderate when a weak driver mutation is detected and the miRNA test is negative or moderate, or when no mutations are detected and the miRNA test is moderate. All included ITNs underwent surgical resection, which revealed 115 (58.3%) malignancies and 5 NIFTP cases, all with positive MPTX diagnosis. The calculated SE and SP of MPTX are 94.3% and 61.4%, respectively.

MiRNA-Based Platforms: RosettaGX Reveal and miRInform
Three studies enrolled 234 cytologically ITNs and tested them with RosettaGX Reveal molecular panel from 2015 to 2018 (60,67,68). The reported number of non-diagnostic results was 12 (5.1%), ranging from 0 to 6.0% among individual studies. The number of RosettaGX Reveal negative and positive results were approximately equal (53% vs. 47%). The nodules' surgery rate with a valid test result was 99%. After surgical treatment, the histological assessment revealed 72 of 120 (60%) malignant tumors. A single NIFTP case with a positive RosettaGX result was recorded following histological assessment (68

Quality Assessment
Two reviewers (SCA and LV) critically assessed the 40 studies' quality in the qualitative analysis using the QUADAS-2 tool (45). We used graphs ( Figure 2) and a table (Supplementary Table 3) to present results for each domain's risk of bias and applicability concerns. Since many studies evaluated multiple index tests, we divided them into several groups, one per index test, raising the total number of appraisals to 50.
If considering the risk of bias for each molecular test, studies evaluating miRNA-based platforms and Thyroseq v3 seem to outperform in terms of flow and timing, as the histological evaluation was available for the majority of included participants. However, this relative superiority is countered by the limited number of studies for these assays. Also, miRNA-based panel Interpace has shown the lowest risk of bias concerning index test, as in two of three studies the molecular testing was performed blind to histological diagnosis (23,67). For other criteria, the quality concerns were similarly high for all tests.
There is a low concern regarding applicability that the included patients do not match the review question as just a few manuscripts restricted the cohort to ITNs with Hurthle cell pattern (52) or Hashimoto thyroiditis (66). Besides, there is a low applicability concern that the conduct or interpretation of the index test differ from the review question in all but three articles in which the choice to order GEC or referral for surgical evaluation was made by the individual clinical provider (56) or molecular test results reported together, such as Afirma GSC with GEC (68) or Thyroseq v1 with Thyroseq v2 (71). Additionally, there is an unclear applicability concern in several studies that did not report the histological subtypes after surgical treatment (57,58,66,71).
Due to the limited number of studies labeled with a low risk of bias, we could not perform sensitivity analyses to explore the influence of the studies' quality on the results.
The AUC value from the SROC curve, displayed in Supplementary Figure 2, was 0.90 (95% CI: 0.87 to 0.98), indicating an excellent overall detection of the Afirma GSC panel. Also, Afirma GSC proved a modest magnitude of change in test-positive cases based on PLR of 1.9 (95% CI: 1.3-2.8) but stronger evidence to change the probability in test-negative cases according to NLR of 0.11 (95% CI: 0.04-0.27), as seen in Table 2. Besides, the DOR of 18 showed a lower value than Thyroseq v3 but was associated with a narrower 95% CI (6-50, 103, 104). However, as we had only four studies on which to rely on our estimation, we precluded the sensitivity analyses with the impact of NIFTPs reclassification, TBSRTC categories, and declared conflicts of interest in this panel's case.

Diagnostic Performance of Thyroseq v2
A total of nine studies have looked at the diagnostic accuracy of Thyroseq v2 (11,15,58,59,61,62,71,73,75). The forest plot is shown in Figure 5 Figure 4) showed that in the low suspicion of malignancy scenario (25%), a PLR of 3.5 increases the post-test probability for a positive test result to 54%, whereas an NLR of 0.11 reduced the post-test probability to 6% for a negative test result. On the other hand, given a pre-test probability of 75% in the high suspicion scenario, a positive posterior probability of 91% could be considered to diagnose TC and the post-test probability was 35% for a negative test result. Also, we computed the DOR, which showed a similar value to Afirma GSC but a narrower 95% CI [19; (9-42)].

Impact of Bethesda Categories and Conflict of Interests on Thyroseq v2 Diagnostic Performance
Looking specifically at the Bethesda III category TNs, the four studies included (15,71,73,75) Figures 5-8).
It has not been possible to compute a sensitivity analysis for repeated FNAs due to the limited number of studies evaluating Thyroseq v2.

Diagnostic Performance of Afirma GEC
The forest plot summarizing the data from the 25 studies involving Afirma GEC assay in diagnosing TC is shown in Figure 6. As high heterogeneity between studies in SE and SP data (I2 = 57%, 95% CI: 38 to 76; respectively I 2 = 85%, 95% CI: 80 to 90) was observed, the random effect size was applied for computing the meta-analysis. The overall SE and SP were 0.97 (95% CI: 0.93 to 0.98) and 0.19 (95% CI: 0.15 to 0.24), and PPV and NPV were 0.39 (95% CI: 0.37-0.40) and 0.91 (95% CI: 0.88-0.93), respectively. Afirma GEC showed the lowest DOR of 7, in conjunction with a narrow 95% CI which is in the range of 3 to 13, and a BCR of 42%. The SROC curve presented in Supplementary Figure 11, and the corresponding value for the AUC, was 0.61 (95% CI: 0.56 to 0.65), indicating a low overall detection.
In Fagan's nomogram low suspicion of TC scenario (25%), the post-test probability for a positive test result was 28%, whereas an NLR of 0.11 reduced the post-test probability to 6% for a negative test result (Supplementary Figure 15). On the other hand, given a pre-test probability of 75% in the high suspicion scenario, a positive posterior probability increases to 78%, and the negative posterior probability decreases to 35%, respectively.
Considering the small study effects, the Deeks' funnel plot for the 25 studies included in our meta-analysis indicated no evidence of publication bias (p = 0.19 for Deeks' funnel plot asymmetry test; see Supplementary Figure 16).

Impact of NIFTP Cases Reclassification on Afirma GEC and Thyroseq v2 Diagnostic Performance
To investigate the impact of revised nomenclature of encapsulated FVPTC and NIFTP reclassification on the molecular test performance we included into analysis only studies (17,40,51,56,58,61,68,72) where the NIFTPs and their test results were reported. Regarding Afirma GEC, we have observed a slight increase in SE (0.98, 95% CI: 0.85 to 1.00) and a decreased overall SP of 0.14 (95% CI: 0.11 to 0. 19 Supplementary Figures 27, 28).
We could not perform analogous analysis for the rest of the molecular tests due to the limited number of studies.

DISCUSSION
Molecular tests are increasingly used as auxiliary diagnostic tools aimed to help avoid both diagnostic and completion surgeries in cytologically ITNs. Previous panels, Thyroseq v2 and Afirma GEC, have proven shortcomings in malignancy detection performance. The present study is the first one to provide a comprehensive analysis of the novel molecular tests, Thyroseq v3, Afirma GSC, multiplatform, and miRNA-based assays for the malignancy assessment in ITNs, to the best of our knowledge. According to the predominant ability to exclude or confirm a malignancy, the molecular panels are classified as "rule-in" or "rule-out" tests (105). Vargas-Salas et al. showed that, considering the cancer prevalence range of 20-40%, a robust "rule-out" test would require an NPV of at least 94% and a minimum SE of 90%, while for a desirable test to predict or "rulein" malignancy, an optimal standard would be a PPV of at least 60% and an SP above 80%. These parameters are associated with both, optimal clinical accuracy and clinical effectiveness (105). A "rule out" test will perform better in a low-risk TN at US or in a cytologic category of low cancer frequency such as Bethesda III or IV category (106). Sonographically high-risk TNs or categories of higher cancer frequency such as in Bethesda V would benefit more from a "rule-in" test, in which case a positive test result would decrease the risk of completion surgery (106).
Our results suggest that ThyroSeq v3 shows excellent diagnostic accuracy compared with its prior iteration based on an AUC of 0.95. Also, Thyroseq v3 showed the lowest NLR of 0.02, making it the most accurate test to exclude malignancy. However, the SE and NLR improved at the expense of decreasing SP and PLR, declining the ability to confirm malignancy. The validity of these results is still questionable, considering the small number of studies evaluating this panel and data instability due to outliners; hence, the ability of Thyroseq v3 to "rule-in" malignancy should be confirmed in future studies. Besides, in theoretical modeling, Thyroseq v3 was slightly more costeffective than Afirma GSC and considerably more cost-effective than diagnostic lobectomy (107).
Afirma GSC succeeded partially to reach its original objective to increase the "rule-in" properties of GEC, given the modest increase in SP and PLR. However, GSC managed to improve substantially the NLR to 0.11 and BCR from 42 to 73%, making GSC even a better "rule-out" test compared with its front-runner. These findings are in line with previous literature results, which  showed a significant increase in BCR (65.3% vs 43.8%) compared to that of Afirma GEC (108). The overall performance of Afirma GSC is considerably improved, given the increase in the AUC to 0.90 and the DOR to 18. GSC could, therefore, be an excellent "rule out" test. However, its "rule-in" properties have not been confirmed, and thereby, the management of cases with suspicious tests should be made, including other clinical, US, and cytological characteristics. Based on the pooled results from nine studies, Thyroseq v2 shows a good overall performance, owing to the AUC of 0.88 and DOR of 19, similar to Afirma GSC. Also, Thyroseq v2 showed the highest PLR, making it the first option from those available to confirm malignancy. However, the PLR of 3.5 examined separately can produce a small shift in malignancy probability. Therefore, Thyroseq v2 strength continues to be in its "rule-out" features, considering the NLR of 0.18, which can generate a shift in post-test probability in the low suspicion scenario from 25 to 6%. When separate analyses by TBSRTC were computed, a slight decrease in SE and increase in SP among Bethesda IV compared to Bethesda III TNs was noticed, thus, suggesting that Thyroseq v2 could be more effective in rule-in malignancy in TNs with higher pre-test prevalence of malignancy. The industry sponsorship and conflicts of interest did not affect the results except for a slight decrease in SP (42). However, controversies exist regarding the clinical utility of this molecular test, especially due to the lack of decrease in the surgery rate along with the additional cost of Thyroseq v2 that can increase the overall cost of care of patients with ITNs (13,109). Moreover, the introduction of ThyroSeq v2 resulted in a shift toward indeterminate cytology results (13).
Regarding Afirma GEC, our analysis based on the pooled results across 25 articles showed unsatisfactory overall diagnostic performance (AUC 0.60) and poor ability to confirm malignancy given the PLR of 1.2. However, when patients were segregated by TBSRTC categories, Afirma GEC reached an AUC of 0.83 for AUS/FLUS and 0.95 for FN/SFN. Also, performing the Afirma GEC test in persistently indeterminate TNs could increase the AUC of GEC. In this regard, several studies claim that AUS/ FLUS and SUSP nodules are reclassified after the repeat FNA in a proportion from 10% to 40%, usually into a benign category (110)(111)(112), hence, affecting the accuracy of the results. It seems that industry sponsorship and conflicts of interest could affect the results for Afirma GEC accuracy. Therefore, based on the optimal NLR, Afirma could be helpful as a "rule-out" test, especially in Bethesda III and IV lesions. It might help in predicting benign TNs in cytologic categories with low cancer frequency, in low-risk TNs at US, or when clinical follow-up is recommended instead of diagnostic surgery.
Recently, Liu et al. performed a meta-analysis assessing the diagnostic performance of Afirma GEC. Similar to our results, they showed that Afirma GEC has a relatively high SE of 95.5%, but a low SP of 22.1% and DOR of 5.25, concluding that the outcome for over half of the nodules with GEC-suspicious is still uncertain, which limits its use in clinical practice (42). Interestingly, the routine use of Afirma GEC in clinical practice seemed to increase the incidence of indeterminate FNA diagnoses, whereas the incidence of benign diagnoses significantly decreased. These results suggest that Afirma GEC may shift FNA interpretation toward Bethesda III/IV, in which molecular testing is used. Moreover, the surgery rate did not appear to change in an institutional retrospective study, raising uncertainty regarding the benefits of this molecular assay in risk stratification (69). Other authors have shown overtreatment among patients whose management was decided following this test result (113).
Due to the limited number of studies, we could not compute separate analyses for Interpace's multiplatform tests, RosettaXG Reveal, and miRInform in the MIDAS framework, which requires a Gaussian quadrature (114). For this reason, we have reported the abovementioned molecular panels SE and SP range across the studies as preliminary evidence. In this regard, The Interpace multiplatform approach provided an optimal SE, across studies but a slightly decreased SP compared to that claimed by its predecessor miRInform. Finally, the recently introduced Rosetta GX Reveal reported an optimal diagnostic accuracy. However, there is a severe concern about the instability of the results, especially the Interpace platform which combines two separate panels, and we need future studies to validate these diagnostic tests and their clinical utility.
The secondary objective of the research was to investigate the impact of revised nomenclature of encapsulated FVPTC and NIFTP reclassification on the aforementioned molecular test performance (28). Our findings support that Afirma GEC and Thyroseq v2 performance outcomes were affected by NIFTP reclassification, due to the increase in FPs rate. As would be expected from a "rule-out" test, Afirma GEC's Se and Sp were not significantly affected, even though AUC markedly dropped. However, as regards Thyroseq v2, a more critical change, especially in Sp was noticed. Reflecting a similar trend to the present results, a recent analysis by Sahli et al. reported an insignificant decrease in Se and Sp for Afirma GEC and a more critical change in the diagnostic performance of Thyroseq v2 after the addition of the new diagnostic entity (38). They also found a decrease in PPV from 47 to 38% for Afirma GEC and from 83 to 29% for Thyroseq v2, respectively (38).
This reclassification of NIFTP lesions from malignant to premalignant has an important impact concerning the diagnostic performance of molecular tests. It was described previously that Afirma GEC and Thyroseq v2 can detect the genetic alterations, such as RAS gene mutations, THADA fusions, PPARc-PAX8 fusions, and BRAFK601 mutation (28,115). Due to the presence of RAS mutations in a significant number of NIFTPs (116), molecular panels will mark NIFTP as "suspicious" for malignancy (115). Moreover, because of the wide variability of genetic mutations among benign thyroid lesions, cautious interpretation of current genetic testing results (117) and recalibration to appropriately account for the NIFTPs is required.
A potential limitation of this review and meta-analysis was that the analyzed diagnostic tests could not be compared and ranked due to the limited number of studies with direct head-to-head comparisons. Second, only patients with surgical pathology were considered and, therefore, excluding many benign nodules by molecular testing managed conservatively. The rationale behind this decision is the inferior reliability of clinical and sonographic follow-up compared to that of histopathology, which is considered the diagnostic gold standard, especially because, in most of the studies, the mean follow-up was less than 2 years. Moreover, statistically, the evidence comparing an assay with the gold standard (i.e., surgery) as well as with other conservative methods (i.e., sonographic follow-up) should be treated as different analyses, because mixing the results could lead to biased results in pairwise meta-analyses (118). Thus, the decision to proceed otherwise would have led to differential reference bias. Third, final pathology was unavailable, especially for those with a benign test result, due to the choice to undergo conservative management. Fourth, all the studies were performed in the USA population, thus raising some concerns regarding the extrapolation of the results to the rest of the world. Finally, an overall unclear methodological quality of the included studies could have led to inaccurate assumptions.
In most TCs, genetic alterations are mutually exclusive events (119). Some mutations, like BRAF V600E and TERT, are highly specific, showing almost a 100% risk of PTC (120,121). However, the impact of RAS mutations or PAX8/PPARg rearrangements is still evolving since they show a considerable overlap among different morphological entities. RAS mutations, RET/PTC, and PAX8/PPARg rearrangements were detected in up to 48, 68, and 55% of all benign nodules, respectively, while some malignant lesions showed no mutations at all (122). Variable number and types of mutations among benign nodules may explain the low Sp and PPV of Afirma GEC (122) and may also challenge the reported PPV of Thyroseq V2 (14). Newer products, Afirma GSC and Thyroseq v3, begun to address the challenges discussed above (122). As experience accumulates, we will gain a deeper insight into how well they mitigate the challenges addressed herein.
The development of new biomarkers in TCs will most likely lead to enhanced versions of current tests or the development of new ones. The ultimate goal of each molecular testing of cytological samples from ITNs is to add evidence in support or against the need for surgical treatment and the extent of surgery, to achieve the individual patient's best outcome. Thus, it will be necessary to determine whether negative test results indeed decrease the number of unnecessary surgeries and a positive result reduces the rate of completion surgeries. Besides, new hopes are directed towards the updated Afirma GSC and XA reports. The impact of Afirma XA could extend beyond informing upon the risk of cancer when the test result is negative or positive, for a specific genomic alteration. It gives potential insights into the molecular analysis of the FNA specimens claiming to inform about the associated neoplasm types, prognostics, identification of molecular targets for systemic therapy, and the recognition of potential hereditary syndromes (18,20). Future evidence is needed to validate the Afirma XA real-word performance.

CONCLUSIONS
Summarizing all the data obtained in this comprehensive metaanalysis, the conclusion that can be drawn is that there is no perfect molecular panel at the current time to discriminate malignancy in ITNs. However, each of the tests above has its strong points and can be used in particular situations. Our results suggest that Thyroseq v3 substantiate the best overall diagnostic performance, followed by Afirma GSC and Thyroseq v2, which were similar in terms of AUC and DOR. In terms of "rule-out" performance, Thyroseq v3 showed the most noticeable results, being able to generate a large shift in cancer probability of a negative test result. However, optimal results to exclude malignancy can be achieved with Afirma GSC but also with previous tests, no longer available, Afirma GEC, and Thyroseq v2. If considering the "rule-in" properties, the recently developed Thyroseq v3 and Afirma GSC failed to achieve a higher performance to confirm a malignancy, being surpassed by Thyroseq v2. Secondly, MPTX and RosettaGX show excellent preliminary results, and future studies are needed to validate them. The quality of evidence seems to be higher for Thyroseq v3, notwithstanding the limited number of studies.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding authors.

AUTHOR CONTRIBUTIONS
CAS and VL conceived and designed the research, drafted the protocol, abstracted the total data from the included articles, and participated in writing the manuscript. RDG and AD conducted the statistical analysis/meta-analysis. SS, BAN, and CEG participated in the search, screening, and analysis of the literature. HS supervised the research, contributed in project administration, and critically revised the manuscript. All authors contributed to the article and approved the submitted version.