A Serum Circulating miRNA Signature for Short-Term Risk of Progression to Active Tuberculosis Among Household Contacts

Biomarkers that predict who among recently Mycobacterium tuberculosis (MTB)-exposed individuals will progress to active tuberculosis are urgently needed. Intracellular microRNAs (miRNAs) regulate the host response to MTB and circulating miRNAs (c-miRNAs) have been developed as biomarkers for other diseases. We performed machine-learning analysis of c-miRNA measurements in the serum of adult household contacts (HHCs) of TB index cases from South Africa and Uganda and developed a c-miRNA-based signature of risk for progression to active TB. This c-miRNA-based signature significantly discriminated HHCs within 6 months of progression to active disease from HHCs that remained healthy in an independent test set [ROC area under the ROC curve (AUC) 0.74, progressors < 6 Mo to active TB and ROC AUC 0.66, up to 24 Mo to active TB], and complements the predictions of a previous cellular mRNA-based signature of TB risk.

inTrODUcTiOn Almost one-fourth of the global population carries a latent Mycobacterium tuberculosis (MTB) infection (1) and is at risk of progressing to active tuberculosis. Known risk factors for progression, such as co-infection with HIV and potentially age of first exposure (2) can only explain a fraction of active disease, thus novel diagnostic and prognostic tests are needed to identify those most likely to progress (3). Accurate identification of individuals likely at high risk of active TB would facilitate prophylactic treatment strategies, potentially curing the TB infection before it progresses to its highly infectious symptomatic stage. As a first step toward this objective, we recently described a blood RNA-based correlate of risk (RNA-CoR) for progression to active TB based on splice-junction abundance from 16 interferon-response genes (4). This RNA-CoR was discovered in a South African cohort of MTB latently infected adolescents and validated using samples from South African and Gambian cohorts of household contacts (HHCs) of MTB index cases. While the results for the RNA-CoR are promising, the sensitivity and specificity of the signature were limited and there is a need to determine whether performance can be augmented using alternative approaches. The predictive power of the RNA-CoR is improved for patients close to progression to active TB. This is consistent with detection of subclinical incipient TB prior to the onset of disease symptoms. Other effective biomarkers could reflect underlying long-term risk factors that predispose individuals to develop active, rather than latent, TB after an exposure event. Exploring alternatives to whole-blood mRNA expression measurements may facilitate the discovery of these factors.
MicroRNAs (miRNAs) are small, non-coding RNAs that, as part of enzymatic protein complexes, execute post-transcriptional regulation of gene expression (5). Recent studies have demonstrated important roles for specific miRNAs during MTB infection (6). Although the established functions of miRNAs are intracellular, numerous studies have detected highly stable extracellular circulating miRNAs (c-miRNAs) in blood (7). These c-miRNAs have been explored as biomarkers for infectious diseases, including TB (8).
In this study, we evaluate c-miRNAs as candidate biomarkers for risk of TB disease progression in HHCs. These analyses make use of serum samples collected from prospective HHC cohort studies carried out in South Africa and Uganda as part of the Bill and Melinda Gates Foundation-funded Grand Challenges 6-74 program (GC6-74). Serum samples were collected from HHCs at enrollment (within 2 months of exposure) and at 6 and 18 months after enrollment if participants remained disease free. TB progressors were defined as individuals who developed intrathoracic TB within the study period based on one of the following two criteria: (1) positive TB sputum culture coupled with at least one of the following: positive chest X-ray, positive acid-fast bacilli (AFB) sputum smear, a second positive TB sputum culture from an independent sample or clinical symptoms consistent with active TB; or (2) positive AFB sputum smear coupled with a positive chest X-ray or a second positive AFB sputum smear from an independent sample. Co-incident TB cases, defined as HHC who developed TB within 3 months of exposure, were excluded from all further analyses. At study end, controls were selected from the individuals who had remained free of active TB for the 2-year study period and matched to cases by study site, sex, age (four age groups: <18, 18-25, 25-36, >36), and year of enrollment (three enrollment groups: 2006-2007, 2008, 2009-2010). Two to three matched controls were included for each progressor. Case-control assignment was performed prior to quantification of c-miRNA levels to ensure a blind case-control design. Prior to analysis, South African samples were split into discovery and validation sets; all Ugandan samples were apportioned to the validation set.

serum c-mirna Profiling and selection
Quantification of serum c-miRNA levels was performed by Exiqon Inc. (Vedbaek, Denmark) using qRT-PCR with lockednucleic acid primers as previously described (9). Briefly, total RNA was extracted from serum using the miRCURY™ RNA isolation kit-biofluids (Exiqon, Inc., Vedbaek, Denmark) as follows. Serum/plasma was thawed on ice and centrifuged at 3,000 × g for 5 min in a 4°C microcentrifuge. An aliquot of 200 µL of serum/ plasma per sample was transferred to a new microcentrifuge tube and 60 µL of Lysis solution BF containing 1 µg carrier-RNA per 60 µL Lysis Solution BF and RNA spike-in template mixture was added to the sample. The tube was vortexed and incubated for 3 min at room temperature, followed by addition of 20 µL Protein Precipitation solution BF. The tube was vortexed, incubated for 1 min at room temperature and centrifuged at 11,000 × g for 3 min. The clear supernatant was transferred to a new collection tube, and 270 µL isopropanol was added. The solutions were vortexed and transferred to a binding column. The column was incubated for 2 min at room temperature, and emptied using a vacuum-manifold. 100 µL wash solution 1 BF was added to the columns. The liquid was removed using a vacuum-manifold, and 700 µL wash solution 2 BF was added. The liquid was removed using a vacuum-manifold. 250 µL wash solution was added and the column was spun at 11.000 × g to dry the columns entirely. The dry columns were transferred to a new collection tube and 50 µL RNase free H2O was added directly on the membrane of the spin column. The column was incubated for 1 min at room temperature prior to centrifugation at 11,000 × g. The RNA was stored in a −80°C freezer.
2 µL RNA was reverse transcribed in 10 µL reactions using the miRCURY LNA™ Universal RT microRNA PCR, Polyadenylation, and cDNA synthesis kit (Exiqon, Inc., Vedbaek, Denmark). cDNA was diluted 50× and assayed in 10 µL PCR reactions according to the protocol for miRCURY LNA™ Universal RT microRNA PCR; each microRNA was assayed once by qPCR on the microRNA Ready-to-Use PCR, Pick-n-Mix using ExiLENT SYBR ® Green master mix. Negative controls excluding template from the reverse transcription reaction was performed and profiled like the samples. The amplification was performed in a LightCycler ® 480 Real-Time PCR System (Roche) in 384 well plates. The amplification curves were analyzed using the Roche LC software, both for determination of Cq (by the second derivative method) and for melting curve analysis. Two technical replicates were performed for each sample, and mean Ct values for each c-miRNA in each sample, along with experimental metadata are provided in Table S1 in Supplementary Material.
An initial panel of 608 c-miRNAs were considered for analysis, based on miRNA primers suggested by Exiqon, Inc. including c-miRNAs previously suggested as potential biomarkers (Table  S2 in Supplementary Material). This panel was down-selected to 164 c-miRNA (Table S2 in Supplementary Material) based on detectable expression in >80% of samples and association with progression in a subset of 40 discovery set samples. The technical replicability of each of the 164 initial candidate miRNAs was then assessed by rerunning the PCR quantification of the candidate miRNA, resulting in two technical replicates for each sample. The quality of the replicates was assessed by measuring the Pearson correlation of individual miRNAs between technical replicates. We observed a strong, non-linear relationship between miRNA expression (as measured by Ct) and technical replicability. In particular, a sharp decline in replicability was observed for miRNAs with mean Ct values greater than 32, indicative of low levels of c-miRNA ( Figure S4 in Supplementary Material). A final panel of 47 candidate miRNAs was thus selected, comprised of miRNAs expressed at reliably detectable levels (Ct < 32) in serum. PCR quantification of these 47 miRNAs was then run on all samples, including the pilot study samples.

normalization of Pcr c-mirna Data
As the abundance of c-miRNAs in serum is relatively low and varies across conditions, there is currently no universally accepted set of reference "housekeeping" c-miRNAs or universally accepted approach for standardizing c-miRNA profiles in order to maximize comparability across samples. To address this issue, we explicitly evaluated multiple normalization approaches within the suite of machine-learning approaches employed to generate predictive signatures. If a particular normalization strategy was strongly superior or inferior than others, this difference would be evident as increased or decreased predictive accuracy when assessed during cross-validation of the discovery set. The normalization strategies that we investigated were variants of two classes. In the first class, subsets of potential reference c-miRNAs were selected by ranking the final panel of 47 c-miRNAs by the magnitude of Spearman rank correlation between the c-miRNA and the overall sample mean of the Cts of all 47 miRNAs. The assumption behind this approach is that any universal difference in c-miRNA abundance between samples would be due to technical reasons (like smaller or less concentrated plasma aliquot) as opposed to biological reasons. The c-miRNAs with the top 1, 3, 5, 10, 20 rank correlations to the overall sample mean would be selected as reference c-miRNAs and then averaged within each sample to generate per-sample normalization constants. Alternatively, for the second class of approaches, the per-sample normalization constants were generated by taking the mean, median, or 25% trimmed-mean computed from all 47 assayed c-miRNAs. The Cts for a given sample were then normalized by subtracting the value of the normalization constant from the Ct of each c-miRNA. This gave a total of eight normalized datasets: trimmed-mean, trimmed-median, 1-ref, 3

c-mirna signature Development
The predictive potential of candidate c-miRNA signatures of risk was estimated by leave-one-donor-out-cross-validation (LOOCV) of the discovery set measurements of the 47 c-miRNAs. To ensure unbiased cross-validation, all samples relating to one donor were held out, the machine-learning algorithm was fit to the remaining data, and the resulting fit used to make blind predictions on the held-out samples. This step was done for each donor, and repeated for every combination of machine-learning algorithm and normalization approach. Using the R package caret (10), a variety of machine-learning algorithms were assessed (Figure 1).
Five machine-learning algorithms were used to train predictive models on the miRNA datasets, with models trained using the R caret (10) (13). Initial performance was assessed using LOOCV during training. During LOOCV, all samples relating to a single donor were held out and predicted on together, i.e., samples taken at differing timepoints from a single donor. In the discovery analysis, the optimal model was selected by examining LOOCV predictive performance considering only the sample most proximal to TB diagnosis.
The R pROC (15) package was used to calculate ROC curves by applying a set of thresholds to numeric predictions from predictive models to predict the progressor or control status of the samples, and then calculating the sensitivity and specificity of the predictor at each threshold. ROC curves were plotted using the R ggplot2 (16) package. Accompanying positive and negative predictive values were calculated using the model prediction threshold that maximized the sum of sensitivity and specificity.
Prediction performance, as measured by ROC statistics, was assessed using the sample for each participant that was most proximal to TB diagnosis. The combination of algorithm and normalization that maximized the area under the ROC curve (AUC) was selected to construct the final signature and was then used to make blind predictions on the validation set. p-Values associated with each signature were calculated using a one-tailed Mann-Whitney U-test comparing signature scores for cases compared with controls and were adjusted for multiple testing using the Benjamini-Hochberg algorithm. Bootstrapping was used to estimate 95% confidence intervals (CIs) of the AUC.

Prediction Performance of combined rna + c-mirna signature
To determine whether combining the c-miRNA signature with the existing RNA-based risk signature (RNA-CoR) led to significant improvement in prediction accuracy, a ROC curves (AUCs) from discovery set leave-one-donor-out-cross-validation (LOOCV) for five different machine-learning algorithms applied to data generated using eight different normalization approaches. Error bars represent the 95% confidence intervals. Normalization primers indicate the numbers of reference primers used to normalize the data ("all" = all 47 primers, and "tmean" and "tmedian" = 25% trimmed-mean or median of all primer expression, respectively). Horizontal red line indicates nondiscrimination (AUC = 0.5). The machine-learning algorithms employed are indicated on the x-axis. (B) LOOCV ROC curves for the optimal algorithm (elastic-net logistic regression-all), stratified by the time between collection of the sample and TB diagnosis (time to TB). . This approach takes into account the nested nature of these models. The significance of the improvement in the combined models' AUC was also evaluated using the highly conservative (17,18) DeLong (19) test, which assumes the independence of the models. These analyses were performed using samples for which both RNA-CoR scores (4) and c-miRNA signatures scores were available (34 progressor samples, 79 control samples) from both the training and test sets. To conservatively estimate c-miRNA signature performance, c-miRNA scores from the cross-validation analysis were used for training set samples and from the blind prediction analysis for the test set samples. Spearman correlations between normalized RNA-CoR PCR data (4) and normalized c-miRNA data were also calculated using matching samples.
resUlTs establishment of study cohorts 43 and 11 HHCs from the South African and Ugandan cohorts, respectively, progressed to active TB ("progressors") and were matched to HHCs that had remained healthy ("controls") during the 2-year study period (summarized in Table S3 in Supplementary Material). Tuberculin skin test (TST) measurements at enrollment found 91% of participants to have TST indurations ≥10 mm and 75% ≥15 mm, suggesting that the vast majority of HHCs have a latent TB infection. TST induration size did not differ significantly between progressors and controls (U-test p = 0.78), indicating that the TST is an ineffective predictor of TB risk in these cohorts. This ineffective prediction is unlikely to be related to false positives caused by BCG vaccination or TST cross reactivity with non-tuberculous mycobacteria (20) and the large TST indurations are more likely to reflect latent M. tuberculosis infection. Compared with our previous study of progression in South African adolescents with latent TB where 0.7% of individuals progressed to active TB over the course of 2 years (4), 3.6% of South African HHCs progressed to active TB. A panel of 47 high expression, technically replicable c-miRNAs were selected from 608 candidate miRNAs. These 47 c-miRNAs were then analyzed in parallel on the discovery (151 samples) and validation (120 samples) sets.

generation and Validation of the c-mirna signature of TB risk
To identify an optimal c-miRNA signature of risk for TB among HHCs, we evaluated five different machine-learning algorithms using eight different normalization strategies (see Materials and Methods, Figure 1A; Table S4 in Supplementary Material). The top algorithm was elastic-net logistic regression normalized by the average of all 47 c-miRNAs, which achieved a cross-validation AUC of 0.7 (95% CI: 0.58-0.82, FDR-adjusted p = 0.04, negative predictive value = 81%, positive predictive value = 59%) ( Figure 1A). Figure 1B Figure S1 in Supplementary Material). The optimal final signature selected was trained on the entire discovery set ( Figure 1C; Table S5 in Supplementary Material). Blind prediction of TB progression by the signature when applied to the validation set was successful (ROC AUC = 0.66, CI: 0.53−0.8, NPV = 90%, PPV = 30%) when applied to all samples; Figure 1D. Stronger performance was observed on samples under 6 months to TB (ROC AUC = 0.74, CI: 0.5−0.98, NPV = 96%, PPV = 35%), consistent with the discovery set. While the signature was not significantly predictive on the baseline validation samples, i.e., samples taken close to study enrollment (AUC: 0.55, CI: 0.32-0.77, NPV = 83%, PPV = 37%), Figure S1 in Supplementary Material, very strong significant predictive performance was seen on baseline validation set samples within 6 months of TB progression (AUC: 0.95, CI: 0.88-1, NPV = 100%, PPV = 50%), Figure S1 in Supplementary Material. These results demonstrate that a c-miRNA derived signature significantly predicts TB risk for HHCs within 6 months of progression.

Drivers of the c-mirna signature of TB risk
Having validated the c-miRNA signature of TB risk, we performed a retrospective analysis to determine which c-miRNAs were the drivers of prediction accuracy. By sequentially removing c-miRNAs with the smallest model weight, retraining on the discovery set, and predicting on the validation set, we were able to identify the most parsimonious predictive signature ( Figure S2 in Supplementary Material, Table S6 in Supplementary Material). Although prediction performance fluctuated stochastically with an overall decline as the signature was reduced, a three-c-miRNA signature predicted comparably to the full signature (AUC: 0.67, CI: 0.55-0.80, NPV = 78%, PPV = 64%), indicating potential for model reduction. Figure 2A shows the combined discovery and validation set expression of the three c-miRNAs. Thus, it appears signature predictions are dominated by the contribution of the three most important miRNAs.

The c-mirna signature of TB risk complements the rna-cor Predictions
The c-miRNA signature of TB risk includes c-miRNAs up-and down-regulated in TB progression, in contrast with the transcriptomic RNA-CoR (4) which was composed of genes upregulated during progression. These distinct kinetics suggest that the c-miRNA and RNA-CoR signatures may contain independent information for predicting TB among HHCs. The South African samples used to validate the RNA-CoR form part of this study cohort, facilitating a direct comparison of the c-miRNA signature with the published qRT-PCR RNA-CoR measurements. A linear combination of the c-miRNA, including all 47 miRNAs, and RNA-CoR signatures shows a modest increase in predictive power, from an AUC of 0.77 (CI: 0.68-0.87, NPV = 88%, PPV = 48%) using RNA-CoR alone to 0.78 (CI: 0.69-0.88, NPV = 87%, PPV = 52%) for the combined signature (Figure 2B), and we observed wide overlap of the 95% CI between the RNA-CoR alone and the RNA-CoR + c-miRNA model. Although the AUCs of the RNA-CoR + c-miRNA did not significantly improve on the RNA-CoR when compared using the conservative DeLong test (p = 0.43), significant (p = 0.03) improvement in predictive performance was observed when the linear combination of RNA-CoR + c-miRNA was compared with RNA-CoR alone using the χ 2 test, which takes into account the nested nature of the models. Notably, predictions were strongly improved in the high-specificity region of the ROC curve, at a specificity of 90%, where RNA-CoR shows a sensitivity of 41%, which improves to a sensitivity of 50% when the c-miRNA scores are added.
To further explore the relationship between the c-miRNA and cellular RNA expression changes, we performed a correlation analysis between the constituents of the two signatures. Figure 2C shows a network of significant (FDR < 0.05) correlations between the components of the c-miRNA and RNA-CoR signatures (Table S7 in Supplementary Material). Both positive and negative correlations between c-miRNAs and the interferon-response genes in the RNA-CoR were observed in a manner consistent with previous functional studies of the implicated RNAs (21)(22)(23)(24)(25) (Figure 2D). These results demonstrate that elements of the c-miRNA signature are correlated with the previously identified RNA-CoR, and that the c-miRNA signature may provide information complementary to the RNA-CoR.

DiscUssiOn
Several previous studies have identified c-miRNAs that are differentially expressed in active TB disease (8), but to our knowledge, this is the first to have prospectively validated a c-miRNA-based signature of risk of TB in an independent cohort. The c-miRNAs comprising the signature are abundant in blood and have established roles in inflammatory and infectious conditions (21,(23)(24)(25). This signature is highly predictive of HHCs likely to progress within 6 months of testing, including tests performed close to exposure, although predictive power is diminished for more distal samples. This increase in signal close to diagnosis suggests that the c-miRNA signature is likely to be detecting an immune response to subclinical or incipient TB, prior to the development of symptomatic active disease. We observed that most progressors developed TB within 6 months of exposure ( Figure S3 in Supplementary Material), suggesting that the temporal resolution of this test may be sufficient for practical application. As our analysis was limited to previously characterized c-miRNAs, we could not have identified potentially important uncharacterized c-miRNAs. Future improvements in sequencing approaches have potential to identify additional c-miRNAs that may be important in the context of TB progression.
The RNA-CoR signature has been shown to have over double the positive predictive value of an interferon-gamma release assay alone and meets the Stop TB Partnership's performance criteria for a prognostic TB test (26). Combined with the RNA-CoR, the c-miRNA signature displays only a slight improvement in AUC vs the RNA-CoR alone. However, the predictive performance shows a strong improvement in sensitivity at high specificities, suggesting that combination of the RNA-CoR and c-miRNA signature would act as an improved "rule-in" test to identify HHCs at higher risk and likely to benefit from INH prophylaxis.
Correlating the components of the c-miRNA signature with components of the RNA-CoR signature suggest how the interferon response to TB disease may be regulated by c-miRNAs. miR-21, which is induced by mycobacteria (21), and is a marker of immune cell activation (24), was positively correlated with genes in the RNA-CoR. In contrast, miR-26a, which has been shown to suppress macrophage responsiveness to IFN-γ (23), and miR-30b, which has been shown to suppress pro-inflammatory cytokine secretion and Fc-receptor expression (25), were both negatively correlated with RNA-CoR genes, including FCGR1B ( Figure 2D).
Recently, blood transcriptional signatures have been developed capable of evaluating TB risk (4) and effective response to TB treatment outcome (27), although the sensitivity and specificity of the risk signature is limited. Investigating alternative platforms to whole-blood transcription holds out the possibility of augmenting the performance of this initial work. The c-miRNA signature developed here demonstrates the potential of serum c-miRNAs to predict TB risk, despite being limited by a preselected pool of candidate miRNAs, and the difficulty of accurately quantifying low-abundance miRNAs in serum. In the future, the development of accurate, sensitive, and unbiased sequencing approaches for c-miRNAs would hold much promise for further improving prediction of TB risk.