The Evaluation and Validation of Blood-Derived Novel Biomarkers for Precise and Rapid Diagnosis of Tuberculosis in Areas With High-TB Burden

Tuberculosis (TB) remains a highly contagious public health threat. Precise and prompt diagnosis and monitoring of treatment responses are urgently needed for clinics. To pursue novel and satisfied host blood-derived biomarkers, we streamlined a bioinformatic pipeline by integrating differentially expressed genes, a gene co-expression network, and short time-series analysis to mine the published transcriptomes derived from whole blood of TB patients in the GEO database, followed by validating the diagnostic performance of biomarkers in both independent datasets and blood samples of Chinese patients using quantitative real-time PCR (qRT-PCR). We found that four genes, namely UBE2L6 (Ubiquitin/ISG15-conjugating enzyme E2 L6), BATF2 (Basic leucine zipper transcriptional factor ATF-like), SERPING1 (Plasma protease C1 inhibitor), and VAMP5 (Vesicle-associated membrane protein 5), had high diagnostic value for active TB. The transcription levels of these four gene combinations can reach up to 88% sensitivity and 78% specificity (average) for the diagnosis of active TB; the highest sensitivity can achieve 100% by parallel of BATF2 and VAMP5, and the highest specificity can reach 89.5% through a combination of SERPIG1, UBE2L6, and VAMP5, which were significantly higher than 75.3% sensitivity and 69.1% specificity by T-SPOT.TB in the same patients. Quite unexpectedly, the gene set can assess the efficacy of anti-TB response and differentiate active TB from Latent TB infection. The data demonstrated these four biomarkers might have great potency and advantage over IGRAs in the diagnosis of TB.


INTRODUCTION
Despite decades of vaccine immunization and anti-TB chemotherapy, tuberculosis (TB) caused by Mycobacterium tuberculosis (MTB) remains a devastating disease and an enormous burden to global public health, with around one fourth of the population at risk of being infected, about 10 million new TB incidences, and 1.2 million deaths worldwide in 2019 (WHO, 2019). Rapid and precise diagnosis of active TB largely represents an unmet clinical need (Pai and Schito, 2015). Traditional diagnosis has defects, such as the low sensitivity (12-15%) of acid-fast bacilli (AFB) and time-consuming nature of cultures (Akkaya and Kurtoglu, 2019). Molecular diagnostics such as Xpert MTB/RIF (Cepheid, Sunnyvale, CA, United States) can achieve a sensitivity of 34-66.7% for smear negative-Pulmonary TB (PTB) and extrapulmonary TB (Qureshi et al., 2019;Wu et al., 2019). Xpert MTB/RIF ultra can improve the sensitivity for TB but has decreased specificity compared with Xpert MTB/RIF (Wu et al., 2019;Jiang et al., 2020). Current etiological methods have limited sensitivity in smear-negative active TB, especially paucibacillary TB (Steingart et al., 2013). Therefore, etiological methods are not suitable for fast diagnosis of TB. Blood-derived biomarkers for precise and rapid diagnosis of TB are intensively studied to meet clinical needs.
The most applied blood-derived immunological method for diagnosis of TB was interferon-γ release assays (IGRAs) and tuberculin skin testing (TST), however, they cannot distinguish active TB from latent TB infection (LTBI) or HIV positive patients (Rangaka et al., 2012). Mycobacteria-specific cytokines are intensively explored as a biomarker to distinguish latent TB infection from active TB (Marc et al., 2015). The diagnostics based on biomarkers derived from blood samples have recently been intensively explored (Denkinger et al., 2015), and also meet WHO's target product characteristics. The blood-based biomarkers' diagnostics have great advantages Walter et al., 2016) for quick samples collection and quantification, as well as point-of-care tests (POCT) (Wallis et al., 2010). However, based on existing research results, effective biomarkers based on whole blood are still lacking.
The host transcriptome response to MTB infection is a valuable source for this end, as exemplified by the abundant National Institutes of Health Gene Expression Omnibus (NIH GEO). To transform the transcriptome data into clinically actionable TB diagnostics, we curated nine transcriptome datasets based on the whole blood from NCBI that meet the statistical criteria for effective data analysis ( Figure 1A). In combination with clinical sample analysis, we determined the specificity and sensitivity of the candidate in the diagnosis of active tuberculosis. The results showed that the effectiveness of the novel diagnostic biomarker was significantly better than T-SPOT.TB. In summary, in this study, a four-gene set (UBE2L6, BATF2, SERPING1, and VAMP5) was validated as a novel method for the diagnosis of active PTB, as well as a biomarker for monitoring anti-TB treatment efficacy.

Microarray Data Information and Usage in Discovery/Validation Stage
Gene Expression Omnibus (GEO) datasets from published microarray-based studies of PTB versus LTBI or other diseases were collected for data mining. From the nine datasets (GSE19491, GSE40553, GSE56153, GSE42834, GSE39941, GSE37250, GSE103119, GSE94438, and GSE124548), 2804 samples were obtained (Table 1). After screening for the most relevant and comprehensive blood samples, 1654 samples were kept for further study. However, most samples are highly heterologous for processing methods, and cannot be directly used for analysis. GSE19491 has very comprehensive information with a large number of samples and was assayed by the same laboratory with uniform methods. Multiple individuals were included in this dataset, such as PTB, LTBI, HC, and other pulmonary diseases. Additionally, the change of transcriptome during treatment monitoring was also analyzed. Specifically, three subseries (GSE19439, GSE19442, and GSE19444) in GSE19491 containing the transcriptome data of TB, LTB, and HC were used for gene differential expression and correlation analysis. Therefore, GSE19491 was used as the discovery dataset to find the differential expression by Limma, correlation by WGCNA, and time-course trend by STEM.
GSE40553 and GSE56153 contained the time-course transcriptome data of TB patients post treatment. GSE42834 contained patients with active TB or miscellaneous pulmonary diseases. GSE37250 and GSE39941 are samples from patients of TB and other diseases with or without HIV co-infection. GSE94438 samples are from household contact subjects (Suliman et al., 2018). Thence, we chose the six datasets for validation.
Another three datasets (GSE103119, GSE124548, and GSE42834) were used to validate the biomarker specificity. GSE103119 contained patients with pneumonia caused by bacteria or virus (except MTB) and healthy subjects, while GSE124548 samples are from cystic fibrosis patients, used to differentiate pulmonary diseases.
Datasets first underwent quantile normalization and were log2 transformed. We mapped the probes to gene symbols based on the probe data before Dec 5, 2018 from GEO.

Identification of Biomarkers From Multiple Datasets
In order to discover the molecules most likely to be biomarkers of tuberculosis from the data set, we combined multiple data analysis methods: Differentially expressed genes (DEGs), Coexpression network analysis, and Time series analyses.

Real-Time qPCR Validation of Differentially Expressed Genes by Prospective Clinical Study
Patients who met inclusive criteria were prospectively enrolled into this study from January 1, 2019 to July 31, 2019 in Shanghai Pulmonary Hospital. The study was approved by the  Quantitative real-time PCR (qRT-PCR) was used to validate the differential expression of the four shortlisted genes in blood samples from participants. 2.0 mL peripheral venous blood was taken directly into PAXgene blood RNA tubes (PreAnalytiX, Hombrechtikon, Switzerland) and stored at −20 • C for use. RNA was extracted from PAXgene tubes stored blood. Before analysis, all test samples and primers were assigned random numerical codes that masked the disease, control status, and the gene identity. The qRT-PCR-based validation and GEO data mining were done in a fully blind manner. The primers used are listed in Table 2. The gene expression levels were quantified relative to the transcription of β-actin by using an optimized comparative Ct ( Ct) value method.

Validation of TB Score
By data mining and validating by clinical study, we defined the geometric mean of the four gene transcription levels as the TB score (TBscore = 4 √ UBE2L6 * BATF2 * SERPING1 * VAMP5) . This TB score was directly tested for diagnostic power by receiver operating characteristic (ROC) curves using the R package pROC. Violin plots showed the TB score for a dataset response to treatment at specific time points. Violin plot error bars showed the inter-quartile range (IQR) between non-normal distributions within subsets. Betweengroups TB score comparisons were done with the Wilcoxon rank sum test. Significance levels were set at two-tailed p < 0.05. All computation and calculations were done in the R language (version 3.5.1).

T-SPOT.TB Assay
T-SPOT.TB was performed in accordance with the manufacturer's instructions (Oxford Immunotec Ltd.). Blood samples were collected immediately prior to the tests in order to

Genes
Primers avoid potential interferences, and patients who received blood transfusions or underwent positron emission tomographycomputed tomography scans within 1 week of the test were recommended to undergo a second test 2 weeks later. Peripheral blood mononuclear cells (PBMCs) were separated from blood samples using Ficoll-Hypaque gradient centrifugation at 400 × g for 30 min at 20 • C. PBMCs were seeded on precoated IFN-γ ELISpot plates and incubated with media without an antigen (as a negative control), media containing peptide antigens derived from ESAT-6 (labeled panel A) or peptide antigens derived from CFP-10 (labeled panel B), or media containing phytohemagglutinin (as a positive control) in a 5% CO2 atmosphere at 37 • C for 20 h. 29-31. After counting the number of spot-forming cells, results are reported with negative control results subtracted (i.e., measured sfu number minus sfu number of negative control), according to the recommendations of the manufacturer. The values for ESAT-6 (panel A) and CFP-10 (panel B) were also scored individually using the same procedure and the maximum of them was regarded as the final result of T-SPOT.TB. All T-SPOT.TB testing was performed before the patients were prescribed anti-TB medications.

Statistical Analysis
Statistical and machine learning methods (R packages: limma, WGCNA, pROC, and STEM software) were employed to discover and validate the biomarker genes for TB diagnosis and treatment response based on the mRNA levels in blood samples. The analyses were carried out using scripts written in Rstudio. The differences in gene expression levels between TB patients' and healthy controls' blood samples were compared using the Wilcoxon test. Multiple comparisons were carried out in patients with lung cancer, pneumonia, and TB by Kruskal-Wallis. Significance levels were set at p < 0.05.

RESULTS
Four Candidate Biomarker Genes Were Found by Integrating the Results of DEGs, WGCNA, and STEM Analysis The DEGs in the three subseries of GSE19491 were analyzed using the limma package following data preprocessing. A total of 555 DEGs were identified, including 336 up-regulated genes and 219 down-regulated genes in PTB compared to HC and 175 DEGs w including 131 up-regulated genes and 44 down-regulated genes in PTB compared to LTB ( Figure 1B). Finally, 117 DEGs were found to be shared by both PTB-HC and PTB-LTB, containing 98 up-regulated genes and 19 down-regulated genes ( Figure 1C). 4807 genes in 134 samples were analyzed by WGCNA to find the modules of highly correlated genes. By a power of 8, 14 modules were found (Figure 2A). Among all modules, the black module with 270 genes had the highest correlation coefficient (p = 5e-24; r = 0.74) with PTB ( Figure 2B). An intra-modular analysis of GS and MM of the genes in the black module found that GS and MM were significantly correlated (p = 6.6e-70; r = 0.83), further supporting that the genes in the black module were highly correlated ( Figure 2D). To investigate whether these modules are conserved in our network, two independent datasets of GSE37250 and GSE42834 datasets were used to test the preservation of these modules. Zsummary > 10 indicates high preservation. Black modules in the GSE37250 and GSE42834 had Z-summary 20, 24, indicating they are well-preserved network in our study (Figure 2C).
To minimize the candidate genes for biomarkers, we conducted a comparative analysis and found 63 genes (Supplementary Table 1) present in the DEGs, WGCNA black module, and STEM profiles ( Figure 2E). Of the 63 key genes, 30, 19, 11, and 3 were assigned to 3, 2, 0, and 4 profiles ( Figure 2F). They were significantly enriched in immune response related terms in the GO, Reactome pathway, and Uniprot keywords enrichment analyses by STRING website (Szklarczyk et al., 2017). This result encouraged us to further explore the roles of the 63 genes in TB. Too many genes might be counterproductive for rapid and precise biomarker diagnosis. We further shortlisted the 63 genes to four genes (SERPING1, BATF2, UBE2L6, and VAMP5), due to their highest GS and MM in the black module, reduced constantly in STEM profile 3, and differentially expressed in both the PTB versus LTB and PTB versus HC. Together, the transcriptional levels of the four genes correlated with TB and changed significantly during treatment. It implied that the four genes might play essential roles in the development of TB and could be candidates for new diagnostic biomarkers.

Four Genes Showed Good Clinical Performance by Real-Time qPCR Validation in Peripheral Blood From Patients
To validate the clinical efficacy of the four genes, 150 participants were included; of them, 14 cases were excluded due to obscure diagnosis, and a total of 126 participants were finally enrolled into the study. They were classified into four groups: 51 cases with active PTB, 30 cases with pulmonary lung cancer (TUMOR), 30 cases with pneumonia (INFLA), and 15 cases as healthy donor (HC). Patient's clinical characteristics were shown in Table 3.
The transcription levels of the four genes were detected by qRT-PCR. The results showed that BATF2, SERPIG1, UBE2L6, and VAMP5 were significantly increased in PTB compared with HC (Wilcoxon test, P < 0.05).
We further plotted the ROC curve to evaluate diagnostic power (Figure 3). The results showed that the diagnostic power of a single gene was relatively lower than their combination and the ROC curve of each gene is different from these of TB score by genes combination (venkatraman method (Venkatraman, 2015), P < 0.05). In the patient samples, the combination of three genes (BATF2-SERPIG1-VAMP5 and BATF2-SERPIG1-UBE2L6) or two genes (BATF2-SERPIG1 and SERPIG1-VAMP5) can improve the diagnostic performance, and there was no difference in the ROC curve in between those matches with the four-gene combination (Supplementary Image 1). The results from Chinese patients were different from those reported in the GEO data in which the combination of four genes showed better performance. The performance of four gene combinations in Chinese patients can reach up to 100% for sensitivity or specificity, the average sensitivity or specificity is AUC = 0.84, FIGURE 3 | Performance of the genes by using ROC curve and the difference between the ROC curve of each gene and TB score. sensitivity = 88%, and specificity = 78%, similar to that by pure GEO datasets analysis from a non-Chinese population, which has an AUC = 0.86, sensitivity = 86%, and specificity = 81%.

The Diagnostic Efficacy of the Four Genes for Active TB Is Significantly Higher Than That of T-SPOT.TB Conducted Over the Same Patients
To compare the four candidate biomarkers and T.SPOT's performance for the diagnosis of active TB, all patients were tested by T-SPOT.TB. 75.3% sensitivity and 69.1% specificity were found, suggesting that the diagnostic accuracy of T-SPOT.TB for active tuberculosis is significantly lower than the four genes in areas with high TB burden.

Four Genes Have Good Specificity for Active TB Diagnosis
In the clinic, other lung diseases often confound the accuracy of tuberculosis diagnosis to a large extent. The diagnostic specificity of candidate markers is crucial. In order to verify the specificity of the four genes, we examined TB score in independent gene expression datasets from clinical TB samples, comparing its efficacy among four types of comparisons by ROC curve (Figure 4) Figure 4C); and (4) active TB patients response to treatment at specific time points (GSE40553 and GSE56153). The TB score did well across all conditions (mean AUC 0.86, sensitivity 86%, and specificity 81%) except sarcoidosis, which might be due to the common disease-related signatures in TB and sarcoidosis (Maertzdorf et al., 2012). We further examined the transcription level of the four genes in blood samples from patients with tumors (TUMOR) and pneumonia (INFLA). The results showed that the transcription of the four genes in PTB was significantly higher than that in HC, TUMOR, and INFLA ( Figure 4D). The transcription of SERPIG1 were about 3-7 times in PTB compared with HC, TUMOR, and INFLA. Those in HC, TUMOR, and INFLA were almost identical, but they were about 3-7 times in PTB compared with HC, TUMOR, and INFLA. The transcription levels of BATF2, UBE2L6, and VAMP5 in TUMOR and INFLA were also similar and were slightly higher than those in HC. However, the transcription levels of the above three genes in PTB were about 3-8 times higher than those in TUMOR and 3-15 times higher than that in INFLA (Supplementary Table 2).
In addition, the four genes can also predict whether close contacts of tuberculosis patients will develop active TB. According to GSE94438, TB score can effectively identify those who developed TB 18 months after contact with active tuberculosis [AUC 0.87 (95% CI 0.81-0.93)] (Figure 4E).

The Four Genes Can Also Be Biomarkers for Treatment Efficacy and Differential Diagnosis via Cross-Validation of TB Score in Independent Test Datasets
As expected, when we investigated TB score in the discovery datasets, we found the four-gene set can differentiate active TB from HC with AUC 0.97 (95% CI 0.95-0.99) and differentiate active PTB from LTBI with AUC 0.94 (95% CI 0.92-0.96) ( Figure 5A). The scores of TB patients decreased significantly after effective treatment ( Figure 5B).
The transcription of the four genes decreased gradually with effective TB treatment in GSE40553 and GSE56153 databases (Supplementary Image 2). Therefore, we tested whether TB score of the four genes can be used to assess the treatment response in databases. For the active TB patients under lengthy treatment, the TB score was significantly decreased after treatment ( Figure 5C). In GSE56153, the TB scores of patients returned to normal after treatment between healthy control and recovery ( Figure 5D, Wilcoxon p > 0.05). The results indicate that the four genes can be biomarkers to monitor treatment efficacy. In summary, the four gene's signatures are excellent specific TB diagnostic biomarkers in the pilot test. However, multiple center clinical studies with more cases should be conducted in the future.

DISCUSSION AND CONCLUSION
Novel biomarkers for rapid and reliable TB diagnosis and treatment efficacy monitors are urgently needed to reduce or eliminate the global burden of TB. Here, we used both a prospective study and public datasets with more than 1,000 whole blood patient samples across a range of ages and countries to find diagnostic biomarker genes for the diagnosis of active TB. We found a four-gene set (UBE2L6, BATF2, SERPING1, and VAMP5), and cross-validated it in five additional independent whole blood datasets. The results showed that the four-gene set is robust for the diagnosis of active PTB with other pulmonary TB and HC, while the diagnostic performance is not affected by HIV status based on the datasets. In addition, the fourgene set can help to distinguish active TB from LTBI, which is usually accomplished by TST or IGRAs test. More importantly, we have confirmed that the accuracy of the novel detection method is significantly higher than that of the IGRAs test. The transcription levels of the four-gene decreased stepwise upon effective treatment and could also be biomarkers to monitor treatment efficacy. Whether they can be biomarkers for treatment failure or relapse remains to be determined. Our data indicated that the combination of the four gene set can reach sensitivity of 88 and 78% specificity for the PTB which were significantly higher than 75.3% of sensitivity and 69.1% of specificity by T-SPOT.TB in the same cohort population. The novel biomarkers can reach as high as 100% sensitivity by parallel of BATF2 and VAMP5 and 89.5% specificity by combination of SERPIG1, UBE2L6, and VAMP5. The most effective three gene combination is BATF2-SERPIG1-VAMP5 with 77% specificity and 91% sensitivity for the diagnosis of PTB. The most effective combination of two genes is SERPIG1-VAMP5 (76% specificity and 86% sensitivity).
The host immune response is crucial for the outcome of active TB. However, the genes and pathways involved in host immune response to M. tuberculosis infection or persistence remain elusive. Based on Genecards annotation, the four genes are all involved in well-established immune response, but very few studies associated them with TB. The protein-protein interaction network of the four genes constructed via STRING-DB database showed that they are strongly associated with ubiquitination, immune cell differentiation, complement activation, and vesicle trafficking (Supplementary Image 3), which are important cellular responses during host interaction with M. tuberculosis. BATF2, also called SARI, is a member of the BATF subfamily of basic leucine zipper proteins regulated by interferon and an inhibitor of AP-1 in human cells (Haiqing et al., 2011), which controls the differentiation of lineage-specific cells in the immune system (Murphy et al., 2013), and Batf2/Irf1 induces inflammatory responses in mycobacterial infection (Sugata et al., 2015). Ubiquitin conjugating enzyme E2 L6 (UBE2L6) serves as an E2 enzyme for post-translational addition of an ubiquitin-like protein ISG15 which is vital for antiviral immunity (Skaug and Chen, 2010) and is involved in the type-I interferon response in active TB disease (Ottenhoff et al., 2012). Vesicle-associated membrane protein 5 (VAMP5) is a member of the SNARE protein family, which regulates the docking and fusion of intracellular membrane vesicles (Hong, 2005) and is involved in the development or function of the respiratory system (Ikezawa et al., 2018). VAMP5 controls intracellular transport events, including endocytosis, exocytosis, and internal recycling (Tajika et al., 2014). SERPING1-encoded serpin peptidase Inhibitor (C1Inh), a member of a large family of serine proteases, can influence the complement C1q levels which can mark active disease in human tuberculosis (Cai et al., 2014;Horwitz et al., 2019). By single cell RNA-seq transcriptome of patients with tuberculosis, we found that four genes are highly expressed in white blood cells of patients with tuberculosis. In general, the levels of these four genes in CD14 + or CD16 + monocytes show the highest trend, among which VAMP5 is relatively higher. This is consistent with the role of monocytes in tuberculosis bacteria. VAMP5 is involved in vesicle transport and has the highest level in monocytes. In addition, the complement activation pathway may also be involved in the elimination of tuberculosis. The sequence-structure-function of the found protein is closely related to its predicted role in tuberculosis. The specific high expression of these genes in TB patients may suggest that they play an important role in the immune response against tuberculosis. Our ongoing study found that the inhibition of BATF2 can benefit the host, suggesting a promising drug target. In addition to the four genes, 63 other key genes we identified were intensively associated with immune response by functional enrichment analysis. Further exploring the immune roles of the 63 genes is worthwhile and might provide more biomarker candidates.
The datasets used in our study have been used by other teams to explore TB diagnostic biomarkers. There is surprisingly little overlap between our results and other reports. Kaforou and colleagues (Myrsini et al., 2013) identified a 44-transcript signature which can distinguish PTB from other diseases (including only one of our genes, SERPING1) and a 27transcript signature which can distinguish TB from latent TB (including only one of our genes, VAMP5). Berry et al. (2010) found an 86-gene signature which is related to neutrophildriven type I interferon (no overlap with our four genes) and can discriminate PTB from other inflammatory and infectious diseases. Bloom et al. (2013) identified 144-transcript signature which distinguished PTB from other lung diseases and controls (none of our four genes in it). Anderson et al. (2014) assessed transcript signatures in children and found a 51-transcript signature for distinguishing TB from other diseases (including only one of our genes, VAMP5) and 42-transcript signature for distinguishing TB from latent TB infection (none of our four genes were in it). Bloom et al. (2012) reported an active TB 664-transcript signature and a treatment-specific 320-transcript signature significantly diminished after 2 weeks of treatment. Zak et al. (2016) identified a 16 gene signature which can predict tuberculosis progression. The size of their gene panel is too large to be clinically affordable or actionable for rapid qRT-PCR-based assay. In contrast, our four genes can differentiate active TB from latent TB and other diseases. The four-gene set will reduce the cost in its clinical qRT-PCR-based diagnosis. Similarly, Costa et al. (2015) found a three-gene set (GZMA, GBP5, and FCGR1A), Sutherland and colleagues (Maertzdorf et al., 2016) found a four-gene set (GBP1, IFITM3, P2RY14, and ID3), Ottenhoff et al. (2012) found a three-gene set (IL15RA, UBE2L6, and GBP4),  found a three-gene set (GBP5, DUSP3, and KLF2), and Roe et al. (2019) found a three-gene set (BATF2, GBP5, and SCARF1) in blood samples that can distinguish TB. But our biomarker genes are different and validated in a Chinese population.
The discrepancy between our result and other reports might have resulted from the ethnicity or the bioinformatic pipelines. Our approach uniquely integrated three bioinformatics methods and validated the results by prospective study in a Chinese population. We explored transcript signatures via integrating differential expression genes, co-expression networks, and expression trends, which can interpret the expression data from multiple dimensions. This rigorous pipeline might underlie the good performance of the four genes. However, this pipeline might miss some candidate biomarkers. There might be additional biomarker genes which can be included for better performance in regions with low incidence rates of active tuberculosis.
Although there are some reports that clearly affirm that some of these genes can be used as a biomarker for TB diagnosis, the effectiveness of a single gene is flawed. The flexible application of the four genes set that we found is a fast and effective diagnostic method for active TB disease. Moreover, this four genes set can also be used as detection molecules for the treatment effect of TB, and are expected to play an important role in quickly distinguishing PTB from LTBI.
In summary, we demonstrated that the four-gene set (BATF2, UBE2L6, VAMP5, and SERPING1) is a robust blood-based diagnostic for active TB across seven datasets containing more than 1,200 clinical samples, the sensitivity or specificity of which can reach 100%, though the mean AUC = 0.86, sensitivity = 86%, and specificity = 81%. They span a variety of age, infection or exposure status, ethnicity (Sutherland et al., 2014) and genetic backgrounds, and diverse circulating lineages of M. tuberculosis. This was further validated in 126 human blood specimens from a Chinese population. The four-gene set can serve as biomarkers to improve clinical diagnosis and treatment response monitoring of TB.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/ Supplementary Material.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by the study protocols were approved by the Institutional Review Board at the Hospital (K17-022). All participants provided written informed consent prior to participation in the study. The patients/participants provided their written informed consent to participate in this study.

AUTHOR CONTRIBUTIONS
YG, XK, ZG, JN, and RZ performed the experiments. YG, LF, and JX analyzed the data. BS and LF diagnosed the patients and collected samples for all clinically related ethical approval. ZG, YG, LF, and JX designed the study and wrote the manuscript. All authors have read and approved the manuscript.