Comprehensive Genetic Analysis of Tuberculosis and Identification of Candidate Biomarkers

Purpose: The purpose of this study is to use the data in the GEO database to analyze, screen biomarkers that can diagnose tuberculosis, and verification of candidate biomarkers. Materials and methods: GSE158767 dataset were used to process WGCNA analysis, differential gene analysis, Gene ontology and KEGG analysis, protein-protein network analysis and hub genes analysis. Based on our previous study, the intersect between WGCNA and differential gene analysis could be used as candidate biomarkers. Then, the enzyme-linked immunosorbent assay was used to validate candidate biomarkers, and receiver operating characteristic was used to assess diagnose ability of candidate biomarkers. Results: A total of 412 differential genes were screened. And we obtained 105 overlapping genes between DEGs and WGCNA. GO and KEGG analysis showed that most of the differential genes were significantly enriched in innate immunity. A total of 15 hub genes were screened, and four of them were verified by Enzyme-linked immunosorbent assay. CCL5 performed well in distinguishing the healthy group from the TB group (AUC = 0.723). And CCL19 performed well in distinguishing the TB group from the ORD groups (AUC = 0.811). Conclusion: CCL19, C1Qb, CCL5 and HLA-DMB may play important role in tuberculosis, which indicated four genes may become effective biomarkers and could be conveniently used to facilitate the individual tuberculosis diagnosis in Chinese people.


INTRODUCTION
Tuberculosis has been accompanied by human for thousands of years, and it is still a major public health problem that threatens human health. According to the World Health Organization tuberculosis report, about a quarter of the world's population is infected with M. tuberculosis and thus at risk of developing TB disease (WHO, 2019). In 2018, about 10 million people were infected with TB. Meanwhile, there were 1.2 million (range, 1.1-1.3 million) TB deaths among HIV-negative people in 2018 (a 27% reduction from 1.7 million in 2000), and an additional 251,000 deaths (range, 223,000-281,000)3 among HIV-positive people (a 60% reduction from 620,000 in 2000). Geographically, eight countries accounted for two thirds of the global total: India (27%), China (9%), Indonesia (8%), the Philippines (6%), Pakistan (6%), Nigeria (4%), Bangladesh (4%) and South Africa (3%). Therefore, curbing the spread of tuberculosis is an urgent problem to be solved (WHO, 2019).
Tuberculosis is mainly transmitted by respiratory tract. Therefore, the early diagnosis of tuberculosis is very important. However, there are still some problems in the diagnosis of tuberculosis (Jakhar et al., 2020). For example, the gold standard sputum culture takes too long, and the false positive rate of tuberculosis antibody is too high Mansoori et al., 2020). According to the World Health Organization, 55% of lung cases were confirmed bacteriologically in 2018 (WHO, 2019). And we should increase the percentage of cases confirmed bacteriologically by scaling up the use of recommended diagnostics (e.g., rapid molecular tests) that are more sensitive than smear microscopy. The biomarkers produced by the immune reaction in the process of infection with Mycobacterium tuberculosis is a relatively accurate and rapid molecular test. Therefore, there is an urgent need to find novel biomarkers to solve the problem of tuberculosis diagnosis.
With the development of bioinformatics technology, more and more new techniques are applied to analyze expression profile data. Differential gene analysis is the most classical analysis method, which plays an important role in this field of biomarkers by a series of statistical algorithms to find differential genes between different subgroups (Liu et al., 2022;Zhao et al., 2022). WGCNA (weighted gene coexpression network analysis) is a topological network that can establish the linkage between gene modules and clinical traits, and the genes classified into the same module are all linked to selected clinical traits, which can then be used for subsequent analysis and experiments (Nguyen et al., 2021;Ye et al., 2021). The combination of differential gene analysis with WGCNA allows the screening of genes that are differentially expressed and associated with selected clinical traits, which can be used for further screening to identify the final biomarkers.
In this study, we used the data from the GEO database to conduct a comprehensive bioinformatics analysis to screen out possible biomarkers for the diagnosis of tuberculosis. Sequencing data from lung tissues were used to analyze and biomarker were screened for lung tissue of high metabolic activity and lung tissue of low metabolic activity, which can reveal specific features of host lung immunity, and can screen for biomarkers associated with immune and metabolic activity, thus improving the sensitivity and specificity of the biomarkers. Enzyme-linked immunosorbent assay (ELISA) was used to validate the screened biomarkers. The results of the verification are used for statistical analysis and the establishment of predictive models to help doctors diagnose tuberculosis.

Acquisition of RNA Data
The gene expression data of tuberculosis patients were obtained from GEO database (Barrett et al., 2013). The selection criteria for the GEO database were set as follows: tissue; RNA highthroughput sequencing; release time descending order. Finally, GSE158767 was selected as a candidate dataset. Gene expression data of 10 lung tissue samples were obtained from GSE158767, of which 5 were metabolic high and five were metabolic low. According to the description of GSE158767, standard uptake values (SUV) of PET-CT greater than three were considered metabolic high, while SUV less than three were considered metabolic low. In addition, the platform of GSE158767 was Illumina NovaSeq 6,000 (GPL24676). According the guideline of edgeR package, gene with low read counts cannot be used for further analysis. Therefore, the gene with cpm (count per million) ≥ 1 was kept in this study. All the data obtained have been normalized for further analysis. We used the function rpkm in edgeR package to conduct normalization process.

WGCNA Analysis
Co-expression networks have facilitated the development of network-based gene screening methods that can be used to identify candidate biomarkers and therapeutic targets. In this study, we constructed a gene expression data map of GSE158767 to construct a gene expression network based on the WGCNApackage (Zuo et al., 2018). WGCNA was used to identify the genes which were related with clinical phenotype. In order to build the scale-free network, we used the function pickSoftThreshold to select soft powers β = 3 (Liang et al., 2020;Zhao et al., 2021). Then, we used the following formula to create the adjacency matrix: aij Sij β Sij: similarity matrix which is done by Pearson correlation of all gene pairs β: softpower value And then the adjacency matrix was transformed into a topological overlap matrix (TOM) as well as the corresponding dissimilarity (1-TOM). Then, a hierarchical clustering tree diagram of the 1-TOM matrix was constructed to classify similar gene expression into different gene coexpression modules (Xu M et al., 2020). To further identify functional modules in the co-expression network, module-trait associations between modules and clinical feature information were calculated based on previous studies. As a result, modules with high correlation coefficients were considered as candidates for correlation with clinical features and were selected for subsequent analysis. are as follows:|logFC|>1, adjust p-value <0.05 (Bos et al., 2017). The p-value was adjusted by the Benjamini-Hochberg method to control for the false discovery Rate (FDR). The DEGs were visualized as volcano plot by using R package ggplot2. Then, the intersect between DEGs and co-expression genes that were extracted from the interesting module were used to identify potential biomarkers, which were visualized as a Venn diagram using the R package VennDiagram.

Enrichment Analysis
The metascape database (https://metascape.org/gp) is a gene annotation and analysis database (Zhou Y et al., 2019). The Kyoto Encyclopedia of Genome and Genome (KEGG) is a database resource for understanding advanced functions and biological systems from large-scale molecular data generated by high-throughput experimental techniques (Kanehisa et al., 2017). Gene ontology (GO) analysis, including annotations of biological process (BP), molecular functional (MFS) and cellular module (CCS), is the main bioinformatics tool for annotating genes and analyzing the biological processes of these genes. We used metascape online database for bioinformatics analysis of overlapping genes. And a false discovery rate (FDR) of less than 0. 05 were considered statistically significant (Bos et al., 2017). The results obtained are visualized by R-package ggplot2.

Protein-protein Interaction Network Construction and Module Analysis
We used the Search Tool for the Retrieval of Interacting Genes (STRING) database (http://string-db.org) to obtain the Proteinprotein interaction (PPI) network (Szklarczyk et al., 2015). Import the network data into Cytoscape (Version: 3.8.0) for further analysis (Smoot et al., 2011). CytoHubba, a plug-in of Cytoscape, was used to filter hub genes (Chin et al., 2014). And 15 hub genes were screened out by MCC method and sequentially ordered. More forward rankings are represented by redder color. Taking the intersection of 15 hub genes and our previous research results (Wang et al., 2021), the obtained genes are used for experimental verification.

Verification of the Candidate Biomarkers
Enzyme-linked immunosorbent assay (ELISA) is a qualitative and quantitative detection method that uses antigen-antibody specific binding for immune response. The experimental arrangement mainly involves the sandwich method and the competition method. Candidate biomarkers for TB were validated using the ELISA kits (USCN Life Sciences; Wuhan, China). Protein levels in plasma were detected according to the manufacturer instructions. Collect plasma using EDTA as an anticoagulant. Centrifuge samples for 15 min at 1,000×g at 2-8°C within 30 min of collection. Remove plasma and assay immediately or store samples in aliquot at -80°C for later use. Avoid repeated freeze/thaw cycles. We measured protein levels of candidate biomarkers in plasma from 88 patients with pulmonary tuberculosis, 88 healthy controls and 88 ORDs (Other respiratory diseases). Student's t-test was used to compare the differences between the two groups, and p < 0.05 was considered statistically significant. Graphpad Prism (version 8.0) was used to visualize the results as well as for statistical analysis. To determine the diagnostic efficacy of biomarkers, the R package pROC was used to perform receiver operating characteristic (ROC) curve.

RESULTS
The work flow of this study was shown in Figure 1.

WGCNA Analysis and Interesting Modules
To find biomarkers associated with focal metabolic activity in TB patients, we constructed a gene co-expression network using the WGCNA package. All genes were divided into different modules and each module was assigned a different color (Figure 2A). The correlation between each module and two clinical features was assessed by plotting a heat map of module-trait relationships. The results of the module-trait relationship were shown in Figure 2B, indicating that the lightcyan module in GSE158767 had the highest correlation with the metabolism-high tissue (r = 0.61, p = 0.007).

Intersect Between the DEGs and Interesting Module
A total of 412 DEGs were identified by differential gene analysis ( Figure 3A). The list of DEGs obtained was intersected with the genes in the lightcyan module and a total of 105 overlapping genes were identified ( Figure 3B). These overlapping genes were dysregulated in expression in foci of increased metabolism and could be used as candidate biomarkers.

Enrichment Analysis for Overlapping Genes
The results of GO enrichment analysis were shown in Figure 4A. In BP, the most genes were enriched in regulation of lymphocyte proliferation, regulation of mononuclear cell proliferation, and regulation of leukocyte proliferation. In CC, the most genes were enriched in collagen-containing extracellular matrix. In MF, the most genes were enriched in receptor ligand activity, and signaling receptor activator activity. The results of KEGG enrichment analysis were visualized as bubble plot ( Figure 4B). The most genes were enriched in small molecule catabolic process, regulation of lymphocyte proliferation, regulation of mononuclear cell proliferation, regulation of leukocyte proliferation, and regulation of T cell proliferation.

Protein-Protein Interaction Network and Hub Genes
The STRING database (https://www.string-db.org/) was used to construct PPI network among the overlapped genes ( Figure 5A). The hub genes selected from the PPI network using the MCC of CytoHubba plugin were presented in Figure 5B. Based on MCC scores, the top 15 highest score genes were selected as hub genes (MMP9, CCL5, SPP1, CCL19, APOE, CXCL3, CHI3L1, IDO1, TDO2, GPC3, LTB, FCGR1A, LYVE1, C1QB, HLA-DMB). According to our previous study, we found CCL5, CCL19, C1QB and HLA-DMB were high expression in TB lung tissue, therefore we finally selected the above biomarkers for experimental verification (Wang et al., 2021).

Experimental Verification
We used ELISA to validate the four candidate biomarkers screened and the experimental results were shown in Figure 6. Among them, C1QB, CCL5 and CCL19 all showed good discriminatory properties, with statistically significant differences in the healthy, TB and ORD groups. While HLA-DMB was statistically significantly different in the healthy and TB groups, but not in the TB and ORD groups. We then assessed the diagnostic efficacy of the four markers

DISCUSSION
Tuberculosis is an infectious disease that can involve organs and tissues throughout the body and is caused by Mycobacterium tuberculosis. At present, TB remains the largest pathogenic cause of mortality worldwide, apart from COVID-19. Although the diagnostic criterion for tuberculosis is sputum positivity, the low rate of positivity by pathogenic methods has resulted in a large number of potential tuberculosis patients not being diagnosed in a timely manner, thus delaying treatment. Therefore, the development of biomarkers that can diagnose TB is demanded. The second-generation RNA sequencing technique  was used to detect the specimens of tuberculosis patients taken out by operation. After that, the four candidate biomarkers were screened by bioinformatics. And then, using ELISA technology to analyze the plasma of tuberculosis group, health control group and ORD group, four biomarkers were verified. The results showed that four biomarkers were the most statistically significant. According to the results of ROC analysis, CCL19 may become a biomarker for the diagnosis of tuberculosis. And the diagnostic efficacy of the combination of these four biomarkers was also high according to the results of logistic regression analysis, but it still needs to be further confirmed by prospective studies.
The C1qB encodes the B-chain polypeptide of serum complement subcomponent C1q, which associates with C1r and C1s to yield the first component of the serum complement system (Bos et al., 2017). C1q is composed of 18 polypeptide chains which include 6 A-chains, 6 B-chains, and 6 C-chains. Each chain contains an N-terminal collagen-like region and a C-terminal C1q globular domain (Lubbers et al., 2020). Some previous studies (Chen et al., 2011;Radanova et al.,  2012) demonstrated that c1q deficiency is associated with lupus erythematosus and glomerulonephritis. There is a certain discrepancy between this and our research results. Our results show that when the expression of C1qB is increased, it indicates that there may be inflammation caused by tuberculosis infection. This may be due to the different mechanisms of inflammation and autoimmune inflammation caused by Mycobacterium tuberculosis infection. The specific mechanism of C1qB in tuberculosis needs to be further studied.
CCL5 is a chemokine that participates in immune regulation and inflammation (Singh et al., 2020), and has chemotaxis to monocytes, memory T helper cells and eosinophils (Tavares et al., 2020). It has been widely studied in HIV infection. It mainly participates in cellular immune response and plays an important role in CD8+T cells. . Some studies (Lee et al., 2019;Fujimoto et al., 2020;Yu-Ju Wu et al., 2020) have pointed out that CCL5 is a key target for the treatment of AIDS, tumors and even inflammation. In these results, the expression of CCL5 is increased, which is consistent with our study. We confirmed that the expression of CCL5 is also increased in inflammation caused by Mycobacterium tuberculosis infection. Combined with the results of other studies, CCL5 is likely to become a new target for the treatment of tuberculosis.
CCL19 is one of several CC cytokine genes clustered on the p-arm of chromosome 9. Cytokines are a family of secreted proteins involved in immunoregulatory and inflammatory processes (Yan et al., 2019). The CC cytokines are proteins characterized by two adjacent cysteines. The cytokine encoded by this gene may play a role in normal lymphocyte recirculation and homing. It also plays an important role in trafficking of T cells in thymus, and in T cell and B cell migration to secondary lymphoid organs (Saxena et al., 2019). It specifically binds to chemokine receptor CCR7 . Some studies (Yan et al., 2019;Xu H et al., 2020) have shown that the increased expression of CCL19 is related to virus infection. This is consistent with our research results. The host immunity of tuberculosis is mainly T cell immunity. However, there has not been any research on the CCL19 of tuberculosis. CCL19 may become a new target for tuberculosis treatment, and its diagnostic value has been confirmed by our research.
Since 2008 to date, more than nine thousand studies of TB biomarkers exist, and more than five thousand of them are diagnostic biomarkers. However, most of the diagnostic biomarkers have not been subsequently validated with largescale data, and there are currently four biomarkers that have been validated with large-scale data (Zimmer et al., 2021). These four biomarkers are CRP, IL-6, IP-10, and TNF-α. CRP was the first to appear and is now in clinical use as an adjunct to the diagnosis of TB and to determine the efficacy of anti-TB therapy (Fusani et al., 2021). IL-6 and IP-10 have considerable potential to diagnose tuberculosis, and biologic companies already exist to develop related products (Zimmer et al., 2021). TNF-α is similar to above three biomarkers and can diagnose tuberculosis, showing a high diagnostic efficacy (Zimmer et al., 2021;Huang et al., 2022). However, the above biomarkers are still lacking in terms of diagnostic specificity, and no biomarkers with high specificity for TB can be identified yet, which may require more in-depth studies targeting host immunity.
In summary, the diagnostic efficacy of each biomarker alone is not satisfactory, but the diagnostic model formed by the combination of the four biomarkers is very effective. The diagnostic model had an AUC of 0.788 in distinguishing TB from HC and an AUC of 0.880 in distinguishing TB from ORD. In the field of diagnostic biomarkers, diagnostic models formed by the combination of several biomarkers are becoming more dominant because they can combine the advantages of each biomarker (Deng et al., 2022;Huang et al., 2022). In this study, new diagnostic biomarkers for tuberculosis were identified in blood using expression profile data obtained from lung tissue. Prior to validation at the plasma protein level, we also performed validation at the RNA level, but were unable to graph the four biomarkers due to their low RNA expression in whole blood. This further suggests that these four biomarkers are released into the blood after being synthesized as proteins in the lung tissue. This study had some limitations, firstly, only sequencing data from lung tissue were used in the screening stage, so it was difficult to state that the screened biomarkers were specific for TB. Secondly, the small number of samples sequenced made it prone to statistical bias. Further research of the four biomarkers and other biomarkers which were found by previous studies were needed to illustrate their diagnostic efficacy.
At present, the diagnosis of tuberculosis is still a major problem to be solved. We have developed an effective diagnostic tool to help clinicians identify TB patients early. In addition, it can also be used as a tool to judge the therapeutic effect of tuberculosis. In addition, the determination of the plasma content of the four biomarkers is a simple detection method. It can reduce the difficulty of tuberculosis diagnosis and reduce the economic burden of patients. These four biomarkers can even be made into kits to promote the use of tuberculosis, so as to diagnose tuberculosis as soon as possible.

CONCLUSION
The study developed four novel biomarkers (CCL5, C1Qb, CCL19 and HLA-DMB) for diagnose of TB. Through the early diagnosis of tuberculosis, clinicians and patients can take more necessary measures in terms of treatment and follow-up.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/supplementary material further inquiries can be directed to the corresponding author/s.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by the Shanghai public health clinical center's ethics committee. The patients/participants provided their written informed consent to participate in this study.