Prognostic Signatures Based on Thirteen Immune-Related Genes in Colorectal Cancer

Background The immunosuppressive microenvironment is closely related to tumorigenesis and cancer development, including colorectal cancer (CRC). The aim of the current study was to identify new immune biomarkers for the diagnosis and treatment of CRC. Materials and Methods CRC data were downloaded from the Gene Expression Omnibus and The Cancer Genome Atlas databases. Sequences of immune-related genes (IRGs) were obtained from the ImmPort and InnateDB databases. Gene set enrichment analysis (GSEA) and transcription factor regulation analysis were used to explore potential mechanisms. An immune-related classifier for CRC prognosis was conducted using weighted gene co-expression network analysis (WGCNA), Cox regression analysis, and least absolute shrinkage and selection operator (LASSO) analysis. ESTIMATE and CIBERSORT algorithms were used to explore the tumor microenvironment and immune infiltration in the high-risk CRC group and the low-risk CRC group. Results By analyzing the IRGs that were significantly associated with CRC in the module, a set of 13 genes (CXCL1, F2RL1, LTB4R, GPR44, ANGPTL5, BMP5, RETNLB, MC1R, PPARGC1A, PRKDC, CEBPB, SYP, and GAB1) related to the prognosis of CRC were identified. An IRG-based prognostic signature that can be used as an independent potentially prognostic indicator was generated. The ROC curve analysis showed acceptable discrimination with AUCs of 0.68, 0.68, and 0.74 at 1-, 3-, and 5- year follow-up respectively. The predictive performance was validated in the train set. The potential mechanisms and functions of prognostic IRGs were analyzed, i.e., NOD-like receptor signaling, and transforming growth factor beta (TGFβ) signaling. Besides, the stromal score and immune score were significantly different in high-risk group and low-risk group (p=4.6982e-07, p=0.0107). Besides, the proportions of resting memory CD4+ T cells was significantly higher in the high-risk groups. Conclusions The IRG-based classifier exhibited strong predictive capacity with regard to CRC. The survival difference between the high-risk and low-risk groups was associated with tumor microenvironment and immune infiltration of CRC. Innovative biomarkers for the prediction of CRC prognosis and response to immunological therapy were identified in the present study.


INTRODUCTION
Colorectal cancer (CRC) is one of the most common malignant tumors, and its morbidity and mortality are on the rise worldwide. More than 1 million new cases of CRC are diagnosed globally every year (1), as are approximately 492,000 deaths (2). Although treatment techniques such as surgery, radiotherapy, and chemotherapy have been greatly improved, the prognosis remains poor. In the USA the respective 5-year survival rates of patients who underwent surgery to remove tumors for localized (stage I), regional (stages II and III), and distant (stage IV) CRC were 91.1%, 71.7%, and 13.3% (3). Most patients are at a progressive stage at the time of diagnosis and have thus missed the opportunity to undergo standard treatment, so more precise diagnoses and more effective treatments are urgently needed.
Currently the TNM classification system compiled up by the American Joint Committee on Cancer is the most robust prognostic indicator for stratifying patients (4). It is also used to guide clinical treatment for CRC. Because of tumor heterogeneity however, even patients at the same TNM stage may exhibit different survival times (5). Therefore, other auxiliary indicators are needed to predict prognoses more accurately, and provide an additional basis for therapy choices. Galon et al. (6) first reported that different subgroups of tumorinfiltrating lymphocytes could predict the prognosis of CRC patients in 2006. Numerous studies have subsequently revealed that tumor-infiltrating lymphocytes are closely related to the prognosis of CRC, and the degree of tumor regression after neoadjuvant radiotherapy in patients with locally advanced rectal cancer (7).
The main components of the tumor microenvironment include vascular cells, mesenchymal stem cells, tumor-associated fibroblasts, immune cells, inflammatory cells, and extracellular matrix, among others (8,9). Most tumor cells express antigens recognized by host CD8 + T cells, but those that evade antitumor immune responses grow progressively (10). In the last decade immunotherapy-based drugs have been intensely investigated in cancer treatment, and immunotherapy has now become an effective therapy for several cancers (11,12). In CRC immune checkpoint therapy can be effective in tumors that are mismatch-repair-deficient or have high levels of microsatellite instability, but ineffective in tumors that are mismatch-repair-proficient, microsatellite-stable, or have low levels of microsatellite instability (13). Therefore, characterizing the function of immunity in different responsive populations contributes to improving the efficacy of immunotherapy for CRC.
In the current study a CRC immune signature based on 13 prognostic immune-related genes (IRGs) was constructed, and its prognostic efficacy was verified using external validation datasets. The role of abnormal immune infiltration and tumor microenvironment heterogeneity in immunotherapy for CRC was also investigated.

Data Processing
The Cancer Genome Atlas (TCGA) CRC expression data, the corresponding phenotype, and survival data were downloaded from the xena database (http://xena.ucsc.edu/). The dataset contained a total of 434 samples, of which 383 were CRC samples and 51 were normal samples. The IRG dataset was downloaded from the ImmPort database (https://www.immport. org/shared/home) and the InnateDB database (https://www. innatedb.com/). After discarding repeated genes the ImmPort database contained 1,811 immune genes and the InnateDB database contained 1,226 immune genes. GSE72970 was downloaded from the Gene Expression Omnibus database to verify the efficacy of the survival prognosis model. GSE72970 contained 124 CRC disease samples. The limma package in R was used to conduct difference analysis on the expression profile data, and p < 0.05 and logFC > 1 as the threshold value were used to identify 4,793 differentially expressed genes (DEGs). A total of 569 IRGs were then identified via the intersection of the ImmPort and InnateDB immune databases and the DEGs.

Functional Enrichment Analysis
Using DAVID (http://david.ncifcrf.gov/), the gene ontology function and CRC IRG pathways were enriched. Gene set enrichment analysis was used for functional analysis of candidate prognostic IRGs in key modules.

Weighted Gene Co-Expression Network Analysis
The co-expression of IRGs in CRC was analyzed via weighted gene co-expression network analysis (WGCNA) using R software. A WGCNA algorithm was used to mine the gene modules that were synergistically expressed, then the correlation between those modules and the sample phenotype was analyzed to identify the modules that were most strongly related to the disease phenotype.

Least Absolute Shrinkage and Selection Operator Analysis
Univariate Cox regression analysis was performed using the "Survival" package (https://CRAN.R-project.org/package= survival, Version:2.41-3), and 16 candidate IRGs associated with CRC prognosis were identified. Least absolute shrinkage and selection operator (LASSO) regression was conducted using the R package "glmnet" (14) to further screen the potential prognostic risk characteristics, and an immune-related CRC prognosis signature was generated. The risk score was then calculated as follows: Coef is the regression coefficient and Exp is the expression value of the corresponding gene in each sample. CRC samples were divided into a high-risk group and a low-risk group based on the median risk score. The significance of the difference between the survival curves in the high-risk group and the lowrisk group was tested via Kapan-Meier analysis. The area under the curve (AUC) was calculated to evaluate the predictive efficiency of the model in 1, 3, and 5 years. To verify whether the constructed risk predictor signature was an independent prognostic indicator, univariate analysis of clinical factors for CRC and multivariate Cox regression analysis of risk scores for CRC were performed. To test the predictive power of the prognostic model the GSE72970 dataset was downloaded from the Gene Expression Omnibus cohort to verify the predictive power of the prognostic model via Kaplan-Meier curve analysis and determine the 5-year area under the receiver operating characteristic (ROC) curve.

Transcription Factor-mRNA Interaction Network Construction
Regulatory relationships between transcription factors and mRNA were downloaded from the TRRUST version 2 database (https://www.grnpedia.org/trrust/), and transcription factors with interactional relationships with the prognostic IRGs were screened out. A network incorporating transcription factors and IRGs was built using Cytoscape (15).

Calculation of Immune Score and Matrix Score
Immune cells and stromal cells are two major types of nontumor components in the tumor microenvironment, and it has been suggested that they are valuable in the diagnosis and prognosis of tumors. Gene expression characteristics of immune cells and stromal cells in the high-risk group and the low-risk group were calculated using the R package ESTIMATE.

Assessment of Proportions of Immune Cell Types
CIBERSORT (http://cibersort.stanford.edu/) was used to characterize cell composition based on the gene expression profile of complex tissues. A characteristic white blood cell gene matrix (LM22) consisting of 547 genes was used to identify 22 immune cell types, including myeloid subsets, natural killer cells, plasma cells, naive and memory B cells, and T cells. CIBERSORT was combined with the LM22 eigenmatrix to estimate the proportions of 22 cell phenotypes in the high-risk group and the low-risk group. The ratio of all estimated immune cell types in each sample adds up to 1.

WGCNA Identified Survival-Related Modules
A total of 4,793 differentially expressed genes were obtained via the limma package in R ( Figure 1A), of which 1,574 were upregulated and 3,219 were downregulated ( Figure 1B). The immune genes in the immune databases ImmPort and InnateDB were then merged, and 2,668 immune genes were identified. By intersecting the 2,668 immune genes and the 4,793 DEGs, 569 overlapping IRGs were identified ( Figure 1C). A heat map of the IRGs is shown in Figure 2A. DAVID analysis indicated that the IRGs were mainly enriched in the Biological process (BP) like delayed rectifier potassium ( Figure 2B), were mainly associated with the cellular component post (CC) like synaptic membrane ( Figure 2C), were mainly included in molecular function (MF) like lipoprotein particle binding ( Figure 2D). In Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis the IRGs were mainly enriched in the mineral absorption region, among others ( Figure 2E).
In WGCNA, the optimal threshold value was 4 if the correlational coefficient was > 0.85 ( Figures 3A, B). The genes were clustered via the average-linkage hierarchical clustering method, and five modules were obtained ( Figure 3C). The blue module were negatively correlated with the disease ( Figure 3D).

Prognostic IRG Acquisition and Its Potential Functions
Sixteen survival-associated IRGs were acquired from the blue module (containing 124 IRGs) via univariate Cox regression ( Table 1). In gene set enrichment analysis conducted to further investigate the possible roles of these genes with potential prognostic functions in gene ontology and KEGG pathways, the enriched biological functions identified included cell activation, defense responses, dendritic development, and negative regulation of leukocyte migration. The enriched KEGG pathways included the hedgehog pathway, neuroactive ligand receptor interaction, NOD-like receptor signaling, and transforming growth factor beta (TGFb) signaling ( Figures  4A-H).

Prognostic IRG Transcriptional Regulatory Factors in CRC
Relationships between transcription factors and mRNA were downloaded from the TRRUST database, and screen out the transcription regulation factors which in relationship with the 16 candidate IRGs. A total of 21 interactional relationships were detected, and they involved the transcription factors TP53, GATA3, and breast-cancer susceptibility gene 1 (BRCA1) ( Figure 5).

Construction and Verification of Prognostic Classifier Based on IRGs
Sixteen prognostic IRGs were selected for LASSO regression analysis, and 13 genes were used to construct the prognostic classifier; C-X-C motif chemokine ligand 1 (CXCL1), F2R-like trypsin receptor 1 (F2RL1), leukotriene B4 receptor (LTB4R), GPR44, angiopoietin-like 5 (ANGPTL5), bone morphogenetic protein 5 (BMP5), resistin-like beta (RETNLB), melanocortin-1 receptor (MC1R), peroxisome proliferator-activated receptor g    Figure 6A). The regression coefficient of each gene was calculated ( Table 2). Risk scores were calculated based on regression coefficients obtained via the LASSO algorithm, and survival times in TCGA corresponding to risk scores were determined ( Figures 6B, C). The median risk score was generated to separate the high-risk and low-risk groups. The risk group and the profile of each clinical feature are shown in Figure 6D. TCGA cohort patients with high risk scores exhibited a lower survival rate than those with low risk scores based on the Kaplan- Meier curve analysis ( Figure 7A). In analysis of time-dependent ROC curves to assess the effects of the classifier the AUCs were 0.68 at 1 year, 0.68 at 3 years, and 0.74 at 5 years ( Figures 7B-D). In GSE72970 analysis patients with high risk scores exhibited a lower survival rate than those with low risk scores ( Figure 7E). The AUC was 0.729 at 5 years ( Figure 7F). Univariate and multivariate Cox regression analysis revealed that age, TNM stage, and risk score in the prognosis model were significantly associated with survival ( Table 3).

Stromal Scores and Immune Scores in the High-Risk and Low-Risk Groups
In GSE72970, stromal scores and the immune scores were significantly higher in the high-risk group than in the low-risk group (Figure 8).

Leukocyte Subsets in the High-Risk and Low-Risk Groups
The proportions of resting memory CD4 + T cells and eosinophils differed significantly in the high-risk and low-risk groups ( Figures  9A, B). In GSE72970 analysis the proportions of naive B cells, memory B cells, plasma cells, CD8 + T cells, CD4 + resting memory T cells, follicular helper T cells, regulatory T cells, resting natural killer cells, and activated natural killer cells differed significantly in the high-risk and low-risk groups (Figures 10A, B).

DISCUSSION
CRC is the third most common cancer in the world, with approximately 1.4 million cases diagnosed worldwide in 2012 (16). Remarkable progress has recently been made in two key areas at the immunology-cancer interface and microenvironment (17), and this may have substantial effects on future CRC diagnoses and treatments. In our study, we identified the prognostic signature based on the thirteen IRGs could categorize CRC patients into two subgroups with statistically different survival outcomes, which was   validated in both TCGA and GSE72920 datasets. Additionally, we explored the underlying mechanisms using ESTIMATE and CIBERSORT analysis between risk groups. Sixteen survival-associated IRGs were acquired from the key module and were significantly enriched in hedgehog signaling, NOD-like receptors and the TGFb-signaling pathway. Aberrant hedgehog signaling in tumor cells can induce abnormal proliferation and invasion (18), and hedgehog signaling in the tumor microenvironment that targets cancer-associated fibroblasts can lead to angiogenesis (19), fibrosis (20), immune evasion (21), and neuropathic pain (22). Hedgehog-related genetic alterations mostly occur in basal cell carcinoma (85%) and sonic hedgehog-subgroup medulloblastoma (87%), and less frequently in breast cancer, CRC, and gastric cancer (23). NODlike receptors are a relatively recent addition to the pattern recognition receptor superfamily (24). Increasing evidence suggests that chronic inflammation caused by aberrant NODlike receptor signaling is a powerful driver of carcinogenesis, genetic mutation, tumor growth, and cancer progression (25). The TGFb-signaling pathway is one of the important pathways in the tumorigenesis of CRC (26), and TGFb activation in the tumor microenvironment can promote tumor-stromal interaction and lead to a malignant CRC phenotype and a poorer prognosis (27).
To investigate underlying molecular mechanisms, a transcription factor-mediated network was constructed to identify vital transcription factors that could regulate identified hub IRGs. TP53, GATA3, and BRCA1 were prominent in this network. TP53 can mediate several cellular stress responses such as DNA repair, cell-cycle arrest, and apoptosis, and suppress tumor formation (28). GATA3 is one of six members of the GATA family of transcription factors, and contains zinc-finger DNA binding domains that bind to 5′-(A/T) GATA (A/G)-3′ motifs (29). It regulates the specification and differentiation of various tissue types, and immunohistochemistry for GATA3 expression is primarily used in surgical pathology diagnosis for carcinomas originating from breast (30) or urothelial (31) tissue. BRCA1 and breast cancer 2 (BRCA2) are tumor suppressor genes that control aberrant cell proliferation and prevent tumor development (32). BRCA1 and/or BRCA2 mutation carriers are at a lifetime risk of developing breast cancer of up to 85%, and for ovarian cancer their lifetime risk is reportedly between 20% and 40% (33,34).
In the present study the 13 IRGs that were strongly associated with CRC prognosis-CXCL1, F2RL1, LTB4R, GPR44, ANGPTL5, BMP5, RETNLB, MC1R, PPARGC1A, PRKDC, CEBPB, SYP, and GAB1-were used in the classifier investigation. Le Rollel et al. (35) reported that human CRC epithelia and myofibroblasts secrete elevated CXCL1 that facilitates blood vessel formation and recruitment of stromal and inflammatory cells, and promotes in vivo tumorigenic growth. There are two types of LTB4R; leukotriene B4 receptor 1 (BLT1) and leukotriene B4 receptor 2. BLT1 is a high-affinity LTB4R that is expressed by various subsets of leukocytes, and is responsible for LTB4-dependent leukocyte migration (36). BLT1 deficiency in Apc min/+ mice reportedly resulted in increased tumor size and increased numbers of intestinal tumors due to altered microbiota and increased chronic inflammation (37).
The tumor suppressor gene BMP5 has been investigated in myeloma, adrenocortical carcinoma, and breast cancer, and Chen et al. (38) reported that loss of BMP5 is an early event in CRC, and that low BMP5 expression was associated with recurrence and poorer prognoses. The intestinal goblet cellspecific protein RETNLB is markedly over-expressed in a  human colon cancer cell line, and its expression is reportedly associated with histological grade of differentiation and lymph node metastasis in CRC patients (39). MC1R expression is associated with a higher risk of melanoma, and has been used as a target in melanoma therapy (40). In another bioinformatic study, MCR1 was one of the five immune genes used in the prognostic risk model of colon cancer (41). PPARGC1A 1a is a prominent regulator of mitochondrial biogenesis and metabolism (42). It has been reported that it can regulate cell proliferation and invasion via the AKT/GSK-3b/b-catenin pathway in human CRC SW620 and SW480 cells (43). PRKDC mediates DNA repair and maintains genomic stability, and it is reportedly upregulated in CRC cancerous tissues compared with normal tissues, and associated with chemoresistance (44). Wang et al. (45) reported that CEBPB is a critical effector of autophagy via regulation of autolysosome formation, and that forkhead box protein O1/CEBPB/nuclear factor kappa B signaling is required for C-C motif chemokine ligand 20 expression to augment chemoresistance in CRC. GAB1 belongs to the Grb2-associated binder family, which includes scaffolding adapter molecules that participate in transducing key signals from multiple receptors such as growth factors, cytokines, and antigen receptors (46). Bai et al. (47) identified GAB1 as a target of miR-409-3p in CRC, and demonstrated its unique function in CRC cell migration and invasion. In the current study, the area under the ROC curve confirmed the satisfactory predictive efficacy of the risk model based on the 13 prognosisrelated IRGs identified, and the prognosis model risk score was an independent factor in multivariate Cox analysis. This innovative IRG-derived risk score model provides a new theoretical basis for predicting prognoses in CRC patients, and is expected to be applied in future clinical treatment. The immune microenvironment affects the progression and prognosis of different cancers. The ESTIMATE algorithm was first presented by Yoshihara et al. (48) in 2013. ESTIMATE algorithm-derived immune scores were calculated in clear cell renal cell carcinoma, and higher immune scores, stromal scores, and ESTIMATE scores were associated with worse survival outcomes, advanced tumor grades and higher pathological stages (49). The same results were evident in patients with lower-grade glioma (50) and gastric cancer (51). In the current study immune scores were significantly higher in the high-risk group and were associated with shorter overall survival. Different degrees of risk may therefore be associated with differences in immune infiltration, and different patients may derive different benefits from immunotherapy.
The level of immune cell infiltration into the tumor is related to tumor growth, progression, and prognosis, and this has been a focus of research in recent years (52,53). The biological software CIBERSORT developed in 2015 can calculate immune cell composition based on the gene expression profile of complex tissues (54). In the present study the expression profiles of CRC in the high-risk and low-risk groups were used to calculate immune cell compositions using CIBERSORT. In TCGA analysis CD4 + resting memory T cells were significantly higher in the low-risk group. CD4 + memory T cells impede the progression of tumor cells by supporting the proliferation of CD8 + cells, which move to tumor-related tissues and differentiate into effector cells. In one study increased disease-free survival was directly associated with higher proportions of resting and activated CD4 + memory T cells in breast cancer, implying an anti-tumor role of CD4 + memory T cells (55).In gene set enrichment analysis there were higher proportions of memory B cells, activated natural killer cells, CD8 + T cells, follicular helper T cells, and regulatory T cells in the high-risk group, and comparatively larger fractions of naive B cells, resting natural killer cells, CD4 + resting memory T cells, and plasma cells in the low-risk group. Lohr et al. (56) reported that mature plasma cells in tumor tissues were associated with a better prognosis in small cell lung cancer. Flammiger et al. (57) reported that Prostatespecific antigen recurrence-free survival was lower in patients with higher densities of FOXP3 + regulatory T cells, and that high levels of FOXP3 + regulatory T cells were associated with advanced prostate cancer tumor stage. We conclude that to an extent differences in immune infiltration may explain the differences in prognoses in high-risk and low-risk patients. The limitation of our study is that the cohort did not consisted of patients who treated with immune checkpoint inhibitors, therefore, although we find the potential immune IRGs but it's not clear if their classification using the immune-related genes are useful for predicting IO therapy in CRC.
In conclusion, in the current study an immune risk score model for CRC was established that could provide effective survival predictions in patients with CRC. Risk score was also significantly associated with immune score, stromal score, and immune cell infiltration. The study generated an alternative tool for survival prediction and treatment guidance in CRC.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/supplementary material. Further inquiries can be directed to the corresponding authors.

AUTHOR CONTRIBUTIONS
X-BM designed the study, conducted the experimental process and literature search, and generated the figures. Y-YX, M-XZ, and LW wrote and edited the manuscript. All authors contributed to the article and approved the submitted version.