The Bioinformatics-Based Analysis Identifies 7 Immune-Related Genes as Prognostic Biomarkers for Colon Cancer

Colon cancer poses a great threat to human health. Currently, there is no effective treatment for colon cancer due to its complex causative factors. Immunotherapy has now become a new method for tumor treatment. In this study, 487 DEGs were screened from The Cancer Genome Atlas (TCGA) database and ImmPort database, and GeneOntology (GO) functional enrichment and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis was performed. Hierarchical clustering of all samples revealed a significant correlation between colon cancer and immunity. The weighted gene co-expression network analysis (WGCNA) algorithm was used to identify key gene modules associated with immunity in colon cancer, here, module grey60 showed the highest correlation. A protein-protein interaction (PPI) network was constructed using the STRING database to screen hub genes, and subsequently, 7 immune-related genes the most closely associated with colon cancer were identified by differential expression in cancer and paracancer. Finally, a risk prediction model was developed using least absolute shrinkage and selection operator (LASSO) COX analysis, and the accuracy of the model was validated by GSE14333. This study determined that IRF4 and TNFRSF17 were immune-related genes in colon cancer, providing immune-related prognostic biomarkers for colon cancer.


INTRODUCTION
Colon cancer (COAD) is a malignant tumor of the gastrointestinal tract. Colon cancer is often classified together with rectal cancer such as colorectal cancer (CRC), which is one of the 5 most frequently diagnosed cancers worldwide, and COAD ranked the 2 rd highest cause leading to cancerrelated deaths worldwide in 2020 (1). The main risk factors cause COAD are poor dietary behavior, obesity, age, and hereditary mutations (2,3). Colorectal cancer incidence and mortality are showing an increasing tread, and new causes of colorectal cancer worldwide is expected to reach 2.2 million with 1.1 million deaths by 2030 (4,5). Colon cancer is a metastatic cancer with common distant metastatic sites in the liver, lung, bone, and brain (6). Patients with early-stage colorectal cancer have a prognostic survival rate close to 90% in contrast to a 5-year survival rate of only 14% for advance-stage patients. In clinical practice, most patient show metastasis by the time diagnosis (7), which is normally associated with a significantly poor prognosis (8). Therefore, developing an effective treatment method for colorectal cancer is currently an urgent task.
In recent years, cancer immunotherapy (CIT) has emerged as a new research hotspot, showing an important link between human immune system and cancer interactions (9). CIT has been implemented to improve the prognosis of patients with malignant solid tumors by mediating tumor destruction through activating anti-tumor immune responses (10,11). The tumor microenvironment (TME), which plays a crucial role in CIT, consists of immune cells, stromal cells, extracellular matrix, cytokines, and chemokines that could promote immune escape, tumor growth, and metastasis (12,13). Immune checkpoint inhibitors have been reported to enhance tumorspecific immune responses and reduce immune escape of cancer cells, thereby inhibiting tumor growth (14,15). Immune-related genes (IRG) have been validated as prognostic biomarkers in non-small cell lung cancer (16), liver cancer (17), clear cell renal cell carcinoma (18), and other cancers.
In this study, immune-related DEGs were screened from the TCGA database and ImmPort database, and the potential functions of DEGs were analyzed using the GO function and KEGG pathway. The WGCNA method was applied to construct a gene co-expression network to find key gene modules significantly associated with colon cancer. A PPI network was constructed to screen hub genes, and a prognostic prediction model was established using LASSO COX analysis. Moreover, the accuracy of the model was verified based on the association between immune-related genes and immune infiltrating cells to provide prognostic biomarkers and potential targets for immunotherapy of COAD.

Clinical Samples and Data Collection
RNA sequencing (RNA-seq) data and clinicopathological information of 455 colon cancer samples and mRNA data of 41 normal tissues were obtained from the TCGA database(see Supplementary Table 1 for details). A list of immune-related genes was downloaded from the Immunology Database and Analysis Portal (ImmPort) database (https://www.immport. org) (see Supplementary Table 2 for details). The score data of the 6 immune-infiltrating cells downloaded from the TIMER database. The expression matrix and clinical information of GSE14333 were acquired from GEO (https://www.ncbi.nlm. nih.gov/geo/) for model verification.

Differentially Expressed Genes (DEGs) and Functional Enrichment Analysis
Differential analysis of immune-related genes was performed using the R package limma package, with adjust P<0.05, |log2FC| > 1 as the screening conditions, and the genes were shown in volcano plots. Venn diagrams were used to take the intersection of the screened DEGs and immune-related genes. GO functional enrichment and KEGG pathway analysis of the intersected genes were conducted using the R package clusterprofiler package, and adjust P < 0.05 was considered as statistically significant. GSEA was performed between high and low expression of IRF4 and TNFRSF17. NOM p-value <0.05, FDR q-value <0.25, and | NES |> 1 were the threshold.

Hierarchical Clustering
Consistency analysis was conducted using the R package ConsensusClusterPlus to classify COAD into different subgroups using immune gene set and to observe the relationship between different subgroups and immunity. The maximum number of clusters was 6. The optimal number of clusters was determined by choosing an appropriate K value, and the clustering heat map was analyzed by the R package pheatmap. Heat maps showing the distribution of immune checkpoints in subgroups were plotted by R software ggplot2 and pheatmap.
Weighted Gene Co-Expression Network Analysis (WGCNA) WGCNA can predict genes associated with carcinogenesis (Yang et al., 2018). The co-expression network of genes in COAD was constructed using the R package WGCNA to analyze immunerelated hub genes and identify gene modules significantly associated with immunity. Soft threshold power (b) was set according to the scale-free topology criterion. The topological overlap matrix dissimilarity was calculated to construct a hierarchical clustering dendrogram to cut a dynamic tree to classify genes with similar expression features and to identify modules. The relationship between co-expression modules and phenotype data was shown by the module-feature relationship heat map, and key gene modules were selected.

Protein-Protein Interactions (PPI)
The STRING database (http://string-db.org/) is one of the online resources dedicated to the whole-object protein network, which can be used to analyze the functional interaction relationship between proteins (19). The screened modular genes were imported into the STRING database, PPI networks were constructed after isolating nodes and visualized using the MCC (Maximal Clique Centrality) algorithm in cytoHubba, a plugin for cytoscape. The top 10 ranked genes were filtered as hub genes. The expression levels of the 10 hub genes were compared in COAD and paracancer, and the DEGs with significant differences were selected.

Construction of a Prognostic Feature Model
The relationship between prognostic immune-related gene expression and OS (overall survival) was assessed using LASSO COX analysis. LASSO analysis was performed using the package glmnet. Prognostic risk prediction model for COAD was developed based on LASSO risk score calculation formula. Patients with COAD were divided into high-risk and low-risk groups according to the median risk score. KM curves were plotted to compare the OS between the two risk groups. ROC survival analysis was performed using the R package SURVIVAL, and the "rmda" package was employed for decision curve analysis. The association between the risk score model and tumor immune infiltrating cells was also investigated using spearman correlation analysis. Statistically significant was defined when P<0.05.

DEGs in COAD and Immune-Related Genes
DEGs were screened from 455 colon cancer tissues and 41 normal tissues in TCGA, and we obtained a total of 1404 upregulated genes and 1246 down-regulated genes(see Supplementary Table 3 for details). Volcano and heat maps are shown in Figures 1A, B, respectively. 487 overlapping DEGs were identified using venn plots to take the intersection of the screened DEGs and immune gene sets ( Figure 1C). GO function enrichment and KEGG difference analysis on overlapping DEGs revealed that in immune system processes, the immune response was the most significant biological process (BP), cytokine activity was the most significant molecular function (MF), the extracellular region was the most significant cellular composition (CC), cytokine-cytokine receptor interaction was the most significant KEGG pathway (Figures 2A-D). Tables 1 and 2 display the top 20 GO functions and KEGG pathways, respectively.

COAD Subgroups Were Significantly Associated With Immunity
Hierarchical clustering was performed on all samples ( Figures 3A, B). As shown in Figure 3C, the consensus matrix heat map was more neatly stratified when the optimal     number of clusters k=3, which indicated a better reliability and stability when the samples were divided into 3 subgroups. The relationship between the 6 immune infiltrating cells and each subgroup was presented in the heat map ( Figure 3D), and the percentage abundance of each cluster is shown in Figure 3E, from which myeloid dendritic cells showed the highest content. Figure 3F shows the distribution of the scores of the 6 immune infiltrating cells in the 3 subgroups. Kruskal-Wallis was used to analyze the expression distribution of immune checkpoint genes in the three subgroups. Each gene was significantly expressed, and its expression was upregulated in group2 and down-regulated in the other two groups ( Figure 3G).

WGCNA Identified Key Modules for COAD
A soft threshold of b=5 was chosen to create a scale-free topological network (Figures 4A, B). A total of 36 gene modules were generated and expression was shown using different colors ( Figure 4C). The number of genes contained  in each module is presented in Table 3. Among them, the grey60 module (R=0.35, P=2e-16) was the most closely associated with tumor immunity, therefore it was considered as the key gene module ( Figure 3D).

PPI Network Construction
A total of 218 genes in the key gene module grey60 were imported into the STRING database to obtain their interactions ( Figure 5A). The 10 top hub genes were screened ( Figure 5B), and the score of each gene was displayed in Table 4, from which IGJ (JCHAIN) showed the highest score. Subsequent identification of the hub genes revealed that 7 genes were significantly different in COAD versus paracancer and all their expression was down-regulated ( Figure 6).

Immune-Related Genes Were Associated With Prognosis
Based on LASSO Cox analysis, a DFS prognostic feature model was established using the 7 IRGs ( Figures 7A, B). When l min =0.0236, a prognostic model containing two genes was obtained. Riskscore=(-0.3063)*IRF4+(-0.0477)*TNFRSF17 was employed to calculate the risk score value of each sample, and divide all the samples into high-risk groups and low-risk groups with the best cut-off value (cut=-0.3). The KM curve showed the difference in survival between the two risk groups (P=0.00095), and the ROC curve showed the predictive ability of the model (Figures 7C, D). GSE14333 served as a validation set to classify the samples according to the best cutoff value (cut=-1.2), and draw the KM curve and ROC curve validation model ( Figures 7E, F). Univariate COX analysis showed that the risk score was statistically significant (P=0.022).
To verify whether the risk score model could reflect the status of the tumor immune microenvironment in patients, the relationship between the risk score model and immune infiltrating cells was investigated, as shown in Figures 8A-F. All the 6 types of immune infiltrating cells were found to be negatively correlated with the risk score.

IRF4 and TNFRSF17 Participated in the Immune Response of COAD
To understand the specific biological pathways of IRF4 and TNFRSF17, we conducted GSEA to explore their enriched gene sets in different subgroups. The top 10 KEGG pathways and hallmarks of each gene are shown in Tables 5 and 6. When IRF4 and TNFRSF17 were low-expressed and were significantly enriched in immune-related pathways (Figure 9), such as receptor interaction, inflammatory response, and hedgehog signaling pathway. The results indicated that IRF4 and   (JCHAIN)  133  2  IGLL5  127  3  CD79A  112  4  IGLL1  96  5  POU2AF1  84  6  MZB1  30  6  IRF4  30  8  IGHV3-11  24  8  IGHV3-15  24  10  TNFRSF17  16 Pan et al.  TNFRSF17 were involved in the immune response and may be potential indicators for COAD immunotherapy.

DISCUSSION
Tumor microenvironment has been increasingly important in cancer treatment. Tumor microenvironment consists of tumor cells and surrounding non-tumor cells, such as immune cells and fibroblasts (20). The composition of tumor immune cells is fundamental in determining the fate of tumors and their ability to invade and metastasize (21). Previous studies found that immune cells with different compositions behaves differently in colorectal cancer and normal intestinal tissues and differs at different stages of the tumor (22). Despite the advances in the treatment modalities of COAD, there is still an urgent need to address the issues related to immunotherapy of the cancer. The aim of this study was to identify immune-related genes in COAD and to develop a prognostic risk score model and validate its accuracy. Immune-related genes play an important role in tumor immunotherapy. As immune-related genes can be quantified in multiple cell types, their expression could serve as a better tumor biomarker (23). In this study, an initial screening of immunerelated genes in COAD was performed using the TCGA database and the ImmPort database. Subsequently, the function and pathway enrichment analysis of DEGs found that the molecular function was significantly enriched in cytokine activity, and it was significantly related to the cytokinecytokine receptor interaction pathway. Cytokines are products of immune cells that regulate the proliferation, differentiation, effector functions, and survival of leukocytes, and have the ability to enhance immune responses and destroy cancer cells (24,25). It has been shown that cytokines may be associated with tumor aggressiveness (26) and could directly or indirectly inhibit tumor cell growth (27). Cytokines also play important role in COAD, for example, interleukin-34 (IL-34) expression is upregulated in colorectal cancer and promotes cancer cell growth (28); interleukin-23 (IL-23)-induced immune cell activation exacerbates intestinal inflammation and promotes COAD growth (29).  Tumor microenvironment can lead to increased systemic inflammatory responses and oxidative stress fibrosis, and will affect cancer treatment and prognosis through its participation in metastasis, immune infiltration, some other pathways and functions of cancer cells (30,31). In recent years, the association between tumor immune infiltrating cells and cancer has received much attention from scholars. In this study, we performed stratified clustering on COAD and analyzed the relationship between the cancer and immunity. The results showed that COAD was significantly associated with immune infiltrating cells, which was consistent with the existing findings (32). Subsequently, 7 immune-related genes of COAD (CD79A, IGLL1, IRF4, JCHAIN, MZB1, POU2AF1, TNFRSF17) were screened by WGCNA algorithm and PPI network construction. Immune-related genes can promote tumor cell proliferation, invasion, and migration (33) and are associated with prognosis (34,35). In non-squamous non-small cell lung cancer and papillary thyroid cancer, immune-related genes demonstrated its prognostic significance (16,36). Prognostic prediction models have been developed based on immune infiltration for COAD (37). Also, prognostic correlation between COAD with different TNM stages and immune-related genes has been investigated using COX regression analysis (22). In this study, LASSO COX analysis was applied to establish a prognostic risk score model for COAD based on immune-related genes, and two prognostic signatures (IRF4, TNFRSF17) were obtained. The GSE14333 verification showed that the prognosis prediction model was relatively accurate, and the two prognostic signatures were significantly related to poor prognosis of the cancer. GSEA results demonstrated that a low expression of IRF4 and TNFRSF17 is related to the immune response signaling pathway.
In conclusion, this study revealed the pathways of immunerelated genes in the tumor microenvironment of COAD through bioinformatics analysis, and finally identified 7 immune-related genes the most closely associated with COAD development and progression. These findings provide immune-related prognostic biomarkers for COAD and provide effective targets for the clinical treatment of the cancer.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are publicly available. This data can has been deposited to NCBI [accession number: PRJNA303175 and GSE14333].