Identification of an Immune-Related Gene Signature to Improve Prognosis Prediction in Colorectal Cancer Patients

Background Despite recent advance in immune therapy, great heterogeneity exists in the outcomes of colorectal cancer (CRC) patients. In this study, we aimed to analyze the immune-related gene (IRG) expression profiles from three independent public databases and develop an effective signature to forecast patient’s prognosis. Methods IRGs were collected from the ImmPort database. The CRC dataset from The Cancer Genome Atlas (TCGA) database was used to identify a prognostic gene signature, which was verified in another two CRC datasets from the Gene Expression Omnibus (GEO). Gene function enrichment analysis was conducted. A prognostic nomogram was built incorporating the IRG signature with clinical risk factors. Results The three datasets had 487, 579, and 224 patients, respectively. A prognostic six-gene-signature (CCL22, LIMK1, MAPKAPK3, FLOT1, GPRC5B, and IL20RB) was developed through feature selection that showed good differentiation between the low- and high-risk groups in the training set (p < 0.001), which was later confirmed in the two validation groups (log-rank p < 0.05). The signature outperformed tumor TNM staging for survival prediction. GO and KEGG functional annotation analysis suggested that the signature was significantly enriched in metabolic processes and regulation of immunity (p < 0.05). When combined with clinical risk factors, the model showed robust prediction capability. Conclusion The immune-related six-gene signature is a reliable prognostic indicator for CRC patients and could provide insight for personalized cancer management.


INTRODUCTION
Colorectal cancer (CRC) is the third most common malignant tumor worldwide and ranks second in tumor-related deaths (Bray et al., 2018). In China, CRC is third highest in annual incidence and is the fifth leading cause of cancer-related deaths . For operable disease, resection offers the best chance of long-term survival and potential cure (Adam et al., 2004). For inoperable patients, chemotherapy (mostly 5fluorouracil-or oxaliplatin-based) and target therapy (epidermal growth factor receptor or vascular endothelial growth factortargeted) have been the standard of care (Xie et al., 2020). However, despite recent advancements in chemo-regimens and clinical management, the overall survival of CRC remains unsatisfactory: The 5 years overall survival is just over 50% (Frampton and Houlston, 2017). More disconcertingly, there is great heterogeneity in individuals not only regarding tumor development but also in the response to uniform treatment: In those receiving surgeries, while some enjoyed disease-free survival, many suffered from tumor recurrence (Stelzner et al., 2019). The same is seen during non-surgical management, where tumor reactions vary: Less than 60% of patients had objective treatment response (Okuno et al., 2017), and adverse tumor response remains a strong predictor for unfavorable survival (Saskia et al., 2019). Thus, identifying reliable biomarkers for prediction of tumor behavior and outcome will benefit personalized modification in clinical management.
Recently, immune checkpoint blockade therapies that provide revolutionary treatments in multiple solid tumors (melanoma, non-small cell lung cancer, head-and-neck squamous cancer, colorectal cancer, etc.) have brought the community's attention to tumor-related immunology Pagni et al., 2019). It is increasingly recognized that immune conditions play a decisive role in the genesis and progression of malignant tumors. The host's immune dysfunction significantly impairs the body's anti-tumor surveillance, along with cells' immuneavoiding mechanisms acquired from the accumulation of gene mutations, marking a vital step toward tumor development (Croci et al., 2007;Shi et al., 2015;García-Albéniz et al., 2019). The most widely recognized prognostic biomarkers for immune therapy are programmed death ligand (PD-1), tumor mutation burden (TMB), and microsatellite instability/mismatch repair deficiency (dMMR) (Duffy and Crown, 2019). However, throughout the published research, these solitary biomarkers only showed moderate stratification efficacy (Snyder et al., 2014;Patel and Kurzrock, 2015;Van Allen et al., 2015;Mansfield et al., 2016), more so in CRC (Ciardiello et al., 2019), and a universal immunerelated gene (IRG) panel as prognostic signature in CRC has not been scored.
In the last decade, several limited-scale studies have attempted to develop a predictive gene signature to stratify high-risk populations using high-throughput technology (Ito et al., 2013;Abdul Aziz et al., 2016;Li et al., 2020). However, most suffer from overfitting due to insufficient sample pools, and external validation is rarely presented. In addition, differences among high-throughput protocols often lead to inconsistency in expression values among studies, presenting a challenge to comprehensive meta-data analysis. From this perspective, the publicly available large-scale genomic databases provide sufficient samples, comparable gene expression at the probe level, and solid follow-up information, and thus have been recognized as ideal platforms for gene signature construction and validation.
In this study, we aimed to identify and validate an IRG signature to stratify CRC patients with significantly worse survival in two independent public databases. The signature was then incorporated with clinical risk factors to provide robust prediction efficacy regarding long-term survival.

MATERIALS AND METHODS
A schematic of this study is shown in Figure 1.

Gene Expression Data Acquisition of CRC Patients
Two sets of colorectal cancer patients with clinical information, including survival status and survival time, were retrospectively enrolled from the publicly available The Cancer Genome Atlas (TCGA) 1,2 and Gene Expression Omnibus (GEO, GSE39582, GSE17538) data pool as training and external validation sets, respectively. CRCs with clinical variables and genes were comprehensively extracted using the following procedures. Samples with complete survival information were selected, and genes with missing expression values in >20% samples were removed.

Selection of Immune-Related Genes
To filter genes that actively participate in immune activity, a third comprehensive data set of immune genes was acquired from the Immunology Database and Analysis Portal database (ImmPort) 3 . After cross-referencing with the ImmPort database, a pack of 2,112 immune-related genes (IRGs) was obtained. As some genes showing no expression value in the above three gene expression profiles, a panel of 1,684 expressing IRGs was further selected for survival analysis (Supplementary Table S1).

Development of Prognostic Gene Models in the Training Set
Univariate Cox regression was performed for each gene regarding survival status to screen for prognostic immune-related genes. For those showing statistical significance (p < 0.05) in Cox regression, the random survival forest algorithm (RSFA) was adopted for dimensional reduction. Further, the risk scores of the prognostic models were determined as follows: where N is the number of genes, Expression is the gene expression value, and coefficient is the gene coefficient value in the Cox FIGURE 1 | Schematic of this study. Survival and relevant clinical information, along with IRG expression data, were acquired from an online database. The TCGA database (n = 487) was used as the training group. An IRG prediction panel was developed via the aforementioned procedures and was tested for its function via GO and KEGG analyses. The model's discrimination was assessed and validated in GSE39582 (n = 578) and GSE17538 (n = 224). TCGA, The Cancer Genome Atlas; IRG, Immune-related gene; RSFA, random survival forest algorithm.
regression analysis, while the median risk score was utilized to group the patients as Low-Risk and High-Risk population.
To rule out overfitting, we constructed full-scale combinations of genes yielded in the RSFA. Time-dependent receiver operating characteristic (time-ROC) analysis was used to test the performance. The C-index, which by value equals the area under curve (AUC), was used to evaluate the concordance between the prediction model and reality. The combination with the highest C-index was designated as the optimal prediction model, which was subsequently verified for performance in internal and external verification (GSE39582 and GSE17538).

Construction and Assessment of a Novel Nomogram Incorporating IRGs and Clinical Factors
We then sought to develop a comprehensive model with clinical features. Via univariate and multivariate Cox regression analyses, independent clinical risk factors (p < 0.05) were incorporated into the IRG panel. Based on these, a comprehensive prediction nomogram was formulated. Subsequently, we used a time-ROC test at different time points to test its performance in the training and validation groups. In addition, the nomogram's prediction bias was evaluated.

Statistical and Bioinformatics Analysis
Statistical analysis was performed with R Software (version 3.6.2), while pROC, TimeROC, randomForestSRC, and survival packages were utilized. Data distribution was validated using the Kolmogorov-Smirnov test. The statistical significance of continuous variables between the training and validation sets was measured using Student's t-test. Chi-square or ranksum tests were performed for layered variables. Kaplan-Meier analysis was used to assess the high-and lowrisk groups. A Z-test was adopted for statistical differences between ROC curves. The co-expressing genes (CEGs) of the selected IRGs were then screened using co-expression network analysis by Pearson test (| Pearson coefficient| 0.6, p < 0.001). To explore the function of the selected coexpressing genes, gene enrichment, namely gene ontology (GO) analysis and KEGG analysis, was analyzed by ClueGO (Bindea et al., 2009), a Cytoscape plug-in to perform GO and KEGG analysis.

Identifying the Prognostic Signature in the Training Set
Following the aforementioned criteria, three datasets with a total of 1,290 patients with CRC were enrolled: one training set (TCGA, n = 487) and two validation sets (GSE39582, n = 579; GSE17538, n = 224). The clinical characteristics are presented in Table 1. The median age of the patients in TCGA was 68 years. At the time of enrollment, most patients were alive (77.8% in the training and 66.6% and 59.8% in the two independent validation sets, respectively), and the median surveillance times were 699, 1,582, and 1,401 days in TCGA, GSE39582, and GSE17538, respectively. Most patients did not have lymph node involvement (stages I-II). The clinical characteristics of the training group (TCGA database, n = 487) and the two test groups (GSE39582, n = 578; GSE17538, n = 224). The median age of the training group was 68 years. By the end of surveillance, most patients were alive (77.8, 66.6, and 59.82%). CRC, Colorectal Cancer.
After cross-comparison with the ImmPort database, 2,112 immune-related genes were selected. Then, repeated, not, or inconsistently expressed genes were excluded, leaving 1,684 genes as candidates. For each gene, univariate COX regression was performed, and 143 IRGs suggested significant protective or risk effects (Figure 2A and Supplementary Table S1). Via RSFA, nine immune-related genes were identified as independent prognostic predictors ( Figure 2B).
Next, to explore the optimal IRG signature and preclude overfitting, we formed a panel of full-size combinations of these nine genes (2 9 − 1 = 511 combinations, Supplementary  Table S2). Using the previously discussed risk formula, 511 candidate predictive signatures were calculated. The performance of each signature was verified via a time-ROC curve. The AUC values of each were rated. A combination of the following genes: CCL22, LIMK1, MAPKAPK3, FLOT1, GPRC5B, IL20RB was screened out with the highest prediction precision (AUC = 0.746; Figures 2C,D). Each IRG's hazard ratio (HR) and p-value is listed in Table 2. Thus, the designated risk model was: Risk score = (−0.421 × expression value of CCL22) The Performance of the Signature in Predicting Overall Survival Using the IRG model, a risk score was calculated for individuals. In the training set, the Kaplan-Meier (KM) test was performed to verify the survival difference between the high-(n = 243) and lowrisk (n = 244) populations, divided by median risk-score-value. The method was consistent with other studies (Song et al., 2019;Wang et al., 2020a). As shown in Figure 3A, significant longevity was observed in low-risk patients in the training set (log-rank p < 0.001). The median survival time was 8.46 years in lowrisk patients vs. 4.12 years in high-risk populations. To explore this in other independent databases, the same methodology was then adopted for the GSE39582 validation set with a relatively larger sample pool (Figure 3B), and the model showed significant differentiation capability (median survival time: 8.83 years in the low-vs. 4.67 years in the high-risk group, n = 579, logrank p < 0.001). Finally, the survival prediction performance was tested in the GSE17538 dataset, and it could also distinguish the CRCs into high-or low-risk groups (5 years survival: 45.90 vs. 63.68% (n = 224), log-rank p < 0.05, Figure 3C).
When the patients in the training and two validation datasets are queued by risk score, clusters in gene expression level and survival information can be observed in Figure 4. In the training dataset (Figure 4A), patients with shorter survival times had higher risk scores, and genes with adverse prognostic effects, namely LIMK1, FLOT1, GPRC5B, and IL20RB, showed consistent elevation in expression in high-risk populations ( Figure 4A). In addition, consistent trends were confirmed in the two external validation sets (GSE39582 and GSE17538, Figures 4B,C, respectively).

The Relationship Between the Signature and Clinical Characteristics
We further explored the potential relationship between gene signature and clinical characteristics in TCGA and GEO databases ( Table 3). Neither patient age (stratified by 68 years) nor gender showed a correlation with gene signature via Pearson's χ 2 -test. Tumors' TNM staging was significantly advanced in the high-risk population in TCGA and GSE39582 (p < 0.001). In univariate Cox regression analysis, old age (>68 years), more advanced tumor stage (stage III and IV), and immunerelated gene signature showed statistically significant effects, confirming them as independent adverse predictors via the following multivariate COX regression. In all three groups, the gene signature suggested great predictive potential regarding the clinical outcomes of CRC patients (High-vs. Low-risk, HR training-TCGA = 4.56, 95% CI 2.81-7.40, p < 0.001, n = 487; HR test1-GSE39582 = 1.55, 95% CI 1.16-2.07, p < 0.001, n = 579;

Development of a Predictive Gene-Clinical Nomogram for Clinical Outcome
To achieve comprehensive outcome prediction, the six-gene prediction model was combined with clinical independent risk factors, namely tumor stage and age, and transformed into a predictive nomogram (Figure 5A) to provide a straightforward estimation of survival at 1, 3, and 5-year intervals. For instance, old-aged (>68 years) advanced-staged (stages III-IV) patients with a gene signature value of 4 would have a total risk score of roughly 60, and the odds of survival would be 80, 55, and   (Figure 5C). It could be judged from the nomogram that the six-gene signature The links between clinical factors and the IRG signature were investigated. The patients were divided into low-and high-risk groups according to median risk factor. Age and sex were verified via χ 2 -test, while tumors' T, N, and M stages were checked by the rank-sum test. Tumor pathological state and disease stage were significantly correlated with the IRG signature. IRG, Immune-related gene; CRC, Colorectal Cancer. ***P < 0.001.
was the most prominent predictor of patient survival, and the performance of the gene-clinical nomogram was consistent over various time points. To assess how this nomogram mimics a real situation, calibration curves using a 1,000-time bootstrap test were plotted. As shown in Figure 5D, in the training set, the nomogram presents good agreement between prediction and real situation. Furthermore, in the external calibration (Figure 5E), the calibration curve showed a minor wobble, but still in the near proximity of the 45-degree-dashed-dashed line. These results suggest that our nomogram closely predicts real-life situations, and via internal and external validation in two independent large-scale databases, the nomogram showed great utility.

Exploring the Function of the Signature
We then explored the potential genetic functions of the IRG panel. In the TCGA dataset (n = 487), the co-expressed relationships of the six genes with the protein-coding genes were computed using Pearson correlation test. The expressions of 446 protein-coding genes were highly associated with at least one of the genes in the signature (|Pearson correlation coefficient| > 0.60, p < 0.001). Next, we performed GO and KEGG pathway function enrichment analysis for these coexpressing protein-coding genes (ClueGo plug-in, Cytoscape). Clusters including 104 GO terms and 3 KEGG functionally pathways were identified (Supplementary Table S3, p < 0.05). The results of these analyses implied that the signature might be involved in tumorigenesis by interacting with proteincoding genes that affect important biological processes such as regulation of immune and inflammatory responses and metabolic processes ( Figure 6B).

DISCUSSION
Recent advances in immune checkpoint blockade therapy warrant further understanding of immune gene variation, and there is an imminent need for robust prognostic biomarkers to guide selective management strategies. In this study, we used In both the training and validation groups, patients' disease stage, age, sex, and IRG signature were tested in univariate and multivariate COX regression, in which age, tumor stage, and IRG signature were independent prognostic factors. The hazard ratios, 95% CIs, and corresponding p-values are given. IRG, Immune-related gene. *P < 0.05; **P < 0.01; ***P < 0.001. three independent, large-scale international genome databases for the exploration and validation of a prognostic IRG panel.
We performed dimension-reduction of acquired IRG data and ruled out overfitting, which is commonly seen in other studies.
We developed a full-scale recombination of nine figures (511 combinations). Finally, the most accurate six-gene prediction There have been a few recently published studies using IRG signatures to predict prognosis in CRC patients. However, not all of these were conducted with a reasonable sample size, and only moderate performance was achieved. Zuo et al. (2019) developed a six-gene signature model to forecast patient prognosis without external validation. In their study, gene selection was based on multivariate regression, and no recombination was performed to rule out overfitting. Indeed, the AUC was only 0.711 and 0.683 for the 3 and 5 years survival, respectively, and inconsistency was seen in the subgroup analysis. Bai et al. (2020) also reported a 14-IRG panel using the TCGA cohort with the absence of any external validation. In addition, they conducted GO and KEGG analyses not with CEG of the selected IRG, but with the CEG of the whole set of 676 IRGs. From this perspective, the accuracy of the analysis could be biased. Our proposed link between IRG signature and disease characteristics was further confirmed by Wang et al. earlier this year (Wang et al., 2020b). Regrettably, none of the abovementioned studies incorporated IRG signatures with clinical risk factors for outcome predictions, so their clinical utilities were largely limited. In contrast, the present study enrolled a large number of patients. The 1,290 CRC patients' IRG sequencing data and clinical characteristics were downloaded from three independent international databases that include patients of various regions and ethnicities, which adds to the utility and credibility of the IRGs signature.
The included immune-related genes for signature were CCL22, LIMK1, MAPKAPK3, FLOT1, GPRC5B, and IL20RB. Through a literature search, CCL22 was identified as an upstream regulator of the PI3K/AKT pathway. Secreted by M2 macrophages, CCL22 regulates the epithelial-mesenchymal transition (EMT) of CRC cells and promotes tumor resistance to chemotherapy Wei et al., 2019). FLOT1 also induces EMT and alters the cell cycle by modulating the Erk/Akt signaling axis . In addition, the prognostic value of IL20RB has been actively discussed in multiple tumors including glaucoma, anal cancer, and lung adenocarcinoma (Wirtz and Keller, 2016;Jeannot et al., 2018;Zhang M. et al., 2019). Moreover, MAPKAPK3 is a member of stress-responsive kinases that induce autophagy in terms of stress (inflammation, infection, and starvation) and thus determines cell fate (Wei et al., 2015;Menon et al., 2017).
The clinical application of gene mutations as prognostic biomarkers is largely limited thus far. The Ras family (KRas and NRas) has been recognized as an indicator of epithelial growth factor receptor status (Amodio et al., 2020). The BRAF V600E mutation was identified as an indication for anti-vascular endothelial growth factor (VEGF) treatment (Apte et al., 2019). Additionally, microsatellite instability has gained increasing attention with the introduction of immune therapy (Hause et al., 2016). However, using individual gene status as prognostic biomarkers did not yield very reliable efficiency (Sjoquist et al., 2018). Independent studies of single gene mutations tend to result in conflicting conclusions. The selected genes in this study involve multiple pathways that play a critical role in cancer development. This is understandable because oncogenesis is the result of several altered pathways that cannot be concluded with one single biomarker. By proposing a multi-gene prediction panel, this study provided insight regarding immunology in cancer development and progression.
While gene status only represents part of the bigger picture, patients' clinical features are also closely linked to their oncological outcomes. When the IRG signature was incorporated with independent clinical risk factors, the model presented good performance. The calibration curve also showed good agreement between model prediction and reality in the training set. Compared to traditional clinical risk scoring, incorporating our IRG signature with clinical risk factors would benefit prognostic prediction. For those anticipated to have significantly inferior survival, a more close-up surveillance strategy should be made to identify early onset of tumor recurrence after resection or tumor progression during non-surgical intervention. In addition, surgeons would be more informed when making treatment decisions.
We used bioinformatics tools to explore the high-dimensional connections and functions of the selected IRGs. We introduced 446 CEGs of the selected IRGs via a co-expression network and conducted a comprehensive interpretation of these genes regarding cellular functions and pathway enrichment. Immune cell adhesion, immune cell function regulation, and cytokine regulation were the most enriched functions based on GO analyses (Figure 6). The CEG showing the highest correlation was enriched in interleukin regulation (regulation of interleukin-10 and interleukin-10 regulation) and the most enriched cell functions were closely linked with RNA processing (GO terms, RNA binding, and ncRNA metabolic process) and immune regulation (GO terms, response to cytokine, regulation of cytokine production, and regulation of defense response).
Our study also has several limitations. First, the gene levels in different cohorts were not measured via universal sequencing protocols, which might have led to some inconsistencies, and the minor drift of calibration in the validation group to some extent might explain the slight decrease in C-index in the test group. It is important to recognize that microarray protocols among databases were not consistent and that different ethnic and geographical variations could result in reasonable inter-cohort bias. Second, in contrast to the volume of gene data, the clinical information in these databases was relatively limited, and it is best to combine the gene signature with more comprehensive clinical factors for optimal prognostic prediction.

CONCLUSION
Taken together, we developed a predictive IRG panel that can legitimately forecast CRC patients with CRC, and the gene signature was more robust when incorporated with clinical risk factors. Our model could potentially benefit individualized clinical management for patients with CRC. For instance, a shorter check-up interval should be considered for patients with adverse survival, as timely medical intervention would be ideal for tumor progression or recurrence.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author/s.

AUTHOR CONTRIBUTIONS
The data gathering was the joint effort of SD and SX. The bioinformatics analysis was performed by SD and YY.