Identification of an Immune Signature Predicting Prognosis Risk and Lymphocyte Infiltration in Colon Cancer

Increasing studies have highlighted the effects of the tumor immune micro-environment (TIM) on colon cancer (CC) tumorigenesis, prognosis, and metastasis. However, there is no reliable molecular marker that can effectively estimate the immune infiltration and predict the CC relapse risk. Here, we leveraged the gene expression profile and clinical characteristics from 1430 samples, including four gene expression omnibus database (GEO) databases and the cancer genome atlas (TCGA) database, to construct an immune risk signature that could be used as a predictor of survival outcome and immune activity. A risk model consisting of 10 immune-related genes were screened out in the Lasso-Cox model and were then aggregated to generate the immune risk signature based on the regression coefficients. The signature demonstrated robust prognostic ability in discovery and validation datasets, and this association remained significant in the multivariate analysis after controlling for age, gender, clinical stage, or microsatellite instability status. Leukocyte subpopulation analysis indicated that the low-risk signature was enriched with cytotoxic cells (activated CD4/CD8+ T cell and NK cell) and depleted of myeloid-derived suppressor cells (MDSC) and regulatory T cells. Further analysis indicated patients with a low-risk signature harbored higher tumor mutation loads and lower mutational frequencies in significantly mutated genes of APC and FBXW7. Together, our constructed signature could predict prognosis and represent the TIM of CC, which promotes individualized treatment and provides a promising novel molecular marker for immunotherapy.


INTRODUCTION
Colon cancer (CC) is one of the most common cancers and remains one of the leading causes of cancer death worldwide (1). Despite continuous achievements in early CC detection, treatment, and management leading to reductions in the incidence and mortality, 30-50% of patients develop recurrence or metastasis within five years of treatment (2). Therefore, in addition to the current clinical and pathological factors for determining the disease prognosis and patient survival, reliable and robust new molecular markers are urgently needed to improve the personalized therapy for CC patients.
Numerous studies have recently highlighted the effects of the immune microenvironment on cancer tumorigenesis, development, and metastasis (3)(4)(5). Indeed, assessment of the enrichment of tumor-infiltrating lymphocytes (TIL) was demonstrated to be an important complement role to the TNM staging system for relapse and mortality prediction in CC (6)(7)(8). Besides, recent immunotherapies targeting specific immune checkpoints such as PD-1/L1 have demonstrated a remarkably durable response in CC treatment (9,10). Patients with certain histopathologic patterns, such as intratumoral infiltration by cytotoxic lymphocytes and tumor neoepitope burden, have also been reported with a better clinical prognosis (11)(12)(13). Conventional methods for detecting the tumor immune infiltrate, such as flow cytometry or immunohistochemistry (IHC), cannot comprehensively evaluate the immune effects due to the limitation of the number of immune markers. As an alternative, continuously accumulating transcriptomics data provides an ideal resource for largescale immune landscape analysis (14). However, there has been no appropriate signature that can systematically evaluate the tumor immune micro-environment (TIM) based on immune-related genes and predict the patients' survival or response to immunotherapies of CC patients. Therefore, it's essential to develop a reliable immune signature on the basis of a comprehensive list of immune-related genes to represent the immune status of TIM and have the prognostic ability of CC.
In this study, we concentrated on constructing an immune signature with survival prediction and lymphocyte infiltration estimation ability based on the comprehensive list of immunerelated genes curated from The Immunology Database and Analysis Portal (ImmPort) database (15). The microarray data from the gene expression omnibus database (GEO) database and RNA sequencing (RNA-seq) data from the cancer genome atlas (TCGA) database were used for analysis and validation. We then evaluated whether this signature was associated with survival outcomes and clinicopathological factors in the CC subgroups. Furthermore, we tried to figure out the relationship between the signature and tumor immune-related indexes including TIL and tumor mutation load (TML) in CC. And finally, we evaluated the effects of this risk signature in identifying the immune responders from immune check-point inhibitors (ICI) therapy. Findings gleaned from this study may be valuable for predicting patients' prognosis and guiding immunotherapy treatment for CC.

Publicly Attainable Expression Datasets and Immune-Related Genes
Gene expression data and clinical features of CC samples were retrospectively collected from publicly available datasets at the NCBI GEO database 1 and at TCGA 2 . The selection criteria of CC datasets were adopted from the workflow of the Dai et.al study (16). A total of 1430 patients were enrolled for analysis, including GSE39582 (N = 557) (17), GSE17538 (N = 200) (18), GSE37892 (N = 130) (19), and GSE33113 (N = 90) (20), and TCGA (N = 453). The GSE14333 (21) dataset was excluded from this analysis owing to the fact that its probe cell intensity (CEL) files overlapped extensively with the GSE17538 series. Among these cohorts, GSE39582 was the largest set consisting of 557 CC samples, and hence, it was marked as a discovery series and used for constructing the gene signature. Considering the small sample sizes of the GSE17538, GSE33113, and GSE37892 cohorts, and the fact that they shared the same microarray sequencing platform (Affymetrix HG-U133 plus 2.0), we integrated the three datasets into a combined large cohort and regarded this as external validation. The ComBat method from the SVA R package was used to remove the batch effects among different GEO datasets (16). The clinical and survival information of the included datasets was summarized in Supplementary Table S1. An immunotherapeutic cohort of advanced urothelial cancer (IMvigor210 cohort) treated with atezolizumab (anti-PD-L1 McAb) were utilized to further validate the efficiency of the immune risk signature (22). The detailed clinical annotations and complete gene expression profile of the anti-PD-L1 cohort were obtained from http://research-pub. gene.com/IMvigor210CoreBiologies. The comprehensive list of immune-related genes, containing a total of 1508 genes, was downloaded from the ImmPort database 3 (15).

Tumor Mutational Load and Neoantigen Analysis
The somatic mutational profile of colon adenocarcinoma (COAD) in the Cancer Genome Atlas (TCGA) mutation annotation format (MAF) were downloaded from Genomic Data Commons (GDC) 4 . Non-synonymous mutations counts were recognized as TML and used for investigating the relationship with the immune signature. Non-synonymous mutations included splice site mutation, missense mutation, nonsense mutation, in-frame mutation, and frameshift mutation. Neoantigens of COAD were collected from previously published studies (23). The antigen peptides resulting from non-synonymous mutated HLA sequences with predicted binding affinities below 500 nM are defined as neoantigens (24).

Gene Set Enrichment Analysis
The R packages limma (25) were used to evaluate differential expression of more than 21,000 genes in samples with different risk groups. To specify, the expression data were background corrected and quantile normalized and probe sets were summarized using RMA with the affy R package. Subsequently, the normalized expression data were then fed into lmFit and eBayes functions to calculate the differential statistics with the limma package. The logFC produced by limma was used as an input to perform gene set enrichment analysis (GSEA) (26) against the REACTOME reference gene set (MSigDB database v7.1). The fast GSEA algorithm implemented in the Bioconductor R package fgsea was used.

Immune Cell Infiltration Estimation With ssGSEA
The relative infiltration of 28 immune cell types in the CC tumor microenvironment were quantified by the single sample gene set enrichment analysis (ssGSEA) (27). Special feature gene panels for each immune cell subset were curated from a recent research (4,28). The relative abundance of each immune cell type was represented by an enrichment score in the ssGSEA analysis. The ssGSEA score was normalized to unify distribution from 0 to 1 for each immune cell type. The bio-similarity of the immune cell filtration was estimated by multidimensional scaling (MDS) and a Gaussian fitting model.

Quantify the Immunotherapy Response Predictor: Immunophenoscore
The superior immune response molecular marker, Immunophenoscore (28), was used to characterize the intratumoral immune landscapes and the cancer antigenomes. The scoring scheme was created from a panel of immune-related genes belonging to the four clusters: major histocompatibility complex (MHC)-related molecules, checkpoints or immunomodulators, effector cells, and suppressor cells. For each class, a sample-wise Z score from gene expression data was extracted and calculated. The weighted averaged Z score was then calculated by averaging the Z scores within the respective category leading to four values, and the sum of the weighted averaged Z score of the four categories.

Significantly Mutated Genes
We identified significantly mutated genes (SMG) by using the MutSigCV algorithm (29). MutSigCV measures the significant enrichment of non-silent somatic mutations in a gene by addressing mutational context-specific background mutation rates. Candidate SMGs were required to meet these criteria: statistically significant (q < 0.1) and expressed in the human cancer cell lines Encyclopedia (CCLE) (30).

Deciphering Mutational Signature Operative in the Genome
The R package Maftools proposed by Mayakonda et al. (31) was used to extract mutational signatures from the TCGA genomic data. The ExtractSignatures function based on Bayesian variant non-negative matrix factorization, factorized the mutation portrait matrix into two non-negative matrices "signatures" and "contributions, " where "signatures" represented mutational processes and "contributions" represented the corresponding mutational activities. Specifically, the number of columns of matrix "signatures" indicated the number of extracted signatures and the rows indicated the 96 mutational contexts (i.e., C > G, C > A, C > T, T > C, T > A, T > G and combined their 5 and 3 adjacent bases). The SignatureEnrichment function can automatically determine the optimal number of extracted mutational signatures and assign them to each sample based on the mutational activities. The extracted mutational portrait of CC was compared and annotated by cosine similarity analysis against the Catalogue of Somatic Mutations in Cancer (COSMIC) (32).

Statistical Analysis
The statistical analysis in this study was generated by R-3.6.1. For quantitative data, statistical significance for comparisons of two groups or more than two groups was estimated by the Wilcoxon rank-sum test or the Kruskal-Wallis H test, respectively. Fisher's exact test was employed for comparisons of qualitative variables. Logistic regression analysis was used to test the association between TML and risk signature. For the genes with prognostic ability, the Cox proportional hazards model with a Lasso penalty (iteration = 1000) was employed to find the best gene model utilizing the R package "glmnet." The immune signature was based on the linear combination of the selected mRNA expression level and weighted by their Lasso-Cox regression coefficients. The association between immune signature and prognosis were analyzed by the Cox proportional hazards model and the Kaplan-Meier survival analysis with the R survival package (Survminer 0.4.7). The receiver operating characteristic (ROC) curve was used to assess the prognosis classification performance of the immune risk signature and tumor stage, and the area under the curve (AUC) was compared by DeLong's test. All comparisons were two-sided with an alpha level of 0.05, and multiple hypothesis testing with the Benjamini-Hochberg method was used to control false discovery rate (FDR) (33).

Construction of Immune Risk Signature
After removing samples without necessary clinicopathological or follow-up data, a total of 1430 CC patients were included for this analysis, including GSE39582 (N = 557), GSE17538 (N = 200), GSE37892 (N = 130), and GSE33113 (N = 90), and TCGA (N = 453). Considering that the GSE39582 cohort from the Marisa et al. contained the largest sample size (N = 557) and included detailed clinical features, we therefore selected it as the discovery dataset to identify the immune risk signature Patients were divided into high-risk and low-risk groups and the median risk score was utilized as the cutoff value. that is associated with CC patients' prognosis. The univariate analysis was performed in all of the 1508 immune-related genes for the discovery dataset (GSE39582). Through the univariate analysis and log-rank test, a total of 161 genes with prognostic ability were identified (P < 0.05). The 161 immune-related genes were then subjected to the Lasso-Cox proportional hazards regression and tenfold cross-validation to generate the best gene model. The Lasso coefficient profile plot was produced against the log(k) sequence, and the minimize k method resulted in 10 optimal coefficients (Figures 1A,B). Finally, a gene model with 10 immune-related genes reached the optimal regression efficiency to speculate the prognostic ability.
The identified immune-related genes included antigen processing and presentation related genes (LAG3, PSMD11, TAP2), defense response to infection (CEBPB, CXCL9, IRF8, RNASE7), epithelial cell migration (ITGB1, SPARC), and MCFD2. Furthermore, we constructed an immune risk signature to estimate the risk score of each patient based on the linear combination of the 10 mRNA expression levels weighted by their Lasso-Cox regression coefficients: Immune Risk Figure 1C, Supplementary Table S1). A heatmap of the identified 10-gene expression level and the scatterplot of relapse-free survival (RFS) with corresponding risk score were illustrated in Figure 1C.
The Prognostic Value of 10-mRNA Immune Signature To identify the immune signature responsible for CC survival prediction, we divided the discovery cohort samples into a lowrisk group (N = 279) and a high-risk group (N = 278) by using the median risk score as a cutoff point. Patients with low-risk were significantly associated with better RFS compared with those of high-risk (P < 0.001, log-rank test; Figure 2A). This association remained markedly significant in the multivariate Cox model after controlling for age, gender, clinical stage, and mismatch repair (MMR) status (HR, 0.41 [95% CI: 0.30-0.57], P < 0.001; Figure 2B).
To confirm that the 10-mRNA-based immune signature classifier had similar prognostic value in different populations, we further corroborate this association in the TCGA dataset and another combined-GEO microarray dataset (including GSE17538, GSE33113, and GSE37892; "Materials and Methods" section). Heatmaps of the signature consisting of 10 immunerelated genes and the scatterplot of RFS time with corresponding risk score in two external validation cohorts were shown in Supplementary Figure S1. In the TCGA and combined-GEO datasets, we also found that patients with low-risk scores demonstrated a better prognosis than those with highrisk scores (TCGA: P = 0.003, Figure 2C; Combined-GEO cohort: P = 0.013, Figure 2E; log-rank test). Multivariate Cox proportional hazards regression analysis further revealed that the Frontiers in Immunology | www.frontiersin.org  Figure 2F). Further analysis confirmed that the higher immune risk signature score was associated with significantly worse tumor staging in CC cohorts (Kruskal-Wallis H test, GSE39582, and TCGA cohorts, both P < 0.001, Supplementary Figures S2A,B) Supplementary Figures S2C,D).

Extracted Immune Risk Signature Associated With Leukocytes Infiltration and Tumor Immunogenicity
Since the prognosis-related risk signature was extracted from the immune-related genes database, we speculated that its status may regulate the leukocyte infiltration and gene pathways enrichment. Therefore, we composed a heatmap with ssGSEA to visualize the relative abundance of 28 immune infiltrating cell subpopulations from the discovery dataset ( Figure 3A). Anti-tumor lymphocyte cell subpopulations, like activated CD4 + /CD8 + T cells, effector memory CD4 + /CD8 + T cells, and natural killer T cells were enriched in the low-risk signature group (P < 0.05). Nevertheless, myeloid-derived suppressor cells (MDSC), immature dendritic cells, neutrophils, and regulatory T cells, which belonged to protumor leukocytes, were elevated in the high-risk signature group (P < 0.05). We also further characterized the immune infiltration profile in TCGA and the combined-GEO validation cohort, and a similar tendency was observed in these cohorts of such risk signature stratification (Supplementary Figure S3).
Furthermore, GSEA on the CC gene expression profile against REACTOME reference datasets revealed the risk signature related biological signaling pathway. Genes involved in antigen processing cross-presentation, B cell/T cell receptor and immune cytokine signaling pathways were significantly enriched in the low immune risk signature group (Figure 3B, Supplementary Figure S4A). However, chromatin organization and RNA processing and modification were enriched in the high-risk group (Figure 3C,  Supplementary Figure S4B).  Immunophenoscore (IPS) was known to determine the tumor immunogenicity and predict response to ICI therapy in multitypes of tumors. Here, we utilized IPS to investigate the relationship between the newly identified signature and immune response. In the discovery cohort, the low-risk signature group had a significantly higher IPS compared with the high-risk group (Wilcoxon rank-sum test, P < 0.001, Figure 4A), and this association was also verified in TCGA and the combined-GEO cohort (TCGA, P < 0.001, combined-GEO cohort, P = 0.001; Figures 4B,C). These findings indicated that CC patients with the immune signature may be more sensitive to ICI treatment.

Immune Signature Determined the Colon Cancer Genomic Landscape
Genomic characteristics, such as tumor non-synonymous mutation load (TML) and mutational signatures (e.g., MMR, POLE signature) have shown a strong correlation with clinical response to ICI treatment (34). Therefore, we investigated the association between the immune signature and the genomic mutational landscape. Patients with a low-risk immune signature exhibited a higher mutation load than those with a high-risk signature in the TCGA dataset (P = 0.030, Figure 5A). We further compared the tumor neoantigen counts and observed similar results in the group classification (P = 0.005, Figure 5B). As high microsatellite instability (MSI-H) tumors accumulated substantial numbers of somatic mutations and significantly affected the TML, we removed the samples with MSI-H and obtained a significantly higher TML in the low-risk signature (P < 0.001, Figure 5C).
To gain further insights into the mutational processes operative in CC samples, we extracted the mutational signatures (i.e., signatures 1, 6, 10, Supplementary Figure S5) against the COSMIC database with varying numbers of somatic mutations from the genomic data ( Figure 5D) (Figure 5E). Hence, mutational counts attributed to signature 6 were significantly higher than other signatures (Kruskal-Wallis H test, P = 0.019). To rule out the possibility that associations between immune signature and TML were affected by these confounding factors, we included all mutational signatures and clinical factors in the multivariate logistic regression model. Associations between the immune risk signature and TML remained statistically significant (OR, 0.15 [95% CI, 0.06-0.24], P < 0.001, Figure 5F).
We also performed SMG analysis for CC samples in the low-risk versus the high-risk subgroup. The SMG mutational landscapes of these two subgroups (Figure 6)  The Immune Risk Signature in the Role of ICI Treatment Immune checkpoint inhibitors (ICI) therapy represented by anti-PD-1/L1 agents have undoubtedly made a great breakthrough in anti-tumor therapy. Therefore, we curated the gene expression profile and clinical features from an immunotherapy cohort (Imvigor210) of urothelial cancer (UC) treated by anti-PD-L1 agent, so as to investigate the relationship between the constructed risk signature and immune response. In this anti-PD-L1 cohort, patients with a low-risk immune signature score exhibited markedly clinical benefits and a significantly prolonged survival rate (HR, 0.71 [95% CI: 0.55-0.92], P = 0.009, Figure 7A). The significant therapeutic advantages and immune response to PD-L1 blockades were observed in samples with a low-risk score compared to those with a high-risk score (Fisher extract test, P = 0.008, Figure 7B; Kruskal-Wallis H test, P < 0.001, Figure 7C). Further analysis revealed that TML and neoantigen burden were significantly elevated in tumors with low-risk score, which closely linked to immunotherapeutic efficacy (Figures 7D,E). Besides, the association between immune risk score and immunotherapy survival remained statistically significant after taking into account gender, smoking, ECOG score, immunophenotype and, TML status (HR, 0.60 [95%CI, 0.40-0.90], P = 0.015; Figure 7F).

DISCUSSION
Although it has long been recognized that immune contexture plays a vital role in tumor initiation and development (35), these insights have not formed a significant impact on routine clinical application. This highlights the important role of TIM estimation in predicting clinical development and progression of CC patients. In this investigation, we established a reliable prognostic risk signature based on 10 immune-related genes in an independent microarray dataset and proved its efficacy in the TCGA and combined GEO datasets across different platforms. This signature stratified the patients into subgroups with different immune risk, representing distinct tumor immune infiltration level and neoantigen burden. Therefore, the newly identified immune risk signature presumably represented the status of TIM for CC patients and served as a potential biomarker for prognosis estimation and clinical response prediction to immunotherapy. This study confirmed that the immune risk signature was significantly associated with CC patients' RFS, and this association remained significant after controlling for clinicalpathological features. More importantly, our signature was based on immune-related genes and revealed a correlation with nonsynonymous mutation load and neoantigen counts. Considering the importance of mutational load in predicting the response to anti-PD-1/L1 treatment (12,36), we speculated that patients with a low-risk immune signature may be more sensitive to ICI therapy. Actually, the IPS and neoantigen load, which determined the tumor immunogenicity and antitumor immune response, also demonstrated a strong connection with this signature. To clarify the effects of this immune signature, we took TML and IPS as the confounding factors into the multivariate Cox regression models, and identified that the immune risk score remained statistically significant in the TCGA cohort (HR, 0.59 [95%CI, 0.39-0.87], P = 0.008; Supplementary Figure S7). These findings further indicated its practical implication in precision immunotherapy.
In recent years, numerous studies focused on the immune landscape have brought attention to biological and clinical cancer research. Individual immune cell markers such as CD3 + and CD8 + T cells have shown prognostic impacts in patients suffering from CC (6). The immune cell subpopulations estimation algorithm (e.g., ssGSEA, CIBERSORT) was frequently utilized to characterize the immune infiltration profiles and analyze the association with clinical therapy (37). Our research also leveraged the aforementioned method and demonstrated enhanced effector T-cells (CD4/CD8 + T cell, NK cells), reduced suppressive regulatory T-cells, and MDSC infiltration in low immune risk signature. Meanwhile, signaling pathways involved in the antigen processing and presentation, B cell/T cell receptor and immune cytokine were significantly altered in different risk subgroups, suggested that our signature was a superior prediction determinant of tumor immune infiltration.
Comprehensive knowledge of the mutated driver genes underlying human cancers is a critical foundation for cancer diagnostics, therapeutics, and selection of rational therapies. Here, we used MutSigCV algorithms followed by further filter criteria and identified that SMGs of APC and FBXW7 mutations were enriched in high-risk groups, TP53 and MSH6 were enriched in low-risk groups. APC was the most common mutational gene in colorectal cancer, and its mutation has indicated a highly significant association with immune resistance (38). FBXW7 is a critical tumor suppressor of human cancers, missense mutations in this gene show a shorter overall survival rate when compared with wild-type patients in CC (39). TP53 and MSH6 mutations may lead to a higher TML owing to the dysregulation of DNA damage repair function (40). Recent research suggested that TP53 mutations significantly induced the expression of immune checkpoint molecules and activated T-effector and interferon-γ signatures, indicating TP53 mutation patients would be more sensitive to checkpoint blockade (41).
Nevertheless, there were several limitations in our investigation. The main limitation stemmed from using a public dataset for different cohorts which can be somewhat heterogeneous in data processing and patient population. The risk signature was identified by using retrospective datasets, therefore, the expression profiles of the 10 genes combined with clinical validation in the patients of CC prospective cohort are needed to prove its efficacy. Besides, mutational results derived from the TCGA COAD genomic landscape were not validated in independent datasets owing to the unavailability of mutation data. Finally, due to a lack of CC cohorts being treated by ICIs, we are unable to verify the association between the signature and the immunotherapeutic responsiveness and believe further research is needed.
To summarize, this study identified a new immune risk signature that can not only predict CC patients' survival outcomes but also represent the immune infiltration status. This signature can be clinically utilized for the improvement of CC patients' survival, personalize therapy methods based on the risk score, and provide new clues for enrolling CC patients in ICI treatment. However, further randomized control trials are required to validate the significance of the generated signature.

DATA AVAILABILITY STATEMENT
The datasets generated for this study can be found in the GEO -GSE39582, GSE17538, GSE37892, and GSE33113.

AUTHOR CONTRIBUTIONS
HC, WC, and XYL: conception and design. XYL and HC: development of methodology. XYL, DW, XKL, and CY: acquisition of data (provided animals, acquired and managed patients, and provided facilities). XYL, HC, and WC: analysis and interpretation of data (e.g., statistical analysis, biostatistics, and computational analysis). HC and WC: writing, reviewing, and/or revising the manuscript, administrative, technical, or material support (i.e., reporting or organizing data, and constructing databases), and study supervision. All authors contributed to the article and approved the submitted version.

ACKNOWLEDGMENTS
We thank Xiangchun Li from Tianjin Cancer Institute for protocols and helpful discussions.

SUPPLEMENTARY MATERIAL
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fimmu. 2020.01678/full#supplementary-material FIGURE S1 | Heatmap of the signature consisting of 10 immune related genes and the risk score curve in the TCGA and combined-GEO colon cancer cohort.
FIGURE S2 | Immune risk score associated with tumor stage. Distribution of immune risk score with respect to clinical tumor stage was shown in the GSE39582 (A) and TCGA (B) cohort. A receiver operating characteristic (ROC) curve was used to assess the prognosis classification performance of the immune risk signature vs tumor stage vs risk signature plus stage in the GSE39582 (C) and TCGA (D) cohort. The area under the curve (AUC) in different subgroups was calculated by DeLong's test.