A Combined Epithelial Mesenchymal Transformation and DNA Repair Gene Panel in Colorectal Cancer With Prognostic and Therapeutic Implication

Background Epithelial mesenchymal transformation (EMT) and DNA repair status represent intrinsic features of colorectal cancer (CRC) and are associated with patient prognosis and treatment responsiveness. We sought to develop a combined EMT and DNA repair gene panel with potential application in patient classification and precise treatment. Methods We comprehensively evaluated the EMT and DNA repair patterns of 1,652 CRC patients from four datasets. Unsupervised clustering was used for classification. The clinical features, genetic mutation, tumor mutation load, and chemotherapy as well as immunotherapy sensitivity among different clusters were systematically compared. The least absolute shrinkage and selection operator regression method was used to develop the risk model. Results Three distinct CRC clusters were determined. Clustet1 was characterized by down-regulated DNA repair pathways but active epithelial markers and metabolism pathway and had intermediate prognosis. Clustet2 was characterized by down-regulated both epithelial markers and DNA repair pathways and had poor outcome. Clustet3 presented with activation of DNA repair pathway and epithelial markers had favorable prognosis. Clustet1 might benefit form chemotherapy and Clustet3 had a higher response rate to immunotherapy. An EMT and DNA repair risk model related to prognosis and treatment response was developed. Conclusions This work developed and validated a combined EMT and DNA repair gene panel for CRC classification, which may be an effective tool for survival prediction and treatment guidance in CRC patients.


BACKGROUND
Colorectal cancer (CRC) remains a major cause of cancer-related mortality worldwide despite advancements in tumor screening, early diagnosis, and curative resection (1). Staging based on the tumor, nodule, and metastasis (TNM) is generally considered as the main tools for routine prognostication of CRC patients in treatment practice (2,3). However, heterogeneity of clinical process and treatment response are often observed between individuals in the same stage, which are often attributed to diversity of CRC (4). The diversity of tumors is also manifested at the molecular level. Tumors of the same histological subtype may have different genetic backgrounds and gene expression profile. Tumors of different histological subtype may share common genetic backgrounds and molecular features. Identifying tumor subtypes with different molecular characteristics and clinical outcome is important for the precise treatment of cancer.
In recent years, the molecular classification of CRC has received increasing attention. The international CRC Subtyping Consortium developed a transcriptomic classification of colorectal cancer, which classifies CRC into four biologically distinct consensus molecular subtypes (CMSs). CMS1 and CMS4 tumors have high levels of immune infiltration but antagonistic functional orientation. CMS2 and CMS3 are devoid of immune cell infiltration (5). CMS4 subtype has the worst prognosis. The French national Cartes d'Identite´des Tumeurs (CIT) program identified six molecular subtypes with distinct clinicopathological characteristics and molecular alterations (6). C1 (CIN ImmuneDown ) is more frequently chromosomal instability (CIN) and immunosuppression. C2 (dMMR) contains most deficient mismatch repair (dMMR) tumors. C3 (KRASm) is enriched for KRAS-mutant tumors. C4 (CSC) is characterized by presenting cancer stem cell (CSC) phenotype-like gene expression profile as well as up-regulating of the bottom crypt signature. C5 (CIN WntUp ) has frequency CNI with up-regulation of Wnt pathway. C6 is enriched for "normal-like'" tumor (7). Nevertheless, some defect limits the clinical application of the above-mentioned classification. There is no consensus on whether classification is associated with treatment response. Besides, tumor classification is based on whole-genome gene expression patterns, which increases the complexity of classification and decreases the feasibility of clinical application. And there is overlap between pathways enriched in different classification, increasing the uncertainty of the interpretation of the results. Selecting characteristic pathways for tumor classification may be a way to simplify the classification process and improve clinical utility, and assess the correlation between classification and treatment response.
Epithelial-mesenchymal transition (EMT) facilitates the acquisition of stem cell characteristics and sustains stem celllike populations (8). During the process of EMT, cancer cells lose their epithelial morphology and adopt a spindle-shaped and mesenchymal appearance progressively. Activation of EMT provides cancer cells with the enhanced plasticity required for invasion and metastasis (9). In CRC, EMT is strongly associated with tumor proliferation, infiltration, metastasis, tumor budding and drug resistance (10). Patients with active EMT tumor have poor prognosis. However, EMT is a reversible process, which offers new insight for the treatment of tumors (11).
Incorporating EMT gene expression profiles into CRC classification may identify a subtype of cancer with high malignancy and therapeutic implications.
DNA repair is a critical system for recognizing and repairing abnormalities in the structure or sequence of DNA. Mutations in DNA repair genes, including mismatch repair (MMR), can impair cells' ability to repair damaged DNA, leading to cell death or genome instability (12). Tumors with aberrant DNA repair pathway have increased mutational and neoantigen burden (13), which in turn were linked with greater tumor infiltration by activated T cells. DNA repair defects are associated with improved clinical response to PD-1 blockade, specifically, in CRC patients with deficient mismatch repair (dMMR) (14).
Therefore, we integrated EMT and DNA repair genes for CRC classification. Three CRC clusters with distinct prognosis and molecular characteristic were determined.

Clinical Specimens
In the present study, eight cases of CRC samples including two cased of metastatic CRC samples and six cased of non-metastatic CRC samples were obtained from patients at the Guangxi Medical University Cancer Hospital. The samples were subjected to RNA sequencing. All of the patients were pathologically diagnosed as CRC without chemotherapy or radiotherapy before the collection of the tissues. Written informed consents were obtained from all patients. The study was approved by the Ethics and Human Subject Committee of Guangxi Medical University Cancer Hospital. All experiments and methods were performed according to relevant guidelines and regulations formulated by the Guangxi Medical University.

RNA-Seq Analysis
Total RNA was extracted using Trizol reagent (Invitrogen). The construction of RNA-seq library was based on the protocol of the IlluminaTruSeq RNA Sample Preparation Kit (illumina). Finally, RNA-seq analysis was performed by GENE+ company (Beijing, China) using Illumina HiSeqX Ten platforms. After quality control and trimming adaptor, reads were mapped onto human genome GRCh38. RNA-seq data have been deposited in the China National Center for Bioinformation (ID: PRJCA003751).

Data Acquisition and Pre-Processing
Multiplatform genomics data was included in the study, including mRNA expression data, gene somatic mutation data, DNA copy data, and clinical information. For mRNA expression data, we collected the TCGA COAD AND READ datasets and three GEO datasets [GSE39582 (6), GSE17536 (15), and GSE14333 (16)] which meeting the following standard: samples were hybridized to the Affymetrix HGU133 Plus 2·0 (GPL570) platforms, each dataset contains more than 150 cases CRC patients, and information about the prognosis could be gathered. Besides, to analyze the efficiency of immunotherapy, we also included the "IMvigor" dataset using "IMvigor" package, which was generated from patients with metastatic urothelial cancer treated with anti-PD-L1 drugs (atezolizumab) (17). For TCGA mRNA datasets, the FPKM (fragments per kilobase of exon model per million reads mapped) normalized expression matrix was download form the Genomic Data Commons (GDC, https://portal.gdc.cancer.gov/). For microarray data, the raw "CEL" files were downloaded from GEO (http://www.ncbi. nlm.nih.gov/geo/) and subjected to a robust multiarray averaging method to perform background adjustment and quantile normalization using the "affy" packages (18). The corresponding clinical data was download at the same time. The gene somatic mutation data (MAF files) and DNA copy data (segment file) of TCGA COAD AND READ cohorts were download from GDC.

Generation of EMT and DNA Repair Gene Panel and Unsupervised Clustering
EMT related genes were obtained from published studies and DNA repair related genes were obtained from Molecular Signatures Database (MSigDB) (4,19,20). Univariate cox regression was used to screening prognostic genes using GSE39582. Genes with a p-value less than 0.1 was selected for further analysis. Unsupervised clustering analysis was applied to identify characteristic expression patterns based on the expression of EMT and DNA repair gene panel, and patients were classified for further analysis. We use a consensus clustering algorithm to determine the number and stability of clusters (21). The "ConsensuClusterPlus" package was used to perform the above steps with 500 times repetitions to guarantee the stability of classification (22).

Gene Set Variation Analysis (GSVA) and Functional Annotation
To investigate the biological pathways and processes enriched in different clusters, we applied GSVA which reckons the variation of pathway and bioprocess activity in the sample population by adopting unsupervised clustering method (23). The gene set files of "c2.cp.kegg.v6.2.symbols" and "c5.bp.v7.0.symbols" were downloaded from the MSigDB for running GSVA analysis using "GSVA" packages in R software. Adjusted P less than 0.05 was considered as statistically significance.

Development and Validation of EMT and DNA Repair Risk Model
In order to reduce the dimension and pick the most meaningful prognostic indicators, we applied the least absolute shrinkage and selection operator (LASSO) Cox regression model to the EMT and DNA repair gene panel. LASSO is a penalized regression method that determines the regression coefficients by maximizing the log-likelihood function, while limiting the sum of the absolute values of the regression coefficients (24). The regression coefficients estimated by LASSO are sparse and many components are exactly zero. Thus, LASSO automatically deletes unnecessary covariates (25,26). 10-fold cross validation was used to confirm the suitable tuning parameter (l) for LASSO regression. The significant genes selected by LASSO were subsequently subjected to stepwise cox regression. The eventual regression model was selected based on the Akaike information criterion (AIC). GSE39582 cohort was served as the training set and the TCGA cohort was served as the validation set. A predicted value was calculated for every patient in the validation set on the basis of the risk model constructed in the training set. The ROC and AUC were used to assess the predictive discrimination ability of the risk model.

Statistical Analysis
The statistically significant differences between non-normally distributed variables was analyzed by the Mann-Whitney U test, and normally distributed variables were reckoned adopting the unpaired Student's t-test. In order to compare more than two groups, used Kruskal-Wallis as non-parametric methods, and adopted one-way ANOVA tests as parametric methods. Spearman and distance correlation analysis were used to calculate the correlation. The survival curves for the prognostic analysis were generated via the Kaplan-Meier method and log-rank tests were utilized to identify significance of differences. Use Cox proportional risk model and the "LR forward" stepwise approach to perform univariate and multivariate analyses. Evaluate the survival prediction of accuracy of the prognostic model via a time-related receiver operating characteristic curve (ROC) analysis. The R software (version 3.5.0) was used to conduct all statistical analyses, and all statistical P values were two-side, with p < 0.05 as statistically significance.

Patient Characteristics and Prognostic Gene Identification
The patient characteristics contained in the datasets used in this study is summarized in Table 1. A total of 1,652 CRC patients from TCGA dataset and three GEO datasets (GSE39582, GSE17536, and GSE14333) were retrospectively analyzed in this study. Median age at diagnosis in different datasets ranged from 62 to 68 years. Male patients accounted for 54.48% (900/ 1652). EMT related genes were obtained from published studies (4,20) and DNA repair related genes were obtained from MSigDB. We used GSE39582 as training set to identified prognostic gene. 98 genes (DNA repair: 41; EMT: 57) were eventually identified and defined as prognostic EMT and DNA repair genes for further study. Interestingly most of the EMT genes are epithelial markers, which were down-regulated in mesenchymal cells. Detailed information of the 98 genes was listed in Supplemental Table 1. The protein interaction network of the 98 genes were shown in Supplemental Figure 1.

Identification of Distinct Molecular Clusters
Based on EMT and DNA Repair Genes repair genes. The optimal number of clusters was set at 3 ( Figure 1A), as suggested by Elbow method. The consensus matrix heatmap revealed the identified three clusters ( Figure  1B). It must be noted that the eventually incorporated EMT genes were principally epithelial cell markers whose expression levels negatively correlate with EMT. As shown in Figure 1D, CRC patients of different clusters possessed specific expression patterns of EMT and DNA repair genes. Cluster 1 (EPI H /DNA repair L ) had increased expression of epithelial markers but down-regulated DNA repair genes. Cluster2 (EPI L /DNA repair L ) was characterized by low expression of epithelial markers and DNA repair genes. Cluster3 (EPI H /DNA repair H ) presented with apparent increased expression of epithelial markers and DNA repair genes. We selected recognized DNA repair genes (MLH1, MSH2, PMS1, and PMS2), which are key genes for determining MMR status and widely used in clinical practice (27), and epithelial genes (CDH1 and DSP) as well as mesenchymal genes (VIM, SNAI1, SNAI2, TWIST1, MMP2, and FN1) to analyze their expression among the three clusters (28). As shown in Supplementary Figure 2, the expression of DNA repair genes (MLH1, MSH2, PMS1, and PMS2) and epithelial genes (CDH1 and DSP) were significantly increased in the Cluster 3(EPI H /DNA repair H ) while significantly decreased in the Cluster 2(EPI L /DNA repair L ). The expression of mesenchymal genes (VIM, SNAI1, SNAI2, TWIST1, MMP2, and FN1) were significantly decreased in the Cluster 3(EPI H / DNA repair H ) but increased in the Cluster 2(EPI L /DNA repair L ). These results indicated that DNA repair was active but the EMT was suppressive in Cluster 3, which contrasts with gene expression pattern in Cluster 2. The three Cluster had different survival profiles, with the Cluster 3 having the best prognosis but Cluster 2 having the worst prognosis ( Figure 1C).
We further validated the 98 genes panel in independent cohort. The first cohort was from TCGA comprised 619 cases of CRC. Three distinct molecular clusters were identified as described above (Cluster 1 (EPI H /DNA repair L ), Cluster 2(EPI L /DNA repair L ), and Cluster 3(EPI H /DNA repair H ), Figure 1E). Survival analysis confirmed that cluster have distinct outcomes. Here again, cluster 2 having the worst prognosis ( Figure 1F). The second cohort was from GSE14333 receive adjuvant chemotherapy.   As shown in Supplementary Figure 3A, three distinct molecular clusters were identified and Cluster 2 having the worst prognosis (Supplementary Figure 3B). The third validation cohort was from GSE17536 comprised 177 cases of CRC. We also identified three distinct molecular clusters as described above (Supplementary Figure 3C). Kaplan-Meier analysis revealed that the three subgroups have distinct outcome, that the Cluster 2 had the worst prognosis while Cluster 1 and Cluster 3 had similar outcome (Supplementary Figure 3D).

Correlation of the Clusters With Clinical Characteristics and Classical Classification
The relationships between CRC classifications and clinical characteristics were then investigated by using the GSE39582 (Figure 2A and Supplementary  Figure 2B summarized the relationship between CLT subtype and different clusters. There was no significant difference in the distribution of KRAS mutation, Tp53 mutation and gender among different clusters. We further validated the association by using TCGA dataset. As shown in Supplementary Figure 4, We again found that Cluster 2 was associated with a higher proportion of T4 and stage III-IV. But, node-negative CRC and patients without lymphatic invasion (LV) and vessel invasion (VL) have higher percentage in Cluster 3.

Characteristics of Tumor Genome Variation in Different Clusters
TCGA has completed a comprehensive molecular characterization of CRC, thus we analyzed the distribution differences of somatic single nucleotide variants (SNVs) among different clusters based on TCGA dataset. As shown in Figures 3A-C, the top three genes with the highest frequency of mutations in cluster1 were APC (82%), TP53 (58%), and KRAS (51%), and those in Cluster2 are APC (72%), TTN (51%), and TP53 (51%), and those in Cluster3 are APC (81%), TP53 (66%), and TTN (47%). There was no significant difference in the frequency of somatic mutations in the three clusters. Tumor mutation burden (TMB) is a measurement of somatic mutation carried by cancer cells and high TMB status presented a durable clinical response to anti-PD-1/PD-L1 immunotherapy in CRC (29). We compared the TMB among different clusters. as shown in Figure 3D, the Cluster2 and 3 had the highest TMB while the Cluster1 had the lowest TMB. These results indicated that Cluster2 and 3 might benefit from immunotherapy. Copy number variants (CNVs) are a key component of genetic variation and have a greater impact in the genome than SNVs. We investigated alteration frequency of CNVs among different clusters. A total of 352 genes with significant differences in amplification frequency or deletion frequency among the three clusters were identified. The genes location, amplification frequency and deletion frequency in each cluster was summarized in Figure 3E. Supplementary Figure 5 presented representative genes with significant differences in amplification frequency or deletion frequency among the three clusters. We performed gene enrichment analysis to explore biological processes and pathways involved in aberrant amplification or deletion of genes (Supplementary Figure 6). Genes significantly amplified in the Cluster3 were enriched in Defense response to bacterium and Focal adhesion, which indicated that Cluster3 might associate with immune and metastasis. Pathways enrichment analysis suggested that significantly amplified genes in Cluster2 were enriched in Cell cycle and Cell adhesion molecules, indicated that Cluster2 might associate with cell proliferation and metastasis.

Clusters Predicts Therapeutic Benefit of Chemotherapy and Immunotherapy
Adjuvant chemotherapy (ADJC) is the primary treatment strategy for patients with non-metastatic CRC cancer (30). Given that the GSE39582 dataset provided information on chemotherapy in patients, we utilized this dataset to analyze the relationship between EMT and DNA repair gene clusters and ADJC benefit. We used OS to assess treatment outcome. Interestingly, only patients in the Cluster 1 had improved OS after receiving ADJC ( Figure 4A). No significant difference in OS of patients in Cluster 2 and 3 regardless of whether they received ADJC ( Figures 4B, C). These results indicated that patients in the Cluster 1 might benefit from chemotherapy. Immunotherapy has recently emerged as an effective new therapy for CRC. However, immunotherapy is currently indicated only for CRC patients with dMMR, which only account for about 5%-15%. It is crucial to identify CRC patients benefit from immunotherapy. We collected an immunotherapy data set (Imvigor210) to explore whether the clusters could predict the immune treatment benefit. As shown in Figure 4D, the proportion of patients achieved a complete response (CR) or partial response (PR) was significantly increased in the Cluster3. These results indicated that patients in the Cluster 3 benefited from immunotherapy at a higher rate.

Biological Pathways and Processes Enriched in Different Clusters
To explore the biological characteristics among these distinct clusters, we performed GSVA enrichment analysis. It should be noted that this was a pathway-level comparison for exploring the biological significance behind the different clusters. It was not a re-phenotyping using a new set of genes. The enrichment analysis results of KEGG pathway showed that Cluster1 was markedly enriched in metabolic pathways such as Retinol Metabolism, Linoleic acid Metabolism, and Arachidonic acid Metabolism ( Figure 5A). Cluster2 presented enrichment pathways associated with EMT including ECM receptor interactions and Cell adhesion molecules (CAMs). While Cluster3 was prominently related to DNA repair pathways such as DNA Replication, Mismatch Repair and Base excision Repair. Figure 5B presented representative pathways and its enrichment scores in different clusters. Again, metabolic pathways had the highest enrichment scores in the Cluster1 and EMT related pathways including extracellular matrix (ECM), Wnt pathways, and TGF-b pathways had the highest enrichment scores in the Cluster2. DNA repair pathways had the highest enrichment scores in the Cluster3. The enrichment scores for the above pathways were significantly different (all P <0.05, Figure 5B). We further explored biological processes enriched in distinct clusters. Different clusters had characteristic biological processes (Supplementary Figure 7). Biological processes associated with Amino acid transport, Ion transport and Transmission of neural signal were significantly enriched in Cluster1 (Supplementary Figure 8A). Cluster2 were enriched in Mesenchymal formation, Immune response and Amino acid transport (Supplementary Figure 8B). Besides, biological processes significantly enriched in Cluster3 including RNA processing and DNA repair (Supplementary Figure 8C). Based on the above analyses, we were surprised to learn that three clusters had significantly distinct biological characteristics. Cluster1 was characterized by activation of metabolic pathways and Cluster2 was characterized by EMT activation. Cluster3 was characterized by activation of DNA repair.

Construction of EMT and DNA Repair Risk Scores Related to Prognosis and Treatment Response
To develop clinically useful prognostic and efficacy assessment models for individual, we applied the LASSO Cox regression model to the 98 EMT and DNA repair genes for dimension reduction. GSE39582 cohort was served as training set and TCGA cohort were served as validation cohort. As shown in Figures 6A, B, the most appropriate tuning parameter l for LASSO Cox regression analysis was determined to be 0.036 when the partial likelihood deviance was the smallest. The 16 genes with non-zero coefficients in the tuning parameter were selected and subject to stepwise cox regression. Ultimately, nine genes were used to constructed the scoring system. The hazard ratios and P-values of the nine genes in the scoring model were summarized in Figure 6C. We compared the expression of nine genes in different clusters, and interestingly, these nine genes were significantly differentially expressed in different clusters ( Supplementary Figure 9), suggesting that these genes represent characteristics of different clusters. Patients were divided into high-risk and low-risk groups according to the risk score predicted. And survival analysis demonstrated that the EMT and DNA repair risk scores had significant power to distinguish good from poor outcomes in CRC patients (P<0.001) ( Figure 6D). We further validated the scoring model in TCGA cohort. Patients with high-risk had worse outcomes compared with low-risk ( Figure 6E). ROC curve analysis revealed that the EMT and DNA repair risk scores had similar degree of discrimination in GSE39582 cohort and TCGA cohort (GSE39582: AUC= 0.714; TCGA: AUC=0.696, Figure 6F). The correlation between risk scores, gene expression and survival state were present in the Figures 6G, H. Next, we analyzed the association between risk scores and cluster. The Cluster 3, with a better prognosis, had the lowest risk score, while Cluster2, with the worst prognosis, had the highest risk score. And Cluster1, with intermediate prognosis, had medium risk score ( Figure 6I). We further validate the risk scores using in-house data. We found that patients with metastatic CRC had higher risk scores than patients with non-metastatic CRC, but the difference was not statistically significant, possibly because of the small sample size ( Figure 6J). These results indicated that the risk scores were closely related to prognosis and different clusters had distinct risk scores.
Since the EMT and DNA repair genes clusters were associated with immunotherapeutic response, we investigated whether the risk scores can predict immunotherapeutic benefit. Cluster 3 benefited from immunotherapy at a higher rate. We first compared the levels of risk scores in different clusters based on Imvigor210 cohort. Cluster 3 had lowest risk scores, which indicated that low risk scores predicated immunotherapeutic benefit (Supplementary Figure   10A). Besides, the proportion of CR or PR was significantly increased in patients with low risk (Supplementary Figure 10B). In patients receiving immunotherapy, patients with low risk had better prognosis than those with high risk (Supplementary Figure  10C). These findings suggested that low risk scores predicated immunotherapeutic benefit.

DISCUSSION
With the development of research, we gain a deeper understanding of the biological and molecular characteristics of CRC (31). CRC classification based on characteristic pathways may be a promising way to simplify the classification process and improve clinical utility. Activation of EMT pathways is associated with malignant behavior and drug resistance (32). While activation of DNA repair pathways is a key feature of "hot tumor" and a predictor of immunotherapy (33). In the present study, we identified three distinct CRC clusters based on a combined EMT and DNA repair gene panel.
The three CRC clusters differ significantly in clinical characteristics, prognosis, genomic variation, active pathways, and response to chemotherapy and immunotherapy ( Figure 7). Clustet1 (EPI H /DNA repair L ) was characterized  Chemotherapy is one of the main treatment strategies for CRC, which is critical for creating surgical opportunities and preventing tumor recurrence (34). Detecting patients who may benefit from chemotherapy is an important step in precision treatment. Activation of EMT is a recognized factor in the induction of chemotherapy resistance (35). 5-fluorouracil (5-Fu) based chemotherapy is commonly used in convention chemotherapy of CRC (36). The 5-Fu resistance is partially induced by EMT via the Akt gene or mediated by Twist, miR-200c, miR-141 (26,34). Besides, down-regulation of EMTrelated miR-200c and miR-141 could induced resistance to oxaliplatin, which is one of the most common drugs in CRC chemotherapy (37). Moreover, EMT is strongly associated with tumor proliferation, infiltration, metastasis, tumor budding (10). Given that Cluster2 presents with activation of EMT, we have reasons to infer that Cluster2 has a poor prognosis and does not benefit from chemotherapy. Metabolic reprogramming is a hallmark of malignancy (38). To support the rapid proliferation, progression, and metastasis, cancer cells rewire metabolic pathways via increased generation of adenosine triphosphate (ATP), macromolecule synthesis, and antioxidant regeneration (39). Abnormal metabolic pathways provide new targets for the treatment of cancer and sensitize cancer chemotherapy (40). For example, increased expression of MUC1 enhanced glycolysis, nonoxidative PPP, and pyrimidine biosynthesis (41). Inhibition of MUC1 sensitizes cancer cell lines to 5-FU (24,42). Combination of antimetabolic therapy and chemotherapy may yield better response rates (43). Based on our analysis, Cluster1 present with increased metabolism pathways, we speculated that Cluster1 patients may benefit from antimetabolic therapy and chemotherapy.
Currently, benefits of immunotherapy have received immense research interest because of the impressive long-lasting response seen in several solid tumors (33,44). In CRC, immune response and survival benefit were limited to mismatch-repair-deficient and microsatellite instability-high (dMMR-MSI-H) CRC patients, who account for only a small percentage of CRC patients (around 8%-15%) (3,45). Thus, the selection criteria for candidates who are likely to benefit from such regimens requires further investigation. In the present study, we found that patients in the Cluster3 had the highest response rate to immunotherapy (around 40%). Besides, Cluster3 was present with high proportion of dMMR and TMB, which were recognized immunotherapeutic response prediction marker. We infer that patients in Cluster3 may benefit from immunotherapy. In addition, an interesting phenomenon we found was that although Cluster3 had a higher proportion of dMMR, the expression of key MMR genes was elevated. The MMR gene expression products are called MMR proteins, and they exist as heterodimeric complexes for mismatch base identification and subsequent repair (45). Most mutations in the MMR gene interfere with dimerization, leading to proteolytic degradation of the heterodimer, resulting in the loss of obligatory and secondary proteins (27). This assumption may explain why mRNA is elevated but protein expression is down-regulated. Further research is needed to confirm this assumption.
In recent years, the availability of clinical-grade, rapid, and inexpensive benchtop next-generation sequencers, as well as prepackaged analytical software and reagents, has driven the rapid growth and popularity of gene panel assays in clinical laboratories (46). The gene panel amplifies only specific genes and therefore has the advantage of lower cost and faster speed (47). The limitations of gene panel assay are the high investment in equipment and the cost of sequencing reagents, making it impractical in the case of too small a total specimen volume. In addition, despite the wide application of the technology in recent years, there is still a shortage of experienced professionals. This lack of expertise results in variable quality of analysis and interpretation of the complex data. It is also unclear how to validate, control and charge for these tests, limiting their deployment in hospital laboratories (48). This study has some limitations. First, the patient population is heterogeneous due to the retrospective nature of this study. Second, the robustness of the predictive value of the gene panel needs further validation in large prospective clinical trials. Third, experimental studies are needed to further elucidate the biological significances of the gene panel. Fourth, although our proposed EMT and DNA repair gene panel has potential clinical applications, such as the development of molecular typing kits for colorectal cancer, many issues remain unresolved, such as further identification of target genes, design of probes and determination of expression thresholds.

CONCLUSION
In conclusion, the present study developed and validated a combined EMT and DNA repair gene panel for CRC classification. Three CRC clusters with distinct characteristics were identified. This gene panel may have clinical application for prognosis estimation and guiding chemotherapy as well as checkpoint inhibitors.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding authors.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by The Ethics and Human Subject Committee of Guangxi Medical University Cancer Hospital. The patients/ participants provided their written informed consent to participate in this study.