Systematical Analysis of the Cancer Genome Atlas Database Reveals EMCN/MUC15 Combination as a Prognostic Signature for Gastric Cancer

Digestive cancers-including gastric cancer (GC), colorectal cancer, hepatocellular carcinoma, esophageal cancer, and pancreatic cancer-accounted for 26% of cancer cases and 35% of cancer deaths worldwide in 2018. It is crucial and urgent to develop biomarkers for the diagnosis, prognosis, and therapeutic benefits of digestive cancers, especially for GC, since the incidence of GC is lower only than lung cancer in China, is hard to detect at an early stage, and is associated with poor prognosis. Mucins, glycoproteins encoded by MUC family genes, act as a part of a physical barrier in the digestive tract and participate in various signaling pathways. Some mucins have been used or proposed as biomarkers for carcinomas, such as MUC16 (CA125) and MUC4. However, there are no systematic investigations on the association of MUC family members with diagnoses and clinical outcomes even though relevant data have been largely accumulated in the past decade. By analyzing transcriptomic and clinical data of digestive cancer samples from TCGA involving colon adenocarcinoma (COAD), esophageal carcinoma (ESCA), liver hepatocellular carcinoma (LIHC), stomach adenocarcinoma (STAD), and pancreatic adenocarcinoma (PAAD), it was found that expressions levels of MUC15, MUC13, and MUC21 were individually associated with survival for digestive cancers, and high expressions of EMCN (MUC14) and MUC15 were correlated with poor survival for STAD. Cox regression analysis indicated the predictive power of an EMCN/MUC15 combination for overall survival (OS) of GC patients, which was validated on an independent dataset from GEO. EMCN/MUC15 correlated genes were identified to be enriched in cancer-related processes, such as vasculature development, mitosis, and immunity. Therefore, we propose that an EMCN/MUC15 combination could be a potential prognostic signature for gastric cancer.


INTRODUCTION
Digestive cancers are a group of cancers that occur in the digestive tract, and include gastric cancer (GC), colorectal cancer, hepatocellular carcinoma, esophageal cancer, and pancreatic cancer. Digestive cancers accounted for around 26% of cancer cases and 35% of cancer deaths in the world in 2018 (Bray et al., 2018). Among them, the morbidity and mortality of GC in Eastern Asia is much higher than the worldwide average level. In China, the incidence of GC is only lower than lung cancer, and the mortality is third to lung cancer and liver cancer (Chen et al., 2014). Most patients suffering from early stage GC are asymptomatic and always develop distant metastasis at the time of diagnosis (Van Cutsem et al., 2016;Bray et al., 2018). Surgery is the main treatment for GC. Adjuvant or neoadjuvant therapy combined with surgery is commonly used to treat advanced GC, while targeted drugs for advanced GC, such as the HER2 (also known as ERBB2) antibody trastuzumab, and the VEGFR-2 antibody ramucirumab, are still in clinical trials (Van Cutsem et al., 2016). Therefore, developing biomarkers for the diagnosis, prognosis, and therapeutic response of digestive cancers, especially of GC, is necessary and urgent for reducing the mortality rate.
Mucins represent a group of glycoproteins encoded by MUC family genes. These high-molecular weight and filamentous glycoproteins could be classified into secreted mucins and membrane-bound mucins. In the digestive tract, secreted mucins form a mucus layer and act as part of a physical defensive barrier against external aggressive forces (Dekker et al., 2002;Dhanisha et al., 2018); membrane-bound mucins possess membrane specific domains which enable their diverse roles in signaling pathways (Dekker et al., 2002;Dhanisha et al., 2018). Not surprisingly, dysfunction of mucins in their fundamental roles is implicated in disease development at mucosal surfaces (Corfield, 2015;Dhanisha et al., 2018), and some mucins have been reported to display diagnostic or prognostic significance in different types of cancer. For example, MUC16, also known as CA125, is a widely used biomarker for the diagnosis of ovarian cancer (Yonezawa et al., 2011;Jonckheere and Van Seuningen, 2018) and was also found to be over-expressed in several other human malignancies, including pancreas, breast, and lung (Aithal et al., 2018). MUC4 promotes carcinogenetic progression and has been proposed as a promising biomarker for pancreatic, ovarian, esophagus, and lung cancers (Kaur et al., 2013;Jonckheere and Van Seuningen, 2018). MUC15 overexpression is significantly correlated with several types of cancers, including colon cancer, hepatocellular carcinoma, and thyroid cancer (Huang et al., 2009;Nam et al., 2011;Wang et al., 2013;Choi et al., 2018). Moreover, MUC4/MUC16/MUC20 highexpression signature was very recently reported to be correlated with poor overall survival (OS) in several types of digestive cancers including pancreatic, colon, and GCs (Jonckheere and Van Seuningen, 2018). However, there are no systematic investigations, so far, on the association of MUC family members with diagnosis, prognosis, and/or therapeutic benefits, even though the Cancer Genome Atlas (TCGA) project is producing massive genomic, transcriptomic, proteomic, and clinical data involving more than 11,000 patients of 33 different types of tumors (Weinstein et al., 2013), and meanwhile, a number of web tools, such as GEPIA (Tang et al., 2017) and cBioPortal for Cancer Genomics (Cerami et al., 2012;Gao et al., 2013), have been developed that enable users to easily and effectively mine TCGA data.
In the present study, by analyzing digestive cancer samples from TCGA involving colon adenocarcinoma (COAD), esophageal carcinoma (ESCA), liver hepatocellular carcinoma (LIHC), stomach adenocarcinoma (STAD), and pancreatic adenocarcinoma (PAAD), we found that expression levels of MUC15, MUC13, and MUC21 were individually associated with survival for all these digestive cancers, and high expressions of EMCN (MUC14) and MUC15 were correlated with poor survival for STAD. Cox regression analysis showed that EMCN/MUC15 combination still exhibited a significant correlation with the OS of GC patients. The prognostic prediction power of signature EMCN/MUC15 was further validated on an independent GC dataset, GSE84437. EMCN/MUC15 top 50 correlated genes were identified to be enriched in cancer-related processes, including vasculature development, mitosis, immunity, and so on. Taken together, we propose EMCN/MUC15 combination as a potential prognostic signature for GC.

Datasets
Datasets were collected from TCGA 1 and GEO 2 (Barrett et al., 2012). Specifically, gene expression data (TPM, Transcripts Per Kilobase Million) and clinical data for digestive cancers including COAD, ESCA, LIHC, STAD, and PAAD, were analyzed with the online webserver GEPIA 1.0 (Tang et al., 2017). Among them, MUC family mRNA expression data (mRNA expression z-scores, which is based on RNASeqV2 processed and normalized using RSEM) and clinical profiles involving 407 STAD samples were extracted by using an online web tool cBioPortal for Cancer Genomics (Cerami et al., 2012;Gao et al., 2013). Additionally, GSE84437 were extracted from the GEO database, which involves mRNA microarray data and clinical profiles of 433 GC samples.

Survival Analysis
Kaplan-Meier (KM) survival analysis for digestive cancer samples as a whole was carried out by using the webserver GEPIA 1.0 (Tang et al., 2017), and for GC samples (TCGA-STAD from cBioPortal and GSE84437 R package survival 3 was used. KM analysis was based on individual gene expression value and survival data. By using the median expression value of a query gene in a certain sample group as a cutoff, the samples were split into high and low expression groups with the expression level of the query gene not less than and less than the cutoff. The Cox proportional hazard model was built by using R package survival, fitted with two genes' expression values for OS or disease free survival (DFS). Similar to the individual gene analysis, the median value of weighted expression value (shortened as WEV) of a gene combination in a certain cohort were used as a group cutoff, where WEV was calculated as the sum of cox-regression coefficient weighted expression value of each gene involved in the combination. Log rank p-values, cox proportional hazard ratios (HRs), and HR p-values were calculated to compare the survival between two groups split by the median value of gene expression or WEV. A p-value of less than 0.05 and HR greater than 1.05 or less than 0.95 suggest statistical significance of the survival difference between high and low groups, which indicates the corresponding gene or gene combination has a prognostic potential.

Gene Co-expression Analysis and Enrichment Analysis
Gene co-expression analysis was carried out using webserver cBioPortal, and the top 25 positively correlated and top 25 negatively correlated genes were selected according to Spearman correlation coefficients, which were taken together and simplified as "top 50 correlated genes" in our results. Here, correlated genes met two criteria: the absolute value of Spearman correlation coefficient is greater than 0.25, and the p-value is less than 0.01. Gene set enrichment analysis (GSEA) was performed by using R package clusterProfiler (Yu et al., 2012). The pathways enriched for GO (Gene Ontology) (Ashburner et al., 2000; The Gene Ontology Consortium, 2019) were plotted based on the negative logarithm of p-value.

MUC15, 13, and 21 Display Prognostic Potential for Digestive Cancer on TCGA
Aiming to assess the prognostic potentials of every MUC gene, KM survival analysis was applied to TCGA digestive cancer samples as a whole involving COAD, ESCA, LIHC, STAD, and PAAD by using the webserver GEPIA 1.0 (Tang et al., 2017). Among the 14 MUC family members with expression data available, the expression levels of MUC1, MUC5AC, MUC6, OVGP1 (MUC9), MUC13, EMCN (MUC14), MUC15, MUC16, MUC17, and MUC21 individually exhibited significant correlations with OS, with HR p-values less than 0.05 and HR greater than 1.05 or less than 0.95; similarly, MUC2, MUC3A, MUC12, MUC13, MUC15, MUC17, MUC20, and MUC21 were significantly correlated with DFS (Table 1 and Supplementary Figure S1). MUC13, MUC15, MUC17, and MUC21 were significant for both OS and DFS, among which MUC15 performed best for OS correlation and the second best for DFS correlation. In comparison, MUC13 displayed the best performance in DFS analysis, while ranked relatively lower (9th) in OS analysis; MUC21 ranked 3rd for OS, and 8th for DFS (Table 1 and Supplementary Figure S1). These indicate that MUC15 represents a promising candidate for developing strategies for prognosis prediction for digestive cancers.

MUC14 (EMCN) and 15 Display Prognostic Potential for Gastric Cancer on TCGA-STAD
To investigate the prognostic potentials of MUC family genes for STAD, we performed KM survival analysis exclusively on STAD samples from TCGA with R package survival. It was found that the expression levels of EMCN (MUC14) and MUC15 individually showed significant correlations with both OS and DFS, and MCAM (MUC18) was significant only with OS ( Table 2). KM survival plots, together with log rank p-values,   cox proportional HRs, and HR p-values summarized in Figure 1 indicated that EMCN performed better than MUC15 in both OS and DFS analyses. Overall, EMCN and MUC15 could be potential biomarkers for STAD prognosis.

EMCN/MUC15 Combination Could Serve as Prognostic Signature for Gastric Cancer
So far we have observed that high expressions of both EMCN and MUC15 were associated with poor prognosis in GC, and that EMCN and MUC15 displayed the strongest correlation to survival for GC and digestive cancers, respectively (Table 2 and Figure 1). Thus, we set out to investigate whether EMCN/MUC15 combination could be a prognostic signature for GC. Cox proportional hazards regression analysis was performed based on the two genes' expression values and OS data derived from TCGA STAD dataset. As expected, the expression of EMCN/MUC15 combination exhibited significant correlation with OS, with log rank p-value of 0.00299 and HR p-value of 0.00301 (Figure 2A). We then separately tested the prognostic prediction power of EMCN, MUC15 and their combination on an independent dataset, GSE84437, which involved 433 GC samples. Again, significant results of EMCN/MUC15 combination (HR = 1.33) were obtained with log rank p-value being 0.0419 and HR p-value being 0.0413 ( Figure 2B); while one single gene, EMCN (HR p-value of 0.0807, HR = 1.27) or MUC15 (HR p-value of 0.156, HR = 0.82), had no significant prognostic prediction power, as shown in Supplementary Figure S2. We therefore proposed that EMCN/MUC15 combination could be a potential prognostic signature for GC.

EMCN/MUC15 Correlated Genes Are Functionally Enriched in Cancer Related Processes
By using webserver cBioPortal, the top 50 EMCN-( Table 3) or MUC15-( Table 4) correlated genes were identified based on mRNA expression data of TCGA STAD samples, including the top 25 positively correlated genes and top 25 negatively correlated genes. It is noticeable that there is no intersection between the two top 50 gene lists at all and no coexpression between EMCN and MUC15 (Spearman's Correlation of 0.0264 with p-value of 0.592) either, implying the functional complementarity between EMCN and MUC15 and thus the rationality of the combination of the two genes in predicting prognosis for GC.
We then performed functional enrichment analysis with the two top 50 correlated genes as a whole. GSEA identified a total of 22 GO terms (Figure 3 and Supplementary Table S1). Among them, the most significant pathways were associated with vasculature development, such as glomerulus vasculature development and renal system vasculature development. Some enriched pathways are associated with mitosis, such as mitotic sister chromatid segregation and mitotic metaphase plate congression. Some pathways were associated with immunity, such as inflammatory cell apoptotic process and response to interferon-gamma. The other enriched pathways were involved in DNA binding, cell cycle phase transition, cell polarity, phosphatase activity, and side of plasma membrane (Figure 3 and Supplementary Table S1). These indicate that genes correlated with EMCN and MUC15 in GC tend to be enriched in cancer related processes, such as vasculature development, mitosis, and immunity.

DISCUSSION
In the present study, by systematically analyzing mRNA expression and clinical data of TCGA digestive cancer samples and GEO GC samples, we propose MUC15 as a promising candidate for prognosis prediction of digestive cancers, and EMCN/MUC15 combination as a potential prognostic signature for GC.
Gene signature identification is essentially a process of dimension reduction of high dimensional data. On one hand, a signature involving less features or genes obviously has more practicality; on the other hand, a signature is also expected to have sufficient interpretability, although it is far from achieved. In this sense, a good signature is supposed to consist of orthogonal or mutually exclusive features which are able to hold a testable hypothesis from a systematic viewpoint while also sustaining the robustness and reliability of the signature. However, most current efforts in this field focus on reducing dimension over enhancing explanatory power of the signature. In our work, although EMCN and MUC15 coding genes belong to the same gene family, it is noted that there is no expression correlation between the two genes and no intersection between their top 50 correlated genes, implying the orthogonality and functional complementarity between EMCN and MUC15. As we expected, the combination of EMCN/MUC15 shows more robust prognostic power than the individual genes in GC according to the testing result implemented on an independent dataset GSE84437. These observations not only support the rationality of the combination of the two genes in predicting prognosis, but also indicate the explanatory power of EMCN/MUC15 signature, which is supposed to play an important role in the robustness improvement. EMCN, i.e. MUC14, encodes a membrane-bound protein, endothelial sialomucin or mucin-like sialo glycoprotein, which was reported to inhibit cell and extracellular matrix interaction, interfere with leukocyte-endothelial cell adhesion, and even promote the peritoneal metastasis process of GC cells (Liu et al., 2001;Zahr et al., 2016;Dhanisha et al., 2018;   (Table 3). CD34, a marker of vascular endothelial cells, is capable of supporting cell adhesion by increasing surface expression (Nielsen and McNagny, 2008). PECAM1, also known as CD31, encodes platelet endothelial cell adhesion molecule 1 that is necessary for leukocyte transendothelial migration (TEM) (Dasgupta et al., 2009). It is noteworthy that EMCN/COL4A5/CCL11 combination was very recently reported as a prognostic signature for diffuse type GC (Bao et al., 2019). In our study, among MUC family members, EMCN exhibits the strongest correlation with survival for GC. Taken together, EMCN may play crucial roles in tumorigenesis and progression of GC via cell adhesion and TEM of lymphocytes. MUC15 also encodes a membrane-bound protein, which could promote cell proliferation, cell-extracellular matrix adhesion, colony forming ability, and invasion in colon cancer cells (Huang et al., 2009). Its overexpression is significantly correlated with diverse cancers (Pallesen et al., 2002;Shyu et al., 2007;Huang et al., 2009;Nam et al., 2011;Wang et al., 2013;Choi et al., 2018). However, it was also found that the expression of MUC15 decreased in hepatocellular carcinoma cells and negatively regulated metastasis of hepatocellular carcinoma (Wang et al., 2013). This suggests that MUC15 may perform diverse functions in tumorigenesis and progression. In our study, MUC15 displays the strongest correlation among the MUC family with survival for digestive cancers and MUC15 overexpression seems to be a promising candidate for a prognosis biomarker of digestive cancers. Combined with EMCN, the two genes provide a potential prognostic signature for GC and show more robustness in the prognostic prediction power than individual genes. As far as we know, the association of MUC15 with GC is rarely reported.
In summary, we propose EMCN/MUC15 combination as a prognostic signature with mechanistic interpretability. It not only possesses prognostic capability for GC, but also offers clues for further exploring systematic mechanisms of carcinogenesis of GC and other digestive cancers.

AUTHOR CONTRIBUTIONS
Y-YL and WD designed the study. WD and JL implemented the data analysis. BL, QL, and QS provided the valuable suggestions. JL and WD drafted the manuscript. Y-YL revised the manuscript and coordinated the study. All authors read and approved the final manuscript.