Differential Expression Analysis Revealing CLCA1 to Be a Prognostic and Diagnostic Biomarker for Colorectal Cancer

Colorectal cancer (CRC) is a common malignant tumor of the digestive tract and lacks specific diagnostic markers. In this study, we utilized 10 public datasets from the NCBI Gene Expression Omnibus (NCBI-GEO) database to identify a set of significantly differentially expressed genes (DEGs) between tumor and control samples and WGCNA (Weighted Gene Co-Expression Network Analysis) to construct gene co-expression networks incorporating the DEGs from The Cancer Genome Atlas (TCGA) and then identify genes shared between the GEO datasets and key modules. Then, these genes were screened via MCC to identify 20 hub genes. We utilized regression analyses to develop a prognostic model and utilized the random forest method to validate. All hub genes had good diagnostic value for CRC, but only CLCA1 was related to prognosis. Thus, we explored the potential biological value of CLCA1. The results of gene set enrichment analysis (GSEA) and immune infiltration analysis showed that CLCA1 was closely related to tumor metabolism and immune invasion of CRC. These analysis results revealed that CLCA1 may be a candidate diagnostic and prognostic biomarker for CRC.


INTRODUCTION
Colorectal cancer (CRC) is the third most common cancer and the second most deadly cancer worldwide (1). The incidence and mortality of CRC continue to increase because of the lack of diagnostic biomarkers and inadequate understanding of the molecular mechanism (2). Detection and monitoring of CRC occurrence and progression are dependent on a combination of radiologic examinations and serum biomarker measurements (3); however, these methods have some limitations. In some cases, the levels of biomarkers do not change. In other diseases, the levels of biomarkers can change (4,5). In addition, some patients do not undergo colonoscopy because of the discomfort of this procedure (6). In the past few decades, advanced gene microarray and highthroughput sequencing technologies have been used to explore novel gene expression, treatment targets, and pathogenesis in CRC (7).
Robust rank aggregation (RRA) has been utilized in various recent cancer studies to overcome the limitations of substantial interstudy variability and the different statistical analysis methods used with different technological platforms (8,9). In our study, we used RRA to analyze 10 microarray datasets from the Gene Expression Omnibus (GEO) database and explored data from The Cancer Genome Atlas (TCGA) through WGCNA to identify differentially expressed genes (DEGs). Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analyses were used to explore the potential functions of these DEGs. We utilized regression analyses and the random forest method to develop and validate the prognostic model. Among the genes included in the model, we used MCC to calculate the top 20 hub genes. We explored biological functions through GO and KEGG analyses and utilized ROC curves to explore diagnostic value. We also explored the relationship between them. Based on the 20 hub genes, we utilized Kaplan-Meier (K-M) analysis to explore relationships with prognosis, and only CLCA1 had a close relationship with prognosis. We continued to explore the potential biological value of CLCA1. In addition, we utilized the online tool TISIDB and R packages to explore the functions of these genes in immunity and performed gene set enrichment analysis (GSEA) to investigate their potential functions in CRC.

Identification of Significant DEGs in CRC Samples
We downloaded the series matrix files from GEO and screened them with the R package "limma" for normalization and DEG identification. Then, the RRA method was utilized to integrate the results of these 10 datasets to identify the most significantly upregulated and downregulated genes (Supplementary file 1). Genes with an adjusted P value of <0.05 were considered significantly differentially expressed. For TCGA database analysis, we first separated mRNA and lncRNA data and used the R package "edgeR" (19) to identify DEGs. The following criteria were used to select DEGs: |log(foldchange)|>2 and P value<0.01 (Supplementary file 2). After obtaining the 2 sets of DEGs, we used the R package "WGCNA" to identify clinical trait-related modules (20). We used the online tool "VENN" (http:// bioinformatics.psb.ugent.be/webtools/Venn/) to generate a Venn diagram to identify genes shared between the key modules from the TCGA and GEO datasets (21). We ultimately obtained 129 DEGs.

GO and KEGG Functional Enrichment Analyses
We conducted GO enrichment analysis using the online tool Database for Annotation, Visualization, and Integrated Discovery (DAVID; https://david.ncifcrf.gov/) (22) and the R packages "digest" and "GOplot"; an adjusted P value of <0.05 was considered statistically significant. For KEGG pathway analyses, we used the R packages "clusterprofiler" (23), "org.Hs.eg.db", "enrichplot", and "ggplot2", with an adjusted P value of <0.05 considered statistically significant. Both GO enrichment and KEGG analysis results were visualized using the R package "GOplot".

Hub Genes from the DEG Network
We utilized the online STRING database (https://string-db.org/cgi/ input.pl/) to explore connections among the DEGs and visualized these connections by constructing a PPI network with Cytoscape software (version 3.6.1) (24). We utilized cytoHubba MCC to calculate the top 20 hub genes. We analyzed relationships between the 20 hub genes using the R package "psych".

Development and Validation of the Prognostic Model
We utilized R (version 3.6.1) to generate a matrix that included the clinical information and DEG expression. We used Cox regression analysis to build the prognostic model using the R package "survival" (25, 26) and online tool "SangerBox". Then, we utilized the R package "randomForest" to validate the prognostic model through risk score and calculate the accuracy, rrror rate, sensitivity and precision from a confusion matrix. The prognostic model was based on the TCGA database.

Diagnostic and Prognostic Value of the Hub Genes
We utilized SPSS to explore the diagnostic value of the genes for CRC and K-M analysis to determine the prognostic value. We validated the differential expression levels between control tissue and tumor tissue with the R packages "limma" and "beeswarm"

Analysis of the Association of Hub Gene Expression With Tumor-Infiltrating Immune Cell Infiltration
We utilized TISIDB (http://cis.hku.hk/TISIDB/) to explore the relationship between the expression of genes and infiltration of tumor-infiltrating immune cells, including CD4 + T cells, CD8 + T cells, B cells, neutrophils, monocytes, eosinophils, mast cells, DCs, NKT cells, NK cells, MDSCs, and CD56 cells (27,28). TISIDB is an online tool that includes genomic, transcriptomic and clinical data for 30 cancer types from the TCGA database.

GSEA of Hub Genes
We utilized GSEA, which was downloaded from (https://www. gsea-msigdb.org/gsea/msigdb), to explore the functions of the hub genes. We performed GSEA of the hub genes with the R package "clusterprofiler" (29) in data downloaded from the TCGA-COAD and TCGA-READ datasets and divided 482 samples into two groups: high expression and low expression. We utilized "c2.cp.kegg.v6.2.symbols.gmt" for analysis and to select the top 5 genes. Then, we used the R packages "plyr", "ggplot2", "grid", and "gridExtra" to integrate different significant pathways into a single diagram.

Validation of Protein Expression and Prognostic Value of CLCA1
We utilized GEO online tools PROGgene online database (http:// genomics.jefferson.edu/proggene/), The Human Protein Atlas (https://www.proteinatlas.org/), and Kaplan-Meier Plotter (http:// kmplot.com/analysis/) to explore the protein expression and prognostic value of CLCA1. The Human Protein Atlas is the online database which provides the distribution of human proteins in tissues and cells, and immunohistochemical techniques are used to examine the distribution and expression of each protein in 48 normal tissues and 20 tumor tissues. Kaplan-Meier Plotter is the online database which including the data from GEO, EGA and TCGA. Figure 1 shows the workflow of our study. We downloaded CRC samples from the TCGA-COAD and TCGA-READ datasets and identified DEGs between the control tissues and tumor tissues. In total, 2097 genes in the TCGA-COAD dataset and 2887 genes in the TCGA-READ dataset were differentially expressed between tumor and control tissues. The volcano plots of these genes are shown in Figure 2. According to the selection criteria for the GEO data, we selected 10 eligible CRC datasets for exploration. The characteristics of all datasets are shown in Table 1. RRA analysis of the GEO datasets identified 212 significantly downregulated and 136 significantly upregulated genes. The top 20 downregulated and top 20 upregulated genes are shown in a heatmap ( Figure 3).

WGCNA and Identification of DEGs
To identify the key modules most associated with CRC clinical traits, we performed WGCNA on the significant genes in the TCGA-COAD and TCGA-READ datasets ( Figures 4A-E).
Clinical information such as age, TNM grade, and survival time was retrieved from TCGA. By setting a soft-thresholding power of 5 (scale free R 2 = 0.89), we eventually identified 5 modules. From the heatmap of module-trait correlations, we found that the bule module was the most highly correlated with clinical traits, especially the futime (P=5.2e-10; Figure 4F). The  blue module contained a total of 299 genes, as shown in Figure  4F. We combined the genes from the blue module and the RRA analysis and used Venn diagrams to identify significantly DEGs common to the 2 datasets, as shown in Figure 5A. The 129 DEGs were visualized using STRING and Cytoscape software, and CytoHubba MCC was used to calculate the top 20 hub genes, as shown in Figure 5B.

Biological Value of the Hub Genes
Through CytoHubba MCC calculation, we obtained 20 hub genes. The 20 hub genes, which are shown in Figure 6A, were also closely related to each other. We utilized SPSS to explore their diagnostic value. ROC curve analysis showed that these 20 Figure 6B). We also explored the prognostic value of the hub genes, and only CLCA1 was closely related to survival time ( Figure 6C); the other genes are shown in Supplementary Figure 1. To further explore the functions of the hub genes, we conducted GO and KEGG analyses. The most significant GO terms for BPs, CCs, and MFs, as well as KEGG pathways, are shown in Figures 6D, E.

Development and Validation of a Prognostic Model Based on the Hub Genes
We utilized Cox proportional hazards regression analysis of the survival-related genes to develop the prognostic model   ( Figures 7A-C). According to the prognostic risk score value, CRC patients were divided into a low-risk and a high-risk group. The risk score distribution was analyzed and is shown in Figure  7A. The risk scores reflected the 1-year, 3-year and 5-year survival rates of CRC patients. The AUCs for 1-year, 3-year and 5-year survival are shown in Figure 7B. K-M curves were used to show the relationship of the risk score with overall survival (OS) in the low-risk and high-risk groups and verified that a low risk score had a stronger positive association with OS (P=0.0079; Figure 7C).  We utilized randomForest to validate the prognostic model. The training group contained 50 died and 263 living patients, and the validation group contained 23 died and 112 living patients. From the confusion matrix, we obtained the following values: accuracy = 79.3%%, error rate = 20.7%, sensitivity = 85%, and precision = 91.1%.

Assessment of the Clinical Significance of the Hub Genes
Among the 20 hub genes, CLCA1 was associated with survival time. We explored correlations between gene expression levels ( Figures 8A, B) and clinical features (Figures 8C-F). CLCA1 was downregulated during CRC, and no differences were identified in its expression across different stages and TNM grades. The persistently downregulated expression of CLCA1 underscores its diagnostic effectiveness.

GSEA for Hub Genes
We performed GSEA to investigate the potential functions of CLCA1 in CRC in the TCGA-COAD and TCGA-READ datasets ( Figure 8G). The top 5 upregulated pathways in which CLCA1 was enriched included "ascorbate and aldarate metabolism", "butanoate metabolism", "fatty acid metabolism", "starch and sucrose metabolism", and "valine, leucine, and isoleucine degradation".

Relationship of Hub Genes with Immune Infiltration of CLCA1
We utilized TISIDB to explore the relationship between hub gene expression levels and lymphocyte levels in colon and rectal cancer. CLCA1 exhibited no relationship or only a weak relationship with immune infiltration (Figure 9). CLCA1

Validation of Protein Expression and Prognostic Value of CLCA1
We utilized GEO datasets to valdiate the CLCA1,P value was >0.05 (Supplementary Figure 2). We also utilized The Human Protein Atlas and Kaplan-Meier Plotter database to validate the  protein expression and prognostic value of CLCA1 ( Figure 10). The protein expression in normal colon tissue was significantly higher than that in colon cancer tissue, and the same was true in the rectum. Both of the two databases showed the good prognostic value for CRC, the P value of The human protein atlas is <0.001 and KM plotter database is 0.0064.

DISCUSSION
CRC is the most frequently diagnosed gastrointestinal cancer (30), and the current colonoscopic diagnosis of CRC has limitations (31); therefore, identifying a significant biomarker for CRC is necessary. Gene microarrays were utilized to discover novel biomarkers or therapeutic targets for CRC (32). To our knowledge, our work is the first to combine RRA analysis of GEO datasets with WGCNA of TCGA datasets to explore the significant genes associated with CRC. Some diseases of the intestinal tract, such as intestinal polyps (33) and inflammatory bowel disease (34), can have symptoms similar to those of CRC and can also develop into cancer. To explore potential DEGs between tumor tissue and noncancerous tissue, we compared normal tissue, normal matched tissue and paratumor tissue with CRC tissue as control tissue. We integrated 10 datasets from GEO, TCGA-COAD and TCGA-READ and identified robust DEGs, such as SST (35), SLC26A3 (36), and SLC4A4 (37), which have been reported to be diagnostic biomarkers or therapeutic targets for CRC. We used GO and KEGG enrichment analyses to explore the functions of the DEGs identified by overlapping the DEGs in the 3 datasets. GO analysis indicated that negative regulation of growth, bicarbonate transport, and transporter activity (38)(39)(40) were closely related to the development and growth of cancer; some KEGG pathways, such as nitrogen metabolism and retinol metabolism, were also linked to the pathogenesis of CRC. Nitrogen is an essential biomolecule in humans and regulates  cellular metabolism, and retinol is a form of vitamin A closely related to immune functions (41,42). Based on the results of the GO and KEGG enrichment analyses, the DEGs were closely associated with CRC occurrence and development. Cytohubba can extract key sub-networks, and MCC is a newer algorithm of cytohubba. To identify the key genes among 129 DEGs, we utilized MCC to determine the top 20 hub genes (CLCA1, TMIGD1, SLC30A10, MT1F, MT1M, MT1G, MT1H, MT1E, GUCA2B, GUCA2A, SLC26A3, CLCA4, MS4A12, SI, SLC9A2, GCG, PYY, SST, SLC4A4, and SLC16A9). To explore the potential functions of the hub genes, we utilized R packages, ROC curves, K-M analysis, and GO and KEGG analyses. According to the results, all the hub genes were closely related to each other and had high diagnostic value, but only CLCA1 was associated with survival time. In addition, the hub genes were closely associated with the development of CRC.
To determine the hub genes significantly associated with overall survival, we utilized Cox proportional hazards regression analysis to develop a prognostic model. We explored each gene's characteristics. Used ROC curve and random forest analysis to verify the model. The AUC values were high for 1-year, 3-year, and 5-year survival, all of which demonstrated the intermediate value of the prognostic model. Then, we calculated the risk score of each patient and divided the patients into a high-risk group and a low-risk group. K-M risk survival analysis showed that the model can predict survival time. Then we utilized the random forest method to validate the prognostic model for CRC, which showed high prognostic value for CRC. The yielded the following values: accuracy = 79.3%, sensitivity = 85% which reflecting good prognositc value for CRC. Among the hub genes, only CLCA1 was associated with a good prognosis in CRC but their dignostic value is very high. In accordance with the expected results, the expression of CLCA1 protein was down-regulated in colorectal cancer tissues. To demonstrate the prognostic value of CLCA1, we conducted the external validation in three online databases and the results of The Human Protein Atlas and KMplotter showed that CLCA1 has a high prognostic value. The results were inconsistent with GEO database, and the unsatisfactory results of validation of GEO database may be related to the possible influencing factors such as sample size, experimental environment and methods.
Although CLCA1 has a high prognostic value for CRC, the mechanism of its influence is unclear. To further explore its characteristics, we analyzed differences in CLCA1 expression levels between tumor and normal tissues, across clinical stages, and across TNM stages. There was a large difference between tumor and normal tissues, but no significant differences were found across the different stages of CRC. This pattern indicates that CLCA1 levels decrease starting from the initial development of CRC and have diagnostic value at every stage of CRC. The characteristic expression of CLCA1 may provide a new perspective for exploring CRC at the gene level and serve as a useful diagnostic biomarker for CRC.
To explore the mechanisms of the hub genes in CRC, we utilized TISIDB and the R package "estimate" to assess immune infiltration and GSEA data of biological functions for CLCA1. TISIDB and the estimated score analysis indicated that CLCA1 had a weak relationship with lymphocyte expression and was expressed mainly in CRC cells. GSEA indicated that CLCA1 was enriched in "ascorbate and aldarate metabolism", "butanoate metabolism", "fatty acid metabolism", "starch and sucrose metabolism", and "valine, leucine, and isoleucine degradation", suggesting that CLCA1 can influence CRC development and progression through different metabolic pathways. This result provides new insight into the mechanism and pathology of CRC.
In summary, we determined that CLCA1 could be used as a prognostic marker for CRC and correlated with immune infiltration. It may be a potential therapeutic target for CRC to improve the prognosis of patients. However, our work has some limitations. First, more work needs to be done on the pathogenic immune responses and gene expression in CRC cells to identify the mechanism linking the immune response with the development of CRC. Second, validation in GEO datasets is not ideal and pure bioinformatics analysis cannot well prove the prognostic significance of CLCA1 in colorectal cancer, in future research we will focus on large-scale population for further investigation. Furthermore, basic research needs to be done to verify our model and the regulatory mechanism in vitro and in vivo.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/Supplementary Material.

AUTHOR CONTRIBUTIONS
F-ZW designed the research. Z-JW, S-WM, JL, and H-YS organized the data. J-NC, F-QZ, and ZL analyzed and visualized the data. F-ZW drafted the article. QL revised the paper. All authors contributed to the article and approved the submitted version.
Marathon Special Fund (No. LC2017L07), and the Medical and Health Science and Technology Innovation Project of the Chinese Academy of Medical Sciences (No. 2017-12M-1-006) of China. In addition, the authors wish to thank all teachers in the Department of Colorectal Surgery, National Cancer Center/Cancer Hospital.