Screening and Identification of Key Biomarkers of Gastric Cancer: Three Genes Jointly Predict Gastric Cancer

Background Gastric cancer (GC) is one of the most common cancers all over the world, causing high mortality. Gastric cancer screening is one of the effective strategies used to reduce mortality. We expect that good biomarkers can be discovered to diagnose and treat gastric cancer as early as possible. Methods We download four gene expression profiling datasets of gastric cancer (GSE118916, GSE54129, GSE103236, GSE112369), which were obtained from the Gene Expression Omnibus (GEO) database. The differentially expressed genes (DEGs) between gastric cancer and adjacent normal tissues were detected to explore biomarkers that may play an important role in gastric cancer. Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analyses of overlap genes were conducted by the Metascape online database; the protein-protein interaction (PPI) network was constructed by the STRING online database, and we screened the hub genes of the PPI network using the Cytoscape software. The survival curve analysis was conducted by km-plotter and the stage plots of hub genes were created by the GEPIA online database. PCR, WB, and immunohistochemistry were used to verify the expression of hub genes. A neural network model was established to quantify the predictors of gastric cancer. Results The relative expression level of cadherin-3 (CDH3), lymphoid enhancer-binding factor 1 (LEF1), and matrix metallopeptidase 7 (MMP7) were significantly higher in gastric samples, compared with the normal groups (p<0.05). Receiver operator characteristic (ROC) curves were constructed to determine the effect of the three genes’ expression on gastric cancer, and the AUC was used to determine the degree of confidence: CDH3 (AUC = 0.800, P<0.05, 95% CI =0.857-0.895), LEF1 (AUC=0.620, P<0.05, 95%CI=0.632-0.714), and MMP7 (AUC=0.914, P<0.05, 95%CI=0.714-0.947). The high-risk warning indicator of gastric cancer contained 8


BACKGROUND
Gastric cancer (GC) is one of the most common cancers all over the world and causes high mortality. Especially in China, where almost a third of the world's new cases of GC occur (1). The treatment of gastric cancer has limited effect. Although the survival of some patients with advanced gastric cancer can be prolonged through chemotherapy, most chemotherapy has limited efficacy and a short maintenance time. The 2-year survival rate is less than 10% (2). The rapid development of tumor transcriptome data based on the second-generation high-throughput sequencing technology comprehensively reveals the multi-genetic and highly heterogeneous measuring points of the tumor. The diagnosis of the tumor molecular and targeted treatment is an effective means for improving the early diagnosis rate (3). It is also the direction and goal of the development of the clinical treatment of the tumor, and the achievement of this goal will ultimately depend on the search for specific tumor biomarkers (4).
Detection through serum tumor markers is a noninvasive diagnostic method commonly used in the clinic. However, conventional assays for carcinoembryonic antigen (CEA) and carbohydrate antigen 19-9 (CA19-9) are not specific or sensitive enough for accurate diagnosis of GC, and it is necessary to develop some novel biomarkers (5). Bioinformatics technology has been increasingly used to authenticate the differentially expressed genes (DEGs) and underlying pathways that are related to the occurrence and progression of GC, which can help researchers excavate the potential genetic targets of diseases (6).
However, it is difficult to obtain credible results when using the independent microarray technology because of the higher false-positive rates (7). Therefore, this study reanalyzed four expression profiling datasets downloaded from the Gene Expression Omnibus (GEO) dataset. T1he DEGs were searched for by GEO2R tools. The Kyoto Encyclopedia of Genes and Genomes (KEGG) and Gene Ontology (GO) analyses of overlap genes among four datasets were conducted in the Metascape database. The molecular mechanisms of the occurrence and progression of gastric cancer were detected by enrichment analysis of functions and pathways and protein-protein interaction (PPI) network analysis. The hub genes were analyzed by Cytoscape. The survival curve analysis was performed by km-plotter and the stage plots of hub genes were generated by the GEPIA online database. CDH3, LEF1, and MMP7 were used as candidate biomarkers to construct a back propagation neural network model.

Downloaded Public Data
The GEO (8) contains a variety of high-throughput sequencing experimental data. Four expression profiling datasets [GSE118916 (9), GSE54129 (10), GSE103236 (11), and GSE112369 (12)], which all screen genes associated with the formation of gastric cancer from normal tissue, were downloaded from the GEO human gene expression array.

Intra-Group Data Repeatability Test
The statistical analysis and graphic drawing were performed via the R programming language. The correlation between all samples from the same dataset was tested using a correlation heat map, which was constructed in R. The method of principal component analysis was used to analyze the dimensionality of the data and observe the distribution of the data (13).

Identification of DEGs
GEO2R (8) was used to identify DEG between gastric cancer and normal tissues adjacent to the cancer. We set the threshold as logFC≥1 or ≤-1, P value < 0.05. A volcano mapping tool (https:// shengxin.ren) was used to map the volcano. Four datasets were then introduced into Fun Rich (a feature rich analysis tool) (http://www.funrich.org/) to filter DEGs. The DEGs shared between the four datasets were obtained via the Venn diagram, which was depicted by the Venn tool. The Circos diagram shows gene overlap and function overlap between different gene lists.
WGCNA performed module identification. Then, WCGNA established a gene co-expression network and we studied the module relationship. Finally, modules were associated with clinical characteristics.
Visual Analysis of Gene Expression Networks DAVID (15) was used for function and pathway enrichment analysis of differential genes. Metascape (16) was again used for DEG enrichment analysis. STRING (17) was used to construct the protein-protein interaction network. Gene Set Enrichment Analysis (18) (GSEA) is an analysis method for genome-wide expression profile microarray data that compares genes with predefined gene sets. That is, based on the existing information base of gene location, properties, functions, biological significance, etc., a molecular tag database is constructed. In this database, known genes are classified according to chromosomal positions, established gene sets, patterns, and tumor-related genes. The group and GO gene set is used to group and classify multiple functional gene sets. By analyzing the gene expression profile data, we can understand their expression status in a specific functional gene set, and whether this expression status has some statistical significance. After analyzing all sequenced genes of gastric cancer tissue and normal gastric tissue by GSEA software and importing gene annotation files, the software analyzed the expression networks between genes. We therefore fully understand the effect of the richness of the feature set on the biological function of genes.

RT-qPCR
CDH3, MMP7, and LEF1-specific primers of human and mouse were designed ( Table 1). Gastric cancer and normal gastric tissue samples were used to extract total RNA. Then, mDNA was used as the template and cDNA was transcribed with random primers (HiScript III 1st Strand cDNA Synthesis Kit, Beijing, China). Then a qPCR fluorescence kit (Taq Pro Universal SYBR qPCR Master Mix, Beijing, China) was used to quantitatively analyze the expression of the target gene. 2-DDCt was expressed as a fold change in gene expression relative to the experimental group compared to the control. GAPDH was used as an internal reference.

Animal Models
Ten BALB/c-nu-nu nude mice were used to construct a gastric cancer mouse model. MGC-803 cells were injected into the gastric serosal layer of nude mice to establish the model of gastric tumor in situ. The cultured cells were resuspended to the concentration of 1.0×10^8/ml with a 1:1 mixture of matrix adhesive.

Immunohistochemistry
The expression of MMP7, CDH3, and LEF1 in gastric tissues was analyzed by immunohistochemistry. After routine sectioning, the slides were dewaxed, dehydrated by gradient alcohol, blocked and inactivated by endogenous peroxidase, repaired by an antigen, and blocked by goat serum. Primary antibody was added and the mixture was incubated at 4°C. Labeled secondary antibody was added and the mixture was incubated at 37°C. Horseradish peroxidase labeling solution was added, the mixture was stained with DAB/H2O2, counterstained with hematoxylin, conventionally dehydrated, made transparent, and observed with a microscope after mounting.

Neural Network Model
We randomly divided the data into a training set and validation set, with a ratio of 1:7. We set 35 samples as the training set and 5 samples as the validation set. We used Matlab (version 10) for machine learning, trained 2678 steps, and established a predictive model when the true value was close to the predicted value. The output variable was the relative expression of MMP7. The training error was 0.033031, and the R value was 0.9624.

Statistical Analysis
The unpaired Student's t-test was used to compare the two sets of data to determine statistical significance. Pearson's test was used to compare the correlation between gene expression and gastric cancer. Receiver operator characteristic (ROC) curve analysis was performed to determine the usefulness of MMP7, CDH3, and LEF1 in predicting gastric cancer. The high-risk early warning range of gastric cancer was analyzed by the cubic spline interpolation algorithm. The Pearson-rho test was used to calculate the expression of hub genes and status of GEA for the correlation analysis.

Checking Data Quality
Pearson's correlation test and the analysis of principal component analysis (PCA) were used to verify the distribution of data within a group. Based on Pearson's correlation test, the data within the group were well distributed in the gastric cancer

Primer
Sequence (5′-3′) group and the control group in the GSE54129 dataset ( Figure 1A). Based on PCA analysis, the repeatability of the data in the GSE54129 group also met the analysis requirements ( Figure 1B). According to Pearson's correlation test, the strong correlation between the GSE112369 in each group was shown ( Figure 1C). PCA showed that the GSE112369 data set was well distributed ( Figure 1D).

The Identification of DEGs
GSE118916, GSE54129, GSE103236, and GSE112369 datasets were used to draw a volcano map (Figures 2A-D). Circos analysis was used to find that GSE54129 and GSE112369 had an overlap of DEGs ( Figure 3A). Concurrently, there was also an overlap in the function of genes ( Figure 3B). The threshold was

Construction of Co-Expression Modules by Weighted Gene Co-Expression Network Analysis (WGCNA)
The GSE54129 datasets were used for WCGNA network analysis. We set the power value as 12 ( Figure 4A). To generate important functional modules, the parameter was set as 0.2 ( Figure 4B). Different modules were found to have cooperative or antagonistic relationships ( Figure 4C). The redder the module, the more likely it is to develop gastric cancer, and the bluer the module, the less likely it is to develop gastric cancer ( Figures 4C, D).

Functional Annotation
GSE54129 datasets were used for GO and KEGG analyses. Biological processes (BP) of GO analysis showed that there were variations including angiogenesis, oncostatin-M-mediated signaling pathway, and so on ( Figure 5A). The analysis of cell components (CC) showed major variations including the extracellular space, apical plasma membrane, and the cell surface ( Figure 5B). The analysis of molecular function (MF) variations included oncostatin-M receptor activity, heparin binding, alcohol dehydrogenase activity, and zinc-dependent ( Figure 5C). In the analysis of KEGG, genes were found to be mainly enriched in glycolysis/ gluconeogenesis, metabolism of xenobiotics by cytochrome P450, and cytokine-cytokine receptor interaction ( Figure 5D). The results of GO enrichment are shown in Figure 5E. GSEA was used to detect the functional enrichment and pathway analysis of DEGs ( Figures 6A-F). The analysis of GO of gastric cancer (111 samples) showed that 2992/4374 genes were upregulated. When the threshold was set as p value < 0.05, 326 genes were identified. When the threshold was set as p value <0.01, 31 genes were identified. In addition, the analysis of GO of normal tissue (21 samples) showed that 1382/4374 gene sets were downregulated. When the threshold was set as p value < 0.05, 32 genes were identified, and when the threshold was set as p value < 0.01, 4 genes were identified. Table 2 lists the upregulated and downregulated genes. GSEA also revealed upregulated gene sets in gastric cancer, of which 99/176 genes were upregulated in gastric cancer compared with control. When the threshold was set as p value <0.05, the result showed that nine genes were enriched. When the threshold was set as p value <0.01, one gene set was enriched significantly in control. In total, 77/176 genes were downregulated in gastric cancer, and 2 gene sets were enriched when p value <0.05. The nine important gene sets correlated with gastric cancer are displayed in Table 3 according to NES, including "KEGG_FOCAL_ADHESION", "KEGG _ECM_RECEPTOR _INTERACTION" " K E G G _ V E G F _ S I G N A L I N G _ P A T H W A Y " , and "KEGG_HYPERTROPHIC_CARDIOMYOPATHY_HCM", and so on.
The GO enriched terms of genes functions are shown in Figure 6G. To further explore the relationships between the terms, we selected items with the best p value from 20 clusters, each cluster did not exceed 15 items, and a total of not more than 250 items. We used Metascape for visual analysis, where each node represented a term, and was colored first by its cluster ID ( Figure 7A). Then, these terms were analyzed by their p values ( Figure 7B).

The Analysis of Hub Genes
A PPI network was constructed for the common differentially expressed genes in the four datasets ( Figure 8A) and major nodes were analyzed ( Figure 7C). The hub genes were obtained by Cytoscape analysis ( Figure 8B). MCODE analysis was used to identify the most important modules ( Figure 8C). The threshold was set as degree of ≥10 to obtain 10 hub genes. These genes may serve as important biomarkers and need to be further validated.

Pathological Analysis and Survival Curve of Hub Genes
We conducted a literature survey on the hub genes, and performed survival analysis for gastric cancer biomarkers with little or no research in the field. We found that aldo-keto reductase family 1 member C1 (AKR1C1), CDH3, LEF1, SLIT3, MMP7, and 15-hydroxyprostaglandin dehydrogenase (HPGD) genes were significantly correlated with prognosis ( Figure 9). Subsequently, we conducted a pathological staging study on hub genes ( Figure 10), and found that HPGD had a significant difference in the progression of gastric cancer (stage3-4, P<0.05), suggesting that it could be a new biomarker for the diagnosis and prevention of gastric cancer metastasis.

Verification of the Expression of MMP7, CDH3, and LEF1
Based on the above bioinformatics analysis, three genes (MMP7, CDH3, LEF1) were markedly upregulated in gastric cancer samples and there was a good interaction among the three genes. The results of western blotting showed that the relative expression level of CDH3, LEF1, and MMP7 was significantly higher in gastric samples, compared with the normal groups (p<0.05) ( Figure 11A). And MMP7, CDH3, and LEF1 levels were verified from the relative expression level by RT-qPCR ( Figures 11B-D). The result demonstrated that MMP7, CDH3, and LEF1 might be identified as biomarkers for gastric cancer. The animal experiment also showed that these three genes were significantly increased in the gastric cancer group by RT-qPCR (Supplementary Figure 1).

Strong Associations Between the Relative Expression of the Three Gene and Gastric Cancer
ROC curves were constructed to determine the effect of the three genes' expression on gastric cancer, and the AUC was used to determine the degree of confidence: CDH3 (AUC = 0.800, P<0.05, 95% CI =0.857-0.895), LEF1 (AUC=0.620, P<0.05, 95% CI=0.632-0.714), and MMP7 (AUC=0.914, P<0.05, 95% CI=0.714-0.947) ( Figure 12A). Immunohistochemical results showed that the expression level of MMP7, CDH3, and LEF1 in gastric cancer tissues was higher than in the control group ( Figure 12B). A western blotting experiment was carried out, and the results showed that the relative expression levels of MMP7, CDH3, and LEF1 in gastric cancer tissues were increased compared with the control group ( Figures 13A-C). These results suggest that MMP7, CDH3, and LEF1 may be biomarkers for gastric cancer.
In order to further explore the correlation between hub genes and CEA which indicates the severity of gastric cancer, we calculated the linear correlation between the hub genes and CEA. CEA was positively associated with the relative expression of MMP7 (Pearson Rho=0.801, P<0.001) ( Figure 13D). CEA was positively associated with the relative expression of CDH3 (Pearson Rho=0.883, P<0.001) ( Figure 13E). CEA was positively associated with the relative expression of LEF1 (Pearson Rho=0.753, P<0.001) ( Figure 13F).

Association Between Three Genes and Gastric Cancer by Pearson's Correlation Test and Univariate Linear Regression
Pearson's correlation coefficient displayed that gastric cancer outcome was significantly correlated with the expression of MMP7, CDH3, and LEF1 (p<0.05). Gastric cancer remained related to MMP7, CDH3, and LEF1 (p<0.05) in the univariate linear regression model ( Table 4).

The Neural Network Prediction Model and High-Risk Warning Range of Gastric Cancer
The neural network model was trained with 35 gastric samples as the training set, and 5 gastric samples as the validation set. After training, the neural network prediction model achieved the best results. The best training performance was 0.033031 at epoch 2678 ( Figure 14A). The difference between the predicted value and the true value was very small (Figure 14B), and the residual plot showed the same result ( Figure 14C). The predicted value was fitted to the true value, and the correlation coefficient was 0.9624 ( Figure 14D). In summary, the expression of the three genes may be joint predictors of gastric cancer.
Through the cubic spline interpolation algorithm, we found that the high-risk warning indicator of gastric cancer was 8<CDH3<15 and 10<expression of LEF1<16 ( Figure 14E). Therefore, the 3D stereogram can better represent the early warning range ( Figure 14F).

DISCUSSION
Gastric cancer is one of the diseases with the highest mortality rate and has a high incidence in China. Recently, although many comprehensive treatments have been used to improve the efficacy, the 5-year survival rate of gastric cancer is still only 20%-30% (19). Actively exploring relevant biomarkers, early diagnosis, reasonable assessment of their prognosis, and timely intervention are of great significance for clinical treatment. With the advancement of molecular biology research methods, new types of prognostic biomarkers have emerged in an endless stream, making prognostic evaluation more objective (20).
In previous studies, CD44v9 was found to be highly expressed in mouse gastric cancer proliferating cells, and CD44v9 positive immune expression can be considered as a prognostic indicator of early gastric cancer, but not as a prognostic indicator of advanced gastric cancer (21). Therefore, CD44v9 can be considered as a prognostic biomarker for early gastric cancer (22). Epidermal growth factor receptor 1 (HER1) is the growth factor of epidermal factor receptor (EGFR) gene coding, and is one of the four members of the human epidermal growth factor receptor family in the receptor tyrosine kinase superfamily (23). It works by binding specific ligands including epidermal growth factor and transforming growth factor-a which are then are activated. EGFR over expression is not only a prognostic indicator in gastric cancer, but also can be used as a basis for personalized treatment. HER1 can also be used as a therapeutic target for gastric cancer (24). HER2 is encoded by ERBB2 and is a member of the HER family. Unlike other members of the HER family, HER2 does not contain sites and signals that bind to other heterodimer ligands (25). In gastric cancer, 4%-7% of tumors were found to have ERBB2 amplification or HER2 overexpression. Studies have shown that the expansion of ERBB2 is associated with a poor prognosis of gastric cancer. Therefore, the identification and research of potential biomarkers are crucial for early diagnosis and prognosis (26).
By analyzing four microarray datasets in the current study, we found the differences between gastric cancer and normal tissues adjacent to the cancer. The four datasets contained a total of 70 DEGs, and their interactions were explored through KEGG and GO analysis. DEGs are mainly concentrated in angiogenesis, the oncostatin-M-mediated signaling pathway, cytokine-cytokine receptor interaction, and glycolysis/gluconeogenesis. Studies have reported that angiogenesis and glycolysis have an important influence on the occurrence and progression of gastric cancer (27,28). In addition, recent studies have found that cytokine-cytokine receptor interactions play a significant part in the development of gastric cancer (29). The expression of  nerve growth factor receptor p75 (p75NTR) in gastric cancer cells is significantly lower than in adjacent tissues, suggesting that p75NTR may play a significant role in gastric cancer metastasis (30,31). The results of this paper are consistent with the above studies.
By analyzing public and private datasets, we found that MMP7 expression is highly expressed in gastric tissues. Matrix metalloproteinase is a protease secreted by endothelial cells, fibroblasts, and smooth muscle cells, which can degrade all extracellular matrix proteins (32). MMP-7 degrades IGFBP-3 by interacting with insulin-like growth factor binding protein 3, so that insulin-like growth factor 1 exerts an anti-apoptosis effect and promotes cell mitosis (33). MMP-7 can also degrade the Fas ligand on the cell membrane, inhibit Fas-mediated apoptosis, and promote tumor growth. The encoded preproprotein is proteolytically processed to generate the mature protease (34). The gene is highly expressed in digestive, urinary, breast, and lung cancer tissues (35,36).
In a case-control study, Fu et al. investigated whether the MMP7 promoter (A-181G and C-153T) polymorphism genotype was a risk factor for gastric cancer in Taiwan. The GG genotype of MMP7 A-181G was identified as a risk factor for gastric cancer (37).
Li et al. found that MMP7 induced T-DM1 resistance and resulted in a poor prognosis of gastric adenocarcinoma in a DKK1-dependent manner. Exogenous overexpression of MMP7 promoted T-DM1 resistance and tumor growth, while MMP7 knockdown was associated with the opposite phenotype. Moreover, DKK1 knockout can lead to decreased expression of MMP7 and resistance to T-DM1 (38). All these results indicate that MMP7 is a very important biomarker for gastric cancer. However, since gastric cancer is not caused by a single gene, it often leads to multi-gene changes. Looking for a promising combination of multiple genes to predict gastric cancer will be more conducive to the early diagnosis of gastric cancer.  CDH3 can bind to cells and the extracellular matrix (39). CDH3 is highly expressed in cancer tissues such as colorectal cancer, thyroid cancer, and pancreatic cancer, but the study of CDH3 in gastric cancer is not completely clear. Hibi K et al. found that CDH3 gene demethylation occurred in 69% of GC patients, and CDH3 demethylation was significantly associated with increased TNM staging (39). CDH3 plays a role in cell-cell adhesion in epithelial tissue and plays a key role in maintaining tissue integrity and morphology. Alterations of CDH3 can lead to tissue disruption, cell dedifferentiation, increased tumor cell aggressiveness, and ultimately metastasis. Our research found that CDH3 is highly expressed in gastric cancer tissues, and it is included in the neural network model for prediction (40). In the future, it may be used as a new biomarker for the early diagnosis of gastric cancer or the prognostic judgment of its progression. LEF1, located on the q23-q25 region of human chromosome 4, is the downstream nuclear transcription factor of the Wnt/b-catenin signaling pathway (41). LEF1 binds to b-catenin to regulate the expression of downstream molecules, thereby regulating the signal pathway (42). LEF1 is highly expressed in acute myelogenous leukemia, prostate cancer, small lymphocytic lymphoma, and other cancers (43,44). microRNA-6852 has been shown to inhibit the progression of stomach and colorectal cancer. Wang et al. found that the expression of LEF1 was negatively correlated with the expression of miR-6852. miR-6852 inhibited proliferation, migration, and invasion of glioma cells by inhibiting LEF1 (45). In our study, LEF1 was found to predict gastric cancer jointly with CDH3 and MMP7, which may be used as a diagnostic marker for gastric cancer in the future. In view of the complexity of the pathogenesis factors of gastric cancer, we should focus on multiple targets to accurately judge the diagnosis and prognosis of the disease, which is better than single gene predictions.

Limitations
Despite the rigorous bioinformatics analysis performed in this study, some limitation are still present. The small sample size of our study may cause some bias in the results.

Future Directions
In subsequent studies, we will conduct MMP7, CDH3, and LEF1 validation and molecular mechanism studies at the animal level. Multi-center randomized controlled clinical trials should be conducted and more subjects recruited to verify the role of the levels of the three genes expressed in the progression of gastric cancer.
The bioinformatics analysis was used to predict the function and expression of 10 hub genes, providing guidance for subsequent research and exploration. This study shows that MMP7, CDH3, and LEF1 are highly expressed in gastric cancer tissues. Also, we found a correlation between the three by constructing a neural network model, and the disease status could be judged through the high-risk early warning range.

CONCLUSION
The bioinformatics analysis was used to predict the function and expression of 10 hub genes, providing a possible mechanism for subsequent research and exploration. This study showed that MMP7, CDH3, and LEF1 are highly expressed in gastric cancer tissues. We selected CDH3, LEF1, and MMP7 as candidate biomarkers to construct a back propagation neural network model from hub genes, which may be helpful for the early diagnosis of cancer through the high-risk early warning range.   The high-risk warning range of gastric cancer at the level of the three-dimensional stereogram. The color represents MMP7 expression: "yellow" represents "high" and "blue" represents "low".

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/ Supplementary Material.

ETHICS STATEMENT
All experiments were approved by the Ethics Committee of the Fourth Hospital of Hebei Medical University. And written informed consent was obtained from all the patients.

AUTHOR CONTRIBUTIONS
The bioinformatics data were analyzed by M-jS. M-jS was also a major contributor in writing the manuscript. L-bM and Y-mZ proofread and submitted the article. L-bM, DK, Y-bL and PG designed the draft of the research process and offered funding for the experiment. All authors contributed to the article and approved the submitted version.

FUNDING
The present study was supported by the Youth Science and Technology Project of Health and the Health Commission of Hebei Province (Hebei, China; grant nos. 20170732, 20180558).