Construction and Analysis of the Tumor-Specific mRNA–miRNA–lncRNA Network in Gastric Cancer

Weighted correlation network analysis (WGCNA) is a statistical method that has been widely used in recent years to explore gene co-expression modules. Competing endogenous RNA (ceRNA) is commonly involved in the cancer gene expression regulation mechanism. Some ceRNA networks are recognized in gastric cancer; however, the prognosis-associated ceRNA network has not been fully identified using WGCNA. We performed WGCNA using datasets from The Cancer Genome Atlas (TCGA) and the Genotype-Tissue Expression (GTEx) to identify cancer-associated modules. The criteria of differentially expressed RNAs between normal stomach samples and gastric cancer samples were set at the false discovery rate (FDR) < 0.01 and |fold change (FC)| > 1.3. The ceRNA relationships obtained from the RNAinter database were examined by both the Pearson correlation test and hypergeometric test to confirm the mRNA–lncRNA regulation. Overlapped genes were recognized at the intersections of genes predicted by ceRNA relationships, differentially expressed genes, and genes in cancer-specific modules. These were then used for univariate and multivariate Cox analyses to construct a risk score model. The ceRNA network was constructed based on the genes in this model. WGCNA-uncovered genes in the green and turquoise modules are those most associated with gastric cancer. Eighty differentially expressed genes were observed to have potential prognostic value, which led to the identification of 12 prognosis-related mRNAs (KIF15, FEN1, ZFP69B, SP6, SPARC, TTF2, MSI2, KYNU, ACLY, KIF21B, SLC12A7, and ZNF823) to construct a risk score model. The risk genes were validated using the GSE62254 and GSE84433 datasets, with 0.82 as the universal cutoff value. 12 genes, 12 lncRNAs, and 35 miRNAs were used to build a ceRNA network with 86 dysregulated lncRNA–mRNA ceRNA pairs. Finally, we developed a 12-gene signature from both prognosis-related and tumor-specific genes, and then constructed a ceRNA network in gastric cancer. Our findings may provide novel insights into the treatment of gastric cancer.


INTRODUCTION
Gastric cancer is a major cause of cancer-related mortality worldwide (Van Cutsem et al., 2016). It is a serious form of cancer characterized by limited chemotherapy regimens and complex patterns of tumorigenesis and progression in different subtypes (Erdem et al., 2018;Kubota et al., 2020). There have been exceptional advancements in the interpretation of the molecular pattern of gastric cancer through research projects including the Cancer Genome Atlas (TCGA) (Cancer Genome Atlas Research N, 2014) and the Asian Cancer Research Group (ACRG) (Cristescu et al., 2015) in recent years; however, current classifications are not sufficient to describe the vast differences in prognoses and summarize overall genomic characteristics, even for patients who are recognized as belonging to the same molecular subtypes.
Integrated analysis of transcriptomes is believed to provide peculiar insights into diseases; in this respect, weighted gene coexpression network analysis (WGCNA) may be the most popular approach for detecting co-expressed RNAs from RNA-seq data to microarray data (van Dam et al., 2018;Kakati et al., 2019). WGCNA can identify select groups of significant genes with similar biological functions and with strong correlations to specific traits. Many recent surveys have used WGCNA for both non-neoplastic and neoplastic diseases, including gastric cancer. Salmena et al. (2011) speculated that the expression of many RNA transcripts is regulated by competing endogenous RNAs (ceRNAs) competing for the same sequences in miRNAs. They established the groundwork for a significant discovery of a communication network between coding and non-coding RNAs. Their theory has been supported by studies on the pathological processes of many malignancies, including breast, colon, and gastric cancers (Qi et al., 2015;Shuwen et al., 2018;Abdollahzadeh et al., 2019).
Thus, finding key genes that will serve as drug targets is crucial to the treatment of gastric cancer. In this study, we used WGCNA to construct a solid cancer-associated ceRNA network in gastric cancer for the first time. We hypothesized that identifying gene co-expression patterns would provide additional insight into disease-associated biological pathways. Finally, we explored a lncRNA-miRNA-mRNA network based on the survival-related hallmark genes of gastric cancer, providing candidate targets for its management and surveillance.

RNA-Sequencing and Microarray Data Collection
RNA-seq data were obtained from TCGA repository (https:// portal.gdc.cancer.gov/) and the Genotype-Tissue Expression (GTEx) portal (https://www.gtexportal.org/) and sequenced on the Illumina HiSeq 2000 RNA Sequencing platform. TCGA offers a comprehensive database of cancer genomic profiles of specific cancer types. GTEx is another project that recruits postmortem donors without diseases, which has made genetic traits of healthy people open to the public. We performed a combined analysis of the stomach data from these two projects in the present study. The transcript per million (TPM) expression values for 625 stomach samples and the RNA-seq by Expectation Maximization (RSEM) expected counts for 624 stomach samples were preserved for further analysis (raw counts data of a patient did not exist).
The microarray data were downloaded from the Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo/). The datasets were obtained from human gene expression microarray profiles of gastric cancer using fresh-frozen specimens. Finally, we included GSE62254 and GSE84433 for further analysis, which are the two largest datasets of microarraybased gene expression profiling. GSE62254 was profiled on the Affymetrix Human Genome U133 Plus 2.0 Array platform (Affymetrix, Inc., Santa Clara, CA, USA), including 300 gastric cancer tumor samples and 100 normal samples, which is the largest number of normal gastric samples among datasets (Cristescu et al., 2015).
We downloaded raw data (CEL files) generated by the Affymetrix platform. The R package oligo was utilized for format conversion, missing data filling, background correction, and data normalization (Parrish and Spencer , 2004). GSE84433 was profiled on the Illumina HumanHT-12 V3.0 Expression Beadchip (Illumina, Inc., San Diego, CA, USA); it includes 357 gastric cancer samples, and thus has the highest number of gastric cancer patients with survival information (Cheong et al., 2018). We downloaded data in the format of raw counts, performed quantile normalization, followed by log transformation.
Annotation of the above datasets was performed according to the different platforms using the official R package downloaded from Bioconductor (http://www.bioconductor.org/packages/). When the same RNA name appeared, the probe with the highest signal value was stored. To facilitate the analysis, only the overlapped RNAs qualified for survival-related ceRNA network construction. To keep the data updated, clinical and survival information of patients were obtained from the websites on March 5th, 2020. All original data were retrieved from the open database; thus, the documents of medical ethics were exempted as all had been approved when first published.

Differentially Expressed RNAs
The Limma package was used for screening differentially expressed RNAs (Ritchie et al., 2015). The RSEM expected counts retrieved from the GTEx database and TCGA database were utilized to discern differentially expressed RNAs between gastric cancer and control groups, which comprised normal stomach tissue both from dead non-cancerous subjects and adjacent stomach tissue in gastric cancer patients. For microarray-based profiling, expression values retrieved from GSE62254 were utilized to distinguish differentially expressed RNAs between 300 gastric cancer samples and 100 normal tissue samples. The R package Limma was used to process the data with the standard of false discovery rate (FDR) < 0.01, and |fold change (FC)| > 1.3. After filtering out RNAs with low expression values, the overlapping, differentially expressed RNAs were considered suitable for further analysis.

WGCNA
WGCNA is a bioinformatics method for dealing with highthroughput gene expression data, which can be used for the construction of a co-expression network (Langfelder and Horvath, 2008). The expression values of genes were preprocessed in the form of log2 (TPM + 0.001). The genes were then chosen in order of descending variance of their expression in the datasets. Finally, 13000 genes were put through WGCNA. Pairwise Pearson coefficients were used to assess the weighted co-expression relationships between all genes to produce an adjacency matrix. The least value for which the scale-free topology fit R^2 index > 0.75 was chosen as the softthreshold power. Pearson coefficients were produced for all paired genes; thus, the co-expression matrix was rendered into an adjacency matrix using soft-threshold power. The softthreshold power was selected according to the standard scalefree distribution. Scale-free co-expression networks were created with 30 RNAs as the minimal module size and 0.25 as the dendrogram cut height for module merging. The soft threshold was used to ensure a scale-free network. Genes with high correlations were clustered into the same module after forming a co-expression network.

Gene Function Analysis
Upregulated and downregulated differentially expressed genes were put into Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG), respectively. For the gene set enrichment analysis (GSEA) and KEGG-GSEA, all genes were incorporated into the analysis. A pathway (term) with an adjusted p-value < 0.05 was considered a functional enriched pathway (term) using the R package ClusterProfiler (Yu et al., 2012).

ceRNA Network
RNAinter was employed to observe the relationships among mRNA, lncRNA, and miRNA. RNAinter contains 24 databases demonstrated by experiment and 14 databases forecasted by calculation, including miRTarBase and starBase. We used lncRNA-miRNA relationships and mRNA-miRNA relationships with the least confidence score limit of 0.5 for further analysis (Lin et al., 2020). To ensure mRNA-lncRNA competing relationship pairs that have the shared miRNAs in gastric cancer, the Pearson correlation test and the hypergeometric test were utilized using the expression profiles. Quantile normalization and correction of batch effects between TCGA dataset and the GTEx dataset were performed for the expression profiles before further analysis (Leek et al., 2012). If the P-values of both tests were less than 0.05, the mRNA-lncRNA competing interaction pairs were saved to create the ceRNA network (Paci et al., 2014). The co-expression network with the overlapped lncRNA-miRNA-mRNA relationship was visualized with Cytoscape software (Su et al., 2014).

Construction and Validation of the Survival Model
Training datasets for the prognostic score system were performed according to the sequencing data from TCGA. Validation of the prognostic score system was performed on the two microarray datasets (GSE62254 and GSE84433). Patients who died within 30 postoperative days and without 30-day follow-up were excluded. Expression profiles of eligible gastric cancer patients were normalized and non-overlapping genes were removed from the analysis. Before survival analysis, the gene expression values in the datasets were processed with a standardization with 0 mean value and standard deviation of 1. Based on the mRNAs obtained in the clinical cancer-associated modules, univariate Cox regression analysis was performed to identify prognosis-related mRNAs; then, all the survival-related genes were used to perform multivariate Cox regression. The backward selection method was used to select the most suitable survival gene group to construct a risk score system (Donithan et al., 1992). The samples in the datasets were divided into high-risk and low-risk groups according to the universal cutoff value of their risk scores. Kaplan-Meier survival curves were used to evaluate correlations between the overall survival of the two groups in datasets using the survival package (Therneau, 2020).

Differential Gene Expression Analysis
Sequencing data (TCGA + GTEx) and microarray data (GSE62254) were downloaded and processed as previously described. A total of 13007 genes and 292 lncRNAs were identified after screening out less-expressed ones. Thirty pathways were generated after the genes were applied to the KEGG-GSEA ( Figure 1A). A total of 3895 differentially expressed genes and 79 lncRNAs were determined by comparing each gastric cancer group to both control groups. In total, 2347 upregulated and 1548 downregulated genes were subjected to GO term and KEGG pathway enrichment analyses. In the GO analysis for upregulated genes, neutrophil activation, neutrophil mediated immunity, and T cell activation were rendered as the top three terms of biological processes (BP); the chromosomal region, condensed chromosome, and centromeric region on chromosome were rendered as the top three terms of cellular components (CC); cell adhesion molecule binding, cadherin binding, and DNA helicase activity were rendered as the top three terms of molecular functions (MF) ( Figure 1B). In the GO analysis for downregulated genes, regulation of neuron projection development, axonogenesis, and regulation of cell morphogenesis were rendered as the top three terms of the BP; the collagen-containing extracellular matrix, nuclear speck, and axon part were rendered as the top three terms of CC; actin binding, coenzyme binding, and extracellular matrix structural constituent were rendered as the top three terms of the MF ( Figure 1C). In the KEGG analysis, human papillomavirus infection, human T−cell leukemia virus 1 infection, and Epstein−Barr virus infection were rendered as the top three enriched pathways for upregulated genes ( Figure 1D); herpes simplex virus 1 infection, MAPK signaling, and oxytocin signaling were rendered as the top three pathways enriched for downregulated genes ( Figure 1E).

WGCNA
WGCNA was performed to ascertain the most strongly cancerassociated genes. When the soft-power b was set to 4, the scalefree topology fit index was over 0.75 ( Figure 2A). The created network included nine modules ( Figure 2B). Figure 2C shows that the green module was recognized as the most specific module with a coefficient of correlation of 0.89 (p = 2 × 10 −217 ); the turquoise module came in second with a coefficient of correlation of 0.61 (p = 8 × 10 −65 ). Genes in both modules showed a high correlation with each other according to the heatmap of the topological overlap plot ( Figure 2D).

Functional Analysis of the Green and Turquoise Modules
A total of 805 genes in the green module and 5213 genes in the turquoise module were subjected to KEGG-GSEA, generating 12 pathways ( Figure 3A). The GO and KEGG enrichment results of 3312 upregulated differentially expressed genes in both modules are plotted in Figures 3B, C, in which enrichment results of the downregulated differentially expressed genes were not found. In total, 817 upregulated and eight downregulated differentially expressed genes were found to be qualified, with appropriate ceRNA relationships in the stomach, and appeared in the two cancer-associated modules ( Figure 3D).

Survival Model Construction and Validation
The clinical and pathological data from the construction and validation cohorts are shown in Table 1. We used Cox univariate regression analysis for the 825 genes for identifying survival-related genes in 332 TCGA gastric cancer samples. The univariate analysis screened 83 predictors based on prognosis. Because three genes did not appear in GSE84433, 80 genes were included in the multivariate analysis, which further led to the identification of 12 upregulated mRNAs for constructing a risk score model, containing KIF15, FEN1, ZFP69B, SP6, SPARC, TTF2, MSI2, KYNU, ACLY, KIF21B, SLC12A7, and ZNF823 ( Table 2). The risk scores for individual samples was calculated using the formula: (HR = 0.801) were less than 1. We further discovered that the best cutoff risk score to differentiate low-risk from high-risk groups was 0.82. The risk score for each individual were calculated and was categorized into two groups according to the cutoff value. Kaplan-Meier analysis showed that there was a significant difference between high-risk (n=51) and low-risk patients (n=281) in the training dataset (log-rank test, p < 0.0001, Figure 5A). Risk stratification, survival information, and expression values of 12 genes of 332 patients were shown in the risk score panel. We observed that both the survival information and the 12-gene expression of patients in the highrisk group varied from those in the low-risk group ( Figure 5D). Using GSE62254 and GSE84433 as validation datasets, the survival curves (Figures 5B, C) and the risk score panels ( Figures 5E, F) between the high-risk groups and the low-risk groups were distinctively different. The results in the validation datasets were similar to those in the training dataset, therefore robustness of the 12-gene signature risk score system in predicting sample risk was supported.

DISCUSSION
Currently, although there have been some investigations concentrating on non-coding RNA-involving gene expression regulation networks in gastric cancer, our research is the first to employ WGCNA to produce a co-expression network of lncRNAs-miRNAs-mRNAs in gastric cancer. Compared with the previous risk score-based system in gastric cancer (Cho et al., 2011;Cheong et al., 2018;Duan et al., 2019;Hu et al., 2019), the advantages of our ceRNA network are that it provided 12 both cancer-associated and prognosis-related genes and was constructed using rigorous calculation of the ceRNA regulation relationships.
KIF15 is a gene involved in immune diseases and cancer progression. Whether it is mutated is related to the sensitivity of immune checkpoint inhibitors. Gyorffy et al. reported that in both wild-type PIK3CA patient groups, individuals with KIF15 mutations displayed substantially increased expression of PD-L1, whereas those without mutations displayed decreased expression (Menyhart et al., 2018). Zhang et al. found that knockdown of KIF15 resulted in mitochondrial damage and ROS-JNK-p53 axis activation, thus promoting apoptosis and inhibiting cell proliferation in gastric cancer cells (Tao et al., 2020). The DNA replication and repair pathway is an important mechanism in gastric cancer. FEN1 plays an important role in apoptotic fragmentation of DNA, maintenance of telomere stability, and rescue of stalled replication forks (Shen et al., 2005;Zheng et al., 2011). Unsurprisingly, genetic variants and changes in the expression of FEN1 will alter patients' sensitivity to chemotherapy and prognosis (Liu et al., 2012;Wang et al., 2014;Xie et al., 2016). The role of SPARC in gastric cancer has not been fully elucidated, although its diagnostic and prognostic value has been confirmed by multiple studies (Zhao et al., 2010;Liao et al., 2018). SPARC can act as both an inhibitor and promoter in cancer (Tai and Tang, 2008;Said, 2016;Li et al., 2019); the real effect of SPARC can be altered by other genes owing to its epistatic effects . However, some oncologists have shown that SPARC is mostly produced by gastric cancer-associated fibroblasts rather than gastric cancer  cells (Ma et al., 2018;Ma et al., 2019). Thus, the concrete mechanism of SPARC in gastric cancer needs further investigation. MSI2 is an oncogene associated with differentiation, resulting in the preservation of cancer stem cells. According to a previous study based on microarray and RT-PCR, MSI2 expression increased slightly relative to normal tissue, but it is still used as a biomarker for gastric cancer (Emadi-Baygi et al., 2013;Yang et al., 2019). SLC12A7 (Solute Carrier Family 12 Member 7) acts as a potassium/chloride cotransporter for maintaining a stable osmotic pressure, which can be activated by insulin-like growth factor (IGF) resulting in cell invasion and progression in breast cancer, adrenocortical cancer, cervical cancer, and ovarian cancer (Shen et al., 2004;Hsu et al., 2007;Chen et al., 2009;Brown et al., 2019). It has been reported that amplification of SLC12A7 is mainly within HER2− patient samples in gastric cancer, while the precise mechanism of SLC12A7 is still unknown (Zhou et al., 2018). ACLY is the integral enzyme responsible for the formation of cytosolic acetyl-CoA, and high expression of ACLY was shown to be associated with a poor prognosis (Qian et al., 2015). Citrate and inhibitors of ACLY can reduce its expression, thus protecting against gastric cancer cell progression (Guo et al., 2016;Icard et al., 2020). In addition, ACLY can be downregulated by miR-133b via PPARg (Cheng et al., 2017) and lncRNA FLJ22763  in gastric cancer. The functions of ZNF823, ZFP69B, SP6, KYNU, KIF21B, and TTF2 have not been reported in gastric cancer. KYNU is a pyridoxal-5´-phosphate dependent enzyme that catalyzes the cleavage of kynurenine into anthranilic acid (AA) and the c l e a v a g e o f 3 -h y d r o x y k y n u r e n i n e ( 3 -H K ) i n t o 3hydroxyanthranilic acids (3-HAA) (Badawy, 2017). Drugs have been developed to regulate the expression of KYNU to suppress tumor growth through the Kynurenine pathways, such as in breast cancer Liu et al., 2020) and melanoma (Rad et al., 2019). It has been demonstrated that TTF2 is able to terminate RNA polymerase II transcription, which has an important function in promoting chromosome segregation and altering protein-DNA interactions Cheng et al., 2012). KIF21B is an ATP-dependent microtubule-based motor protein that participates in the intracellular transfer of membranous organelles. KIF21B is a potential oncogene that resists the induction of apoptosis and facilitates malignant tumorigenesis, tumor development, intrusion, and metastasis. Patients with high expression of KIF21B have been demonstrated to have a poorer prognosis in hepatocellular carcinoma (Zhao et al., 2020) and non-small cell lung cancer . SP6 is an important gene that regulates odontogenesis, belonging to a family of transcription factors that contain 3 classical zinc finger DNAbinding domains (Aurrekoetxea et al., 2016;Smith et al., 2020;Nakamura et al., 2020).
In summary, we identified a survival-related gene-based ceRNA network using the WGCNA algorithm, and the constructed lncRNA-miRNA-mRNA ceRNA interactive network will probably provide a basis for additional inspection of the regulatory mechanisms of gastric cancer. Gastric cancer has high heterogeneity among its different histological and molecular subtypes. Thus, while we have selected the expression profiles with the largest sample size that we could obtain currently, we must admit that further experimental works and large cohorts are needed to verify our results and elucidate the prognostic value of ceRNA networks in gastric cancer.

AUTHOR CONTRIBUTIONS
XZ and YX contributed to study conception. XZ, XW, LZ, and HZ contributed to data collection and analysis. XZ, WL, BW, LX, and YT contributed to manuscript writing. All authors contributed to the article and approved the submitted version.