Co-expression Network Analysis Identifies Four Hub Genes Associated With Prognosis in Soft Tissue Sarcoma

Background: Soft tissue sarcomas (STS) are heterogeneous tumors derived from mesenchymal cells that differentiate into soft tissues. The prognosis of patients who present with an STS is influenced by the regulation of a complex gene network. Methods: Weighted gene co-expression network analysis (WGCNA) was performed to identify gene modules associated with STS (Samples = 156). Results: Among the 11 modules identified, the black and blue modules were highly correlated with STS. However, using preservation analysis, the black module demonstrated low preservation, therefore the blue module was chosen as the module of interest. Furthermore, a total of 20 network hub genes were identified in the blue module, 12 of which were also hub nodes in the protein-protein interaction network of the module genes. Following additional verification, 4 of 12 genes (RRM2, BUB1B, CENPF, and KIF20A) demonstrated poorer overall survival and disease-free survival rate in the test datasets. In addition, gene set enrichment analysis (GSEA) demonstrated that samples with a high level of blue module eigengene (ME) were enriched in cell cycle and metabolism associated signaling pathways. Conclusion: In summary, co-expression network analysis identified four hub genes associated with prognosis for STS, which may diminish the prognosis by influencing cell cycle and metabolism associated signaling pathways.


INTRODUCTION
Soft tissue sarcoma (STS) is a rare group of tumors that accounts for approximately 1% of adult cancers. In 2009, it was estimated that 3,300 new cases were diagnosed in Britain and 10,000 in the United States (Linch et al., 2014). There are approximately 50 STS subtypes, which differ significantly in their disease presentation, response to currently available treatments and risk of tumor progression (Casali et al., 2018). Multiple factors have been reported to be related to the progression of STS, including capillary morphogenesis gene 2 (CMG2) (Greither et al., 2017), HIF-2α protein (Nakazawa et al., 2016), epidermal growth factor receptor (EGFR) protein (Yang et al., 2017) and microRNAs (Smolle et al., 2017). However, no molecular biomarkers have been defined for predicting the prognosis of the disease in clinical. Therefore, a better understanding of the molecular pathogenesis is required.
To date, microarray-based expression data have been used to identify genes related to tumor progression and prognosis. Takahashi et al. (2014) identified 25 survival-associated genes using a knowledge-based filtering and multiple testing approach. Beck et al. (2010) has reviewed the manner in which gene expression profiling has been used to understand sarcoma pathobiology and identify clinically useful biomarkers. However, most studies have focused on screening genes that have different patterns of expression with explanations gained from gene ontology (GO) analysis. Such approaches, however, have failed to address the large number of interconnections between genes, because genes with similar expression profiles are most likely to function closely together. Therefore, weighted gene co-expression network analysis (WGCNA) clusters genes co-expressed in a network, based on similarities in expression profiles among samples and in clinical traits, to define sub-network regions (known as modules) (Langfelder and Horvath, 2008).
In this study, we utilized WGCNA to identify the most relevant module in STS. Key genes in the module were identified and validated using survival and protein-protein interaction (PPI) analyses. These key genes may shed new light on the biological mechanisms underlying STS progression and could potentially be used as prognostic biomarkers or therapeutic targets.

Study Design and Data Collection
Study design, data preparation, preprocessing, analysis and validation are described in a flowchart (Figure 1). Core codes used to reproduce the results were provided in Supplementary  Table S1. Firstly, normalized RNAseq data and associated clinical data were downloaded from the NCBI Gene Expression Omnibus (GEO). Dataset GSE21122 (Barretina et al., 2010), which was generated using an Affymetrix human genome U133A microarray (HG-U133A), was used as a training set to construct the co-expression network and identify key modules in this study. This dataset included 149 STS samples and 9 normal fat tissue samples. The STS samples contained 116 different types of liposarcoma and 34 malignant fibrous histiocytomas (MFHs). Most STSs (68.8%) were primary tumors at the time of sample procurement from patients whose mean age was 56 years. In addition, two test datasets were used to test the preservation of identified modules and survival significance of hub genes. The first one, which included RNA sequencing data and associated clinical information of 265 STS samples, were downloaded from The Cancer Genome Atlas (TCGA) database 1 . The other one, 1 https://genome-cancer.ucsc.edu/ GSE21050 dataset (Chibon et al., 2010), which included RNA sequencing data and associated clinical information of 310 STS samples were downloaded from the NCBI GEO.

Data Preprocessing
Firstly, we extracted training expression data from the GSE21122 MINiML file. The expression data was background corrected using the Robust Multi-array Average (RMA) algorithm and log base 2 normalized. The data were then checked to ascertain whether there was a batch effect. No apparent batch effect was observed after analysis of expression clusters, box plots and principal components analysis (PCA) (Supplementary Figure S1). In order to detect outliers for WGCNA analysis, sample network was calculated based on squared Euclidean distance. The connectivity of each sample was defined as the sum of the connectivity of that sample with all other samples. Outliers were identified after normalization of the connectivity of each sample, by use of the threshold z.k < 0.6. Generally, genes whose expression varies greatly are more biologically relevant. To reduce background noise, we selected genes that were varied expressed across samples and removed those whose expression was the same across samples. The median absolute deviation (MAD) was calculated for each gene as a robust measure of variability. Then, genes were sorted based on the MAD value and the top 3,000 ranked genes were used for the subsequent WGCNA analysis.

Co-expression Network Construction and Module Preservation Analysis
The WGCNA package (Langfelder and Horvath, 2008) was used to construct the co-expression network. The concordance of genes in the expression dataset was measured with Pearson correlation, then the Pearson correlation matrix was transformed to weighted network with the power adjacency function. The first step in this process was selection of an appropriate soft power, in which strong connections between genes are promoted and weak connections penalized, so as to transform the network into one meeting the requirements of a scale-free network. Modules were identified using the dynamic tree-cutting function with a deepSplit argument value of 2 and a minimum size cutoff of 30. To test whether the identified modules were stable in the test TCGA dataset, the downloaded fragments per million (FPKM) expression data of 265 samples were transformed to the transcripts per million (TPM). A total of 2704 common genes in the training and TCGA datasets were used for preservation analysis. The module Preservation function (nPermutations = 200) of the WGCNA package (Langfelder et al., 2011) was utilized, in which the preservation statistic Zsummary was used to quantify the preservation of gene modules between datasets.

Finding Modules of Interest and Functional Annotation
Because the module eigengene (ME) provides the most appropriate synopsis of gene expression profiles of any given module, we correlated MEs with clinical traits. In this study, clinical traits refer to whether the sample was a STS or normal fat tissue. Correlations were then calculated using linear regression model. The modules for which the eigengenes showed high correlation were chosen as the modules of interest. In an attempt to ascertain possible mechanisms of genes within a module affecting STS progression, functional enrichment analyses using the KEGG and GO databases of the hub module was performed with the "clusterProfile" package in R (Yu et al., 2012).

Identification of Hub Genes and Correlation Analysis
Hub genes are those that have a high degree of intra-module connectivity. In this study, hub genes were defined as the 20 module genes with highest connectivity in the interested module. A PPI network was constructed in order to identify hub nodes by uploading all genes in the hub module to the Search Tool for the Retrieval of Interacting Gene (STRING) database 2 . The PPI network was then imported into the Cytoscape software platform and a comprehensive analysis of the relationship between nodes was performed using the Maximal Clique Centrality (MCC) function, reported to be the most effective method of finding hub nodes in a co-expression network (Chin et al., 2014), within the "cytoHubba" application. In this way, the most cohesive genes were marked as "first stage nodes." In the PPI network of blue module genes, the 30 most highly ranked nodes were identified as "first stage nodes." Genes that were defined as both hub genes in the module and "first stage nodes" in the PPI network were chosen as primary hub genes.

Survival Analysis and Efficacy Evaluation
The internet tool, Gene Expression Profiling Interactive Analysis (GEPIA) 3 , was used to perform overall survival and diseasefree survival analyses for all hub genes. The platform utilizes all expression data and survival information of the TCGA database. Users are able to accomplish survival analysis by simply submitting a gene name and selecting a tumor type. Patients were divided into two groups (high vs. low) based FIGURE 3 | Color coding of co-expression network modules for mRNAs. on the hub gene expression level in comparison to the mean expression level of that hub gene. Furthermore, dataset GSE21050, which includes 310 STS samples in which metastasis status and survival time were provided, was used to test the significance of hub genes for metastasis survival. A Kaplan-Meier survival plot was constructed using the "survival" package in R (Li, 2003). Differential expression between STS and normal tissue in the training set was plotted as a box plot graph.

Gene Set Enrichment Analysis (GSEA)
In the training data set, 156 samples were dichotomized into two groups (High vs. Low) based on the ME value of blue module in comparison to the mean ME level of blue module of all samples. GSEA was then performed between the two groups. The 3,000 most variable genes from the WGCNA were imported for enrichment. In this way, GSEA was used to validate the results of GO and KEGG analysis of the blue module. The cut-off criterion for GSEA was FDR < 0.05.

Co-expression Network Construction and Module Preservation Analysis
After discarding two outlier samples (GSM528297 and GSM528333), WGCNA was performed on the 3,000 most variable genes of 156 samples. Soft threshold power was set to 6, in which R 2 was 0.916, ensured a scale-free network (Figure 2). Following this, 11 co-expression modules were identified, ranging in size from 43 to 669 genes (with each module assigned a color) (Figure 3).
By comparing the training dataset GSE21122 with the TCGA test dataset, we were able to establish whether the co-expression modules produced in the training dataset could be reproduced in the test dataset through summary preservation statistics. Three modules (black, brown, and magenta) demonstrated poor preservation with each Zsummary statistic < 10. The remaining modules, including the blue module were stable enough, suggesting they were preserved between the training data set and the test data set (Figure 4).

Finding Modules of Interest and Functional Annotation
It is important to identify the most significant modules related to STS. Both black and blue modules showed a significantly high FIGURE 4 | medianRank and Zsummary statistics of the most variant gene modules in module preservation. In the preservation medianRank graph (left), a medianRank value close to zero indicates a high degree of module preservation. In the preservation Zsummary graph (right), the dashed black lines indicate the thresholds Z = 2, 10. These horizontal lines indicate Zsummary thresholds for strong evidence of conservation (above 10) and for low to moderate evidence of conservation (above 2).
Frontiers in Genetics | www.frontiersin.org correlation with sarcomas (Figures 5, 6). However, due to the lack of stability of the statistical data (Zsummary < 10), the black module was not further analyzed. Therefore, the blue module was defined as an important module of clinical significance and extracted for further analysis.
For the sake of exploration of the biological relevance of the blue module, GO functional and KEGG pathway enrichment analyses were performed on 414 genes in the blue module. The biological processes of the genes in the blue module were found to associate with the cell cycle, such as mitotic nuclear division, chromosome segregation and sister chromatid segregation. In the KEGG pathway analysis, cell cycle associated signaling pathways such as DNA replication, cell cycle, p53 signaling pathway, oocyte meiosis, mismatch repair and metabolism associated pathways such as pyrimidine metabolism and purine metabolism were enriched (Figure 7).

Identification of Sarcoma Hub Genes in the Blue Module
Highly connected hub genes within a module perform important roles in tumor biological processes. Therefore, the 20 genes with greatest module relevance in the blue module were selected as candidate hub genes for STS (Supplementary Data Sheet S1). In addition, a PPI network in the blue module was constructed in accordance with the STRING database (Figure 8). Twelve of   the 20 candidate genes in the co-expression network were also identified as hub nodes of the PPI network. Finally, these 12 genes were considered "primary" hub genes associated with STS and therefore selected for additional analyses.

Survival Analysis and Efficacy Evaluation
While testing the TCGA dataset, four out of 12 hub genes demonstrated significant connectivity with overall and diseasefree survival (Figure 9). When testing the GSE21050 dataset, these four hub genes showed significant correlation with metastasis free survival (Figure 10). Furthermore, they were significantly highly expressed in STS tissue compared to normal fat tissue (Figure 11).

Gene Set Enrichment Analysis
In order to find out the potential function of both blue module and hub genes, GSEA was performed to identify KEGG pathways enriched in samples with higher level of ME of blue module. In GSEA analysis, five signaling pathways were significantly enriched, including ubiquitin mediated proteolysis (FDR = 0.01), pyrimidine metabolism (FDR = 0.03), oocyte meiosis (FDR = 0.02), cell cycle (FDR = 0.04) and DNA replication (FDR = 0.04) (Figure 12). Moreover, the last four pathways were consistent with the results of KEGG pathway analysis ( Figure 7D).

DISCUSSION
Soft tissue sarcomas remain among the most challenging diseases for medical oncologists to treat. STSs are mesenchymal neoplasms that can arise from any site within the body, including extremities, the trunk, retroperitoneum, head, and neck. These are biologically heterogeneous diseases of which greater than 50 subtypes exist, varying by molecular, histological and clinical characteristics.
In this study, WGCNA was utilized to construct a coexpression network for identification of gene co-expression modules associated with STS. The blue module was positively identified and 20 hub genes selected from this module. In addition, as a result of the PPI network, 12 genes were identified as hub nodes of the co-expression module and PPI network, indicating that these 12 hub genes were closely   related to STS and had important biological significance. Subsequent survival analysis established that four of the 12 hub genes (RRM2, BUB1B, CENPF, and KIF20A) were significantly associated with survival. We, therefore, focused on these four genes.
The ribonucleotide reductase regulatory subunit M2 (RRM2) is one of two subunits that constitute ribonucleotide reductase, the enzyme responsible for catalyzing the conversion of ribonucleotides into deoxyribonucleotides, and thus performing an important role in DNA synthesis. RRM2 is important in controlling cellular function in a number of human malignant tumors, including DNA repair, cell proliferation and senescence. Importantly, RRM2 functions as a driver in a variety of tumors, with in vivo and in vitro experiments confirming that knocking down expression using siRNA significantly inhibits tumor cell proliferation (Fang et al., 2016).
The BUB1 mitotic checkpoint serine/threonine kinase B (BUB1B) is a member of the spindle assembly checkpoint protein family, crucial for ensuring correct chromosome separation during cell division (Fu et al., 2016). BUB1B perfoms a role in the inhibition of APC expression, established as a tumor suppressor gene in most colorectal cancers. Accordingly, many reports have shown that upregulation of BUB1B is related to the recurrence and progression of bladder cancer (Yamamoto et al., 2007), gastric cancer (Ando et al., 2010), esophageal squamous cell carcinoma (Tanaka et al., 2008), breast cancer (Yuan et al., 2006), hepatocellular carcinoma (Liu et al., 2009) and others.
Centromere protein F (CENPF) is another important protein involved in chromosome segregation during mitosis.
Upregulation of CENPF protein expression, especially through a gene amplification effect, suggests that high levels of CENPF protein may affect the occurrence of tumors, especially in the early stages of tumor development (Varis et al., 2006). Clinical research has demonstrated that high expression levels of CENPF results in poor prognosis in nasopharyngeal carcinoma (Cao et al., 2010), colorectal gastrointestinal stromal tumors (Chen et al., 2011), esophageal squamous cell carcinoma (Mi et al., 2013) and prostate cancer (Zhuo et al., 2015). It has also been shown to play an important role in driving hepatocellular carcinoma (Dai et al., 2013).
Kinesin family member 20A (KIF20A, also known as RAB6KIFL) belongs to the kinesin superfamily-6, located in the Golgi apparatus and contributes to intracellular organelle transport and cell division (Echard et al., 1998). Recently, it has been reported that KIF20A is associated with mitosis, cell adhesion, migration and proliferation. Furthermore, recent studies have demonstrated that KIF20A is involved in tumor progression and angiogenesis. High expression of KIF20A results poor prognosis in glioma patients (Duan et al., 2016;Saito et al., 2017), nasopharyngeal cancer (Liu et al., 2017), hepatocellular carcinoma , melanoma (Yamashita et al., 2012) and early-stage cervical squamous cell carcinoma .
Regarding GSEA, it was found that cell cycle and metabolism associated pathways were significant enriched in samples with higher level of ME of blue module. This is consistent with the initial GO and KEGG analysis results of the blue module and are related to the physiological function of these four hub genes.
In summary, through WGCNA and other related analysis methods, we identified four genes (RRM2, BUB1B, CENPF, and KIF20A) related to the progression and prognosis of STS. These genes may play a role by regulating the cell cycle and metabolism associated signaling pathways.

AUTHOR CONTRIBUTIONS
ZZ and DS designed the study. ZZ and ZJ performed the data collection. ZJ and LW performed the data analysis. ZZ and MZ drafted the manuscript. All authors read and approved the final version of the manuscript.

FUNDING
This study was supported by the Special Projects of Health in Jilin Province (3D5148273428).

SUPPLEMENTARY MATERIAL
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00037/full#supplementary-material FIGURE S1 | Data quality examination. DATA SHEET S1 | Hub genes of blue module.