Prognostic Genes of Breast Cancer Identified by Gene Co-expression Network Analysis

Breast cancer is the mostly diagnosed malignance in female worldwide. However, the mechanisms of its pathogenesis remain largely unknown. Methods In this study, we used weighted gene co-expression network analysis (WGCNA) to identify novel biomarkers associated with the prognosis of breast cancer. Gene expression profiles were obtained from the Gene Expression Omnibus (GEO) database. Results A total of 5 modules were identified via the average linkage hierarchical clustering. And a module significantly with the pathological grade was screened out. 33 genes with high connectivity in the clinically significant module were identified as hub genes. Among them, CASC5 and RAD51 were negatively associated with the overall survival and disease-specific survival. Similar results were observed in the validation dataset. Protein levels of CACS5 and RAD51 were also significantly higher in tumor tissues compared with normal tissues based on the analysis of the Human Protein Atlas. Convincingly, qRT-PCR analysis of breast cancer tissues and matched paracancerous tissue demonstrated that CACS5 and RAD51 were significantly upregulated in in breast cancer compared to paracancerous tissues. Further cell proliferation assay indicated that CACS5 and RAD51 depletion decreased cell proliferation capability. In conclusion, our findings suggested that CASC5 and RAD51 could serve as biomarkers related to the prognosis of breast cancer and may be helpful for revealing pathogenic mechanism and developing further research.


Introduction
Breast cancer is the most common cancer and is the second leading cause of cancer death in women worldwide. Despite the incidence of breast cancer is dramatically higher in developed countries, almost half of newly diagnosed cases and approximately 60% of deaths are occurring in developing world (1). Survival rates of breast cancer are largely different in the world. In developed countries, 5year survival was estimated at 80%, while it is thought to below 40% for developing countries (2).
Breast cancer-related deaths have been decreased in recent decades as a result of improving strategies to diagnose and treatment. However, a minority of breast cancer patients initially diagnosed with advanced stage are nearly incurable. Some patients present early-stage disease suffer distant or locoregional recurrence (3). Thus, it is important to understand the underlying mechanisms during the initiation and progression of breast cancer.
Breast cancer is a disease with great heterogeneity. Based on the expression of the three biomarkers, estrogen receptor (ER), progesterone receptor (PR) and human epidermal growth factor receptor 2 (HER2), molecular studies have demonstrated that breast cancer can be categorized into three subtypes: luminal (hormone receptor positive), HER2-enriched (high expression of HER2 gene), and triple-negative (ER-negative, PR-negative, HER2-negative). Luminal type of breast cancer is characterized by the expression of estrogen receptor and progesterone receptor. Patients with luminal breast cancer often have the best prognosis due to the application of endocrine treatment, while half of these patients will develop acquired resistance. Triple-negative and HER2-enriched breast cancers have more aggressive behaviors. Patients with triple-negative and HER2-enriched breast cancers are thought to have poorer prognosis, with trend to early relapse and metastatic spread to the liver, lung and central nervous system(4, 5). BRCA1 and BRCA2 are used to assess the risk of inherited breast cancer. The risks of breast cancer associated with BRCA1 and BRCA2 mutations are 65% and 45% in population, while in cancer-prone families, the risks are estimated to be 87% and 84%(6, 7).
Based on the histopathological features of breast cancer, such as tumor size, lymph node status and grade, patients could be stratified into high-and low-risk groups of recurrence and mortality(8).
However, even in patients with histologically similar tumors, clinical outcomes of breast cancer patients are largely different due to the heterogeneity of breast cancer(9). Several studies have established multiple gene prognostic signatures to predict the prognosis of patients with breast cancer. Some prognostic models have been validated and are in clinical application(10-12). It is important to detect these biomarkers to support early diagnosis, therapeutic strategies determination and prognosis prediction after treatment.
WGCNA (Weighted Gene Co-expression Network Analysis) is a systems biology method to analyze the correlation patterns among genes (13). It can be used to identify candidate biomarker genes or therapeutic targets according to the interconnectedness of gene sets and the association between gene sets and phenotypes. In the present study, we used WGCNA to explore candidate biomarkers of patients with breast cancer.

Materials And Methods Data Processing
Gene expression profiles and clinical data of breast cancer patients (GSE37751) were downloaded from the GEO database. The GSE37751 was based on GPL6244 platform (Affymetrix Human Gene 1.0 ST Array [transcript (gene) version]). This dataset included 47 normal breast tissues and 61 breast cancer tissues. We used the Robust Multi-array Average (RMA) method to normalize raw microarray datasets, including background correction, log2 transformation and normalization. Probes were changed into gene symbols using corresponding annotation files. the expression value of each gene was compared between cancer samples and normal samples to identify differentially expressed genes (DEGs) by Linear Models for Microarray Data (LIMMA) package. The False Discovery Rate method was used to adjust the p-value. Adjusted P-value < 0.05 and |log2 foldchange (FC)| > 1 were set as the cutoff criteria to select genes for further network construction.

Construction Of Co-expression Network
The gene co-expression networks of DEGs were constructed using R WGCNA package. In the present study, an appropriate soft-threshold power β was selected to build a weighted adjacency matrix. We calculated the network connectivity of genes by transforming the adjacency matrix into topological overlap matrix (TOM). In order to classify the genes with similar expression profiles into gene modules, the average linkage hierarchical clustering was performed according to the TOM-based dissimilarity measure with a minimum size of 30 for the genes dendrogram.

Identification Of Clinically Significant Modules
Module eigengenes (MEs) are the first principle component of each gene module and were considered as the expression patterns of all genes within a given module. Correlations between clinical trait and MEs were calculated to identify the clinically significant module. In addition, module significance (MS) of each module was calculated. The higher absolute value of module significance of a given module was considered as more biologically significant.

Pathway Enrichment And Functional Analyses
In order to identify the biological function of the clinically significant module, gene ontology and KEGG pathway enrichment analyses were performed using a functional annotation tool in the database for annotation, visualization and integrated discovery (DAVID, http://david.abcc.ncifcrf.gov/, version 6.8).
Adjusted P < 0.05 was considered statistically significant.

Identification And Validation Of Hub Genes
Hub genes are thought to be highly interconnected with nodes in a module and are more likely to have functional significance. Module connectivity of each gene was measured by absolute value of the module membership (MM). And the absolute value of gene significance (GS) was also calculated to identify the relationship between genes and clinical trait. Hub genes are considered to have higher absolute value of MM and GS. In the present study, absolute MM > 0.8 and absolute GS > 0.4 were defined as hub genes in co-expression network. GSE21653 was downloaded to validated the prognostic value of our hub genes. Survival analysis was also performed using Kaplan Meier-plotter (www. kmplot.com) (14). Immunohistochemistry data were obtained to validate protein levels of

Statistical Analysis
Kaplan-Meier method with log-rank test and Cox regression model were used to perform survival analysis. We separated patients into low-and high-expression groups based on the median expression value of each hub gene. Student's t test and one-way ANOVA were used to compare the difference between two or more groups. We performed all statistical analysis using R software 3.6.1 and GraphPad Prism software version 7.0 (GraphPad Software, San Diego, CA, USA), a P < 0.05 was set as the probability value of statistical significance.

Result
Weighted co-expression network construction and key modules identification Based on the threshold of adjusted P-value < 0.05 and |log2 foldchange (FC)| > 1, a total of 3668 DEGs (2756 up-regulated and 912 down-regulated) were selected for WGCNA analysis. Nine tumor samples without complete clinical information and 47 normal samples were removed from our subsequent co-expression analysis. To assess the microarray quality, we performed sample cluster of GSE37751 in Pearson's correlation matrices and average linkage method (Fig. 1A). Fifty-two samples were included in co-expression analysis. In this study, we selected the power of β = 16 (scale free R 2 = 0.87) as the soft-thresholding to ensure a scale-free network (Fig. 1B, C). A total of 5 modules were identified via the average linkage hierarchical clustering. Turquoise module was found to have the highest association with pathological grade (Fig. 2), which was selected as the clinically significant module to be studied for further analysis.

Gene Ontology And Pathway Enrichment Analysis
In order to identify the biological functions of turquoise module, genes in this module were categorized into 3 functional groups: biological process (BP), molecular function (MF) and cellular component (CC). In the BP group, turquoise module genes were mainly enriched in cell division, mitotic nuclear division, DNA replication, sister chromatid cohesion, mitotic nuclear division and G1/S transition of mitotic cell cycle. In the MF group, they were mainly enriched in protein binding, ATP binding, histone binding, DNA binding and protein heterodimerization activity. In the CC group, they were significantly enriched in nucleoplasm, nucleus, nucleosome, kinetochore and cytosol ( Fig. 3A-C).
KEGG pathway analysis demonstrated that these genes were mainly involved in cell cycle, DNA replication, alcoholism, systemic lupus erythematosus and Fanconi anemia pathway (Fig. 3D). These results showed that genes in the clinically significant module were mainly involved in mitotic cell cycle process.

Identification and validation of hub genes
Based the cut-off criteria (|MM| > 0.8 and |GS| > 0.4), a total of 33 genes with high connectivity in the clinically significant module were identified as hub genes. Among them, CASC5 and RAD51 were negatively associated with the overall survival and disease-specific survival. Consistently, high expression of CASC5 and RAD51 indicated poor relapse-free survival in GSE21653 (Fig. 4). The same results were observed in Kaplan Meier-plotter database (Fig. 5). In GSE37751, CASC5 and RAD51 were upregulated in triple-negative breast cancer. Higher expression levels of CASC5 and RAD51 were associated with advanced tumor stage and grade (Fig. 6). Similar results were found in the validation set GSE21653 (Fig. 7). Protein levels of CACS5 and RAD51 were also significantly higher in tumor tissues compared with normal tissues based on the Human Protein Atlas (Fig. 8). Convincingly, we performed qRT-PCR using breast cancer tissues and matched paracancerous tissue to validate their expression levels. CACS5 and RAD51 were significantly upregulated in in breast cancer compared to paracancerous tissues. Further cell proliferation assay demonstrated that CACS5 and RAD51 depletion decreased cell proliferation capability (Fig. 9). To further analyze the function of CACS5 and RAD51, GSEA was conducted to search KEGG pathways enriched in CACS5 or RAD51 highly expressed samples. Based on the cut-off criteria (FDR < 0.05), 3 functional gene sets were enriched in CACS5 highly expressed samples: "Proteasome", "Spliceosome", "RNA polymerase", and "Oxidative phosphorylation". Six functional gene sets were enriched in RAD51 highly expressed samples: "Pyrimidine metabolism", "Spliceosome", "Oxidative phosphorylation", "RNA polymerase", "DNA replication", and "Basal transcription factors" ( Figure S1).

Discussion
Breast cancer is the most frequent malignant tumor in females. It is still easy to recur even after combined therapy. In the era of precise medicine, microarray has been widely used to analyze the expression changes of mRNA in breast cancer and predict the potential therapeutic targets. However, better biomarkers for cancer specific prognosis and progression were still required.
In the present study, we performed WGCNA to explore gene co-expression modules associated with progression of breast cancer. A total of 3668 DEGs were used to construct co-expression network and 5 modules were identified. Turquoise module was found to have the highest association with tumor grade, ER status and triple-negative tumor. Thirty-three genes with high connectivity were screened from the module. Among them, CASC5 and RAD51 were negatively associated with the prognosis of patients.
CASC5, also known as D40, encodes a protein that functions as a scaffold for proteins influencing the spindle assembly checkpoint during the eukaryotic cell cycle. It is required for creation of kinetochoremicrotubule attachments and chromosome segregation. CASC5 is widely expressed in various cultured human cancer cell lines and primary tumors of different origins. Yuri N et al. characterized CASC5 as a member of the cancer/testis gene family, and CASC5 knockdown significantly inhibited the growth of human cancer cell lines both in vitro and in vivo (16). In poorly differentiated primary lung cancer, CASC5 expression level was significantly higher. It was the first gene in cancer/testis gene family for which expression is associated with smoking habits of lung cancer patients (17). In patients with malignancies, some of the cancer/testis genes could encode antigens on tumor cells and cause immune responses by cytotoxic T cells which were called cancer/testis antigens. However, it was not clear whether CASC5 protein elicits immune responses in breast cancer patients (18,19). A sequence homology search of public database revealed that the CASC5 sequence was identical to a gene on human chromosome 15, AF15q14. AF15q14 is a partner that fuses to mixed-lineage leukemia genes, which were involved in the development of acute leukemias (20)(21)(22). In dataset GSE37751, CASC5 was highly expressed in breast cancer tissues compared with normal breast tissues. In addition, ROC curve indicated that CASC5 exhibited excellent diagnostic efficiency for normal and tumor tissues. One-way ANOVA and t test demonstrated that the expression level of CASC5 was higher in triple-negative tumor and associated with tumor progression. Besides, the Oncomine database also showed that the expression level of CASC5 was significantly higher in breast cancer samples. To obtain further insight of translational level of CASC5, we used the Human Protein Atlas database to examine the immunohistochemistry staining of CASC5 in both normal breast and breast cancer, and discovered that the protein level of CASC5 was significantly up-regulated in breast cancer tissues compared with normal breast tissues. Survival analysis revealed that high expression of CASC5 was associated with the worse overall survival and relapse free survival. CASC5 has the potential to be a prognostic biomarker. GSEA demonstrated that high expression of CASC5 was substrates for anabolism and energy production. Guppy and colleagues stated that breast cancer cells generated 80% of their ATP by the mitochondrion, introducing the concept of oxidative tumors(30). It was reported that the typical "glycolytic" type of cancer cells includes enhanced glycolytic machinery confronted to a low efficiency oxidative phosphorylation system, while the "oxidative phosphorylation" type of cancer cells relies mainly on mitochondrial respiration to produce ATP from glucose and glutamine oxidation(31-33).
The protein encoded by RAD51 is known to be involved in the repair and homologous recombination of DNA. RAD51 interact with BRCA1 and BRCA2, which regulate both the intracellular localization and DNA-binding ability of this protein. Loss of these controls may be a key event causing genomic instability and tumorigenesis(34, 35). In breast cancer, RAD51 expression correlates with high-grade metastatic breast tumor and poor prognosis. Cells with RAS51 overexpress exhibit disruption of cell cycle, resistance to apoptotic signals and associated resistance to radiotherapy and chemotherapy(36, 37). In dataset GSE37751, RAD51 was highly expressed in breast cancer tissues compared with normal breast tissues. In addition, ROC curve indicated that RAD51 exhibited excellent diagnostic efficiency for normal and tumor tissues. One-way ANOVA and t test demonstrated that RAD51 was highly expressed in triple-negative tumor and associated with high-grade tumor.
Oncomine database showed that the expression level of RAD51 was significantly higher in breast cancer samples. Human Protein Atlas database with the immunohistochemistry staining of RAD51 was used to further investigate the translational level of RAD51. The protein level of RAD51 was significantly up-regulated in breast cancer tissues compared with normal breast tissues. Survival analysis revealed that high expression level of RAD51 was associated with the worse overall survival and relapse free survival. GSEA demonstrated that high expression level of RAD51 was associated with the pathway of pyrimidine metabolism, spliceosome, oxidative phosphorylation, RNA polymerase, DNA replication, and basal transcription factors.

Conclusion
In conclusion, our WGCNA analysis revealed that CASC5 and RAD51 were associated with the progression and poor prognosis of breast. Enrichment analysis demonstrated that CASC5 and RAD51 might promote tumor progression via spliceosome, RNA polymerase and oxidative phosphorylation pathways. Our results may be helpful for revealing pathogenic mechanism and developing further research.

Ethics approval and consent to participate
The research was carried out according to the World Medical Association Declaration of Helsinki and was approved by the Ethics Committee at Zhongnan Hospital of Wuhan University.

Availability of data and materials
Not applicable

Supplementary Files
This is a list of supplementary files associated with this preprint. Click to download.