Identification of Diagnostic Markers for Breast Cancer Based on Differential Gene Expression and Pathway Network

Background: Breast cancer is the second largest cancer in the world, the incidence of breast cancer continues to rise worldwide, and women’s health is seriously threatened. Therefore, it is very important to explore the characteristic changes of breast cancer from the gene level, including the screening of differentially expressed genes and the identification of diagnostic markers. Methods: The gene expression profiles of breast cancer were obtained from the TCGA database. The edgeR R software package was used to screen the differentially expressed genes between breast cancer patients and normal samples. The function and pathway enrichment analysis of these genes revealed significant enrichment of functions and pathways. Next, download these pathways from KEGG website, extract the gene interaction relations, construct the KEGG pathway gene interaction network. The potential diagnostic markers of breast cancer were obtained by combining the differentially expressed genes with the key genes in the network. Finally, these markers were used to construct the diagnostic prediction model of breast cancer, and the predictive ability of the model and the diagnostic ability of the markers were verified by internal and external data. Results: 1060 differentially expressed genes were identified between breast cancer patients and normal controls. Enrichment analysis revealed 28 significantly enriched pathways (p < 0.05). They were downloaded from KEGG website, and the gene interaction relations were extracted to construct the gene interaction network of KEGG pathway, which contained 1277 nodes and 7345 edges. The key nodes with a degree greater than 30 were extracted from the network, containing 154 genes. These 154 key genes shared 23 genes with differentially expressed genes, which serve as potential diagnostic markers for breast cancer. The 23 genes were used as features to construct the SVM classification model, and the model had good predictive ability in both the training dataset and the validation dataset (AUC = 0.960 and 0.907, respectively). Conclusion: This study showed that the difference of gene expression level is important for the diagnosis of breast cancer, and identified 23 breast cancer diagnostic markers, which provides valuable information for clinical diagnosis and basic treatment experiments.

2020; Kanathezath et al., 2021). Therefore, based on bioinformatics analysis methods, pathway and network analysis methods (Yu et al., 2021a) were used to systematically analyze breast cancer related genes, aiming to accurately understand the pathogenesis of breast cancer incidence and development, identify the most important pathogenic genes related to breast cancer, and provide valuable information for clinical diagnosis, treatment and control (Yu et al., 2018;Cao et al., 2021).
Previously, the use of bioinformatics methods to explore differentially expressed genes has made significant progress in pancreatic cancer (Xu et al., 2020), gastric cancer (Cheng et al., 2020;Gu et al., 2021;Yuan et al., 2021), colorectal cancer , prostate cancer (Rode et al., 2021) and other diseases (Ao et al., 2021;Yu et al., 2021b;Laudisi et al., 2021;Zhao et al., 2021). The occurrence and development of breast cancer is a process involving and synergistic action of multiple genes in multiple stages (Kretschmer et al., 2011). Therefore, the differential change of gene expression profile has always been a hot topic in breast cancer research (Manjili et al., 2012). At present, some studies have screened differentially expressed genes in breast cancer patients. Studies have shown that Grb14 is highly expressed in 23.1% of breast cancers, and this high expression is associated with better overall and disease-free survival, and can be used as a better independent prognostic factor (Huang et al., 2013). In adipose tissue of mammary gland, the overexpression of leptin can promote cell proliferation and cancer (Jardé et al., 2011). Abnormal growth factor signaling between stromal cells and epithelial cells can promote malignant cell growth (Wiseman and Werb, 2002;Haslam and Woodward, 2003).
In recent years, many researchers devote themselves to the research of breast cancer and have made great breakthroughs. Staub et al. have shown that patients with a low expression module co-expressed with the WIPF1 gene generally have a good prognosis in three tumor types including colorectal cancer, glioma and breast cancer (Staub et al., 2009). Protease has also been studied as a biomarker for prognosis and diagnosis of breast cancer (Jardé et al., 2011). Song et al. also demonstrated that the expression of transforming acid curly spiral egg (TACC3) gene is an independent prognostic factor in breast cancer patients (Song et al., 2018).
This study aims to identify diagnostic markers of breast cancer by differential expression genes screening and pathway networks constructing. The gene expression profile was used to screen the genes related to breast cancer, pre-screen the diagnostic markers of breast cancer, and analyze the function of these genes. The potential breast cancer markers were further explored through the construction of KEGG pathway network, and reliable breast cancer diagnostic markers were screened by combining with differentially expressed genes. The diagnostic prediction model was constructed using these markers, and the predictive power of the model was verified. In addition, these markers are documented to provide targets and references for clinicians as well as biological experimentalists.

Data Acquisition and Processing
In this study, expression profile data of breast cancer were obtained from The Cancer Genome Atlas (TCGA) database (Cancer Genome Atlas Research et al., 2013;Li et al., 2020). The TCGA database is one of the most ambitious and valuable cancer genomics projects currently under way. It is a joint initiative of the National Institute of Cancer Research and the National Human Genome Research Institute. It had the molecular signatures of more than 20,000 primary tumors and matched normal samples covering 33 cancer types. It contains clinical data on a variety of human cancers, genomic mutations, mRNA expression, miRNA expression, methylation, and other data, which is a very important source of information for cancer researchers (Tang et al., 2018;Tian and Wang, 2021;Zhu et al., 2021). These data have improved our ability to diagnose, treat and prevent cancer and will continue to be publicly available to anyone in the research community.
The level 3 gene expression data of RNA-SeqV2 in breast cancer were downloaded from the TCGA database and included 1218 samples, including 1104 breast cancer samples and 114 normal control samples, containing 20,530 genes. The gene expression profile was measured experimentally using the Illumina HiSeq 2000 RNA Sequencing platform by the University of North Carolina TCGA genome characterization center. This dataset shows the gene-level transcription estimates, as in log2(x+1) transformed RSEM normalized count. Genes are mapped onto the human genome coordinates using UCSC Xena HUGO probeMap.

Screening of Differentially Expressed Genes
In our study, the potential related genes of breast cancer were obtained by screening the differentially expressed genes. Differentially expressed genes were screened using edgeR software package. EdgeR software package is a Bioconductor software package for differential expression analysis of RNAseq expression profiles with biological replication, which is used to identify differential expressions or differential markers using read counts from different technology platforms (including RNA-Seq, ChIP-seq, ATAC-seq, Bisulfite-seq, SAGE, etc.). It implements a range of statistical methodology based on the negative binomial distributions, including empirical Bayes estimation, exact tests, generalized linear models and quasilikelihood tests (Robinson et al., 2010).
The screening principle of differentially expressed genes was based on the Fold Change (FC) and Pvalue of breast cancer genes between breast cancer patients and the control group. Set filter conditions when Pvalue <0.05, and the Fold Change of gene expression value was satisfied with FC > 2, or FC < 0.5, the genes that satisfy both conditions were identified as differentially expressed gene.
These differentially expressed genes were displayed using heat map and volcano plot. The heat map was plotted using gplots R package and the volcano plot was plotted using ggplot2 R package.

Biological Functions and Pathways Enrichment Analysis of Differentially Expressed Genes
In this study, using DAVID (database for Annotation, Visualization and Integrated Discovery) (Huang et al., 2009a;Huang et al., 2009b), a database used for annotation, visualization and integration of discoveries, we conduct a GO (Gene Ontology) biological functions enrichment analysis and a KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways enrichment analysis towards the list of differentially expressed genes, with p controlled within 0.05, which could find out the biological characteristics and pathways related.
DAVID is a bioinformatic database that combines information tools to provide a structured and complete description of biological functions for a large number of genes or proteins and help users obtain useful biological information.

Construction of KEGG Pathway Gene Interaction Network and Extraction of Key Genes
KEGG (Kyoto Encyclopedia of Genes and Genomes) is a database tool for understanding the advanced functions and availability of biological systems (such as cells, organisms, and ecosystems) based on information at the genome and molecular levels (Kanehisa and Goto, 2000). It is a computer representation of a biological system formed by molecular blocks of genes, proteins and chemicals that are integrated into molecular wiring diagrams of an information system of interactions, reactions and relationships. It also contains disease and drug information as disturbances to biological systems.
In this study, the KGML which were organized as XML files of significant pathways obtained from the previous enrichment analysis was downloaded from the KEGG website. Extract relation, entry and group in these XML files using "XML" R package. The entry element contains information about a node of the pathway. Relationship between two proteins (gene products) or two KOs (ortholog groups) or protein and compound, which is indicated by an arrow or a line connecting two nodes in the KEGG pathways. Group stands for complex of KOs group. The types associated between nodes include ECrel, PPrel, GErel, PCrel and maplink. Respectively represent enzyme-enzyme relation, protein-protein interaction, gene expression interaction, proteincompound interaction and link to another map. We integrate the real protein-protein interaction by connecting these three types of files and construct the KEGG path gene interaction network, and analyze the topology properties of the network. The network was analyzed and visualized using Cytoscape software (Shannon et al., 2003). Cytoscape is an open-source software platform that allows users to visualize molecular interaction networks and biological pathways and integrate these networks with annotated information, gene expression profiles and other data.
The size of a node in the network is expressed by the degree of the node. The gene with a large degree of node is called Hub gene in the network. The larger its value is, the more edges are connected to the node, and it may play a more important role in maintaining the overall structure of the network. Its change may affect more genes that interact with it. Therefore, in this study, genes with a degree greater than 30 were extracted from the network as key genes affecting breast cancer.

Literature Mining Confirms the Genes Screened
Next, we take the intersection of the obtained differentially expressed genes and the Hub genes obtained in the previous step to obtain a narrow and reliable biomarker list for breast cancer diagnosis.
To test whether the biomarkers screened in our study are indeed associated with breast cancer, we use PubMed (www. pubmed.gov) to conduct a literature review, and analyze whether the genes are indeed related to breast cancer in previous reports, so as to prove that the screening of tumor-related genes using our methods is effective. PubMed (White, 2020) is a widely used search engine, built and maintained by the National Biotechnology Information Center (NCBI) of the National Library of Medicine (NLM), which can provide more than 28 million academic biomedical publications.
This method is simple and feasible. The selected genes and breast cancer will be searched as the keywords in the literature database, and then consult the literature to see if there is a strong or weak relationship between the screened genes and the occurrence and development of breast cancer.

The Construction of Diagnosis and Prediction Model of Breast Cancer
According to the expression profile data of breast cancer, the corresponding expression values of 23 breast cancer diagnostic markers obtained in the previous step were found, and the expression profile data of these 23 genes were obtained. To verify the accuracy of the model, cancer patients and normal control samples in this dataset were randomly divided into two sets, one for training set and another for test set. The principle of division was to ensure that the proportion of cancer patients and normal control samples in each set was the same. The training set included 609 samples, including 552 samples from breast cancer patients and 57 normal control samples. The validation dataset consisted of 609 samples, including 552 samples from breast cancer patients and 57 normal controls.
Next, we use the training set samples to build the support vector machine model. Support Vector machines (SVM) is a kind of dichotomous classification model. Its basic model is a linear classifier, which defines the maximum interval in the feature space, which is the biggest difference between it and perceptron. The SVM learning strategy is to maximize the interval, which can be formalized as a problem for solving convex quadratic programming.
The purpose of the SVM model is to find the maximum distance between each sample point and the hyperplane, that is, to find the hyperplane with the largest interval. Any hyperplane can be described by the following linear equation: Now let's start to calculate the interval. In fact, the interval is equal to the projection of the difference of two heterogeneous support vectors on w: x satisfy: Get: After the interval is solved, the idea of SVM is to maximize the interval, so: max and min to obtain the optimization problem: The classification model was tested internally using a tenfold cross validation method, and the model was tested externally using test sets to verify the accuracy of the model in classifying new patients. The performance of the model was also measured using sensitivity and specificity (Cheng et al., 2018;Tao et al., 2020;Zhao et al., 2020).

Acquisition of Differentially Expressed Genes in Breast Cancer
By screening for differentially expressed genes, we obtained 1060 genes that were differentially expressed between breast cancer patients and normal control samples. These genes can be seen as potentially related to breast cancer. Among them, 516 were downregulated genes and 544 were up-regulated genes. They were displayed using a volcano plot ( Figure 1A), where the red dots represent the genes that are down-regulated in breast cancer, the blue dots represent the genes that are up-regulated in breast cancer, and the green dots represent the genes that are not significantly different.
In addition, we also used a heat map to show the expression levels of these differentially expressed genes in cancer and normal samples ( Figure 1B). The rows in the figure represent differentially expressed genes and the columns represent patients, where the red bar represents breast cancer patients and the blue bar represents normal control samples. As can be seen from this figure, there are indeed significant differences in these genes between breast cancer patients and normal control samples, which can clearly distinguish cancer patients from normal control samples. This suggests that the genes we screened are indeed associated with breast cancer.  Pathways and Biological Functions Differentially Expressed Genes Involved DAVID bioinformatics tool was employed to complete the Gene Ontology and KEGG pathway enrichment analysis. From the result of enrichment, we can see that the differentially expressed genes are involved in a number of cancer-related biological functions, such as cellular protein metabolic process, negative regulation of gene expression, epigenetic, response to drug, cell development, transcriptional activator activity, RNA polymerase II transcription regulatory region sequence-specific binding et al. (Figure 2A). Through the bubble diagram, it can be seen intuitively that the GO function and KEGG pathway enriched by the 1060 differentially expressed genes. It is significant that these genes annotate the transcriptional misregulation in cancer pathway, suggesting that these genes may be closely related to cancer development and may serve as potential biomarkers for breast cancer.
In addition, a total of 28 significantly enriched pathways were identified (Table 1; Figure 2B), including Neuroactive ligandreceptor interaction, Systemic lupus erythematosus, Chemical carcinogenesis, Metabolism of xenobiotics by cytochrome, Drug metabolism-cytochrome P450, Transcriptional misregulation in cancer, cAMP signaling pathway et al. This suggests that the occurrence and development of breast cancer is a complex physiological process, with abnormal functions in a variety of pathways.

Acquisition of Potential Diagnostic Markers for Breast Cancer
Next, we downloaded the pathways with significant enrichment of differentially expressed genes obtained above from the KEGG website. Then a network is constructed by integrating the real gene interaction information from the KEGG pathways. The network contains 1277 nodes and 7345 edges, and the size of the nodes is represented by the degree (Figure 3A). Of the 1277 nodes (genes), 175 were differentially expressed genes (a total of 1060) previously screened between breast cancer and normal control samples, and the remaining 1102 genes were obtained by pathway enrichment analysis. If the key genes (nodes with large degree) are removed, the stability of the KEGG pathway gene interaction network will be seriously threatened, and the network topology will be lax. Therefore, although the nodes with large degree only account for a small part of the nodes in the network, which conforms to the power law distribution of the degree distribution of the biomolecular network ( Figure 3B), they are essential key nodes. The nodes with large degree in the network are called Hub nodes. These nodes are generally the key nodes in the network, because their changes may affect more genes that interact with them. We tend to select the top 10% nodes in the network as Hub nodes. However, under this threshold, there are many nodes with the same degree and their degree is 38. Therefore, in order to ensure less omission of breast cancer diagnostic markers, we select nodes with a degree greater than 30 on the basis of this threshold. So, we filtered the degree attribute according to the attribute table exported after topology analysis and set the filter condition as degree >30, 154 genes, known as Hub genes, were identified as candidate genes potentially associated with breast cancer.
In this study, the common gene set of the previously obtained differentially expressed genes in breast cancer and key genes in the pathway network were selected, including 23 genes ( Table 2). As candidate markers for the diagnosis of breast cancer, to ensure the accuracy of these markers.

Breast Cancer Related Genes Were Confirmed by Literature Review
Next, to verify that our method of screening for biomarkers for breast cancer is reliable. We selected the genes for literature mining and verification. To see if there is any evidence in the literature that these genes are indeed involved in the development and progression of breast cancer.
After the intersection of the two sets of genetic data sets, 23 diagnostic markers of breast cancer were obtained. After literature mining in PubMed literature retrieval system, we found that some of these 23 genes have been confirmed to be related to breast cancer, including 12 genes, ADH1A, ADH1C, AKR1C4, ALDH3A1, CYP1A2, CYP2B6, CYP2C18, CYP2C19, CYP3A4, CYP3A7, GSTA1, RXRG. For example, THE TT genotype of CYP3A4 polymorphism is associated with increased risk of breast cancer (Liu et al., 2019). RXRG protein is an independent predictor of breast cancer specific survival and distant metastasis-free survival (Joseph et al., 2019). ADH1A has been found to be a potential biomarker for diagnosis, treatment and prognosis of breast cancer (Wu and Yu, 2021). It is worth noting that the remaining genes, such as GNGT1 and UGT1A7, have not been documented to be associated with breast cancer. However, the literature has linked these genes to other cancers. Therefore, the prediction of the correlation between these genes and breast cancer will provide clinicians and biological experimentalists with targets and references for future research directions.

Construction and Prediction of Breast Cancer Diagnostic Model
According to the expression profile data of breast cancer, the corresponding expression profile data of 23 diagnostic markers of breast cancer obtained in the previous step were obtained. Cancer patients and normal control samples were divided into training sets and test sets. The proportion of breast cancer patients and normal control samples in both datasets was also ensured to be the same. The training set included 609 samples, including 552 samples from breast cancer patients and 57 from normal control samples; the validation dataset also consisted of 609 samples, including 552 samples from breast cancer patients and 57 from normal controls.
The training set was used to construct the classification model of support vector machine. The accuracy of the model was evaluated using a tenfold cross validation method. The confusion matrix was shown in Figure 4A. According to the matrix, 600 out of 609 samples were classified correctly, and the classification accuracy was 98.5%. The sensitivity and specificity of the model were 99.1 and 93%, respectively. The ROC curve (receiver operating characteristic curve) of the model is shown in Figure 4C, and the AUC reaches 0.960.
Next, the established model is used to predict the test set to test the prediction ability of the model. The confusion matrix is shown in Figure 4B, 593 out of 609 samples in test set are correctly classified, and the classification accuracy is 97.4%. The sensitivity and specificity of the model were 98.9 and 82.5%, respectively. Compared to the training set, there was a decrease in specificity and a certain increase in the misclassified normal control samples. The ROC curve of the model is shown in Figure 4D, and the AUC reaches 0.907. In other words, for new patients, once we have data on the expression levels of these 23 genes, we can use the classification model constructed in this study to predict whether they are likely to develop breast cancer.
These results indicate that the diagnostic prediction model constructed in this study can effectively distinguish between breast cancer patients and normal control population, and these 23 genes can be used as reliable biomarkers for breast cancer diagnosis, but further experiments are still necessary. Especially for genes that have not been reported in the literature. In addition, we analyzed the biological functions of these 23 diagnostic markers and found that they were involved in many cancer-related biological processes and pathways, such as chemical carcinogenesis, drug metabolism, xenobiotic metabolic process, metabolic pathways, oxygen binding etc.

DISCUSSION
Breast cancer is a heterogeneous cancer with the highest incidence among women in the world, which is a serious threat to women's health. The occurrence of breast cancer is a complex biological process involved and regulated by multiple genes, and the difference of gene expression levels in tumor cells of different patients determines the different treatment and prognosis of patients (Saad et al., 2010). Therefore, to explore the characteristic changes of breast cancer from the level of genes and the discovery of biomarkers of breast cancer diagnose will play an important role in guiding the treatment of breast cancer. Advances in high-throughput sequencing technology have made it easier for researchers to understand how diseases occur and develop at the genome-wide level. RNA-Seq refers to transcriptome sequencing technology, which is a highthroughput sequencing technology to reflect the expression level of mRNA, small RNA and noncoding RNA or some of them by determining their sequences with high-throughput sequencing technology. TCGA is an open and free database resource that integrates multiple cancer data types. A large sample size of RNA-Seq data for breast cancer can be obtained. In this study, the Level 3 gene expression data of RNA-SeqV2 in 1104 breast cancer tumor samples and 110 normal control samples adjacent to cancer were downloaded from the TCGA database, and the genetic characteristics of breast cancer were analyzed at the whole genome level using these data. Identify genes that are differentially expressed in breast cancer and molecular markers for cancer diagnosis.
In this study, a total of 1060 differentially expressed genes were screened from the tumor and normal samples of breast cancer by using edgeR R package, of which 544 genes were up-regulated and 516 were down-regulated in the cancer samples. Through the enrichment analysis of GO function, it was found that these genes were mainly enriched in biological processes such as cellular protein metabolic process, negative regulation of gene expression, epigenetic, response to drug, cell development, transcriptional activator activity, RNA polymerase II transcription regulatory region sequence-specific binding et al. In addition, KEGG pathway enrichment analysis of these genes showed that they were significantly enriched in 28 pathways such as Neuroactive ligand-receptor interaction, Systemic lupus erythematosus, Chemical carcinogenesis, Metabolism of xenobiotics by cytochrome, Drug metabolism -cytochrome P450, Transcriptional misregulation in cancer, cAMP signaling pathway et al. By integrating the interaction information of genes in these pathways, a KEGG pathway network was constructed, and the hub nodes in the network were extracted, and 154 candidate genes potentially related to breast cancer were obtained. By integrating with the list of differentially expressed genes, 23 potential diagnostic markers of breast cancer were finally obtained. Some of these genes have been shown to play important roles in the development of breast cancer, which confirms the reliability of the results of this study. However, the remaining genes have not been studied in relation to the risk of breast cancer. They may be new diagnostic factors or risk genes for breast cancer, so further analysis and experimental confirmation of these genes are necessary.
Using gene expression profile of the 23 diagnostic marker genes, we constructed the breast cancer diagnosis prediction model based on SVM classifier, and analyzed the model prediction ability. The results proved that the model can effectively distinguish breast cancer patients and normal people, and further identifying these 23 genes can be used as diagnostic markers of breast cancer. It provides targets and reference for clinical doctors and biological experimentalists to treat breast cancer. It is worth noting that many of the 23 genes come from the same gene family, and further research is needed to determine whether there is redundant information between them. In addition, the methods in this paper can also be applied to other patient data to guide the diagnosis and treatment of cancer patients.
In this study, bioinformatics research methods were used to systematically and comprehensively analyze the differentially expressed genes related to breast cancer, and explore the pathogenesis, therapeutic targets and target prediction of breast cancer.
In conclusion, the use of computer technology and mathematical modeling and other methods can effectively analyze the differential expression genes related to breast cancer, so as to understand the pathogenesis of breast cancer and predict the prevention and treatment targets. It can also effectively solve the problems in biomedical science, and further provide meaningful information for future experimental research.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/ Supplementary material.

AUTHOR CONTRIBUTIONS
GW conceived and designed the experiments. SZ and HJ conducted all the data processing work described in the section of methods and performed the analysis. SZ prepared and edited the manuscript. BG checked and proofread the entire manuscript. WY participated in the revision of the manuscript.