Bioinformatics Identified 17 Immune Genes as Prognostic Biomarkers for Breast Cancer: Application Study Based on Artificial Intelligence Algorithms.

An increasing body of evidence supports the association of immune genes with tumorigenesis and prognosis of breast cancer (BC). This research aims at exploring potential regulatory mechanisms and identifying immunogenic prognostic markers for BC, which were used to construct a prognostic signature for disease-free survival (DFS) of BC based on artificial intelligence algorithms. Differentially expressed immune genes were identified between normal tissues and tumor tissues. Univariate Cox regression identified potential prognostic immune genes. Thirty-four transcription factors and 34 immune genes were used to develop an immune regulatory network. The artificial intelligence survival prediction system was developed based on three artificial intelligence algorithms. Multivariate Cox analyses determined 17 immune genes (ADAMTS8, IFNG, XG, APOA5, SIAH2, C2CD2, STAR, CAMP, CDH19, NTSR1, PCDHA1, AMELX, FREM1, CLEC10A, CD1B, CD6, and LTA) as prognostic biomarkers for BC. A prognostic nomogram was constructed on these prognostic genes. Concordance indexes were 0.782, 0.734, and 0.735 for 1-, 3-, and 5- year DFS. The DFS in high-risk group was significantly worse than that in low-risk group. Artificial intelligence survival prediction system provided three individual mortality risk predictive curves based on three artificial intelligence algorithms. In conclusion, comprehensive bioinformatics identified 17 immune genes as potential prognostic biomarkers, which might be potential candidates of immunotherapy targets in BC patients. The current study depicted regulatory network between transcription factors and immune genes, which was helpful to deepen the understanding of immune regulatory mechanisms for BC cancer. Two artificial intelligence survival predictive systems are available at https://zhangzhiqiao7.shinyapps.io/Smart_Cancer_Survival_Predictive_System_16_BC_C1005/ and https://zhangzhiqiao8.shinyapps.io/Gene_Survival_Subgroup_Analysis_16_BC_C1005/. These novel artificial intelligence survival predictive systems will be helpful to improve individualized treatment decision-making.


INTRODUCTION
As the most common malignant tumor in women, breast cancer (BC) resulted in 2,088,849 new cases and 626,679 deaths in 2018 (1). Although advances in diagnosis and treatments improved the survival rate of patients with early BC, but the survival rate of patients with advanced BC was still poor (2). Early identification of BC patients with poor prognosis and timely individualized treatments were helpful to improve the prognosis of BC patients.
The tremendous progress of bioinformatics has provided tremendous support for exploring the intrinsic mechanism of tumorigenesis and prognosis. (3)(4)(5)(6). Tumor-infiltrating immune cells were reported to be associated with tumorigenesis and prognosis (7,8). It was reported that there was a significant correlation relationship between tumor-infiltrating immune and prognosis in BC patients (9). Immune-related genes could be used to calculate the immune scores and evaluate the tumor infiltration of immune cells to analyze the tumor immune characteristics (10). There were several prognostic models for prediction of prognosis in BC patients (11)(12)(13). However, these prognostic models could only provide mortality risk prediction for patients in different subgroups, but not individual mortality risk prediction for a specific patient at the individual level. From a specific patient's point of view, the patient's own mortality risk prediction was more important than that of patients in different subgroups. Therefore, a prognostic model that can provide individualized mortality risk prediction for a specific patient is helpful to optimize individualized treatment and improve clinical prognosis.
The current research aimed at exploring the relationship of immune-related genes with transcription factor, immuneinfiltrating cells, and disease-free survival (DFS) of BC patients. Based on different artificial intelligence algorithms, the current study focused on developing artificial intelligence survival predictive systems for providing individual mortality risk prediction for BC patients.

Study Datasets
The original gene expression dataset from The Cancer Genome Atlas (TCGA) database contained 21,205 mRNAs from 1,109 tumor specimens to 113 normal specimens. After removal of patients with survival time <1 month and duplicate samples, 1,030 were included in further survival analyses. The original gene expression values have been log 10 transformed for TCGA dataset. GSE31448 dataset (GPL570 platform) contained 246 patients and 23,319 mRNAs.
Gencode.v29 was used for converting probe IDs name to gene symbols.
Abbreviations: BC, breast cancer; TCGA, The Cancer Genome Atlas; GEO, the Gene Expression Omnibus; ROC, receiver operating characteristic; DFS, disease free survival; HR, hazard ratio; CI, confidence interval; AJCC, the American Joint Committee on Cancer; SD, standard deviation.

Differentially Expressed Analyses
Differentially expressed analyses were conducted with cutoff values of log 2 |fold change| >1 and P < 0.05 by "edgeR" (14). Data were normalized by Trimmed mean of M values method.

Immune Gene and Transcription Factor
Immune genes were identified through Immunology Database and Analysis Portal database (15). Transcription factors were defined through Cistrome Cancer database (16).

Tumor Immune Infiltration
Six tumor-infiltrating immune cell data were obtained from Tumor IMmune Estimation Resource database (16). Single sample gene set enrichment analysis was used to evaluate tumor immune infiltration scores for 28 immune categories (17,18).

Statistical Analyses
Statistical analyses were conducted by SPSS Statistics 19.0 (SPSS Inc., Chicago, IL, USA). Artificial intelligence algorithms were performed by Python language 3.7.2 and R software 3.5.2. Artificial intelligence algorithms were carried out according to the original articles: Cox survival regression (19), multitask logistic regression (20,21), and random survival forest (22,23). Threshold for statistically significant difference was P < 0.05.

Study Datasets
Details of research steps are displayed in Supplementary Figure 1. Table 1 displays the basic information of patients in the model dataset and validation dataset. The mortality rate in the validation dataset was 32.1% (79/246), which was significantly higher than 19.6% (202/1,030) in the model dataset.

Differentially Expressed Analyses
Volcano plots of 21,205 mRNAs and 3,627 immune genes are presented in Figures 1A,B. There were 265 up-regulated and 185 down-regulated immune genes in differentially expressed analyses.

Functional Enrichment Analyses
To explore biological functions of immune genes, Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) were performed. Bar plot (Figure 1C), GO chord chart (Figure 2), and KEGG chord plot (Supplementary Figure 2) showed that biological functions of immune genes were mainly enriched in leukocyte migration, positive regulation of cell adhesion, regulation of inflammatory response, regulation of immune effector process, T cell activation, regulation of lymphocyte activation, positive regulation of leukocyte cell-cell adhesion, leukocyte chemotaxis, positive regulation of cell-cell adhesion, and leukocyte cell-cell adhesion. The top five KEGG items were as follows: cytokine-cytokine receptor interaction, hematopoietic cell lineage, viral protein interaction with cytokine and cytokine receptor, human T cell leukemia virus 1 infection, and PI3K-Akt signaling pathway.

Immune Regulatory Network
Univariate Cox regression determined 179 prognostic genes for DFS. The current research adopted three methods to explore the relationship between immune genes and transcription factors. First, with thresholds of correlation coefficient >0.5 and P < 0.01, the current study identified transcription factors that were highly correlated with prognostic immune genes. Second, prognostic immune genes and their highly correlated transcription factors were put in STRING database (medium confidence, 0.70) to

Construction of Prognostic Model
Multivariate Cox regression identified 17 genes as independent influence factors for BC ( Table 2 and Figure 4). The formula of prognostic model was as follows: Prognostic nomogram is shown in Figure 5.

Artificial Intelligence Survival Prediction System
Artificial intelligence survival prediction system was constructed to provide individual mortality risk prediction for BC patients (Figure 8). This tool could provide three individual mortality risk predictive curves by using random survival forest algorithm ( Figure 8A), multitask logistic regression algorithm ( Figure 8B), and Cox survival regression algorithm ( Figure 8C). Artificial intelligence survival prediction system is available at https://zhangzhiqiao7.shinyapps.io/Smart_Cancer_Survival_ Predictive_System_16_BC_C1005/.

Gene Survival Analysis Screen System
The Gene Survival Analysis Screen System was constructed for exploratory research of immune genes (Supplementary Figure 8). The Gene Survival Analysis Screen System is available at https://zhangzhiqiao8.shinyapps.io/Gene_ Survival_Subgroup_Analysis_16_BC_C1005/.

Independence Assessment
Prognostic signature, AJCC PT, and AJCC PN were independent risk factors for DFS in the model dataset ( Table 3). In the validation dataset, prognostic signature was proven to be an independent risk factor for DFS.
Clinical Correlation Analyses Figure 9 shows a correlation coefficient heatmap between prognostic genes and clinical variables. Supplementary Figure 9 presents correlation significance heatmap between prognostic genes and clinical variables. Tumor Immune Infiltration Figure 11 demonstrates expression levels of six tumor immune infiltration in the high-risk group and low-risk group. Figure 12 presents scatterplots between six tumor immune infiltrations and prognostic score.

Gene Differential Expression Between Normal Samples and Tumor Samples
To demonstrate the gene differential expression between normal samples and tumor samples at the molecular level, the current study performed group differential expression analyses between normal samples and tumor samples obtained from TCGA database. There were 1,109 tumor samples and 113 normal samples for group differential expression analyses. Supplementary Figure 11 presents the gene differential expression between normal samples and tumor samples at the molecular level.

Clinical Performance in Different Cancers
To explore the clinical performance of the current prognostic model for other cancers, four tumor datasets were obtained from TCGA database as external validation datasets. There were 348 patients in the hepatocellular carcinoma dataset, 265 patients in the colorectal cancer dataset, 494 patients in the lung cancer dataset, and 370 patients in the ovarian cancer dataset. The prognostic scores in external validation datasets were calculated according to the previous formula derived from the model dataset. Survival curve analyses indicated good diagnostic performance of the current prognostic model for hepatocellular carcinoma, colorectal cancer, lung cancer, and ovarian cancer (Supplementary Figure 12), suggesting that the current prognostic model might be useful for other malignant tumors.

External Validation of Accuracy and Clinical Validity
To validate the accuracy and clinical validity of the current prognostic model in other external validation dataset, hepatocellular carcinoma dataset, colorectal cancer dataset, lung cancer dataset, and ovarian cancer dataset were downloaded from TCGA database. These four malignant-tumor datasets were merged into one joint dataset as external validation dataset with 1,640 tumor patients. Supplementary Figure 13A demonstrates that concordance indexes were 0.832, 0.781, and 0.778 for 1-, 3-, and 5-year survival, respectively. Supplementary Figure 13B suggests that the current prognostic model could distinguish tumor patients with high mortality risk from those with low mortality risk. Calibration curves of external validation dataset showed good accordance between actual mortality percentage and predicted mortality percentage (Supplementary Figure 14).

DISCUSSION
The current research determined 17 immune genes as prognostic biomarkers for DFS. Then the current research depicted regulatory relationships among transcription factors and immune genes through correlation analyses and STRING database. Based on these 17 immune genes, the current research created a prognostic nomogram to predict the DFS for BC patients. Based on the previous prognostic nomogram, the current research developed two artificial intelligence survival predictive systems for individual mortality risk prediction. These two artificial intelligence survival predictive systems were helpful to provide precise individual mortality risk prediction and improve individual treatment decision-making.
Several prognostic models have been built for predicting the prognosis in BC patients (11)(12)(13). The previous prognosis models could only provide the mortality curves for two groups of tumor patients with different characteristics, failing to provide the individual mortality curve for a special patient. The progress of artificial intelligence algorithms provides necessary basic conditions for the realization of individualized mortality risk prediction of cancer patients. Random survival forest algorithm (25)(26)(27), multitask logistic regression (28,29), and Cox survival regression algorithm (30) have been proposed and used to improve the predictive performance of prognostic models. Based on three artificial intelligence algorithms above, we develop an artificial intelligence survival predictive system. Our artificial intelligence survival predictive system could display three individual mortality risk predictive curves by using random survival forest algorithm, multitask logistic regression algorithm, and Cox survival regression algorithm. At present, there are few prognostic models that can provide individual mortality risk prediction. The current study provides an interesting and feasible way for the transformation and application of artificial intelligence algorithm in the field of medicine. Tumor immune infiltration acted an important role in oncogenesis and prognosis (7,31). Immune genes could be used to predict the prognosis of BC patients (32,33). Three prognostic models have been developed to predict the prognosis for BC patients (11)(12)(13). Compared with three previous prognostic models, the current prognostic model could provide individualized mortality risk prediction and online calculation function, which were of great significance for clinical application by patients and clinicians.
Biological processes of immune genes were explored via TISIDB databases (http://cis.hku.hk/TISIDB/index.php). Top biological processes of CD1b molecule (CD1B) were adaptive  immune response, antigen processing and presentation, and antigen processing and presentation via major histocompatibility complex class Ib. Top biological processes of lymphotoxin α (LTA) were adaptive immune response, lymphocyte-mediated immunity, leukocyte-mediated immunity, and inflammatory response to antigenic stimulus. Top biological processes of CD6 molecule (CD6) were immunological synapse formation, cell recognition, acute inflammatory response, and inflammatory response to antigenic stimulus. Top biological processes of cathelicidin antimicrobial peptide (CAMP) were cell killing, antibacterial humoral response, innate immune response in mucosa, mucosal immune response, and organ-or tissue-specific immune response. Top biological processes of interferon γ9 (IFNG) were cell killing, neutrophil homeostasis, leukocyte homeostasis, and neutrophil apoptotic process. Top biological processes of ADAM metallopeptidase with thrombospondin type 1 motif 8 (ADAMTS8) were phosphate ion transport, anion transport, and inorganic anion transport. Top biological processes of apolipoprotein A-V (APOA5) were receptormediated endocytosis, tissue regeneration, and positive regulation of receptor-mediated endocytosis. Top biological processes of siah E3 ubiquitin protein ligase 2 (SIAH2) were proteasomal protein catabolic process, regulation of cysteinetype endopeptidase activity involved in apoptotic process, and negative regulation of cysteine-type endopeptidase activity involved in apoptotic process. Top biological processes of steroidogenic acute regulatory protein (STAR) were response to molecule of bacterial origin, response to oxidative stress, and response to reactive oxygen species. Top biological processes of cadherin 19, type 2 (CDH19), were homophilic cell adhesion via plasma membrane adhesion molecules and cell-cell adhesion via plasma-membrane adhesion molecules. Biological processes of C-type lectin domain family 10, member A (CLEC10A), were adaptive immune response. ADAMTS8 was related with poor prognosis for breast invasive ductal carcinoma patients (34). ADAMTS8 could regulate invasion and apoptosis of hepatocellular carcinoma through ERK signaling pathway (35). The interaction between CD4 + T cells and lung cancer cells could up-regulate expression of DNMT and methylation of IFNG promoter (36). CpG methylation of IFNG gene could induce immunosuppression of tumor-infiltrating lymphocytes (37). High expression level of XG in Ewing sarcoma cell line could promote tumor migration and invasiveness (38). Highly expressed SIAH2 was associated with poor progressionfree survival after tamoxifen treatment (39). SIAH2 participated in the regulation of EAF2 polyubiquitin in prostate cancer cells as E3 ligase of EAF2 polyubiquitination (40). Cathelicidin antimicrobial peptide directly activated exchange protein, which regulated migration and apoptosis of BC cells (41). Interleukin 24 enhanced apoptosis of BC cell via cAMP-dependent PKA pathway (42). Low expression of NTSR1 was associated with non-invasive growth of colorectal cancer (43). Interaction of CLEC10A with macrophages and dendritic cells might play an important role in tumor progression (44). As a functional ligand of CLEC10A, sv6D could induce the maturation of immune cells (45). Variation of STXBP6 might affect the response of TNFα inhibitors in rheumatoid arthritis patients (46). There was a significant correlation between LTA RS909253GA genotype and the development of Asian gastric cancer (47). These previous studies revealed possible immune regulatory mechanisms and biological roles of previous 17 immune genes in tumorigenesis and progression.
Toll-like receptor-activated plasma-like dendritic cells inhibited growth of BC cells (48). CD56 enhanced formation of cytotoxic immune synapses and strengthened sensitivity of cytotoxicity mediated by natural killer cells (49). CD4 and CD8 T cell tumor infiltration driven by HER2-dendritic cells improved survival of BC mice (50). CD4 + T cells inhibited CD8 + T cell failure at the initiation stage of immune response in BC (51). V delta 2 + gamma delta T lymphocyte had cytotoxicity to MCF 7 BC cells (52). High expression of SEMA4C was correlated with the proliferation of tumor cells and the aggregation of macrophages in BC (53). Interleukin 32θ suppressed the growth of BC by regulating CCL18 secreted by macrophages (54). Macrophage adhesion regulated by integrin induces lymphovascular dissemination in BC (55). CCL5 induced recurrence of BC via aggregation of macrophages in residual tumors (56). High expression of mast cells induced the tumor size and the incidence of spontaneous metastasis in BC mice (57). Tumor-infiltrating myeloid-derived suppressor cells (MDSCs) was related with therapeutic effect and prognosis of neoadjuvant chemotherapy of BC (58). Neutrophil-lymphocyte ratio could predict prognosis of triple-negative BC patients (59). Interleukin 10 and interleukin 2 promoted proliferation and cytotoxicity of CD8+T cells (60).
Advantages of current research: First, two artificial intelligence survival predictive systems were developed based on immune genes for BC patients. These two tools could provide online individual mortality risk prediction and provide valuable prognostic information for optimizing individual treatment decision. Second, artificial intelligence survival prediction system provided three individual mortality risk predictive curves based on different artificial intelligence algorithms, providing different valuable individual mortality curves as the references of individual medical decision-making. Third, artificial intelligence survival prediction system provided predicted median survival time and 95% confidence interval of predicted mortality, which were of clinical practical values for optimizing individual medical decision-making.
Shortcomings of current research: First, the current research explored potential biological functions and regulatory mechanisms of immune genes in BC based on public databases, but the conclusion was not validated by confirmative experimental studies. Second, estrogen receptor, progesterone receptor, and erb-b2 receptor tyrosine kinase 2 are closely related to prognosis of BC patients. Subgroup studies based on these biomarkers are helpful to provide more accurate individual mortality risk prediction for BC patients in different subgroups. Third, the current studies did not include and analyze the impacts of several clinical factors, such as radiotherapy, chemotherapy, and targeted drug therapy, which should be taken into account for future studies. Fourth, the calculation process of artificial intelligence algorithms are too complex to perform and cannot be present through conventional formula. The operation process of artificial intelligence algorithm is opaque, just like a black box, which limits the clinical application and verification of artificial intelligence algorithm. Because it is difficult for artificial intelligence algorithm to perform repeated verification research, we provided three different artificial intelligence algorithms as references for each other. As the inherent deficiency of artificial intelligence, opaque computing process and lack of verification research need to be solved by future artificial intelligence algorithm research. Fifth, the current study identified 17 immune genes were correlated with prognosis of BC patients. However, the associations of these immune genes with tumor heterogeneity and tumor resistance were still unclear. Further basic immune research is helpful to clarify the associations of these immune genes with tumor heterogeneity and tumor resistance. Sixth, although the current study demonstrated the gene differential expression between normal samples and tumor samples, the current study was lack of external validation at the cell level and the animal model level. Further validation studies at the cell level and the animal model level were helpful to ascertain the differences of immune regulatory mechanism of BC patients compared with normal people.
In conclusion, comprehensive bioinformatics identified 17 immune genes as potential prognostic biomarkers, which might be potential candidates of immunotherapy targets in BC patients. The current study depicted regulatory network between transcription factors and immune genes, which was helpful to deepen the understanding of immune regulatory mechanisms for BC cancer. Two artificial intelligence survival predictive systems are available at https://zhangzhiqiao7. shinyapps.io/Smart_Cancer_Survival_Predictive_System_16_ BC_C1005/ and https://zhangzhiqiao8.shinyapps.io/Gene_ Survival_Subgroup_Analysis_16_BC_C1005/. These novel artificial intelligence survival predictive systems will be helpful to improve individualized treatment decision-making.

DATA AVAILABILITY STATEMENT
Publicly available datasets were analyzed in this study. This data can be found here: https://zhangzhiqiao8.shinyapps.io/Gene_ Survival_Subgroup_Analysis_16_BC_C1005/.

ETHICS STATEMENT
The studies in TCGA database and GEO database have received ethical approval from ethics committees of their respective research institutes. These studies obtained informed consent from patients before admission. The current study is a second study based on public datasets from TCGA database and GEO database. Details of all patients in public datasets have been anonymously processed and therefore the current research does not involve patients' privacy information. This study was performed according to public database policy and declaration of Helsinki. Ethical approval and informed consent were not applicable.

AUTHOR CONTRIBUTIONS
ZZ, JD, JL, and TH: conceptualization, methodology and resources. ZZ, JD, JL, and TH: investigation, data curation, formal analysis, validation, software, project administration, and supervision. ZZ, and JD: writing and visualization. ZZ: funding acquisition.

FUNDING
The current research was funded by Medical Science and Technology Foundation of Guangdong Province (A2016450 and B2018237). and molecular marker identification in glioblastoma multiforme based on mRNA/microRNA/long non-coding RNA analysis using random survival forest method. Neoplasma.