Identification of a Six-Immune-Related Long Non-coding RNA Signature for Predicting Survival and Immune Infiltrating Status in Breast Cancer

Long non-coding RNAs (lncRNAs) play critical roles in tumor immunity; however, the functional roles of immune-related lncRNAs in breast cancer (BC) remain elusive. To further explore the immune−related lncRNAs in BC, whole genomic expression data and corresponding clinical information were obtained from multiple BC datasets. Based on correlation with the immune-related genes within the training set, we screened out the most promising immune-related lncRNAs. Subsequently, Lasso penalized Cox regression analysis followed by stepwise multivariate Cox regression analysis identified six survival-related lncRNAs (AC116366.1, AC244502.1, AC100810.1, MIAT, AC093297.2, and AL356417.2) and constructed a prognostic signature. The cohorts in the high−risk group had significantly poor survival time compared to those in the low−risk group. In addition, a nomogram integrated with clinical features and the prognostic signature was developed on the basis of the training set. Importantly, all the findings had a similar performance in three validated datasets. In the following studies, our integrative analyses indicated that the infiltration of CD8-positive (CD8) T cells associated with a good prognosis was strikingly activated in the low−risk group. To further provide an interpretation of biological mechanisms for the prognostic signature, we performed weighted gene co−expression network analysis (WGCNA) followed by KEGG pathway-enrichment analysis. Our results showed that the antigen presentation pathway involved in protein processing in endoplasmic reticulum and antigen processing and presentation was markedly altered in the high-risk group, which might promote tumor immune evasion and associate with poor clinical outcomes in BC patients with high risk scores. In conclusion, we aimed to take advantage of bioinformatics analyses to explore immune−related lncRNAs, which could function as prognostic indicators and promising therapeutic targets for BC patients.


INTRODUCTION
Breast cancer (BC) is one of the most frequent malignant tumors and a leading cause of cancer death among females (Siegel et al., 2019). Although comprehensive treatments including surgery, radiotherapy, and chemotherapy have been well supplied, the prognosis of BC patients is not satisfactory (Schwartz and Erban, 2017;Emens, 2018). Therefore, it is urgent to have an insight into novel prognostic and diagnostic markers to further develop novel therapeutic approaches for BC patients.
The tumor microenvironment (TME) is a complicated system, consisting of not only tumor cells but also tumor-associated normal epithelial and stromal cells, immune cells, and vascular cells (Junttila and de Sauvage, 2013). Emerging evidence has identified that dysfunctional immune status in TME is a hallmark of cancers (Yu et al., 2018). Infiltrating immune cells in TME play critical roles in the initiation and progression of cancer, and positively or negatively influence patient prognosis, which is primarily due to tumor heterogeneity and distinct populations of tumor-infiltrating immune cells (Talmadge et al., 2007). The infiltration of regulatory T (Treg) cells inhibits endogenous immune responses against tumors and correlates with poor prognosis (Joshi et al., 2015), whereas dendritic cell (DC) infiltration is typically associated with a positive clinical outcome associated with their capacity to initiate and regulate both innate and adaptive immunity (Saxena and Bhardwaj, 2018). Myeloidderived suppressor cells (MDSCs) could prevent cytotoxic T lymphocytes (CTLs) from binding to the peptide-MHC complex and therefore inhibit antitumor activity (Nagaraj et al., 2007). In contrast, infiltration of CD8-positive cytotoxic T lymphocytes (CD8 T cells) into solid tumors is associated with good prognosis in various types of cancer, including BC (Jiang and Shapiro, 2014;Li et al., 2019). CD4-positive T cells, namely, T follicular helper (TFH) cells, have two predominant cell subtypes, Th1 and Th2. In general, infiltration of TH1 cells predicts a positive outcome, whereas TH2 cells predict a negative outcome (Giraldo et al., 2014;Pradhan et al., 2014). Natural killer (NK) cells, lymphocytes of the innate immune system, are essential for defense against infectious pathogens and nascent malignant cells. Classically, the CD56dim NK cell subset is considered to mediate antitumor responses, whereas the CD56bright NK cell subset is involved in immunomodulation. Nevertheless, current studies demonstrated that brief priming with IL-15 strikingly enhanced the antitumor response of CD56 bright NK cells (Wagner et al., 2017). Given that clinical outcomes of cancer patients were frequently associated with the infiltration of immune cells in TME, current diagnostic and therapeutic strategies are mainly focused on the analysis and manipulation of Abbreviations: AUC, the area under the ROC curve; BC, breast cancer; CD8 T cells, CD8-positive cytotoxic T lymphocytes; DFS, disease-free survival; FDR, false discovery rate; GEO, the Gene Expression Omnibus; GSEA, gene set enrichment analysis; GSVA, gene set variation analysis; HR, hazard ratio; KEGG, Kyoto Encyclopedia of Genes and Genomes; lncRNAs, long non-coding RNAs; LR: logistic regression analysis; MSigDB, the Molecular Signatures Database; OS, overall survival; RF: random forest; ROC curve, receiver operating characteristic curve; ssGSEA, single sample gene set enrichment analysis; TCGA, The Cancer Genome Atlas; TME, the tumor microenvironment; WGCNA, weighted gene co-expression network analysis. infiltrating immune cellular populations (Talmadge et al., 2007;Tamborero et al., 2018).
Long non-coding RNAs (lncRNAs), longer than 200 nucleotides, play crucial roles in the occurrence and progression of cancers (Iyer et al., 2015). Previous studies have shown that lncRNAs could interact with other biomolecules, such as proteins, regulatory DNA regions, and miRNAs, and therefore regulate gene expression at multiple levels, including epigenetic, transcriptional, and post-transcriptional (Wu et al., 2018). Recently, mounting studies found that the lncRNAs also play a pivotal role in cancer immunity (Hu et al., 2013;Heward and Lindsay, 2014). For instance, lincRNA-Cox2, proximal to the prostaglandin-endoperoxide synthase 2 (Ptgs2/Cox2) gene, regulated both the activation and repression of diverse types of immune genes (Carpenter et al., 2013;Hu et al., 2016). LncRNA FENDRR had an impact on the immune escape of hepatocellular carcinoma cells through interaction with miR-423-5p and GADD45B (Yu et al., 2019). Another research revealed that lncRNA NKILA overexpression in tumor-specific CTLs and TH1 promoted their apoptosis and contributed to shorter survival time in patients with BC or lung cancer (Huang et al., 2018). However, the immune−related lncRNAs associated with BC progression remain poorly understood.
In this study, we integrated bioinformatics analyses and took advantages of lncRNA expression profiles to explore novel prognostic marks and therapeutic targets for BC patients.

Sample Datasets and Data Processing
The GSE20685 dataset was downloaded from the GENE EXPRESSION OMNIBUS (GEO) which contains 327 BC cases. This dataset was based on the GPL570 platform (Affymetrix Human Genome U133 Plus 2.0 Array). Meanwhile, patients' clinical information was also obtained from the GEO database, including age, overall survival time (OS), survival status, and clinical stage. We re-annotated probe sets of Affymetrix Hg-U133 Plus 2.0 array by mapping all probes to the human genome (GRCh37) using SeqMap (Jiang and Wong, 2008). In the analysis, we ensure that those probes mapped to the genome were unique with no mismatch, and all probes mapped to pseudogene transcripts were removed. The analysis procedure was conducted by the previous study (Du et al., 2013).

Construction and Validation of Immune-Related LncRNAs Prognostic Signature
Firstly, the immune−related genes were extracted from immune system process and immune response gene sets sourced from the Molecular Signatures Database (MSigDB, 1 ) (Liberzon et al., 2015). Then, a cohort of immune−related lncRNAs was identified according to Pearson correlation analysis between immune genes and lncRNA expression level in BC samples (|r| > 0.5, p < 0.001). Next, the univariate Cox regression analysis and Kaplan-Meier method were separately performed to sort out survival−related lncRNAs with significant prognostic value (p < 0.05) on the basis of the training set. Following these two independent analyses, the overlapping lncRNAs with significant difference (p < 0.05) were sent for further analyses. Finally, Lasso penalized Cox regression analysis followed by stepwise multivariate Cox regression analysis identified the most promising survival-related lncRNAs and constructed a prognostic signature.
The enrichment scores (ES) of immune genes associated with the prognostic signature were calculated using the Single-sample Gene Set Enrichment Analysis (ssGSEA) in Gene Set Variation Analysis (GSVA) package (Hanzelmann et al., 2013), and then the immune signature based on enrichment scores was developed. The Pearson correlation analysis was employed to evaluate the relationship between the prognostic signature and the immune signature. Subsequently, random forest (RF) and logistic regression (LR), two well machine learning algorithms, were applied to evaluate the performance of the six-lncRNA signature for stratifying the immune status based on the median enrichment score of immune signature according to the area under the ROC curve (AUC) through five-fold cross validation (Han et al., 2014).

Risk Score Construction
To confirm the reliability of the detected immune−related lncRNAs, the risk score was designed to establish a unitive signature for the prognostic analysis, and the coefficient for each lncRNA was obtained through the multivariate cox regression following bidirectional stepwise selection. A risk score formula was established as follows:

Nomogram Construction
A nomogram integrated clinical features and the six-lncRNA signature was constructed using the "rms" R package. Calibration curve were performed to assess the predictive accuracy of the nomogram. The predicted outcomes and actual observed outcomes of the nomogram were presented in the calibrate curve, and the 45 • line represents the best prediction.

Gene Set Enrichment Analysis
To further explore the relationship between immune-related gene sets and the six-lncRNA signature, GSEA was performed with the JAVA program. Genes were ranked on the basis of differential significance between the high-and low-risk groups, which were stratified via the median risk score in BC patients. After performing 1000 permutations, gene sets enrichment with nominal p < 0.05 and FDR < 0.25 were considered significant.

Weighted Gene Co-expression Network Analysis (WGCNA)
The WGCNA package (Langfelder and Horvath, 2008) was applied for further analysis on the basis of the training set, and the analysis procedure was conducted as previously described (Giulietti et al., 2016). In brief, two weighted gene co-expression networks were constructed via the WGCNA package according to quartile of risk scores, with the bottom group serving as the reference network and the top group serving as the test network. The conservation of gene pairs between two networks was evaluated by module preservation function from WGCNA. In order to make analysis more accurate, 1000 times of permutation were performed. Subsequently, median rank was used to confirm non-preservative modules and Z-summary score (Z-score) was applied to evaluate the significance of non-preservative modules. Note that Z-scores greater than 10 indicate a high preservation and Z-scores less than 10 indicate a weak preservation. Finally, the genes in non-preservation modules were enrolled to send for DAVID tool (version 6.8) (Huang et al., 2007) to map these hub genes to KEGG pathways. Enriched KEGG pathways with p-values < 0.05 were extracted, and then the enrichment results were visualized by ggraph package.

External Data Validation
To further validate predictive capability of the six-lncRNA signature, we enrolled three independent datasets, including GSE21653, GSE88770, and TCGA with a total of 245, 117, and 888 cases, respectively. GSE21653 and GSE88770 were based on the platform GPL570 (Affymetrix Human Genome U133 Plus 2.0 Array). The TCGA dataset (level3, raw count) was obtained from The Cancer Genome Atlas (TCGA) via the TCGAbiolinks package. Patients without complete information in training and validation sets had been excluded from this study. The inclusion patient ids are listed in Supplementary Table S1.

Statistical Analysis
In Kaplan-Meier survival analysis, the "surv_cutpoint" function in survminer package was applied to determine the optimal cutoff value of risk scores. Note that the amount of samples in any group could not be less than 30% of the total samples. According to the optimal cutoff value, BC patients were divided into high− and low−risk groups to perform survival analysis. The time−dependent receiver operating characteristic (ROC) analysis was performed to investigate the prognostic accuracy of the six−lncRNA signature using the survivalROC package. After the ssGSEA method calculated the enrichment scores of 24 immune cell types (Bindea et al., 2013), Pearson correlation analysis was performed to evaluate the relationship between risk scores and infiltrating immune cells. Logistic regression analysis was carried out to evaluate whether combination of the six lncRNAs have higher predictive accuracy for infiltrating status of CD8 T-cells than each predictor alone. Differences with p < 0.05 were considered significant.

Construction of Immune-Related LncRNAs Prognostic Signature
A flow chart described our analysis procedure was presented in Figure 1A. In our study, 327 samples sourced from GSE20685 served as the training set. A total of 313 immune−related genes were obtained from the Molecular Signatures Database, which were associated with immune response and immune system process (Supplementary Table S2). Next, 82 immune−related lncRNAs were extracted using Pearson correlation analysis (|r| > 0.5, p < 0.001) (Supplementary Table S3). Univariate Cox regression analysis and Kaplan-Meier method was separately performed to identify survival-related lncRNAs (p < 0.05), of which 13 were overlapped (Figure 1B and Supplementary  Table S4). After primary filtration, Lasso penalized Cox regression analysis following stepwise multivariate Cox regression analysis was carried out to sort out six survival-related lncRNAs and developed a prognostic signature (Figures 1C,D;  Supplementary Table S5). Moreover, a regulatory network Frontiers in Genetics | www.frontiersin.org associated with 6 lncRNAs and 86 signature-related immune genes was established and visualized using Cytoscape3.7.1 (Supplementary Figure S1). Table S6), we constructed an immune signature via the ssGSEA method. The enrichment scores of the immune   Supplementary Table S7. Subsequently, random forest (RF) and logistic regression (LR) revealed that the prognostic signature could significantly stratify two immune subgroups on the basis of the median enrichment score (LR, AUC score = 0.919, RF, AUC score = 0.898) ( Figure 1E). Moreover, the interplay among immune signature, prognostic signature, and six lncRNAs was evaluated by Pearson correlation analysis (Figure 1F; Supplementary Table S8).

Based on 86 model-related immune genes (Supplementary
We observed that the six-lncRNA signature was negatively associated with the immune signature (r = -0.342, p = 1.99E-10). In the prognostic signature, AC116366.1, AC244502.1, AC100810.1, MIAT, and AC093297.2 acted as the protective factors with an HR value less than 1, and AL356417.2 acted as the deleterious factor with an HR value greater than 1 ( Table 1).

Primary Evaluation of the Six-LncRNA Signature in the Training Set
Overall survival curves of six immune-related lncRNAs are shown in Figure 2A. Risk scores, survival status, OS period, and six-lncRNA expression level of each patient are presented in Figure 2B. Moreover, patients in the low−risk group showed increased survival time in contrast to those in the high−risk group ( Figure 2C). Univariate following multivariate Cox regression analysis was conducted to identify that N-stage and the six-lncRNA signature functioned as independent prognostic factors (Supplementary Figures S2, S3). The predictive accuracy of the six-lncRNA signature was assessed by time−dependent ROC curves at 1, 3, and 5 years on the basis of AUC scores. The AUC scores at 1, 3, and 5 years were 0.88, 0.758, and 0.725, respectively, which indicated that the prognostic signature had a moderate sensitivity and specificity in the training set ( Figure 2D).

Further Evaluation of the Six-LncRNA Signature in Validation Sets
To further evaluate the predictive efficacy of the six−lncRNA signature, three independent external datasets (GSE21653, GSE88770, and TCGA) were used to validate our findings. Kaplan-Meier analysis showed that patients in the low-risk group had evidently longer survival time than those in the high-risk group in three validation sets (Figures 3A-C), which was consistent with the results from the training set.
The ROC curves at 1, 3, and 5 years indicated that the signature also had a good predictive performance in validation sets (Figures 3D-F).

Construction and Evaluation of Nomogram
To enhance the predictive accuracy, four clinical features and the six-lncRNA signature were integrated and a nomogram was constructed on the basis of the training set ( Figure 4A). The calibration curves presented a good agreement between prediction by nomogram and actual observation for predicting survival probability at 3 and 5 years in training and validation datasets (Figures 4B-E).

Immune Status Analysis for High-and Low-Risk Groups
In the following study, GSEA was applied to show that immune system process and immune response gene sets were significantly enriched in the low−risk group (Figures 5A,B). The ssGSEA followed by Pearson correlation analysis was conducted to validate the relationships between risk scores and infiltration immune cells. As presented in Figure 5C, risk scores had the strongest correlation with CD8 T cells compared to other types of immune cells. Based on the median risk score, BC patients in the training set were stratified into two subgroups, high-and low-risk groups. High-risk cohorts displayed lower enriched levels of the immune signature and CD8 T cells compared to low-risk cohorts ( Figure 5D). Meanwhile, the immune signature was strikingly positively related with CD8 T cells on the basis of Pearson correlation analysis (r = 0.72, p = 1.54E-53; Supplementary Figure S4). Survival curve revealed that infiltration of CD8 T cells was associated with a favorable prognosis in BC patients ( Figure 5E).
In training and validation sets, risk scores were negatively related with enrichment scores of CD8 T cells on the basis of Pearson correlation analysis (Figures 6A,D). To further identify predictive efficacy of the signature for infiltration of CD8 T cells, the patients of training and validation sets were divided into highand low-infiltration groups according to the median enrichment score of CD8 T cells. Logistic regression analysis showed that the combination of six lncRNAs had higher predictive efficacy than each predictor alone (Figures 7A-D).

Construction of Weighted Gene Co-expression Network and Identification of Key Modules
In the following study, we applied the WGCNA method to identify non-preservation modules, whose characteristics could  distinguish the test network from the reference network. Firstly, we confirmed a soft threshold power value of 16 as best optimum index in the high-risk group (Supplementary Figure S5) and 2 as the best optimum index in the low-risk group (Supplementary Figure S6). Next, we identified 6 modules in the high-risk group and 22 modules in the low-risk group except for the gray module via hierarchical clustering. Finally, the module preservation function in the WGCNA package was applied to identify three non-preservation modules, the light green module (Z-score = 8.2), the midnight blue module (Z-score = 9.8) and the tan module (Z-score = 6.0) ( Figure 8A).
To further investigate the biological mechanisms of nonpreservation modules, we mapped internal genes of nonpreservation modules to KEGG pathways using the DAVID tool (version 6.8) (Huang et al., 2007). Enriched KEGG pathways with p values < 0.05 were extracted. KEGG pathway-enrichment FIGURE 8 | Construction of weighted co-expression network and identification of key modules. (A) Each module is represented by its color-code and name. Left plot shows the preservation median rank, which tends to be independent from module size with high median ranks indicating low preservation. Right plot shows Z-score value. The green dotted line indicates that Z-score is equal to 10. Z-score < 10 represents weak preservation. (B) Enriched KEGG pathways with p-values < 0.05 were presented. KME value (eigengene connectivity) of module internal hub genes was indicated by dots ' size, respectively. analysis demonstrated that the midnight blue module was mainly involved in protein processing in endoplasmic reticulum and antigen processing and presentation; the light green module was mainly involved in sphingolipid signaling pathway, hippo signaling pathway, and tight junction; and the tan module was mainly involved in circadian rhythm ( Figure 8B). In summary, the midnight blue module could be a crucial module, whose disorganized functions affected immune cell infiltration and contribute to shorter survival time in patients with high risk scores.

DISCUSSION
Infiltration of immune cells in TME has dual roles on cancer progression and patient prognosis, and deciphering their complex interplay is of seminal importance for exploring novel prognostic markers and therapeutic targets in cancer patients (Fridman et al., 2011;Klemm and Joyce, 2015). LncRNAs have been reported to function as key regulators in immune response, inflammation, and distinct immune cell types (Das et al., 2018;Li et al., 2018;Wang et al., 2018;Xu and Cao, 2019), and play crucial roles in cancer malignant progression (Bartonicek et al., 2016;Jiang et al., 2016;Schmitt and Chang, 2016). Therefore, insight into immune-related lncRNAs could bring profound effects on cancer-related therapeutics and survival prognosis prediction (Chen et al., 2017).
Since the lncRNA expression level and biological functions were markedly associated with immune-related genes (Roux et al., 2017), we employed immune-associated gene sets to screen out highly immune-related lncRNAs by co-expression analysis. Subsequently, a series of bioinformatics analyses were integrated to identify a six-immune-related lncRNA signature, which could well stratify patients into favorable or unfavorable prognoses. Moreover, the six-lncRNA signature was considered as an independent risk factor by multivariate Cox regression analysis. To further improve the predictive accuracy, a nomogram consisting of age, TMN stage, and the six-lncRNA signature was constructed. A good performance of the nomogram was identified by the calibration curve. Meanwhile, our findings were validated in multiple BC datasets. Additionally, functional annotation conducted by GSEA revealed that the immunerelated gene sets, the immune system process, and immune response were significantly activated in the low-risk group.
Previous studies have shown that different types of infiltrating immune cells have distinct effects on tumor progression (Fridman et al., 2012;Bindea et al., 2013). In order to investigate the association between infiltration immune cells and the lncRNAs signature, the ssGSEA method was used to calculate enrichment scores of 24 immune cell types (Bindea et al., 2013), and then Pearson correlation analysis was conducted to evaluate their correlations. Our integrated analyses indicated that risk scores were significantly associated with the enrichment level of CD8 T cells. In addition, the combination of six lncRNAs had higher prognostic accuracy for CD8 T cell infiltrating status than each predictor alone, indicating that the signature could provide promising immunotherapeutic targets for the treatment of BC. CD8 T cells function as a tumor suppressor and play crucial roles in cancer progression (Barry and Bleackley, 2002;Vesely et al., 2011;Jansen et al., 2019). In BC, previous studies have demonstrated that CD8 T cells are the primary effector immune cells and associated with favorable clinical outcomes of BC patients (Mahmoud et al., 2011). Taken together, dysregulation of CD8 T cells might be a key mechanism associated with a higher mortality of patients in the high-risk group.
To further provide an interpretation of biological mechanisms for the six-lncRNA signature, we performed WGCNA analysis to explore non-preservative modules in the high-risk group compared with the low-risk group. Subsequent KEGG pathwayenrichment analysis demonstrated that the midnight blue module was strongly associated with immune-related pathways mainly involved in protein processing in endoplasmic reticulum and antigen processing and presentation. In the immune system, the antigen presentation pathway plays central roles for the immune surveillance and elimination of malignant transformations (Hu et al., 2019). Nascent malignant cells may reduce antigenicity so that anti-tumor lymphocytes fail to detect transformed cells to escape immune surveillance (Hu et al., 2019). Endoplasmic reticulum (ER) could process and present mutation-derived tumor-related nascent antigens to CD8 T cells, which contribute to tumor suppression (Johnsen et al., 1999;Hu et al., 2019). In our study, the antigen presentation pathway involved in protein processing in endoplasmic reticulum and antigen processing and presentation was disorganized in the highrisk group, which might promote tumor immune evasion and associate with poor clinical outcomes in the patients with high risk scores.
In the study, we integrated bioinformatics analyses to provide a solid theoretical basis for potential value of the six-lncRNA signature. However, in the process of public database mining, some unknown factors are usually difficult to extrapolate. In subsequent studies, we will further demonstrate and improve our findings by more rigorous experimental evidence in immunerelated lncRNAs associated with BC.

CONCLUSION
In conclusion, we integrated informatics analyses to identify a six immune-related lncRNAs signature. Moreover, CD8 T cell infiltration was significantly activated in the low-risk group, which could contribute to a better prognosis of BC patients. Our findings aim to help improve clinical outcome prediction and provide promising immunotherapeutic targets for BC patients.