Identification of a Six-lncRNA Signature With Prognostic Value for Breast Cancer Patients

Breast cancer (BRCA) is the most common cancer and a major cause of death in women. Long non-coding RNAs (lncRNAs) are emerging as key regulators and have been implicated in carcinogenesis and prognosis. In this study, we aimed to develop a lncRNA signature of BRCA patients to improve risk stratification. In the training cohort (GSE21653, n = 232), 17 lncRNAs were identified by univariate Cox proportional hazards regression, which were significantly associated with patients’ survival. The least absolute shrinkage and selection operator-penalized Cox proportional hazards regression analysis was used to identify a six-lncRNA signature. According to the median of the signature risk score, patients were divided into a high-risk group and a low-risk group with significant disease-free survival differences in the training cohort. A similar phenomenon was observed in validation cohorts (GSE42568, n = 101; GSE20711, n = 87). The six-lncRNA signature remained as independent prognostic factors after adjusting for clinical factors in these two cohorts. Furthermore, this signature significantly predicted the survival of grade III patients and estrogen receptor-positive patients. Furthermore, in another cohort (GSE19615, n = 115), the low-risk patients that were treated with tamoxifen therapy had longer disease-free survival than those who underwent no therapy. Overall, the six-lncRNA signature can be a potential prognostic tool used to predict disease-free survival of patients and to predict the benefits of tamoxifen treatment in BRCA, which will be helpful in guiding individualized treatments for BRCA patients.


INTRODUCTION
Breast cancer (BRCA) is the second leading cause of cancer death among women. More than 268,000 new patients are diagnosed with BRCA each year and 41,760 patients will die from BRCA (DeSantis et al., 2019;Siegel et al., 2019). The current treatment for BRCA, which can improve survival of BRCA patients, includes mastectomy, hormone therapy (Early Breast Cancer Trialists Collaborative et al., 2011), surgery with adjuvant radiation therapy (Bradley and Mendenhall, 2018;Chargari et al., 2019), and chemotherapy (Oikonomou et al., 2019). Immunotherapy of BRCA patients is a recent emerging area of treatment (Greenlee et al., 2017;Jia et al., 2017;Adams et al., 2019). Although the TNM stage system is a valuable resource for the classification of BRCA patients, it does not predict the prognosis of patients. Therefore, the molecular markers need to be identified so that the survival of BRCA patients can be evaluated (Giuliano et al., 2017;Zhang et al., 2017).
Long non-coding RNAs (lncRNA, >200 nucleotides in length) are a class of non-coding RNAs transcribed from mammalian genomes . Some lncRNAs are found to be deregulated between cancer and normal tissues, such as BRCA (Liu et al., 2015), lung cancer (Jen et al., 2017), gastric cancer (Liu et al., 2017), and prostate cancer (Xu et al., 2018). Furthermore, lncRNAs have been confirmed to participate in diverse biological processes by acting as key regulators in cancers. Gupta et al. (2010) found that dysregulated HOTAIR increased cancer invasiveness and metastasis through dependence on PRC2, and lncRNA HOXD-AS1 regulated the Rho GTPase activating protein 11A (ARHGAP11A), which resulted in induced metastasis . In recent years, some lncRNAs have been found to be biomarkers of predicting BRCA patient outcomes, such as lncRNA BCYRN1 (Booy et al., 2017) and HOTAIR (Zhang et al., 2013a), which has attracted increasing attention.
In this study, we developed a six-lncRNA signature based on lncRNA expression, with the ability to predict disease-free survival of patients with BRCA, and we assessed its prognostic value in the training and validation cohorts. This signature had an independent prognostic value after adjusting for clinical factors. Furthermore, the lncRNA signature also significantly predicted survival of grade III and estrogen receptor (ER)-positive BRCA patients. Moreover, the signature predicted survival benefits of tamoxifen therapy in BRCA patients.

Study Samples
Breast cancer gene expression data generated by the Affymetrix HG-U133 Plus 2.0 microarray platform and corresponding clinical information were obtained from the publicly available GEO database 1 . To analyze the correlation of lncRNA expression with disease-free survival (DFS) for BRCA, we selected those data sets that included patients with survival status information. In total, 232 samples from GSE21653 (Sabatier et al., 2011a,b), 101 samples from GSE42568 (Clarke et al., 2013), and 87 samples from GSE20711 (Dedeurwaerder et al., 2011) were obtained. The GSE21653 data set was defined as the training cohort, and the GSE42568 and GSE20711 data sets were treated as the validation cohort. Another dataset, GSE19615 (n = 115) (Li et al., 2010), which contained 62 patients treated with tamoxifen, was obtained to validate the prognostic value of the signature for patients after hormone treatment. Detailed clinical information of patients with BRCA in this study is shown in Table 1.

Microarray Data Processing and lncRNA Re-annotation
All the raw microarray data (CEL files) of BRCA patients were downloaded from the GEO database and background adjusted and normalized using the Robust Multichip Average (RMA) algorithm (Irizarry et al., 2003a,b) and "Affy" package 1 https://www.ncbi.nlm.nih.gov/geo/ (Gautier et al., 2004). The probe sequences of Affymetrix HG-U133 Plus 2.0 array were downloaded from the Affymetrix website 2 and uniquely mapped to the human genome (hg19). Specific probes of lncRNAs were obtained by matching the chromosomal position of probes to the chromosomal position of lncRNA genes based on the annotations from GENCODE (release 23) according to the previous studies (Du et al., 2013;Zhou et al., 2015). When multiple probes were mapped to the same lncRNA, expression values of these probes were integrated using the median value to represent the expression value of the single lncRNA. As a result, 2,673 lncRNAs were obtained for further analysis.

Identification of a Survival-Related lncRNA Signature Set Associated With Breast Cancer
A univariate Cox proportional hazards regression analysis was carried out to evaluate the association between expression levels of lncRNAs and patients' disease-free survival in the training cohort. Only those lncRNAs with a p-value of <0.01 were considered statistically significant. We then conducted the least absolute shrinkage and selection operator (LASSO) penalized Cox proportional hazards regression analysis to select the prognostic markers of the above lncRNAs (Tibshirani, 1997;Zhang et al., 2013b). We created a risk-score formula by a linear combination of the expressions of these six lncRNAs, weighted by their respective Cox regression coefficients as follows (Zhang et al., 2012(Zhang et al., , 2013c: where N is the number of prognostic genes, Exp i is the expression value of the i gene, and Coef i is the estimated regression coefficient of the i gene in the univariate Cox regression analysis. Using the median signature risk score in each cohort as the cutoff point, BRCA patients in every cohort were divided into low-and high-risk groups.

Statistical Analysis
The association between the lncRNA gene expression and the patient's survival was assessed by univariable Cox regression analysis. LASSO logistic regression analysis was used to identify the lncRNAs comprising the prognostic signature with nonzero coefficients in the training cohort using "glmnet" package (Friedman et al., 2010). Kaplan-Meier survival analysis and the log-rank test were used to compare the difference in disease-free survival between the high-risk group and lowrisk group using the R package "survival." Furthermore, we used Cox multivariate analysis to test whether the lncRNA signature was independent of patient age and histological grade. Hazard ratio (HR) and 95% confidence intervals (CI) were estimated by the Cox proportional hazards regression model. The time-dependent receiver operating characteristic (ROC) curves were used to compare the prognostic accuracy of the six-lncRNA signature for survival. Statistical significance was defined as two-tailed p-values being less than 0.05. All of the statistical analyses were performed using R program 3.5.2 3 and Bioconductor.

RESULTS
Identifying a Six-lncRNA Signature in the Training Cohort As summarized in the workflow (Figure 1), we first performed an univariable Cox proportional hazards regression analysis to assess the association between lncRNA expression and disease-free survival of patients with BRCA in the training cohort. A set of 17 lncRNAs that were significantly correlated with patients' survival (p ≤ 0.01, Table 2) was identified. We found six lncRNAs (LINC00917, AL391840.1, TRIM52-AS1, AL355075.4, AC093802.2, and AC091544.4) to comprise a prognostic signature using a LASSO-penalized Cox proportional hazards regression analysis for the above 17 lncRNAs with optimal tuning parameters. All six lncRNAs have positive coefficients, which indicates that their high expressions are associated with shorter survival. Finally, we calculated the signature risk score based on a linear combination of the expression levels of six prognostic lncRNAs, weighted by the coefficients derived from the univariable Cox regression analysis as follows:

The Six-lncRNA Signature Predicts Disease-Free Survival of Patients With Breast Cancer
We calculated the six-lncRNA signature risk score for each patient in the training cohort (GSE21653, n = 232). The patients were divided into a high-risk group (n = 116) and a low-risk group (n = 116) using the median risk score as the cutoff. Compared with the low-risk patients, the high-risk patients had shorter disease-free survival (median survival 62.4 months vs greater than 200 months, HR = 1.67, 95% CI = 1.05-2.66, p = 0.028, Figure 2A). The prognostic value of the six-lncRNA signature was then evaluated in the validation cohort (GSE42568, n = 101). The signature classified patients into two groups, including a high-risk group (n = 50) and a low-risk group (n = 51), based on the median risk score. The disease-free survival of the high-risk group was significantly shorter than that of the low-risk group (median survival 69.7 months vs greater than 100 months, HR = 2, 95% CI = 1.09-3.66, p = 0.022, Figure 2B). Similarly, in another validated cohort (GSE20711, n = 87), the high-risk group still had a poorer prognosis than the low-risk group (median survival 77.8 months vs 122.5 months, HR = 1.54, 95% CI = 1.02-2.91, p = 0.040, Figure 2C). Next, we assessed whether the prognostic value of the six-lncRNA signature was independent of other clinical factors. We performed univariate and multivariate Cox proportional hazards regression analysis for factors, including age, ER status, histological grade, and the signature. In the training cohort, the high-risk six-lncRNA signature (HR = 1.789, 95% CI = 1.122-2.852, p = 0.015), grade III (HR = 3.174, 95% CI = 1.314-7.666, p = 0.010) and grade II (HR = 2.881, 95% CI = 1.181-7.028, p = 0.020) were significantly correlated with DFS of patients ( Table 3). We found that the signature (HR = 2.327, 95% CI = 1.256-4.311, p = 0.007) and ER status (HR = 0.472, 95% CI = 0.234-0.877, p = 0.017) significantly independently predicted patients' disease-free survival in the validation cohort GSE42568 (Table 3). Moreover, the six-lncRNA signature was also an independent prognostic factor associated with diseasefree survival in the GSE20711 dataset (HR = 1.631, 95% CI = 1.037-3.105, p = 0.043). These results indicate that the six-lncRNA signature is an independent prognostic factor for BRCA patients' disease-free survival.

The Six-lncRNA Signature Predicts Survival of Patients During Diverse BRCA Groups
We explored whether the six-lncRNA signature was effective for patients within different histological grades using a Kaplan-Meier survival analysis. For grade III patients, the signature significantly classified patients into two groups with distinctively different survival times (median survival 55.2 months vs greater than 150 months, HR = 2.39, 95% CI = 1.26-4.51, p = 0.0057, Figure 3A), including the high-risk group (n = 56) and the low-risk group (n = 61) in the training cohort. The signature showed a similar prognostic value for grade III patients in the validation cohort (median survival 25.2 months vs greater than 69.3 months, HR = 3.01 95% CI = 0.96-9.46, p = 0.048, Figure 3B). In grade I patients, there were no significant survival differences among the high-risk groups and the low-risk groups in two cohorts (Supplementary Figure S1A,B). A similar phenomenon was observed in grade II patients from the GSE21653 data set (Supplementary Figure S1C). However, in grade II patients from the GSE42568 data set, the high-risk and low-risk groups had significant survival differences (HR = 5.29, 95% CI = 1.17-23.9, p = 0.015, Supplementary Figure S1D).
Furthermore, Kaplan-Meier survival analysis was performed after patient stratification according to ER status. The ERpositive patients were divided into high-risk and low-risk groups. The high-risk ER-positive patients had shorter diseasefree survival than low-risk ER-positive patients in the training cohort (HR = 1.77, 95% CI = 0.93-3.38, p = 0.078, Figure 4A) and the validation cohort (HR = 3.32, 95% CI = 1.31-8.38, p = 0.0072, Figure 4B). There were no significant survival differences between the high-risk and low-risk ER-negative patients in these two cohorts when using the same risk formula (Supplementary Figure S2).

The Six-lncRNA Signature Predicts Patient Outcome After Tamoxifen Therapy
We further tested whether the six-lncRNA was useful to guide therapy in an independent cohort (GSE19615). In this cohort, FIGURE 2 | Kaplan-Meier survival curves of disease-free survival between high-risk (red) and low-risk (blue) patients in the (A) training cohort (GSE21653, n = 232), and the (B,C) validation cohort (GSE42568, n = 101; GSE20711, n = 87). The differences between the two curves were determined by the two-sided log-rank test. The number of patients at risk is listed below the survival curves. HR, hazard ratio. there were 62 patients who received tamoxifen therapy and 47 who did not. We classified each patient into high-and lowrisk groups based on the lncRNA signature risk score. Among the 58 low-risk patients, tamoxifen treatment could prolong the disease-free survival of these patients (HR = 0.08, 95% CI = 0.01-0.62, p = 0.0018, Figure 5A), while there were no significant survival differences between patients with and without tamoxifen therapy in the high-risk group (Figure 5B). This  result revealed that tamoxifen treatment was only beneficial for low-risk BRCA patients.

Comparison of the Survival Prediction Power Between Clinical Factors and the Six-lncRNA Signature
To compare the sensitivity and specificity in survival prediction between clinical factors (histological grade and ER status) and the six-lncRNA signature, we performed a time-dependent ROC analysis in the training cohort. We also constructed a prognostic model by combining our signature with histological grade or ER status. There were no significant differences between histological grade and the lncRNA signature (p = 0.171). A similar result was found between the signature and ER status (p = 0.997). Moreover, for the histological grade, we observed that the histologic grade combined with the six-lncRNA signature (AUC = 0.73) had a higher area under the ROC curve than the histological grade alone (AUC = 0.68, Figure 6A). The six-lncRNA signature could also improve the prognostic accuracy of the ER status (0.63 vs 0.59, Figure 6B). In addition, for further clinical utility, we constructed a full clinical prognostic model by combining  all clinical factors including age, histological grade, and ER status. After adding the six-lncRNA signature into the clinical prognostic model, the prediction accuracy of the model was effectively improved (0.74 vs 0.69, Figure 6C). These results suggest that our six-lncRNA signature can add a complementary value to known clinical factors.

DISCUSSION
In the current study, we developed and validated a prognostic six-lncRNA signature based on lncRNA expression, which stratified BRCA patients into two groups (high-risk group and low-risk group) with different disease-free survival. We demonstrated that this signature could predict the survival of grade III BRCA patients. The ER-positive patients who were classified as the lowrisk group achieved better survival benefits. Furthermore, by using this signature, we can find a subgroup of patients who are likely to benefit from tamoxifen therapy. In sum, the six-lncRNA signature for BRCA patients may be a prognostic tool that is helpful in guiding individualized treatment of patients. Histological classification of BRCA into grades I, II, and III, determines the treatment of BRCA patients (Cortes et al., 2012;Harris et al., 2016;Waks and Winer, 2019). The tumor cells of grade III cancer tend to grow more quickly and look different from normal breast cells (Wani et al., 2010). We observed that the six-lncRNA signature significantly predicted the survival of grade III BRCA patients. This finding suggests that this lncRNA signature predicted survival in patients with invasive cancer. In addition, we found that high-risk ER-positive BRCA patients had shorter disease-free survival than low-risk ER-negative patients. Some studies have confirmed that ER is an essential predictor for responding to therapy, such as tamoxifen therapy, in metastatic BRCA (Fisher et al., 1997).
Given the heterogeneity of cancer, reliable prognostic biomarkers are needed to identify patients who can benefit from therapy Zhang et al., 2018). There is growing research on several gene signatures to improve decision-making and individualization of BRCA therapy (Cronin et al., 2007;Cardoso et al., 2016;Yu et al., 2019). However, it is difficult to apply all of them for clinical management. Our prognostic signature could identify a group of patients at low risk, where the use of tamoxifen therapy led to significantly extended diseasefree survival. This suggests that our signature may hold special clinical value by separating responders to tamoxifen treatment, from non-responders, independent of pathological stage. Such separation could spare non-responders from therapy that is not beneficial and could promote the exploration of more effective therapeutic regimens.
The six-lncRNA (LINC00917, AL391840.1, TRIM52-AS1, AL355075.4, AC093802.2, and AC091544.4) signature in BRCA suggests that lncRNAs can be used as prognostic factors for the survival of patients. To avoid the influence of proteincoding genes, we annotated these probes with protein-coding genes, and found that only one lncRNA overlapped with proteincoding gene RPPH1. This gene had no predictive performance for survival, whether by itself or in combination with other lncRNAs (p = 0.39 and 0.16 respectively, Supplementary Figure S3). In addition, among these lncRNAs, TRIM52-AS1 was dominantly up-regulated in triple-negative breast cancer (TNBC) tissues compared to non-TNBC tissues by a RT-PCR (Lv et al., 2016). Moreover, another study found that the overexpression of TRIM52-AS1 suppressed cell migration and proliferation and induced cell apoptosis in renal cell carcinoma (Liu et al., 2016). However, these six lncRNAs have not been studied in BRCA. Thus, this is a novel study on the association between lncRNA expression and the disease-free survival of patients with BRCA.
Although the signature demonstrated an accurate survival prediction, several limitations should be noted. Because the sample size of our study was limited, large-scale cohort studies should be performed to investigate the prognostic value of this six-lncRNA signature. In addition, we only used a bioinformatics method to predict the six-lncRNA signature in BRCA, thus, further in vitro or in vivo experiments need to be conducted. Third, we investigated the efficacy of tamoxifen therapy in a low-risk BRCA group, thus more examinations are required to evaluate its efficacy and safety.
In conclusion, the six-lncRNA signature that we identified predicted the disease-free survival of patients with BRCA. This signature also predicted the survival of grade III and ER-positive patients. Furthermore, our findings revealed that the six-lncRNA signature could predict the benefits to patients treated with tamoxifen therapy. Further validation studies are needed to test the prognostic power of this signature before it is used clinically.

DATA AVAILABILITY STATEMENT
All datasets used in this study were publicly available from the GEO database (https://www.ncbi.nlm.nih.gov/geo) at accession numbers GSE21653, GSE42568, GSE20711, and GSE19615.

AUTHOR CONTRIBUTIONS
JH conceived the idea and conceptualized the study. EZ, YL, FQ, and JX conducted the bioinformatics analysis and interpreted the results. EZ and JX collected and pre-processed data. EZ, YL, SA, and LW generated the figures and tables. EZ and YL wrote the manuscript. JH and JX supervised the whole study process and revised the manuscript. All authors have read and approved the final version of manuscript.