A Clinicogenetic Prognostic Classifier for Prediction of Recurrence and Survival in Asian Breast Cancer Patients

Background Several prognostic factors affect the recurrence of breast cancer in patients who undergo mastectomy. Assays of the expression profiles of multiple genes increase the probability of overexpression of certain genes and thus can potentially characterize the risk of metastasis. Methods We propose a 20-gene classifier for predicting patients with high/low risk of recurrence within 5 years. Gene expression levels from a quantitative PCR assay were used to screen 473 luminal breast cancer patients treated at Taiwan Hospital (positive for estrogen and progesterone receptors, negative for human epidermal growth factor receptor 2). Gene expression scores, along with clinical information (age, tumor stage, and nodal stage), were evaluated for risk prediction. The classifier could correctly predict patients with and without relapse (logistic regression, P<0.05). Results A Cox proportional hazards regression analysis showed that the 20-gene panel was prognostic with hazard ratios of 5.63 (95% confidence interval 2.77-11.5, univariate) and 5.56 (2.62-11.8, multivariate) for the “genetic” model, and of 8.02 (3.52-18.3, univariate) and 19.8 (5.96-65.87, multivariate) for the “clinicogenetic” model during a 5-year follow-up. Conclusions The proposed 20-gene classifier can successfully separate the patients into two risk groups, and the two risk group had significantly different relapse rate and prognosis. This 20-gene classifier can provide better estimation of prognosis, which can help physicians to make better personalized treatment plans.


INTRODUCTION
Breast cancer is a leading cause of death in women around the world (1). Surgical options for breast cancer treatment include partial mastectomy with sentinel lymph node biopsy/axillary lymph node dissection and radiation therapy or modified radical mastectomy. Complete surgical resection is the gold standard in breast cancer treatment (2); however, there exists a wide variation in prognosis, and patterns of recurrence vary extensively in survivors (3,4). Recurrence can be local (in the same breast or in the surgery scar), regional (in nearby lymph nodes), or in a distant metastasis. Patients who do not experience recurrence within 5 years usually enjoy a relatively low risk of recurrence (5). However, late recurrence may occur after a 5-year time span, and potential risks include the patient's age, stage at diagnosis, hormone receptor status, genetic variants, and lymph node involvement (6). Breast cancer survivors with luminal type tumors (i.e., estrogen receptor positive [ER+], progesterone receptor positive [PR+), and human epidermal growth factor receptor 2 negative [HER2-)) are at a higher risk of late recurrence (7).
Breast cancer has various histopathological features and diverse responses to systemic treatment. Clinicopathological variables such as tumor size, lymph node metastasis, histological grade, ER and PR expression, and HER2 status (also known as ERBB2), are prognostic and thereby drive decision making for breast cancer treatment (8). However, they are not sufficient for implementation of individualized therapy. In fact, about 60% of early stage breast cancer patients still receive adjuvant chemotherapy, of which only a small proportion, 2-15% of them, will derive benefit, while all will suffer an increased risk of side effects. Breast cancer is a polygenic disorder, and a complex interplay of genetic factors governs the etiology and evolution of the disease (9). Therefore, to enhance the understanding of breast cancer heterogeneity at the molecular level and to optimize and individualize treatment, gene expression profiling (10) has emerged as an important prognostic indicator and has been extensively studied by breast cancer researchers. The findings can potentially guide treatment in women with early stage breast cancer and are embraced by clinical oncologists in their daily practice.
Prognostic multi-gene expression assays such as Oncotype DX (11), EndoPredict (12), and RecurIndex (13)(14)(15), are used in breast cancer to estimate the risk of recurrence after surgery and endocrine therapy and to determine the necessity of chemotherapy. Estimating distant recurrence risk among women with ER+/HER2-early breast cancer helps with decisions on using adjuvant chemotherapy. The most widely used test is Oncotype DX, which reports a recurrence score based on 21 genes that predict the risk of distant recurrence for patients who are node-negative. The EndoPredict assay combines the expression of 3 proliferative and 5 ER-signaling/differentiationassociated genes and provides a risk score that ranges between 0 and 15 (16). RecurIndex integrates information from recurrencerelevant genes in Asian patients and clinical factors to predict the 5-year risk of local recurrence and distant metastasis. The test results can serve as an important reference in the determination of appropriate treatment.
Despite extensive racial and geographic variations in breast cancer incidence, progression, presentation, and outcomes, studies on risk factors in relation to tumor subtypes and survival have mainly been conducted in Caucasians (17,18). Meanwhile, the incidence of breast cancer is continuously increasing in Asia (19). We have used a 34-gene and an 18-gene classifier to conduct risk stratification of Asian breast cancer patients regarding loco-regional recurrence post-mastectomy on microarray platform (13)(14)(15). In addition to gene expression, regional lymph node status and pathological stage contribute to this risk (20). Understanding the therapeutic consequences of a previously identified gene, whose expression correlates with outcomes in a heterogeneous group of primary breast cancer patients, is vital. Racial differences resulting from genetic and biological factors might impact disease incidence and prognosis. Hence, in this study we conducted genomic profiling of Asian breast cancer patients to predict the risk of relapse within 5 years of surgery. The primary purpose of this study was to assess the clinical utility of a 20-gene classifier model in stratifying women with breast cancer into distinct risk groups to predict 5year recurrence.

Study Population
The Amwise data set (Amwise Diagnostics PTE. LTD) comprised breast cancer patients from 8 hospitals in Taiwan, including China Medical University Hospital-Radiation Oncology, MacKay Memorial Hospital, National Taiwan University Hospital, Taiwan Adventist Hospital, Taipei Veterans General Hospital, China Medical University Hospital-Surgery, Chia-Yi Christian Hospital, and Cheng Hsin General Hospital. All patients in the Amwise database underwent breast-conserving surgery or mastectomy. Informed consent was obtained from all patients. The Institutional Review Board of each participated medical centers approved the study protocol. All patients eligible for this study had the approval from Institutional Review Board of each hospital. Patients enrolled in the study were of luminal type (ER+/PR+/ HER2-). Patients with (i) T4 or N3 disease, (ii) pre-operative chemotherapy or radiotherapy, (iii) distant metastasis at initial presentation, or (iv) inadequate formalin-fixed, paraffin-embedded (FFPE) tumor samples were excluded. Patients with missing clinical or genetic data were also excluded. Figure 1 illustrates the modeling workflow. The proposed 20gene classifier was used to stratify patients into high risk and low risk groups based on a cut-off determined by receiver operating characteristic analysis. The model based on the 20-gene signature is referred to as the "genetic" model. We further evaluated the discriminatory ability of the 20-gene classifier (13) along with clinical factors such as age at surgery, tumor stage (T1, T2, T3), nodal stage (N0, N1, N2), to predict 5-year survival. Risk assessment model based on both genetic and clinical predictors is referred to as the "clinicogenetic" model. The whole population was analyzed by both models.

20-Gene Classifier
The 20-gene panel consists of BLM, BUB1B, CCR1, CKAP5, CLCA2, DDX39, DTX2, ERBB2, ESR1, MKI67, OBSL1, PGR, PHACTR2, PIM1, PTI1, RCHY1, SF3B5, STIL, TPX2, and YWHAB along with 3 housekeeping genes ACTB, RPLP0 and TFRC. In the previous studies, we performed LASSO regression to identify the best combination of genes (13) that now constitute the 20-gene signature of this study. However, all previous studies were examined by the microarray platform (13-15), we aim to transfer the platform to PCR in this study. Therefore, we reanalyzed the gene set to obtain the best gene combinations with the approaches summarized in Figure 1. Considering the operation time and cost, we investigated one patient per PCR plate. All 23 genes were simultaneously measured in different wells. More specifically, we put primer pairs of the target genes into the 96-well plates and performed reverse-transcriptase (RT) quantitative polymerase chain reaction (qPCR) by using the total RNA isolated from the FFPE tumor tissues. The experimental platform was the ABI-7500Fast real-time PCR system. Quantitative PCR was used to measure the expression of each of the 23 genes in the FFPE samples. Normalization of gene expression were calculated as delta CT = 25 -CT (gene of interest) + CT (mean of housekeeping genes). RNA was extracted from FFPE tissue sections (5-10μm in thickness) with the RNeasy FFPE Kit (Qiagen, Valencia, CA, USA). The extracted RNA was stored at -80°C until use after the concentration was determined by OD with a Nanodrop spectrophotometer (Agilent RNA 6000 Nano kit, Agilent Technologies, Santa Clara, CA, USA). A total of 2 μg RNA was used for RT-PCR using the RT² First Strand and RT² SYBR Green ROX qPCR MM kits (Qiagen, Valencia, CA, USA). Briefly, the RT reaction was performed at 42°C for 15 min before the reaction was terminated at 95°C for 5 min. PCR was performed on the ABI7500Fast instrument (Thermo Fisher, CA, USA) using the Standard mode with 40 cycles at 95°C for 15 sec and 60°C for 45 sec.

Model Training and Validation
The genetic and clinicogenetic models were built with a leaveone-out-cross-validation (LOOCV) strategy. Logistic regression with a logit function was used for a binary response (Y=0 for with recurrence and Y=1 for without recurrence), while X was the vector space of the predictors. The predictors for the genetic model were the profiles of the 20-gene panel, and the predictors for the clinicogenetic model were the 20-gene panel, age at surgery, nodal stage, and tumor stage. The best-fit model was selected using the glm.fit() function in R using all samples (n), and LOOCV was used to internally validate the model. The LOOCV method used randomly chosen "n-1" samples to train the model while the remaining 1 sample was used for testing. This process was repeated n times to calculate the accuracy.

Recurrence Analysis
Prediction of the risk of 5-year recurrence was evaluated independently for both the genetic and the clinicogenetic model. Subjects were excluded from the recurrence study if (i) they had no follow-up information or (ii) they reported recurrence before the surgery date. Prediction of recurrence during the 5-year follow-up period in patients who underwent breast conserving surgery or mastectomy was conducted.

Statistical Analysis
Recurrence analyses using univariate and multivariable Cox proportional hazards regression models were conducted on 5year follow-up data for breast cancer patients that underwent surgery. Survival prediction for patients with the indicated risk classification (high or low) was done based on clinicogenetic factors such as age at diagnosis, lymph node stage, tumor stage, and the 20-gene panel. R packages survminer and survival were used to conduct all survival analyses.

Patient Demographics
A total of 473 patients with luminal type breast cancer who underwent modified radical mastectomy or breast-conserving surgery were included in this study (Figure 1). A total of 100 patients and 119 patients were excluded, respectively, from the genetic model and the clinicogenetic model due to missing data. Finally, 373 patients were used for the genetic model building and 354 for the clinicogenetic model. To determine the recurrence and survival rates of the patients, 5-year follow-up studies were conducted on a total of 370 patients (genetic model) and 351 patients (clinicogenetic model) after censoring 3 patients from each analysis (for recurrence before the surgery date). The patients were classified as high risk and low risk based on cut-offs of risk scores 0.155 and 0.135 for the genetic and clinicogenetic models, respectively. Table 1

Performance of the Models
The final trained genetic model that was used to predict recurrence in breast cancer patients is shown below.  (Table  S1(a)). The positive predictive value (PPV) was 25.2% and the negative predictive value (NPV) was 95.0%. Similarly, the quality of the clinicogenetic model was judged through the LOOCV accuracy of 73.7%, with sensitivity, specificity, PPV, and NPV of 85.0%, 72.3%, 28.1%, and 97.4%, respectively (Table S1(b)).

5-Year Follow-Up Analysis of Recurrence
The demographic and clinical details of the patient samples included in the 5-year follow-up analysis for the genetic model are reported in Table 2. A total of 370 samples were used in the 5-year recurrence study, in which 130 patients were classified as high risk and 240 as low risk. High risk patients had a mean age of 52.29 years, and 32 (24.6%) relapsed within 5 years. Low risk patients had a mean age of 53.58 years, and 10 (4.2%) relapsed in 5 years. The results for the clinicogenetic model agree with the findings from the genetic model. Table 3 summarizes the follow-up statistics for the patients used to build the clinicogenetic model, in which 351 samples were retained. A total of 121 patients (mean age = 55.13 years) were stratified as high risk, of which 34 (28.1%) relapsed in 5 years, and 230 (mean age = 52.23 years) were stratified as low risk, of which 3 (1.3%) relapsed in 5 years.  Figure 2 shows the Kaplan-Meier for patients with high risk versus low risk for recurrence within a follow-up period of 5 years, post-mastectomy. The 20-gene classifier successfully identified the risk groups for luminal type breast cancer patients (P<0.0001 for both models). The survival curves indicate that patients with high risk scores displayed lower survival rates than those with low risk scores.
To investigate the treatment effects, we summarized the patients' characteristics of patients receiving treatments or not in Table S2 and utilized the multivariate Cox regression model to evaluate the treatment effect of patients receiving treatment or not (Table S3).

DISCUSSION
This work focused on testing the efficacy of (i) a 20-gene classifier and (ii) a 20-gene classifier + clinical characteristics for risk stratification of luminal type breast cancer patients, postmastectomy or breast-conserving surgery, by the risk of recurrence within 5 years. The primary question was whether the results predicted by the classifier model corresponded to the actual prognosis of patients with breast cancer. This was evaluated via the accuracy, sensitivity, specificity, PPV and NPV of the prediction models.
Accuracy alone is not sufficient to understand the complete picture when a dataset has a significantly different number of positive (high risk of recurrence) and negative labels (low risk of recurrence). It is argued that sensitivity and specificity can be used for making decisions preventing recurrence only if they are extremely high; hence, information about all five metrics (accuracy, sensitivity, specificity, PPV, and NPV) is vital for evaluating the predictions and achieving maximum benefit for patients. Predictions based on our genetic and clinicogenetic classifiers should ideally prevent patients from undergoing unnecessary treatments, thereby enhancing quality of life for patients.
Gene expression profiling studies have put forth a perspective that breast cancer is not a singular condition but consists of a collection of different diseases with different risk factors, clinical presentations, histopathological features, outcomes, and responses to systemic therapy. These studies also revealed that response to treatment is not only determined by anatomical prognostic factors such as tumor size or nodal stage, but also by intrinsic molecular characteristics of the tumors that can be probed with molecular methods (21). Commercially available, quantitative PCR-based multi-marker assays increase the probability of detecting the risk of tumor recurrence or metastasis, and thus can potentially characterize relapse on a molecular level. This enhances the determination of the prognosis, monitoring of the disease, and may allow us to individualize therapeutic strategies in the future. Our 20-gene classifier, after extensive evaluation, was found to accurately classify patients at high and low risk for relapse in both genetic and clinicogenetic models, thus predicting recurrence within 5 years after mastectomy.
Such prognostic assays have proven to work well for hormone receptor-positive breast cancers. However, signatures derived from similar tumor cohorts have been reported to share few overlapping genes (22,23). Nevertheless, it is argued that the prognostic concordance among multiple gene expression signatures suggests potential functional equivalence between the signatures (24). Therefore, despite the absence of many overlapping genes, the aforementioned signatures may share common functions or similar pathways that facilitate the prediction of recurrence and survival.
Gene expression profiling of breast cancer appears to be a promising new strategy for prognosis of luminal-like breast cancer. Such information can help to guide therapeutic decisions and clinical trial design. It is not yet completely clear what specific factors determine the progression toward metastatic disease, i.e.,  why some patients present with metastatic cancer, while in other patients many years may lapse before the disease advances to this stage. Theories such as different cells-of-origin having specific differentiation programs that strongly predispose a person to an aggressive malignancy could be one explanation. This study identifies a 20-gene signature that, in combination with clinicallyrelevant prognostic factors, can help determining the probability of relapse at 5 years of HR+/HER2-negative breast cancer in the Asian population. However, there were some limitations in this study. First, the treatment benefits were difficult to be estimated directly  because patients receiving treatments or not showed heterogeneous demography and tumor characteristics. Second, we were mostly unaware of the adjuvant chemotherapy regimens administered to the patients analyzed in this study, because they were recruited from different hospitals and the information was retrospectively unavailable in most cases. Hence, to estimate prognosis, we were only able to introduce a dichotomous variable to indicate whether a patient received or not adjuvant chemotherapy. Third, it is wellknown that the menopausal status was a confounding factor needed to be controlled. However, this variable was not fully collected in all patients, and thus we followed the approach utilized in the TAILORx (25) trial to use age as the confounding factor in the analyses. Overall, the model used in this study might be beneficial to accurately classify patients' prognosis. Further studies are warranted to draw more definitive conclusions with respect to its applicability to the clinical practice for therapeutic decision-making.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding author.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by AUTHORCONTRIBUTIONS T-HC, J-YC, and K-HS conceptualized and designed the study. JL provided experimental support. K-HS and JW provided administrative support and study materials or patients. T-HC, JW, and K-HS collected and assembled the data. T-HC, K-HS, JW, and J-YC analyzed and interpreted the data. T-HC wrote the manuscript. All authors contributed to the article and approved the submitted version.

FUNDING
The research grants were from the Amwise Diagnostics Pte. Ltd during the conduct of the study.