Identification of Tamoxifen-Resistant Breast Cancer Cell Lines and Drug Response Signature

Breast cancer cell lines are frequently used to elucidate the molecular mechanisms of the disease. However, a large proportion of cell lines are affected by problems such as mislabeling and cross-contamination. Therefore, it is of great clinical significance to select optimal breast cancer cell lines models. Using tamoxifen survival-related genes from breast cancer tissues as the gold standard, we selected the optimal cell line model to represent the characteristics of clinical tissue samples. Moreover, using relative expression orderings of gene pairs, we developed a gene pair signature that could predict tamoxifen therapy outcomes. Based on 235 consistently identified survival-related genes from datasets GSE17705 and GSE6532, we found that only the differentially expressed genes (DEGs) from the cell line dataset GSE26459 were significantly reproducible in tissue samples (binomial test, p = 2.13E-07). Finally, using the consistent DEGs from cell line dataset GSE26459 and tissue samples, we used the transcriptional qualitative feature to develop a two-gene pair (TOP2A, SLC7A5; NMU, PDSS1) for predicting clinical tamoxifen resistance in the training data (logrank p = 1.98E-07); this signature was verified using an independent dataset (logrank p = 0.009909). Our results indicate that the cell line model from dataset GSE26459 provides a good representation of the characteristics of clinical tissue samples; thus, it will be a good choice for the selection of drug-resistant and drug-sensitive breast cancer cell lines in the future. Moreover, our signature could predict tamoxifen treatment outcomes in breast cancer patients.


INTRODUCTION
The overall recurrence rate of estrogen receptor positive (ER+) early breast cancer can be reduced by adjuvant treatment with tamoxifen. However, approximately 30-40% of ER + breast cancer patients receiving adjuvant tamoxifen therapy still would relapse or progress to deadly advanced metastatic stages within 15 years follow-up; this is largely attributed to tamoxifen resistance (Ye et al., 2019). Therefore, it is of great clinical significance to identify the efficacy of tamoxifen in ER + breast cancer patients. Cell lines are a common modeling tool in cancer research (Domcke et al., 2013); they can help us to better understand the biological processes and molecular mechanisms of cancer and aid in the development of anticancer drugs (Kong and Yamori, 2012;Knudsen et al., 2014). However, whether cell line models could adequately reflect the characteristics of clinical tissue samples is controversial (American Type Culture Collection Standards Development Organization Workgroup ASN-0002, 2010;Liedtke et al., 2010;Bayer et al., 2013;Capes-Davis et al., 2019;Wass et al., 2019). It is well known that tumor cell lines might lose some of their tumor-related characteristics owing to the culture environment (Masters, 2000). Cross-contamination (International Cell Line Authentication Committee, 2014) and misidentification (American Type Culture Collection Standards Development Organization Workgroup ASN-0002, 2010) of cell lines exacerbates such issues. Moreover, there is no unified gold standard for the identification of drugresistant cell lines, which also results in some cell lines poorly reflecting the characteristics of clinical tissue samples (Liedtke et al., 2010). Thus, it is of great value to find resistant/sensitive cell line models that are more representative of clinical tissue samples.
Considering tamoxifen survival-related genes from breast cancer tissue samples as the gold standard, we screened for the optimal cell line model. In the survival-related analysis of tissue samples, we assumed that genes that were positively (negatively) correlated with survival risk in tissue samples were comparable with genes that are upregulated (downregulated) in resistant compared with sensitive cell lines. In this study, through evaluating the consistency of prognosis-related genes in tissue samples from patients undergoing tamoxifen treatment with drug-resistance genes in cell lines, we selected the optimal cell line model to represent the characteristics of clinical tissue samples; the consistent genes between tissues and cell lines were identified as clinical drug-resistance-related genes.
Moreover, the relative expression orderings (REOs) of gene pairs within individual samples, also called qualitative transcriptional characteristics, are robust against experimental batch effects and can be directly applied to samples at an individual level (Eddy et al., 2010;Guan et al., 2019). The robustness property of the qualitative transcriptional characteristics enables integration of multiple datasets from different sources to develop disease signatures or classifiers, which improves the probability of finding robust signatures (Xu et al., 2008;Guan et al., 2019). Thus, based on qualitative transcriptional characteristics and the clinical drug-resistancerelated genes that we identified, we developed a tamoxifenresistance signature for ER + breast cancer and verified it in independent data.

Data and Preprocessing
Breast cancer gene expression data and corresponding clinical information were downloaded from the GEO database (Gene Expression Omnibus, http://www.ncbi.nlm.nih.gov/geo/). Relapse-free survival (RFS) time was defined as the interval between the first day of surgery and the date of death from any cause or of recurrence (local and/or distant) (Punt et al., 2007;Merok et al., 2013). Breast cancer tissue samples from ER+ patients who had received post-operative tamoxifen treatment were selected from the seven datasets, as described in Table 1. Nine gene expression datasets for breast cancer tamoxifen-resistant/sensitive cell lines were also downloaded from the GEO database, as shown in Table 1.
For the array data measured by Affymetrix platform, raw mRNA expression data (.CEL files) were downloaded, and the Robust Multi-array Average algorithm was used for normalization with Affy package in R software Irizarry et al., 2003). For sequence-based data, the processed data were directly downloaded.

Identification of Survival-related Genes in Tissue
The Cox proportional hazard model was used to study the relationships between gene expression levels and survival (Kreike et al., 2010). For the coefficient β obtained from the Cox model, if β > 0 for a certain gene, this gene was considered to be positively correlated with survival risk and was comparable with the upregulated gene between resistant and sensitive cell lines. Similarly, if β < 0, the gene was comparable with the downregulated gene between resistant and sensitive cell lines.

Identification of Differentially Expressed Genes (DEGs) in Cell Lines
In this study, the SAM (significance analysis of microarrays) algorithm (Tusher et al., 2001) was used to identify DEGs between resistant and sensitive cell lines.

Consistency Evaluation Between Tissues and Cell Lines
In this study, we hypothesized that genes positively (negatively) associated with survival in tissues corresponded to those genes upregulated (downregulated) between resistant and sensitive cell lines.
The consistency ratio, which is the number of overlapping and consistent DEGs/number of overlapping DEGs, was used to evaluate the similarity between tissues and cell lines. The significance was evaluated by the binomial distribution test as follows: where n denotes the number of overlapping DEGs between tissue and cell line, and k denotes the number of those overlapping DEGs with the same dysregulation direction. Then, the p-values were adjusted using the Benjamini-Hochberg method (Benjamini and Hochberg, 1995). Method denotes the production process for tamoxifen-resistant breast cancer cell lines. # High-throughput sequencing data.

KEGG Pathway Enrichment
The hypergeometric distribution model was used to determine the significance of KEGG (Kanehisa and Goto, 2000) (Kyoto Encyclopedia of Genes and Genomes) pathways enriched with the genes of interest using the following statistical model: where N denotes the number of background genes, n denotes the number of genes of interest, m denotes the number of genes in a given pathway, and k denotes the number of genes of interest in that pathway.

Identification of REO-based Tamoxifen-resistance Signature
Taking the consistent DEGs between tissues and cell lines as candidate genes, we used the Cox model and C-index analysis (Harrell et al., 1984) to develop a tamoxifen-resistance signature. The detailed process was described as follows.
Step 1: Selecting Survival-related Gene Pairs (1) For the n candidate DEGs, pairwise comparisons were performed for all genes (generating a total of C 2 n gene pairs), and this gene pair set was defined as Set 1.
(2) From all gene pairs (G i , G j ) in Set 1, the Cox model was used to select those that were significantly correlated with RFS of the tamoxifen-treated breast cancer patients. The set of significantly correlated gene pairs (FDR < 10%) was defined as Set 2.
Step 2: Optimizing the Gene Pair Signature First, we enumerated all the gene pair combinations in Set 2. For each gene pair combination in a sample, if at least half of the gene pairs in the combination were consistent with tamoxifen sensitivity, the sample was identified as low risk; otherwise, it was considered high risk. Then, we calculated the C-index value for each gene pair combination, and selected the combination with maximum C-index as our tamoxifen-resistance signature (Set 3).

Identification and Evaluation of DEGs in Cell Lines
A flowchart of the analysis procedure is shown in Figure 1. We identified the DEGs between tamoxifen-resistant and tamoxifensensitive cell line samples within each of the nine datasets using the SAM method (FDR < 20%). We also evaluated the consistency of DEGs among different datasets (a total of C 2 9 = 36 combinations). Among the 36 combinations, only 16 showed significant consistency (p < 0.05), as described in Table 2. These results indicate that there is greater heterogeneity among cell lines from different sources.

Identification of Tamoxifen Survival-related Genes in Tissues
Based on the univariate Cox regression model with FDR < 20%, 893 and 968 tamoxifen survival-related genes were identified in datasets GSE17705 and GSE6532, respectively; 235 genes were common to the two groups, all of which had the same dysregulation direction (which could not occur by chance; binomial test, p < 1.0E-16), further verifying the reliability of the results. These 235 genes were considered to be breast cancer tissue candidate genes.
Owing to the heterogeneity among cell lines, we evaluated the consistency between tissue candidate genes and DEGs from different cell line datasets (resistant vs sensitive) to select an optimal cell line model that could well represent the characteristics of clinical tissue samples. We found that only the DEGs from dataset GSE26459 were well reproduced among tissue candidate genes; the consistency ratio was above 73%, indicating that this did not occur by chance (binomial test, p = 2.13E-07). The DEGs from the other cell line datasets were not well reproduced among the tissue candidate genes ( Table 3). These results demonstrate that the cell line data from dataset GSE26459 could well represent the characteristics of clinical breast cancer tissue samples.

KEGG Pathway Enrichment
KEGG pathway enrichment analysis was performed for the 235 tissue candidate genes from datasets GSE17705 and GSE6532 using a threshold of FDR < 0.2, and for the DEGs from cell line dataset GSE26459 using the same threshold (Table 4). There was no pathway commonly enriched between tissues and the cell line, possibly owing to the low statistical power  or to partial differences between resistant and sensitive cell lines induced by tamoxifen treatment (Dancik et al., 2011). Thus, taking the pathways enriched in tissues as the gold standard, we obtained the p-values of these pathways in dataset GSE26459 (Table 4). With p < 0.2, the cell cycle, p53 signaling pathway, oocyte meiosis, and progesterone-mediated oocyte maturation were recurring themes in the pathway analysis for both tissues and cell lines. These pathways have been reported to be correlated with tamoxifen resistance.
Studies have shown that tamoxifen could affect the cell cycle of human breast cancer cell lines, the major sensitivity to tamoxifen in terms of both inhibition of cell cycle progression and drug cytotoxicity occurring particularly in the G0-G1 stage (Taylor et al., 1983). Tamoxifen could also affect the mitosis of oocytes and lead to premature centromere separation (London and Mailhes, 2001). The PTEN protein, encoded by the gene, in the p53 signaling pathway has been shown to be associated with tamoxifen resistance (Shoman et al., 2005). Similarly, the PGR protein in the progesterone-mediated oocyte maturation signaling pathway has been shown to be associated with tamoxifen response (Elledge et al., 2000). In summary, the pathways found to be enriched in tissues and also in cell line dataset GSE26459 (p < 0.2) were correlated with tamoxifen resistance, further demonstrating that the cell line model from dataset GSE26459 could represent the characteristics of clinical tissue samples.
Moreover, with FDR < 20%, the DEGs from cell line dataset GSE26459 were enriched in 31 pathways, compared with only seven pathways for the genes from tissue samples. However, as shown in Table 4, many of the pathways enriched for the cell lines from dataset GSE26459 are associated with tamoxifen treatment. For example, the prolactin signaling pathway and neurotrophin signaling pathway are related to side effects of tamoxifen (Lamberts et al., 1982;El-Ashmawy and Khalil, 2014), indicating that some of the differences between resistant and sensitive cell lines were due to tamoxifen treatment.

Identification of Tamoxifen Response Signature
First, we considered the 84 consistent DEGs between tissues and cell line dataset GSE26459 to be clinical tamoxifenresistance-related genes. In the training dataset GSE12093, pairwise comparisons were performed for all clinical tamoxifenresistance-related genes, and all the gene pairs were analyzed with a univariate Cox regression model. With FDR < 10%, 20 gene pairs were identified that were significantly associated with RFS. Then, among the 20 gene pairs, we enumerated all the gene pair combinations to calculate their C-index values, and selected the gene combination with the maximum C-index as the tamoxifen response signature. Finally, two gene pairs (TOP2A, SLC7A5; NMU, PDSS1) were identified. Based on our signature and the majority vote rule, the training dataset samples could be divided into high-and low-risk samples, which had significantly different RFS (hazard ratio [HR] = 9.509, logrank p = 1.98E-07). Our signature was also verified in an independent validation test using combined data from datasets GSE4922 and GSE2990 (HR = 2.191, logrank p = 0.009909), as shown in Figure 2A. Moreover, we searched public databases again for breast cancer tissue samples treated only with post-operative tamoxifen, for which associated RFS information was available, to further verify the performance of our signature. Finally, two new independent datasets were obtained. For the breast cancer tissue samples  from dataset GSE42568, 37 samples were identified as high risk, and 30 were identified as low risk (HR = 1.804, logrank p = 0.2), as shown in Figure 2B. For the breast cancer tissue samples from dataset GSE9195, 41 samples were identified as high risk and 36 as low risk (HR = 1.516, logrank p = 0.5), as shown in Figure 2C. Although the difference between the groups was not significant according to statistical tests, there was a clear trend indicating a difference in RFS between the highand low-risk groups identified by our signature (Figure 2B-C). Moreover, we combined the above two datasets to further verify the performance of our signature. In the combined data from datasets GSE42568 and GSE9195, 78 samples were identified as high risk and 66 samples were identified as low risk (HR = 1.7, logrank p = 0.1), as shown in Figure 2D. In summary, the results indicate that our signature (consisting of two gene pairs) can predict drug efficacy to some extent.

DISCUSSION
Cell line models are widely used in various fields of medical research, especially in basic cancer research and drug discovery (Masters, 2000;Mirabelli et al., 2019). Despite the successful application of cell lines in basic research, their use as model systems remains controversial (Masters, 2002;Sandberg and Ernberg, 2005;Peng et al., 2018;Hallas-Potts et al., 2019).
Owing to issues such as cross-contamination, mislabeling, or the identification of drug resistance, some cell line models do not adequately represent the characteristics of clinical tissues. In this study, based on evaluation of the consistency of DEGs between tissues and cell lines, we selected the optimal cell line model to represent the characteristics of clinical tissue samples; this was further verified by pathway analysis. Our analysis method is also suitable for other types of cell line modes. The tamoxifen survival-related genes identified in tissue samples from different datasets were significantly consistent, suggesting that the results were reliable. However, the DEGs found in tamoxifen-resistant and tamoxifen-sensitive cell lines from different sources were less reproducible, indicating that cell line models from different sources show more heterogeneity. Therefore, it will be of great clinical significance to screen  for drug-resistant and drug-sensitive cell line models that better represent the characteristics of clinical tissue samples. According to our results, the DEGs from cell line dataset GSE26459 were reproducible in tissue samples, indicating that the cell line model from this dataset was representative of the characteristics of clinical tissue samples. Tissue samples were obtained by surgical resection before tamoxifen therapy. Thus, the survival-related genes obtained from tissues were intrinsic to the patient and not induced by tamoxifen treatment. The resistant and sensitive cell lines from dataset GSE26459 were selected from MCF subclones (Gonzalez-Malerva et al., 2011); this might partly explain why the cell lines from GSE26459 could represent the characteristics of clinical tissue samples. The pathways enriched in tissues and in cell line dataset GSE26459 (p < 0.2) have been reported to be associated with tamoxifen resistance (Lamberts et al., 1982;El-Ashmawy and Khalil, 2014). Moreover, the clinical tamoxifen-resistance gene-pair signature we developed was verified in independent validation dataset, which indicates that our signature has some power to predict response to tamoxifen therapy, and further demonstrates that we have selected appropriate tamoxifen-resistant and tamoxifen-sensitive cell line models. Although the cell line models identified by our analytical method could well reflect the information of clinical tissue samples, there were some limitations. As patients with breast cancer usually have good prognosis, the endpoint of their followup is usually survival or recurrence time. Furthermore, as well as the effects of drugs, many factors including mood, marital status, and economic status could affect the survival of patients. The above factors might cause that some of the survival-related genes that we have identified are not involved in tamoxifen resistance. In future work, use of more tissue sample data or an improved algorithm should be considered. Moreover, as DNA methylation patterns, genomic changes, etc., might also predict sensitivity to drugs, the use of other types of data (such as microRNAs,

DATA AVAILABILITY STATEMENT
All datasets presented in this study are included in the article/supplementary material.

AUTHOR CONTRIBUTIONS
QZG and XKS conceived the study, analyzed the data, produced the figures, performed the statistical analysis, and drafted the manuscript. ZZZ participated in the revision of the manuscript. YZZ and YTC searched the data and participated in the statistical analysis. JL conceived the study and participated in its design and coordination, helped to draft the manuscript, and supervised the work. All authors contributed to the article and approved the submitted version.