Bioinformatics-Based Identification of HDAC Inhibitors as Potential Drugs to Target EGFR Wild-Type Non-Small-Cell Lung Cancer

Patients with EGFR-mutant non-small-cell lung cancer (NSCLC) greatly benefit from EGFR-tyrosine kinase inhibitors (EGFR-TKIs) while the prognosis of patients who lack EGFR-sensitive mutations (EGFR wild type, EGFR-WT) remains poor due to a lack of effective therapeutic strategies. There is an urgent need to explore the key genes that affect the prognosis and develop potentially effective drugs in EGFR-WT NSCLC patients. In this study, we clustered functional modules related to the survival traits of EGFR-WT patients using weighted gene co-expression network analysis (WGCNA). We used these data to establish a two-gene prognostic signature based on the expression of CYP11B1 and DNALI1 by combining the least absolute shrinkage and selection operator (LASSO) algorithms and Cox proportional hazards regression analysis. Following the calculation of risk score (RS) based on the two-gene signature, patients with high RSs showed a worse prognosis. We further explored targeted drugs that could be effective in patients with a high RS by the connectivity map (CMap). Surprisingly, multiple HDAC inhibitors (HDACis) such as trichostatin A (TSA) and vorinostat (SAHA) that may have efficacy were identified. Also, we proved that HDACis could inhibit the proliferation and metastasis of NSCLC cells in vitro. Taken together, our study identified prognostic biomarkers for patients with EGFR-WT NSCLC and confirmed a novel potential role for HDACis in the clinical management of EGFR-WT patients.


INTRODUCTION
Lung cancer has the highest morbidity and mortality in China and around the world. Most patients presented with lung cancer at a late stage owing to hidden onset and unspecific symptoms associated with the disease (1, 2). Lung cancer is generally classified into nonsmall-cell lung cancer (NSCLC) and small cell lung cancer (SCLC). However, this traditional classification according to histological assessment fails to account for the complex prognosis and drug resistance associated with the disease (3).
Radiotherapy combined with chemotherapy is the major treatment strategy for SCLC, whereas targeted therapy has become the first-line treatment for NSCLC patients carrying specific driver mutations (4)(5)(6). Epidermal growth factor receptor (EGFR)-tyrosine kinase inhibitors (TKI) such as gefitinib and erlotinib were the first targeted therapy for NSCLC. They have been widely applied in the clinical application for NSCLC patients carrying EGFR-sensitive mutations such as in-frame deletions at exon 19 and exon 21 point mutations (L858R). Also, EGFR-TKIs have significantly prolonged disease-free survival (DFS) compared with platinum-based chemotherapy (7,8). However, only 20-30% of all NSCLC patients with EGFR-sensitive mutations can benefit from EGFR-TKIs. For patients with no EGFR gene mutations or an unknown mutation status, platinum-based doublet chemotherapy regimens remain the standard first-line therapy (9,10). In these cases, the tumor response rate is estimated to be less than 10% and overall survival (OS) is only slightly improved (11). There is an unmet need to develop a novel therapy and to improve the prognosis for patients with EGFR wild-type (EGFR-WT) NSCLC.
The rapid development of bioinformatics analysis has allowed the development of novel biomarkers that can predict prognosis in patients with lung cancer (such as PD-L1 (12,13), GLUT1 (14), and Ki-67 (15,16)). However, little effort has been focused on the identification of specific biomarkers for EGFR-WT patients. Thioredoxin reductases 1 (TrxR1) has been reported to be related to the poor prognosis in EGFR-WT and ALKnegative NSCLC (17). As the statistical power of individual biomarkers is considered to be weak, it is necessary to establish a gene signature biomarker to improve the accuracy of prognosis prediction (18)(19)(20). Weighted gene co-expression network analysis (WGCNA) is a systems biology approach that clusters genes with a high co-expression relationship into the same module (21). WGCNA has been widely used to assess the functions of transcriptome systems (22), to identify gene modules related to clinical parameters and to investigate cancer biomarkers (23)(24)(25). However, WGCNA has not yet been reported to reveal the prognostic prediction of biomarkers in EGFR-WT NSCLC patients.
In this study, WGCNA was conducted on the expression profiles of EGFR-WT NSCLC patients and a two-gene prognosis signature was obtained by LASSO COX regression. We performed connectivity map (CMap) database analysis to identify HDACis as potential drugs to effectively target EGFR-WT NSCLC patients with a high risk score (RS). Our findings provided a further understanding for prognosis prediction and clinical treatment of EGFR-WT NSCLC patients.

MATERIALS AND METHODS
The flow chart for this research is shown in Figure 1.

Data Acquisition and Consolidation
GSE31852 database expression profile and clinical data were downloaded from the GEO database (Table S1 ) as a training set, and 62 EGFR-WT patients with complete survival data were selected for further analysis (Table S2). Gene expression profiles of these samples were annotated by using the Human Gene 1.0 ST Array (Table S3, Affymetrix, Santa Clara, CA) according to Affymetrix protocols (Table S4). Probes with no gene or duplicate-gene annotation were excluded.

WGCNA Network Construction
R package "WGCNA" was used for the automatic construction of a co-expression network. Firstly, a hierarchical clustering analysis of the samples was undertaken to ensure that there is little difference between the samples in the GSE31852 dataset ( Figure S1A). The co-expression similarity matrix of gene expression was defined according to the Pearson correlation coefficient. Following the selection of an appropriate soft threshold b, the unweighted co-expression similarity matrix was converted into a weighted adjacency matrix. Then, the topological overlap matrix (TOM) was constructed using the degree of dissimilarity between the nodes and the dissimilarity index was defined between the nodes (26). Finally, by using the dynamic tree-cutting algorithm, the TOM was modified and the network modules were initially identified by satisfying conditions such that the difference between these modules is less than 0.25, or the similarity exceeds 0.75 (Table S10).
R package "WGCNA" was used sequentially to visualize the constructed network module and elucidate the correlation between external information. Modules with a significance P < 0.05 in the correlation test were defined being related to the trait. All genes in modules related to prognosis (time and status) were included in the construction of a prognostic risk signature for EGFR-WT patients.

LASSO Regression and Multivariate COX Regression
R packages "glmnet" and "survival" were used to perform COX regression analysis through the LASSO algorithm. Those parameters with non-zero regression coefficients in the LASSO regression results were further included in the multivariate COX regression analysis. Genes with statistical significance in the multivariate Cox regression analysis were used to calculate their weighted gene expression values to establish RSs for each patient. The RS formula was established as follows: Internal and External Verification of the Prognostic Risk Score Signature X-tile software was used to calculate the best cut-off value of the patient's RS. According to the best cut-off value, all patients were divided into a high-RS group and a low-RS group. Kaplan-Meier survival analysis was performed using the "survival" package with the "log-rank" method. Both the consistency parameter C-index of the survival model and the accuracy of the prediction model in the training set were validated by the resampling method for internal cross-validation using R package "boot." R package "survivalROC" was used to plot the ROC curve and calculate the area under the curve (AUC).

Gene Set Enrichment Analysis for Biological Function
GSEA Version 3.0 software was employed to enrich the main biological function pathways in the high-RS group, referring to "c2.cp.kegg.v6.2.symbols.gmt" and "h.all.v7.1.symbols.gmt" gene sets taken from the MsigDB database. All processes were performed according to the default parameters of the GSEA software. The number of random combinations was set to 1,000 and the results were sorted according to normalized enrichment scores (NES).

Differential Gene Screening and Targeted Drug Prediction
Differentially expressed genes (DEGs) of the high-RS group versus low-RS group with |log fold change (log FC)| > 0.585 and p-value < 0.05 were analyzed by the R package "limma" (Table S11). Then, to find those drugs targeting high-RS patients, the differential gene sets were input into the CMap drug database (http://www.broadinstitute.org/cmap). The results included genes, diseases, or drug networks that were similar with, or opposite to, the expression profile. A positive score meant that the change in the expression profile caused by a drug was similar to the input gene expression profile. Conversely, a negative score indicated that the change in the expression profile caused by a drug was opposite that in the input gene expression profile. A drug with a negative score may reverse the corresponding gene expression in the disease and thus serves as a potential targeting drug for the disease (Table S12). Potential compound drugs were selected for verification according to the correlation score (less than 90) of the drugs (Table S12) based on published data from the literature (26).

MTT Assay
Approximately 2,000 cells per well in 96-well plates were treated with various concentrations of TSA or SAHA for 48 or 72 h. Then, we added 20 ml of 3-(4,5-dimethylthiazolyl-2-yl)-2,5diphenyltetrazolium bromide (MTT) (5 mg/ml) to each well and the cells were incubated for another 4 h at 37°C. After removing the medium, cells were lysed in 200 ml dimenthylsulfoxide (DMSO) at room-temperature, and the optical density (OD) was measured at a wavelength of 570 nm with a microplate reader (Bio-Rad Laboratories, Hercules, CA, USA).

Transwell Assay
Cells were treated with HDACis for 24 h, collected, and resuspend in serum-free media. Transwell chambers (Corning, NY, USA) were plated into a 24-well plate; 2 × 10 4 cells in 200 ml of serum-free medium were seeded onto the upper chamber and 500 ml of medium with 10% FBS was added to the lower chamber with or without HDACis. After incubation for 24 h, the chambers were fixed with methanol and cells on the upper membrane were removed. Cells in the lower membrane were stained with Wright-Giemsa dye and the number of cells counted.

Statistical Analysis
All statistical analyses and visualization were performed using GraphPad Prism 7.0 and R (version 3.6.2). Student's t-test was used to compare the differences between the two groups: P < 0.05 was considered statistically significant.

Data Collation and Patient Characteristics
In order to identify the key prognostic biomarkers for EGFR-WT NSCLC patients, we performed a systematic analysis of the GSE31852 dataset that included 62 patients with complete progression-free survival (PFS) information. The general information of these patients is summarized in Table 1. These patients included 24 lung adenocarcinomas, accounting for 38.7% of the samples, and six lung squamous cell carcinomas, accounting for 9.7% of the sample. The cohort consisted of 21 men and 16 women, accounting for 33.9 and 25.8% of the samples respectively. Six patients had no previous history of smoking, and 31 patients were former smokers. According to the expression profiles, all of the patients were clustered gene expression datasets ( Figure S1A). As no obvious outliers in the expression profiles were observed, all of the 62 EGFR-WT patients were included in the subsequent analysis.

Construction of a Scale-Free Network by WGCNA
WGCNA was conducted on the gene expression profile data to identify the gene network modules co-expressed in EGFR-WT patients and to explore the relationship between these gene network modules and prognosis. Firstly, a soft threshold (b = 18) with an R 2 > 0.9 was defined to establish an adjacency matrix ( Figures 2A, B). All genes were clustered into 21 modules with different colors ( Figure 2C). The gray module in which the genes were not clustered was excluded from the analysis. The similarity between each module was less than 0.75 ( Figures 2D, E).
We calculated the module eigengenes (ME) value of the samples which represented the gene expression pattern of each module. We found that the gene expression of every sample in each module was different whereas the gene expression of each sample in every module was similar. These results suggested that the modules composed of genes with similar expression patterns may have important biological functions ( Figure S1B). Then, according to gene significance (GS) which was defined as the association of a single gene with external information (clinical pathology parameters), the correlation between the survival parameters (PFS time and state) and related modules was explored (Table S10, Figure 2F). Among each of the 20 modules, the turquoise and green modules were both significantly associated with PFS (P < 0.05) compared to the other traits. These modules were most closely related to PFS time and PFS status with correlation coefficients of −0.42 (P < 0.0001) and −0.35 (P < 0.0001), respectively. These results indicated that the gene expression patterns in the turquoise and green modules were most closely correlated with prognosis in the EGFR-WT patients.

Construction of Two-Gene Prognostic Signature-Based RS for EGFR-WT NSCLC Patients
As each module in the WGCNA network based on gene expression profiles could be regarded as a characteristic for EGFR-WT patients, the turquoise module ( Figure S2A) and green modules ( Figure S2B) may contain the prognostic prediction genes for EGFR-WT patients. The genes of these two modules were combined as candidate gene sets for prognostic markers in EGFR-WT patients. To further select the key genes related to prognosis, we performed the Least Absolute Shrinkage and Selection Operator (LASSO) regression combined with multivariate COX regression analysis ( Figures 3A and S2C, D). CYP11B1and DNALI1 were identified as independent prognostic factors for PFS of EGFR-WT NSCLC patients by univariate and multivariate Cox regression ( Table 2). Based on the expression of independent prognostic factors and their corresponding coefficients in the regression analysis, the two-gene signature-based RS of each EGFR-WT patient was derived based on the following formula: According to the best cut-off value (RS = 19.1), the patients were divided into high-and low-RS groups ( Figure 3B). Kaplan-Meier survival analysis showed that the PFS of patients in the high-RS group was significantly shorter than that in the low-RS group ( Figure 3C, HR = 3.99, 95% CI = 2.04-7.82, P < 0.0001). The C-index was 0.8625 (95% CI = 0.7579-0.9671, P < 0.001) by internal cross-validation.
To assess the predictive ability of the model for short-term PFS, the receiver operating characteristic (ROC) curves of the two-gene signature-based on the RS and gene expression were drawn at 1, 2, and 6 months, and the AUC was determined. The results showed that the two-gene RS model was a good indicator of short-term PFS ( Figure 3D). The AUCs for the two-gene RS at 1, 2, and 6 months were 0.736, 0.781, and 0.848, respectively. These values were significantly better than those obtained based on simple gene expression for predicting PFS at each time point (Figures S2E, F).
Taken together, these results indicated that the two-gene RS signature established by the PFS-related modules of WGCNA reflected the survival of EGFR-WT patients and had better predictive power than independent prognostic gene expression.

External Validation of Two-Gene RS for EGFR-WT Patients
To validate the significance of the two-gene signature for the prognosis of PFS in EGFR-WT NSCLC patients, the GSE31210 database was selected for external verification. All patients were divided into high-or low-RS groups according to the best cut-off value (RS = 10.1) and analyzed based on the Kaplan-Meier survival curves. In patients with EGFR mutations, the PFS of low-RS group was significantly shorter than that of high-RS group (Figures S3A, B). In contrast, in patients with EGFR-WT (including KRAS mutations, ALK fusion, and none of three mutations named as triple-negative type), the PFS of high-RS patients tended to be shorter although it failed to reach statistical significance (P > 0.05) potentially due to the small sample size ( Figures 4A, B). Further refinement of the grouping indicated that the two-gene RS signature had better performance in predicting PFS for patients with triple-negative disease (HR = 3.11, 95% CI = 1.35-7.19, P < 0.01) (Figures 4C, D). The C-index results of the prognostic model in the three classifications of populations also showed that the RS could fit the true situation of PFS for patients with triple-negative lung cancer (Table 3). However, no significant difference in the long-term OS was found between the high-and low-RS groups regardless of the types of mutations ( Figure S4). Our results demonstrated that the two-gene RS was significant for survival prediction in EGFR-WT patients particularly in those who did not have EGFR/KRAS/ALK mutations. The two-gene signature was also verified in TCGA database ( Figures S3C, D and 4E, F). All NSCLC patients were also divided into high-and low-RS groups and analyzed from the Kaplan-Meier survival curves. The survival time of high-RS patients was significantly shorter than that of low-RS patients (HR = 2.1, 95% CI = 1.51-2.91, P < 0.0001), suggesting that the two-gene signature had a generalized and adaptive capacity in predicting the prognosis of EGFR-WT NSCLC patients.

Verification of HDAC Inhibitors as Potential Targeted Drugs for EGFR-WT NSCLC
To clarify the biological characteristics of high-RS populations, we performed GSEA on the gene expression profile of EGFR-WT NSCLC patients using the tumor-related HALLMARK pathway    Figure 5A).
To ascertain potential drug targeting in the EGFR-WT high-RS population, we performed differential gene analysis between the high-and low-RS groups, and identified 25 up-regulated differential expression genes (DEGs) and 36 down-regulated DEGs (Table S11, Figure 5B). We compared these DEGs with the CMap database, aiming to identify the drugs that interacted with the DEGs as detailed in the Materials and Methods ( Figure  5C). Twenty-four candidate drugs were identified, and   interestingly, nine were classified as HDACi (Tables 4, S12).
Considering the safety and feasibility of HDCAis for clinical applications, we verified two common HDACis, TSA, and SAHA in EGFR-WT lung cancer cells (A549 and H1299). Both HDACis significantly inhibited the proliferation and migration of these cell models ( Figures 6A, B), indicating the potential of TSA and SAHA as novel treatment strategies in EGFR-WT lung cancer.

Discussion
In the current study, we established a RS based on a two-gene signature for EGFR-WT NSCLC patients and found that EGFR-WT patients with a high RS had a worse prognosis. Mechanistically, a high RS might cause poor prognosis by activating multiple metastasis-related pathways such as epithelial-mesenchymal transition and ECM receptor interaction. Furthermore, HDACis were screened out as potential targeted drugs for EGFR-WT NSCLC and found to inhibit cell proliferation and migration of EGFR-WT NSCLC cells. Although well-known driver genes such as EGFR and ALK with sensitive mutations could strongly predict the efficacy of targeted drugs in NSCLC patients, about 50% of NSCLC patients without driver gene mutations do not benefit from the personalized targeted therapy (27,28). For those patients that do not have corresponding mutations or have unknown mutation status, platinum-based doublet chemotherapy remains the standard first-line regimen which has poor efficacy (9,10). In these patients, docetaxel or pemetrexed could be used as second-line single-agent chemotherapies. However, the tumor response rate was less than 10% and the OS was only slightly improved with these treatments (11). The current known tumordriver mutations are not sufficient to fully predict the drug response of lung cancer. This study aimed to ascertain the characteristics of gene expression profiles in EGFR-WT NSCLC patients and provide evidence for the clinical treatment for NSCLC patients without specific mutations.
The rapid development of NGS technology and bioinformatics has allowed major progress in the prediction of NSCLC prognosis. In addition to the TNM stage, multi-gene prognostic signatures based on transcriptome sequencing have been developed for prognosis. However, the majority of studies of gene signatures for predicting prognosis in lung cancer have focused on RSs related to specific mechanisms, starting with functional molecules or cell components, such as immune infiltration (29)(30)(31), EMT scores (32), hypoxic or metabolic catabolites (33,34), and non-coding RNA (35,36). Little research has considered specific mutations.
Few gene signature studies have considered a scale-free property of the gene interaction network using WGCNA (37). WGCNA bridges the gap from individual genes to systems biology networks and can be utilized to identify hub genes that play key roles in the disease by converting the relationship between genes from a constant probability to the value adding a correlation weight (38). This allows the conversion from a random network to a scale-free network (39). Unfortunately, WGCNA on EGFR-WT NSCLC patients has not yet been reported.
In the present study, we used WGCNA to cluster the gene expression pattern of the EGFR-WT population, studied the coexpressed gene modules and correlated the gene modules with patient prognostic information for the first time. Due to the limited omics data containing information on mutations, the accuracy and comprehensiveness of our two-gene signature need to be validated in a large cohort of lung cancer patients. Also, we observed that the two-gene signature predicted the prognosis of patients without known driver gene mutations (EGFR/KRAS/ ALK-WT) better than that of patients with EGFR-WT alone. Thus, the accumulation of next-generation sequencing (NGS) data in clinical tumor assessment and treatment may allow the further verification of a two-gene signature in a larger multiomics database.
Considering the genes in WGCNA modules might be interrelated, they were not suitable for direct prognostic signature construction as this may result in multi-collinearity. We performed LASSO regression on the genes in the WGCNA modules to screen for the prognostic factors for the subsequent multivariate COX regression model. These analyses led to the selection of CYP11B1 and DNALI1 genes to construct a prognostic risk signature for EGFR-WT patients.
CYP11B1 encodes the cytochrome P450 family 11 subfamily B member 1 protein that is mostly expressed in the adrenal glands and is related to excessive cortisol secretion (40). In cancer, CYP11B1 is not only related to aldosterone-and cortisol-co-secreting adrenal tumors (41,42), but it also affects the drug response of breast cancer (43), gastrointestinal tumors (44), leukemia (45), and other tumors. DNALI1 encodes dynein axonemal light intermediate chain 1 protein which is a human homologue of p28 in Chlamydia. DNALI1 is widely expressed in the human testis, ovaries and other tissues (28,46), but the functions of DNALI1 in physiological processes and tumor development remain unclear. Limited studies have reported that DNALI1 is down-regulated in breast cancer (47) and negatively correlated with poor prognosis (48,49). However, the expression pattern and function of both genes in lung cancer, as well as their influence on prognosis are unknown.
In this study, we analyzed the predictive significance of both molecules in the prognosis of NSCLC patients from online databases ( Figure S5). The prognosis of patients with high expression of CYP11B1 or low expression of DNALI1 was significantly poor (P < 0.0001). Furthermore, the efficiency of prognostic prediction in the EGFR-WT population from RS based on the two-gene signature was better than that by individual genes (Figures S2E, F). The RS based on the twogene signature had good predictive significance for the prognosis of EGFR-WT patients. This may provide a good reference value for clinical decision-making in EGFR-WT patients, particularly in patients without known driver mutations.
From the online drug database named CMap, HDACis were screened out as potential drugs to improve the prognosis of EGFR-WT patients with a high RS based on the two-gene signature.
CMap is a database developed by Broad Research Institute to reveal the functional relationships between small molecule compounds, genes, and disease states (50). CMap was employed to compare the differentially expressed gene list with the reference gene sets after specific treatments in the database. A correlation score (−100 to 100) was obtained according to the enrichment of differentially expressed genes in the reference gene expression profile. A positive score indicated that the upregulation or down-regulation pattern of input genes was similar to the pattern of reference gene expression treated with different drugs. In contrast, a negative score indicated that the drugs regulated the expression of genes in an opposite direction. Finally, all treatments in the database were ranked according to the correlation score with the reference gene expression profile.
We found that among a total of 24 drug candidates with scores less than −90, 9 were HDACis, accounting for more than one-third of those identified. HDACis, such as romidepsin and vorinostat, are approved by the Food and Drug Administration (FDA) in cutaneous T cell lymphoma therapy (51,52). However, although multiple preclinical studies have shown that HDAC inhibitors play a significant anti-cancer role in vitro or in animal models (53,54), HDACis monotherapy clinical trials for lung cancer have failed (55)(56)(57). The results in the present study independently predicted that HDACis may be potential drugs for patients with high-RS EGFR-WT NSCLC and implicated the possibility of HDACis as single drugs for lung cancer therapy. Furthermore, HDACis were found to inhibit the proliferation and migration capacity of EGFR-WT NSCLC cells which was consistent with previous studies. Our research provides support for the independent application of HDAC inhibitors in EGFR wild-type NSCLC. Large-scale clinical trials should be carried out to confirm the efficacy of HDACis in EGFR-WT NSCLC patients.
In conclusion, the current study showed that a two-gene signature could effectively predict the survival of EGFR-WT patients, especially in NSCLC patients without known gene mutations. Our data also indicate that HDACis, such as TSA or SAHA, might be potentially effective clinical drugs for high-RS EGFR-WT patients. This study may fill the gap in lung cancer data analysis on one-specific mutant population, highlights the need for differential analysis of different oncomutations in cancer and also provides clues for the clinical treatment of EGFR-WT NSCLC patients.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding authors.

AUTHOR CONTRIBUTIONS
YW and CZ analyzed the data and drafted the manuscript and all figures. WL and DW helped interpreted the data. YCheng and YChen completed analysis of drug prediction. KH and YL completed the in vitro experiments. JQ edited language grammars and all tables. XC and XH designed the study and revised the manuscript. All authors contributed to the article and approved the submitted version.