Identification of Serum MicroRNAs as Novel Biomarkers in Esophageal Squamous Cell Carcinoma Using Feature Selection Algorithms

Introduction: Circulating microRNAs (miRNAs) are promising molecular biomarkers for the early detection of esophageal squamous cell carcinoma (ESCC). We investigated the serum miRNA expression profiles from microarray-based technologies and evaluated the diagnostic value of serum miRNAs as potential biomarkers for ESCC by using feature selection algorithms. Methods: Serum miRNA expression profiles were obtained from 52 ESCC patients and 52 age- and sex-matched controls via performing a high-throughput microarray assay. Five representative feature selection algorithms including the false discovery rate procedure, family-wise error rate procedure, Lasso logistic regression, hybrid huberized support vector machine (SVM), and SVM using the squared-error loss with the elastic-net penalty were jointly carried out to select the significantly differentially expressed miRNAs based on the miRNA profiles. Results: Three miRNAs including miR-16-5p, miR-451a, and miR-574-5p were identified as the powerful biomarkers for the diagnosis of ESCC. The diagnostic accuracy of the combination of these three miRNAs was evaluated by using logistic regression and the SVM. The averages of the area under the receiver operating curve and classification accuracies based on different classifiers were more than 0.80 and 0.79, respectively. The cross-validation results suggested that the three-miRNA-based classifiers could clearly distinguish ESCC patients from healthy controls. Moreover, the classifying performance of the miRNA panel persisted in discriminating the healthy group from patients with ESCC stage I-II (AUC > 0.76) and patients with ESCC stage III-IV (AUC > 0.80). Conclusions: These results in this study have moved forward the identification of novel biomarkers for the diagnosis of ESCC.


INTRODUCTION
Esophageal cancer (EC) is the seventh most common malignancy and the sixth leading cause of cancer-related death worldwide in 2018 (1). Esophageal squamous cell carcinoma (ESCC) accounts for over 90% of all EC cases in lower income countries, especially in parts of Asia (1). Despite the improvements in surgical techniques and perioperative management have extensively prolonged the survival of ESCC patients, ESCC still remains one of the most deadly carcinomas of the gastrointestinal tract. The 5year survival rate of late-stage ESCC is <15% (2). Therefore, early detection represents an essential way of reducing the morbidity and mortality of ESCC. Currently, diagnosis of ESCC mainly depends on endoscopic examination with biopsy, meaning that accurate detection of ESCC can only be obtained when patients have obvious symptoms and lesions. However, most earlystage ESCC patients are asymptomatic and their lesions are confined to the mucosa or submucosa, thus they usually lose the opportunity to be early diagnosed. Moreover, the invasiveness and the potential for sample error limit the effectiveness of the endoscopic biopsy (3). Thus, the discovery of novel non-invasive biomarkers with high efficiency for early detection of ESCC is urgently needed.
MicroRNAs (miRNAs) are a class of small non-coding RNAs of about 18-25 nucleotides in length. MiRNAs regulate gene expression by direct binding to the 3 ′ untranslated region (3 ′ UTR) of target messenger RNAs (mRNAs) according to base pair complementarity to promote their degradation and/or translational inhibition (4). The altered expression of specific miRNAs has been associated with various diseases, including cancers (5)(6)(7)(8). In recent years, several studies have analyzed the serum miRNA profiles in human ESCC patients by microarraybased techniques, and have examined their potential clinical relevance (9)(10)(11). However, a comprehensive evaluation of the value of circulating miRNAs as potential biomarkers for screening ESCC has yet to be investigated. Moreover, for the microarray data, the number of samples is usually much smaller than that of measured miRNAs, and this limitation is known as high-dimension, low-sample-size (HDLSS) problem that may lead to over-fitting and negatively influence the diagnostic performance in traditional statistical models. Therefore, the identification of miRNAs as biomarkers is tightly linked with the curse of dimensionality. Feature selection algorithms including multiple hypotheses testing and Lasso-type variable selection methods that have been proposed to reduce the dimension due to their simplicity and efficiency are more suitable for highdimensional microarray data of miRNAs (12).
In this case-control study, we profiled serum miRNA expression of a relatively large number of samples using the Agilent miRNA array. The combined application of the false discovery rate (FDR) procedure (13), the family-wise error rate (FWER) procedure (14), Lasso Logistic regression (15), hybrid huberized support vector machine (HHSVM) (16)and support vector machine using the squared-error loss (SESVM) with the elastic-net penalty (17) was then carried out, and identified three serum miRNAs, including miR-16-5p, miR-451a, and miR-574-5p, as candidate biomarkers for diagnosing ESCC. We then developed three-miRNA-based classifiers based on these three miRNAs by carrying out Logistic regression and SVM model to predict ESCC. We assessed the predictive accuracies of these classifiers and discovered that they were highly efficient in discriminating ESCC patients from healthy controls, suggesting these miRNAs are potential diagnostic biomarkers for ESCC.

Patients and Healthy Controls
The participants were composed of 104 Chinese Han people, including 52 unrelated ESCC patients and 52 healthy controls. All ESCC cases were pathologically diagnosed for the first time and were recruited consecutively from the Endoscopy Center of Cancer Hospital, Linzhou City, Henan province, from October 2014 to October 2015. During the same period, healthy controls were randomly selected from an early EC screening program, which were frequency-matched to the ESCC cases by age (±5 years) and gender. All enrolled ESCC case and control subjects were residents in Linzhou city. This study was approved by the Institutional Review Board of Capital Medical University and was in accordance with the principle of the Declaration of Helsinki. All subjects recruited for this study provided written informed consent prior to participation. Detailed characteristics of ESCC patients and the healthy volunteers were summarized in Table 1.

Total RNA Extraction
Whole blood samples were loaded into the serum collection tubes and stood for 1 h at room temperature, followed by being centrifuged at 820 g for 10 min at 4 • C. The resulting serum was transferred into new tubes, followed by further centrifugation at 16, 000 g for 10 min at 4 • C.

Statistical Analysis
The intersection among the significant miRNAs selected by different statistical methods is identified as the candidate biomarkers for ESCC (Figure 1). For each miRNA, the nonparametric Wilcoxon rank-sum test was used to compare expression levels between the ESCC and the control group. The single p-value of each test is then calculated. The adjusted pvalues are computed by the FDR method (13) and the FWER method of Holm's procedure (14), and miRNAs with an adjusted p-value less than or equal to 0.05 are defined significant and can be utilized as candidate biomarkers of ESCC. Simultaneously, the Lasso logistic regression (15), HHSVM (16) and SESVM (17) were performed to select the most useful diagnostic biomarkers from all ESCC associated miRNAs based on the microarray data. The intersection among significant miRNAs selected by different statistical methods is identified as a panel of biomarkers for the diagnosis of ESCC. Integrating multiple biomarkers in logistic regression and SVM, multi-miRNA classifiers are constructed to predict ESCC. The performance of the multi-miRNA classifiers is measured by classification accuracy and the area under the receiver operating characteristic (ROC) curve (AUC). Sensitivities and specificities of multi-miRNA classifiers were determined by the highest Youden index. The crossvalidation was applied to evaluate the diagnostic performance.
Each split of data set in cross-validation is 3/2 for training and 1/3 for testing. The heatmap of significant miRNAs identified by the FDR method was plotted with MultiExperiment Viewer (MeV, version 4.9, TM4, Boston, MA, USA). Each significant miRNA was standardized independently by performing Z-score transformation to scale the log base 2 of the expression levels into having a mean of zero and standard deviation of one. All statistical analysis was performed with R software (version 3.4.1, R Foundation for Statistical Computing, Vienna, Austria). The R packages "glmnet" and "gcdnet" were used to implement the biomarker screening by Lasso logistic regression, HHSVM, and SESVM. The package "e1071" was used to perform SVM, and the package "pROC" was employed to plot the ROC curve and to determine the AUC.

Identification of Significant MiRNAs by Controlling FDR and FWER
For each one among 842 human miRNAs encoded within three digits, we compared the miRNA expression levels of the ESCC patients and the normal controls. By controlling the FDR and FWER at 0.05, we identified seven significantly differentially expressed miRNAs, including miR-16-5a, miR-92-3a, miR-107, miR-320C, miR-451a, miR-486, and miR-574 ( Table 2). The heatmap of the seven differential miRNAs indicated that their expression levels were consistent within each group but obviously different between the two groups (Figure 2). In addition, as shown in Figure 3, the serum levels of these identified miRNAs were all significantly up-regulated in ESCC patients than in healthy controls.

Selection of Significant MiRNAs by Lasso Logistic Regression
We also applied Lasso logistic regression to screen these 842 human miRNAs encoded within three digits. The regularization parameter in Lasso logistic regression was selected by performing the 10-fold cross-validation, in which the cross-validated binomial deviance and the misclassification error were used as the criteria of predictive performance (Supplementary Figure S1). In the cross-validation process, the array data were randomly split into 10 subsets. Because the selected miRNAs were sometimes slightly different among each analysis, we carried out the cross-validation for 100 times and then identified candidate miRNAs that were selected for more than 30 times out of the 100 cycles of cross-validation. To obtain a more manageable set of miRNAs, we sacrificed one more crossvalidation error and finally identified six miRNAs, including miR-7b-5p, miR-107, miR-16-5p, miR-191-3p, miR-451a, and miR-574-5p, which all had a selection frequency higher than 90%.

Identification of Significant MiRNAs With Penalized SVM
The SVM method is one of the most powerful classification techniques, which is widely used for analyzing microarray data. Since the standard SVM cannot automatically select significant genes, to identify candidate miRNA biomarkers we performed the HHSVM and the SESVM. In these two methods, the tuning parameters that control the elastic-net penalty  were selected by performing multiple rounds of 5-fold crossvalidation. The margin-based loss function and misclassification error (ME) were employed for controlling cross-validation errors (Supplementary Figures S2, S3) for the HHSVM and the SESVM. We iterated the cross-validation 100 times for the high-dimensional microarray datasets and computed the selection frequency of each miRNA. MiRNAs with selection frequency of more than 30% were selected as significant ones. The HHSVM and the SESVM identified the same set of six miRNAs as Lasso logistic regression, including miR-7b-5p, miR-107, miR-16-5p, miR-191-3p, miR-451a, and miR-574-5p. The selection frequencies of these six miRNAs by HHSVM and SESVM methods using two types of cross-validation errors were summarized in Table 3. As shown in Table 3, five miRNAs were selected for more than 80 times out of 100 random splits by HHSVM and SESVM.

The Diagnostic Values of Selected Serum MiRNAs in ESCC
Notably, four miRNAs, miR-107, miR-16-5p, miR-451a, and miR-574-5p were simultaneously identified to be differentially expressed between ESCC and controls with all feature selection methods performed in this study. Although miR-107 was significant when the FDR was controlled at a level of 0.05, it was not significant if the FWER was also controlled at the same level. Furthermore, miR-107 had two outliers with extremely high levels, resulting in very large standard deviances in both ESCC and control groups. Therefore, miR-107 was excluded from further analysis and only the remaining three miRNAs, miR-16-5p, miR-451a, and miR-574-5p, were defined as potential biomarkers for the diagnosis of ESCC.   To evaluate the efficiency of this novel panel of three biomarkers in the detection of ESCC, logistic regression and SVM were applied to develop three-miRNA-based classifiers for predicting ESCC. We also used the panel of the three significant miRNAs and the main clinical characteristics including age, sex, smoking status, and drinking status for the classifiers. We randomly split the miRNA profile data into the training and the testing sets 100 times. In each split, 35 ESCC patients and 35 healthy controls were randomly classified into the training group and the rest 17 ESCC and 17 controls were correspondingly defined as the testing group; the ROC curves were plotted and the AUCs were calculated with logistic regression, linear SVM and SVM with the Radial Basis Function kernel (Figure 4 and Supplementary Figure S4). As shown in Table 4, the average AUCs and average accuracies of the three classifiers with the significant three miRNAs all exceeded 0.80 and 0.79, respectively; the average sensitivities and specificities of the two methods both were larger than 0.74. Furthermore, the diagnostic performance of the three-miRNA-based classifiers was very similar to (a little worse than in some settings) that of the classifiers with the panel of three significant miRNAs and the main clinical characteristics. These results indicated that the panel of three miRNAs had high predictive accuracy for distinguishing ESCC  from the healthy group with relatively high sensitivity and specificity.
Multiple comparisons of three miRNA levels between patients with ESCC stage I-II (ESCC I-II), ESCC stage III-IV (ESCC III-IV) and the healthy groups were further performed ( Figure 5A and Supplementary Table S1). In both patient subgroups, the levels of miR-16-5p, miR-451a, and miR-574-5p were significantly increased compared with the healthy group (p < 0.005 for all three miRNAs). The diagnostic performance of the miRNA panel in different ESCC stages was further evaluated  by using 100 times cross-validation (

DISCUSSION
In this study, we systematically investigated the serum miRNA profiles of a relatively large cohort of 52 ESCC patients and 52 well-matched healthy controls using high-throughput microarray-based assay. Using the FDR method, the FWER method, Lasso Logistic regression, HHSVM and SESVM together, this study has identified three miRNAs, miR-16-5p, miR-451a, and miR-574-5p, as potential biomarkers for diagnosis of ESCC. The expression levels of these three miRNAs were significantly up-regulated in the serum of ESCC patients compared to normal controls. We developed classifiers based on these three miRNAs with the logistic regression and the SVM model to predict ESCC. The classification accuracy and AUC were used to assess the performance of the three miRNAs as a diagnostic tool in detecting ESCC. The average AUCs and classification accuracies based on different classifiers were higher than 0.80 and 0.79, respectively, indicating that the three-miRNA-based panel had a high potential to distinguish ESCC patients from healthy controls. This study has major strengths. To the best of our knowledge, this is the first study in the identification of potential serum miRNAs as diagnostic biomarkers in ESCC based on microarray data of exceeding 100 sample cases, which increased the statistical power of each test and had a stronger sensitivity for quantitative detection of the serum miRNAs. Furthermore, five representative feature selection algorithms, which are appropriate for high dimension and low sample data analysis, were together performed to select significant miRNAs with a high credibility. Lastly, reliable validation employing 100 times random splits for the samples has shown that miRNA-based classifiers were powerful for the diagnosis of ESCC.
The biological functions of the three miRNAs identified in our study have been investigated in previous studies. MiR-16 was one of the earliest miRNAs found to be involved in cancers; it could suppress apoptosis and promote cell growth by downregulating reversion-inducing-cysteine-rich protein with Kazal motifs (RECK) and sex-determining region Y-box (Sox) 6, two genes that play important roles in the pathogenesis of ESCC (18). MiR-451 is a key factor in regulating erythroid differentiation and in maintaining homeostasis of erythroid cells (19). It has also been associated with cell proliferation, migration and apoptosis via regulating different target genes (20), such as calcium-binding protein 39 (CAB39) (21), Ras-related protein 14 (RAB14) (22), and macrophage migration inhibitory factor (MIF) (23). MiR-574-5p has been reported to act as an oncogene in various types of cancers, including ESCC (24,25).
There are some limitations of this study that need to be declared. First, the microarray experiment in this study was not designed specifically to identify biomarkers distinguishing between ESCC and normal esophagus, but rigorous dimensionality reduction techniques were employed to strengthen its effect of screening miRNAs. Second, although our study design is a population-based case-control study, potential drawbacks such as selection bias may still occur, because the ESCC cases were selected from the Cancer Hospital of Linzhou City, which accounted for above 60% of the esophageal cancer patients in the study population. However, the healthy controls were recruited from the general population in Linzhou City. Moreover, this case-control study design is not warranted to infer causal relationships between the expression levels of miRNAs and ESCC. Third, the levels of the three candidate miRNAs selected from the microarray were not further quantified by quantitative real-time PCR (qRT-PCR). However, miRNA microarray expression has shown to be highly concordant when re-analyzed with qRT-PCR (26), with correlation coefficients measuring from r = 0.986 to 0.994, depending on normalization method (27). Further validation of the newly identified miRNAs in larger independent population is necessary before they can be put in a clinical application.

CONCLUSION
In summary, five useful feature selection algorithms were jointly adopted to select the common serum miRNA signatures for the diagnosis of ESCC based on microarray data. Three serum miRNAs were identified, and classifiers based on the combination of them were constructed by the Logistic regression and the SVM. The cross-validation results showed that the three-miRNA-based classifiers can accurately distinguish ESCC patients from healthy controls. This study provides a resource and an impetus for further investigating these novel serum miRNAs as biomarkers in ESCC.