Early Diagnosis of Pancreatic Ductal Adenocarcinoma by Combining Relative Expression Orderings With Machine-Learning Method

Pancreatic ductal adenocarcinoma (PDAC) is an aggressive and lethal cancer deeply affecting human health. Diagnosing early-stage PDAC is the key point to PDAC patients’ survival. However, the biomarkers for diagnosing early PDAC are inexact in most cases. Therefore, it is highly desirable to identify an effective PDAC diagnostic biomarker. In the current work, we designed a novel computational approach based on within-sample relative expression orderings (REOs). A feature selection technique called minimum redundancy maximum relevance was used to pick out optimal REOs. We then compared the performances of different classification algorithms for discriminating PDAC and its adjacent normal tissues from non−PDAC tissues. The support vector machine algorithm is the best one for identifying early PDAC diagnostic biomarker. At first, a signature composed of nine gene pairs was acquired from microarray gene expression data sets. These gene pairs could produce satisfactory classification accuracy up to 97.53% in fivefold cross-validation. Subsequently, two types of data from diverse platforms, namely, microarray and RNA-Seq, were used to validate this signature. For microarray data, all (100.00%) of 115 PDAC tissues and all (100.00%) of 31 PDAC adjacent normal tissues were correctly recognized as PDAC. In addition, 88.24% of 17 non-PDAC (normal or pancreatitis) tissues were correctly classified. For the RNA-Seq data, all (100.00%) of 177 PDAC tissues and all (100.00%) of 4 PDAC adjacent normal tissues were correctly recognized as PDAC. Validation results demonstrated that the signature had a good cross-platform effect for early detection of PDAC. This work developed a new robust signature that might be a promising biomarker for early PDAC diagnosis.

Pancreatic ductal adenocarcinoma (PDAC) is an aggressive and lethal cancer deeply affecting human health. Diagnosing early-stage PDAC is the key point to PDAC patients' survival. However, the biomarkers for diagnosing early PDAC are inexact in most cases. Therefore, it is highly desirable to identify an effective PDAC diagnostic biomarker. In the current work, we designed a novel computational approach based on withinsample relative expression orderings (REOs). A feature selection technique called minimum redundancy maximum relevance was used to pick out optimal REOs. We then compared the performances of different classification algorithms for discriminating PDAC and its adjacent normal tissues from non−PDAC tissues. The support vector machine algorithm is the best one for identifying early PDAC diagnostic biomarker. At first, a signature composed of nine gene pairs was acquired from microarray gene expression data sets. These gene pairs could produce satisfactory classification accuracy up to 97.53% in fivefold cross-validation. Subsequently, two types of data from diverse platforms, namely, microarray and RNA-Seq, were used to validate this signature. For microarray data, all (100.00%) of 115 PDAC tissues and all (100.00%) of 31 PDAC adjacent normal tissues were correctly recognized as PDAC. In addition, 88.24% of 17 non-PDAC (normal or pancreatitis) tissues were correctly classified. For the RNA-Seq data, all (100.00%) of 177 PDAC tissues and all (100.00%) of 4 PDAC adjacent normal tissues were correctly recognized as PDAC. Validation results demonstrated that the signature had a good cross-platform effect for early detection of PDAC. This work developed a new robust signature that might be a promising biomarker for early PDAC diagnosis.

INTRODUCTION
Pancreatic ductal adenocarcinoma (PDAC) is one of the deadliest malignant carcinomas and it accounts for at least 95% of all pancreatic cancer cases (Tanaka, 2016). PDAC has a poor survival outcome (Zhang et al., 2018b) by reason of the difficulty of diagnosing and assessing PDAC at an early stage. Most patients with PDAC do not present any specific early characteristics during the early stage, which means that early PDAC cannot be detected timely and thus causes missed chances for surgery. At present, the most commonly and widely used tumor biomarker for early PDAC diagnosis is carbohydrate antigen 19-9 (CA19-9) (Goggins, 2005), but it is not an ideal biomarker because of its relatively low level of sensitivity and specificity (70% with a 5% error rate, for diagnosis of PDAC) (Goonetilleke and Siriwardena, 2007;Datta and Vollmer, 2014). Therefore, a reliable signature with exquisitely high sensitivity and specificity is urgently needed to facilitate early PDAC diagnosis.
The main shortcoming of the existing diagnostic signatures is that they are basically obtained by using signature genes' absolute expression value (Klett et al., 2018;Liao et al., 2018;Lu et al., 2018;Cheng et al., 2019b;Zou and Ma, 2020). Therefore, the batch effects could influence the choice of diagnostic signatures. Luckily, we could obtain diagnostic signatures with qualitative transcriptional information through exploiting the relative expression ordering (REO) method. The REO method is highly robust to experimental batch effects (Eddy et al., 2010;Cai et al., 2015;Zhao et al., 2016) and platform differences (Guan et al., 2016;Cheng, 2019). Therefore, it is possible to find robust and reliable disease signatures by using the datasets integrated from different platforms. Moreover, the REO strategy has been successfully used to identify the early diagnosis signature of malignant carcinoma, such as gastric cancer , hepatocellular carcinoma (Ao et al., Abbreviations: PDAC, pancreatic ductal adenocarcinoma; REOs, relative expression orderings; mRMR, maximum relevance minimum redundancy; IFS, incremental feature selection; SVM, support vector machine. 2018), and colorectal cancer (Guan et al., 2019). Consequently, it is worth employing the within-sample REO method to develop a robust qualitative signature for diagnosing earlystage PDAC.
Machine-learning techniques, which can be used to uncover biological principles and mechanism, is a good choice for biological knowledge mining (Liu et al., 2013Cao et al., 2017;Cheng et al., 2018a,b;Du et al., 2018;Zou et al., 2018;Stephenson et al., 2019). Hence, this work was devoted to develop an artificial intelligence-based approach to identify early-stage PDAC diagnostic signature. In the first step, all REOs were used for initial diagnosis descriptor. Subsequently, the minimum redundancy maximum relevance (mRMR), a features selection technique, was utilized to remove redundant REOs. The support vector machine (SVM), decision tree, logistic regression, random forest, naïve Bayes, and Bayes net algorithms were used for classification. Finally, 9 salient and genuine gene pairs including 16 genes were screened as the diagnostic signature for diagnosing early-stage PDAC. The nine gene pairs' signature displayed good diagnosis performance for early-stage PDAC in different diagnosis platforms by combining with SVM.

The Construction of Datasets
The microarray gene expression data and RNA-seq data used in current paper were collected from two Total 573    databases: the GEO database 1 and TCGA database 2 .
The detailed description of all data sets is elucidated in Table 1. Microarray data performed by the platform of Affymetrix and Illumina were freely downloaded from the GEO database. It contained 573 PDAC samples (Set1), 153 PDAC adjacent normal samples (Set2), 10 pancreatitis samples (Set3), and 74 normal samples (Set4). For data performed by Affymetrix, the raw data were directly downloaded from GEO and then the robust multi-array averaging (Bolstad et al., 2003;Irizarry et al., 2003a,b) was used to do the background correction and normalization. For data performed by Illumina, the originally processed data (series matrix files) were used. For all microarray data, the mapping information of probe IDs and Entrez gene IDs can be found in the corresponding platform files. For one gene with multiple probes, we used the arithmetic mean of these probes' values as this gene's expression value.
The RNA-Seq data set included 177 PDAC and 4 adjacent normal samples. We downloaded the free RNA-Seq profiles from TCGA (up to November 19, 2019) website using the TCGAbiolinks R package (Colaprico et al., 2016). The gene symbol expression matrix was obtained by using Ensembl gene IDs. To make the evaluation of the model more objective, each category of samples (Set1, Set2, Set3, Set4) were divided into two subsets of data: training set (80% of each category of samples) and testing set (20% of each category of samples). Ultimately, the training set contained 580 tumor samples (458 PDAC samples and 122 PDAC adjacent normal samples) and 67 non-tumor samples (59 normal samples and 8 pancreatitis samples). The testing set contained 146 tumor samples (115 PDAC samples and 31 PDAC adjacent normal samples) and 17 nontumor samples (15 normal samples and 2 pancreatitis samples) for the performance evaluation of the signature. In addition, the RNA-Seq data set and testing set belong to the independent test data sets.

Relative Expression Orderings (REOs)
To obtain a more robust and reliable signature from gene expression profiles, REO methodology was utilized for feature construction. The REO for gene pair (i and j) is formulated as Gi > Gj or Gi < Gj, where Gi and Gj represent the expression values of gene i and j. For the gene pair, if more than 85% of the samples have the same REO, we deem this REO as a stable REO of this gene pair. The stable reversal gene pairs represent the gene pairs that have stable REOs in both tumor tissues and control tissues, but REO patterns are different (Gi > Gj or Gi < Gj in tumor tissues but Gi > Gj or Gi < Gj in non-tumor tissues). Also, the reversal stable gene pairs between tumor and control samples were chosen as the candidate REO-based qualitative diagnostic signature. We later gained the consistent genes of all preprocessed data sets and its corresponding gene expression profiles. Whereafter, based on the reversal gene pairs and gene expression profiles, we gained new profiles by using 0, 1, and −1 to denote Gi > Gj, Gi < Gj, and other cases (Gi or Gj does not exist), respectively.

Minimum Redundancy Maximum Relevance (mRMR)
The mRMR (Peng et al., 2005) approach can omit the redundant features and choose the high-relevancy features to the target class, and thus significantly improve the classification accuracy. mRMR is on the base of information theory and it can be The absolute expression value of gene i is higher than that of gene j in PDAC patients compared with non-PDAC patients.
accomplished through mutual information (MI) operation, and the MI is formulated as follows: where v represents the feature vector and C represents the class to be targeted. The mRMR is estimated as where denotes the set of ranked features, MI (v i , C) denotes mutual information between the v i feature and class C, and IM (v i ,v j ) denotes mutual information between v i and v j . In this work, a reversal stable gene pair was considered as a feature. The feature selection process was essential for exact classification between tumor samples (positive samples) and non-tumor samples (negative samples). Thus, we utilized mRMR method to pick out effective features (gene pairs).

Incremental Feature Selection (IFS)
Based on mRMR techniques, we gained a list of ranked features (gene pairs). The incremental feature selection (IFS)  strategy was adopted to find the optimal feature subset which could produce the best diagnosis for PDAC. During IFS process, the gene pair was added one by one to feature subset and the optimal features (gene pairs) were determined when the highest accuracy was obtained.

Classification Algorithms
As a popular supervised learning approach, SVM was first introduced by Vapnik and has been widely used in various bioinformatics classification problems (Song et al., 2009;Shoombuatong et al., 2012;Win et al., 2017Win et al., , 2018Chen et al., 2019b;Laengsri et al., 2019;Manavalan et al., 2019b;Schaduangrat et al., 2019;Hasan et al., 2020;Liu and Chen, 2020). Herein, the free LibSVM (version 3.23) package (Chang and Lin, 2011) was employed to execute SVM. The LibSVM with fivefold cross-validation and radial basis function was employed to perform classification. The grid search with fivefold cross-validation was used to determine the C and γ values for SVM. As a result, we obtained the optimal values 32 and 0.03125 for C and γ, respectively. Apart from SVM, decision tree, logistic regression, random forest, naïve Bayes, and Bayes net were also utilized as classification algorithm and performed by using Weka (version 3.8.3) (Frank et al., 2004). Within this research, the aforementioned six classification algorithms with fivefold cross-validation were used.

Performance Measurements
In the current paper, six indexes were used to measure the effectiveness of our model. They are accuracy (ACR), sensitivity (SES), specificity (SPF), Matthews correlation coefficient (MCC) Bao et al., 2019;Chen et al., 2019a;Cheng et al., 2019aCheng et al., , 2020, the receiver operating characteristic (ROC) curves, and the area under the ROC curve (AUC). Especially, taking into consideration the class imbalance of tumor tissues and non-tumor tissues, we appointed MCC as the major performance measurement in this work. The details about ACR, SES, SPF, and MCC can be found from Tang

Derivation of PDAC Diagnostic Signature
The whole procedure of deriving the diagnostic signature is provided in Figure 1. First, with the relative expression orderings elaborated in Materials and Methods section, for 458 PDAC samples and 122 PDAC adjacent normal samples in the training set, there were 30,865,512 and 49,177,748 stable gene pairs, respectively. Also, there were 17,842,291 consistent stable gene pairs in total. Likewise, for 8 pancreatitis samples and 59 normal samples in the training set, there were 53,719,117 and 44,523,890 stable gene pairs, respectively. There were 25,687,362 consistent stable gene pairs in total. Among 17,842,291 and 25,687,362 gene pairs, there were 16 stable reversal gene pairs between the two sets of samples. Then, on the basis of the novel profiles (see Materials and Methods), we captured the optimal feature set from the 16 gene pairs by using mRMR with SVM, decision tree, logistic regression, random forest, naïve Bayes, and Bayes net. The comparison results of the aforementioned six classification algorithms are listed in Table 2. It was obvious that the SVM algorithm was the best one for identifying early PDAC diagnostic biomarker. The accuracy, sensitivity, specificity, and MCC of SVM were, respectively, 97.53%, 97.96%, 93.22% and 0.8615. Therefore, the final model used for early PDAC diagnostic biomarker identification was built based on SVM algorithm. The blue curve in Figure 2 displayed the process of IFS method. As we could see from  signature could identify PDAC with up to 97.53% accuracy on training set. That is to say, nine gene pairs illustrated in Table 3 were deemed as the optimal signature for diagnosing the earlystage PDAC.

Examination of the Signature
We then assessed the classification ability of nine gene pairs in independent test data sets, and the test results with fivefold cross-validation are shown in  Figure 3). According to independent tests on testing set and RNA-Seq data set, it was concluded that the signature could discriminate PDAC (PDAC and adjacent normal tissues) patients from non-PDAC (pancreatitis and normal tissues) patients.

DISCUSSION
Pancreatic carcinoma is a life-threatening malignant tumor of the digestive system with bad prognosis due to late diagnosis. The current imaging techniques and existing tumor signatures have insufficient sensitivity and/or specificity for early PDAC diagnosis. Herein, new strategies for diagnosis at an early stage of the disease are urgently needed. In the current work, we found a robust qualitative diagnostic signature 9 gene pairs (16 genes), which can discriminate PDAC (PDAC and adjacent normal tissues) patients from non-PDAC (pancreatitis and normal tissues) and might be a promising biomarker for early diagnosis of PDAC. Database PubMed was searched and retrieved appropriate journal articles on the association between 16 genes in 9 gene pairs and PDAC published before August 18, 2020. Seven genes in the nine gene pairs' signature, including UBE2C, SERPINB5, LAMC2, CTSE, HOXB7, RRM2, and ONECUT1, had been reported to be related to PDAC. The description of the association between seven genes and PDAC is displayed in Table 5. They might play a vital role in PDAC tumorigenesis and were critical genes for cancer. Notably, CTSE (Keliher et al., 2013), HOXB7 (Chile et al., 2013;Nguyen Kovochich et al., 2013), and RRM2 (Bhutia et al., 2013) were overexpressed in PDAC. UBE2C could encode a ubiquitin-conjugating enzyme which correlated with the PDAC development and progression. Also, the proliferation and epithelial-mesenchymal transition in PDAC could be inhibited by silencing UBE2C (Wang et al., 2019). SERPINB5 had been found to link to the prognosis of PDAC (Cheng et al., 2019b). LAMC2 has relation with the occurrence and progression of PDAC patients (Pan et al., 2018;Yang et al., 2018;Zhang et al., 2018a). Furthermore, the high expression level of LAMC2 could facilitate the invasion of PDAC cell and thus increase the risk of tumor recurrence (Yang et al., 2018). Patients with pancreatic diseases (chronic pancreatitis) had a higher risk of developing PDAC and thus the expression of CTSE in pancreatic diseases might be the key to early PDAC detection and PDAC progression. HOXB7, frequently overexpressed in PDAC, closely connected with lymph node metastasis (Nguyen Kovochich et al., 2013) and worse survival in PDAC patients (Zhang et al., 2014). Knockdowning HOXB7 could cause cell apoptosis and cell cycle arrest (Chile et al., 2013). RRM2   (Wang et al., 2019) was involved in the process of deoxyribonucleotide synthesis. Gene expression of RRM2 was significantly higher in PDAC tissues than in normal pancreatic tissues, which brought about the chemoresistance of PDAC to nucleoside analogs (Bhutia et al., 2013). A loss of ONECUT1 expression in PDAC cells implied its tumor suppressor function in this malignant tumor (Jiang et al., 2008).
To further study the detailed information and functions of the 9 gene pairs, we analyzed 16 genes (9 gene pairs) via using online tools in Metascape 3 (Tripathi et al., 2015). The enrichment analysis included GO terms functional enrichment and KEGG pathway enrichment. Pathways with P-value were less than 0.05 and the number of enriched genes greater than or equal to 3 was considered significant. Ultimately, based on the GO enrichment, the 16 genes (9 gene pairs) enriched in two terms in the category BP, including "regulation of cell cycle process" and "regulation of mitotic nuclear division." UBE2C, RRM2, TNKS, NUSAP1, and MYO19 were included in the genes enriched in regulation of cell cycle process, whereas TNKS, UBE2C, NUSAP1, and MYO19 enriched in the regulation of mitotic nuclear division. Collecting the aforementioned results, the genes of the nine gene pairs might play a significant part in the tumorigenesis of PDAC.
In conclusion, we had identified nine gene pairs' signature for early-stage PDAC diagnosis that could correctly distinguish PDAC (PDAC and PDAC adjacent normal tissues) tissues from non−PDAC (normal and pancreatitis tissues) patients at individual level. Because the number of normal and pancreatitis samples used in the current work for distinguishing early-stage PDAC is relatively small, we will try to collect more samples from more public databases to further obtain a novel diagnostic signature with higher accuracy on larger numbers of such specimens. Moreover, we hope that some RNA signature (Fang et al., 2019;Vaschetto, 2019;Wu et al., 2019) can be found and applied in related fields.

DATA AVAILABILITY STATEMENT
Publicly available datasets were analyzed in this study. This data can be found here: https://portal.gdc.cancer.gov/ and http://www. ncbi.nlm.nih.gov/geo/.