Identifying Drug Targets in Pancreatic Ductal Adenocarcinoma Through Machine Learning, Analyzing Biomolecular Networks, and Structural Modeling

Pancreatic ductal adenocarcinoma (PDAC) is one of the leading causes of cancer-related death and has an extremely poor prognosis. Thus, identifying new disease-associated genes and targets for PDAC diagnosis and therapy is urgently needed. This requires investigations into the underlying molecular mechanisms of PDAC at both the systems and molecular levels. Herein, we developed a computational method of predicting cancer genes and anticancer drug targets that combined three independent expression microarray datasets of PDAC patients and protein-protein interaction data. First, Support Vector Machine–Recursive Feature Elimination was applied to the gene expression data to rank the differentially expressed genes (DEGs) between PDAC patients and controls. Then, protein-protein interaction networks were constructed based on the DEGs, and a new score comprising gene expression and network topological information was proposed to identify cancer genes. Finally, these genes were validated by “druggability” prediction, survival and common network analysis, and functional enrichment analysis. Furthermore, two integrins were screened to investigate their structures and dynamics as potential drug targets for PDAC. Collectively, 17 disease genes and some stroma-related pathways including extracellular matrix-receptor interactions were predicted to be potential drug targets and important pathways for treating PDAC. The protein-drug interactions and hinge sites predication of ITGAV and ITGA2 suggest potential drug binding residues in the Thigh domain. These findings provide new possibilities for targeted therapeutic interventions in PDAC, which may have further applications in other cancer types.


INTRODUCTION
Pancreatic ductal adenocarcinoma (PDAC) is one of the most malignant solid tumors (Bailey et al., 2016). PDAC is difficult to treat due to the stage of diagnosis, severe cachexia and poor metabolic status, the resistance of cancer stem cells (CSCs) to current drugs, and the marked desmoplastic response that facilitates growth and invasion, provides a physical barrier to therapeutic drugs, and prevents immunosurveillance (Al Haddad and Adrian, 2014). PDAC is also a drug-resistant disease, and the response of pancreatic cancer to most chemotherapy drugs is poor. Until now, most of research effort in PDAC has been directed at identifying the important disease-driving genes and pathways (Waddell et al., 2015). These studies have shown that KRAS, CDKN2A, TP53, and SMAD4 are the four most common driver genes in PDAC (Carr and Fernandez-Zapico, 2019). With the development of multi-omics data, a series of new regulators that are strongly correlated with survival have been proposed to be PDAC biomarkers (Rajamani and Bhasin, 2016;Mishra et al., 2019), including genes (e.g., IRS1, DLL1, HMGA2, ACTN1, SKI, B3GNT3, DMBT1, and DEPDC1B) and lncRNAs (e.g., PVT1 and GATA6-AS). The integrated transcriptomic analysis of five PDAC datasets identified four-hub gene modules, which were used to build a diagnostic risk model for the diagnosis and prognosis of PDAC (Zhou et al., 2019). Integrated genomic analysis of 456 PDAC cases identified 32 recurrently mutated genes that aggregate into 10 pathways: KRAS, TGF-b, WNT, NOTCH, ROBO/SLIT signaling, G1/S transition, SWI-SNF, chromatin modification, DNA repair, and RNA processing (Bailey et al., 2016). Previous treatments for pancreatic cancer have focused on targeting some of these PDAC-associated pathways, including TGFb (Craven et al., 2016), PI3K (Conway et al., 2019), Src (Parkin et al., 2019), and RAF!MEK!ERK (Kinsey et al., 2019) and NFAT1-MDM2-MDMX  signaling, as well as cell-cell communication within the tumor microenvironment (Shi et al., 2019). The discovery of novel drug targets provides extremely valuable resource towards the discovery of drugs. Although the human genome comprises approximately 30,000 genes, proteins encoded by fewer than 400 are used as drug targets in disease treatments. A range of therapeutic targets in PDAC have been proposed, including suppressing the abovementioned genes and pathways (Tang and Chen, 2014). However, the current drug targets for PDAC will not be 100% effective due to the heterogeneous nature of the disease. To tackle this challenge, a complete understanding of the molecular mechanism of PDAC is urgently needed.
Improving PDAC therapy will require a greater knowledge of the disease at both the systems and molecular levels. At the systems level, protein-protein interaction (PPI) networks provide a global picture of cellular function and biological processes (BPs); thus, the network approach is used to understand the molecular mechanisms of disease, particularly in cancer (Conte et al., 2019;Sonawane et al., 2019). Some proteins act as hub proteins that are highly connected to others, thus cancer drug targets can be predicted by hubs in PPI networks Lu et al., 2018;Zhu et al., 2019). However, there are some conflicting results that suggest disease genes or drug targets have no significant degree of prominence (Mitsopoulos et al., 2015), but higher betweenness, centrality, smaller average shortest path length, and smaller clustering coefficient (Zhao and Liu, 2019). Recent advances in systems biology have led to a plethora of new network-based methods and parameters for predicting essential genes , disease genes, and drug targets (Csermely et al., 2013;Vinayagam et al., 2016;Zhang et al., 2017;Fotis et al., 2018;Liu et al., 2018). Additionally, the structural annotation of PPI networks that has highlighted key residues has enriched the fields of both systems biology and rational drug design (Kar et al., 2009;Winter et al., 2012). The prediction of binding sites, allosteric sites, and genetic variations based on systems-level data is critical for suggesting therapeutic approaches to complex diseases and personalized medicine (Duran-Frigola et al., 2013;Yan et al., 2018). Combined with PPI network analysis, molecular docking studies of target genes can further help to find drug molecules and protein-drug interactions for lung adenocarcinoma (Selvaraj et al., 2018).
Together with advances in "-omics" data, including gene expression and PPI data, machine learning (ML), and artificial intelligence (AI) techniques are powerful tools that can assess gene and protein "druggability" from such massive and noisy datasets (Kandoi et al., 2015;Zhavoronkov, 2018). As the most used ML method, support vector machine (SVM) has been used for cancer genomic classification or subtyping, which may be useful for obtaining a better understanding of cancer driver genes and discovering new biomarkers and drug targets . ML-based methods have been applied to study PDAC for different purposes. By applying ML algorithms to proteomics and other molecular data from The Cancer Genome Atlas (TCGA), two subtypes of pancreatic cancer can be classified (Sinkala et al., 2020). A meta-analysis of PDAC microarray data could help predict biomarkers that can be used to build AI-based computational predictors for classifying PDAC and normal samples , as well as predicting sample status (Almeida et al., 2020). To predict and validate novel drug targets for cancer, including PDAC, a ML-based classifier that integrates a variety of genomic and systems datasets was built to prioritize drug targets (Jeon et al., 2014).
In this study, we developed a computational framework that integrates various types of high-throughput data, including transcriptomics, interactomics, and structural data, for the genome-wide identification of therapeutic targets in PDAC. A novel centrality metric, referred to as SVM-REF and Network topological score (RNs), was proposed for the identification of disease genes and drug targets. This method incorporates gene expression and network topology information from ML and PPI analyses. Moreover, the predicted genes were validated by "druggability" prediction, survival, and comparative network analyses, as well as functional enrichment analysis. Finally, the structural and dynamic properties of two integrins (ITGAV and ITGA2) as drug targets were investigated. The workflow of these methods is shown in Figure 1.

Identification of DEGs
In this study, three independent PDAC expression microarray datasets with 184 pancreas samples (95 cancer and 89 nonmalignant samples) were used. The datasets were obtained from the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO, https://www.ncbi.nlm.nih. gov/geo/). Details of each dataset are listed in Table 1. The GSE15471 dataset included 36 PDAC samples and matching normal pancreas samples from pancreatic cancer patients in Romania (Badea et al., 2008). There were also matched samples in the GSE28735 dataset, which contains gene expression profiles of 45 matched pairs of pancreatic tumor and adjacent non-tumor tissues from PDCA patients in Germany (Zhang et al., 2012;Zhang et al., 2013). The GSE71989 dataset contained expression profiles of eight normal pancreas and 14 PDAC tissues (Jiang et al., 2016). The normalized data were downloaded from GEO and then analyzed to identify DEGs using t-tests, with p-values adjusted by the Benjamini-Hochberg method. Only genes with adjusted p-values < 0.01 and |FC| > 1.5 were chosen as DEGs.

Gene Prioritization Pipeline
Disease genes and drug targets usually have large degree in PPI networks, but there is no single network parameter that can accurately predict them (Li et al., 2016). Protein targets do not exert their function in isolation; rather they are affected by interactions within their PPI network, which are governed by protein localization and environment. In the same way, topological information from PPI networks alone is not enough to identify disease genes and drug targets without biological information. To overcome these limitations, we developed a new three-step pipeline to identify cancer-related genes that may be candidate drug targets in PDAC. The pipeline integrated information from gene expression data and local and global topological characteristics of genes in PPI networks.
Step 1: For each gene expression dataset, we employed SVM methods based on a Recursive Feature Elimination (SVM-RFE) algorithm (Guyon et al., 2002), which is an embedded method to specifically deal with gene selection for cancer classification (Boloń-Canedo et al., 2014), rank DEGs, and select the most relevant features (Jeon et al., 2014). SVM-RFE can remove redundant features (genes) to generalize performance, implement backward feature elimination, search an optimal subset of genes, and provide a ranking for each gene. We ranked genes by SVM-RFE score (R s ), according the following formula: where n is the number of DEGs and r i is the rank of gene i.
Step 2: A PPI network of DEGs was constructed with the STRING database (von Mering et al., 2003;Szklarczyk et al., 2017) using scores > 0.9. The topological parameters degree and shortest path length for each gene in the PPI network were FIGURE 1 | The computational pipeline proposed in this work included three steps. Overall, a machine learning method was used to identify DEGs in PDAC, which were then combined with two parameters of the PPI network to define a new score that predicted disease genes and drug targets in PDAC. All potential targets were then further verified by other bioinformatics analyses and investigated by a "druggability" analysis of structural and dynamic properties.
calculated. The degree (K) of a node in the PPI network is the number of links attached to that node, which is one of the measures of centrality of a node in the network. The average path length (L) of node v in the network is the average length of the shortest paths between v and all other nodes and was defined as: is the length of the shortest path between nodes v and I, and n is the node number in the network.
Step 3: Finally, we incorporated Network topological properties into R s and defined a new score (RNs) for each gene as: Accordingly, this new RNs score (SVM-RFE and Network topological score) considers the cancer status of each gene by including information about gene expression and two levels of topological features in PPI networks, namely, degree K indicates the importance of the node, while the shortest path length L shows the effects from other nodes. The code for gene prioritization is freely available on GitHub for download at: https://github.com/CSB-SUDA/RNs.

PPI Network Analysis
Once the PPI network was constructed, two other analyses were performed. The first analysis was the calculation of two commonly used centrality parameters: betweenness and closeness centrality. The betweenness centrality (BC) (Freeman, 1977) of node v was defined as: where g ivj is the number of the shortest paths from i to j that pass through node v, and g ij is the number of shortest paths from i to j. The closeness (CC) of node v is the reciprocal of the average shortest path length, which was calculated as: Proteins are often incorporated into modules that can be shared between several different cellular activities. The second analysis was module detection of PPIs by integrating a Gaussian network (GN) algorithm (Newman and Girvan, 2004) and functional semantic similarity (Wang et al., 2007). In general, this involved using the GN algorithm to detect the module of PPI networks, and then applying functional semantic similarity to filter links. Thus, the genes in the detected modules not only had topological similarity, but also functional similarity.

Survival Analysis
To evaluate the prognostic value of candidate genes, a survival analysis was performed using data from the human protein atlas (Uhlen et al., 2017), which contains gene expression data and clinical information of 176 pancreatic cancer patients. P-values < 0.01 were considered significantly correlated with overall survival.

Functional Enrichment Analysis
Functional enrichment analysis, including cellular component (CC), molecular function (MF), and BP, from the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways of genes was performed using the R package cluster Profiler (Yu et al., 2012). Terms with adjusted p-value < 0.05 were considered significant.

Structural Modeling and "Druggability" Analysis
The protein structures of potential drug targets were retrieved from the Protein Data Bank (PDB) if they were available. The Swiss model (Waterhouse et al., 2018) and I-TASSER (Roy et al., 2010) were used for the structural modeling of genes if protein structures were unavailable. We choose the Swiss model when the sequence similarity between searched models was >30%; otherwise, we used I-TASSER, which predicts protein structure using modeling by iterative threading assembly. Based on model structures, Fpocket (Le Guilloux et al., 2009) was used to detect druggable pockets and calculate "druggability" scores, which were based on several physicochemical descriptors on a genomic scale. The pocket with the highest score in the entire PDB was defined as the reference druggable score. The score of each pocket was classified as: 0.0-0.5: non-druggable; 0.5-0.7: druggable; and 0.7-1.0: highly druggable.

Molecular Docking and GNM Modeling
To study the interactions and binding mode of small molecules with the potential drug targets, molecular docking was performed using AutoDock 4.2 (Khodade et al., 2007). The target, drug, and related disease information were collected from the Drug Bank database (Version 5.0) (Wishart et al., 2018) and the Therapeutic Target Database 2020 (Wang et al., 2020). A normal mode analysis of the GN model (GNM) was performed to investigate collective dynamics via the DynOmics online tool (Danne et al., 2017). The default cutoff distance of 7.3 Å between GNM model nodes was used.

Identification of Disease Genes and Drug Targets in PDAC
From the three datasets GSE28735, GSE71989, and GSE15471, we identified 3,079, 1,225, and 2,257 DEGs between PDAC and adjacent tissues, respectively. The top 10 genes with the smallest p-values are marked in Figure 2. In GSE28735, 1,724 genes showed increased expression in PDAC tissues, while 1,355 genes showed decreased expression (Figure 2A). In GSE71989, 766 genes were upregulated and 459 genes were downregulated in PDAC tissues compared with normal tissues ( Figure 2B). In GSE15471, 1713 genes were overexpressed, while 544 genes showed decreased expression in tumor tissues ( Figure 2C). Together, there were 313 common DEGs between PDAC and adjacent tissues in all three datasets ( Figure 2D).
Additionally, we evaluated gene expression as an input feature for ML and selected the most relevant genes for PDAC using SVM-RFE (Almeida et al., 2020), which provided a ranking for the genes. Then, each DEG was assigned an R s value (see Materials and Methods), which was used to further rank all genes. As an illustration, the top 100 R s values of the DEGs in each dataset are listed in Table S1. This shows that there is little overlap of results between the different datasets. This means that calculating R s based on SVM-RFE can provide information for classification, but not enough for ranking.
The DEGs were next mapped to the STRING database, which yielded a PPI network with 144 genes and 440 links ( Figure 3). Then, degree and shortest path length of each gene in the network were calculated. Finally, we ranked the genes according to our designed score RNs, which integrated these two topological parameters and was based on gene expression profile. The top 20 genes predicted based on at least two datasets were considered potential drug targets. As shown in Table 2 and  Table S2, eight genes (ADAM10, TIMP1, MATN3, PKM, APLP2, ACTN1, CALU, and VCAN) were identified in all three datasets, and nine genes (LGALS1, ITGA2, BST2, MFGE8, ITGAV, EGF, APOL1, ALB, and MSLN) were identified in two of three datasets. We propose that genes predicted by at least two datasets could serve as disease genes and/or drug targets. Taken together, 17 genes predicted by RNs score are listed in Table 3, and most have been previously reported to be PDAC-associated genes. There are only four that have not been previously associated with PDAC. This suggests that our metric RNs is useful for identifying novel disease genes and drug targets.
It is also useful to compare our results predicted by RNs with other common network parameters. The genes predicted by calculating betweenness and closeness centrality are also listed in Table S2. Among our 20 predicted potential drug targets, six and nine were also found by betweenness and closeness centrality, respectively. Notably, ADAM10, ACTN1, and TIMP1 were in all three lists, which suggested they had important roles in PDAC. Moreover, two other genes (ITGAV and ITGA2) were in the top 20 of two datasets, which suggested they should be investigated. Overall, compared with the top 20 genes predicted by these two common network parameters, our RNs parameter identified more extracellular matrix (ECM) proteins, including integrins and collagens. The other  interesting finding was that four common genes (ALB, EGF, ITGA2, and VCAN) were identified by isolating the nodes with large degrees (hubs) in PPI network construction based on other PDAC GSE datasets (Lu et al., 2018). Survival analysis was also performed to evaluate whether the expression of our 17 identified candidates was related to the prognosis of PDAC. Using Kaplan-Meier analysis with the logrank test for 176 pancreatic cancer patients from the human protein atlas (Uhlen et al., 2017), we found that higher expression levels of 11 genes were significantly correlated with decreased overall survival (p < 0.01, Figure 4). For the eight genes identified in all three datasets, five (ADAM10, PKM, APLP2, CALU, and VCAN) were associated with poor prognosis when highly expressed. The other six highly expressed genes (LGALS1, ITGA2, BST2, ITGAV, APOL1, and MSLN) associated with poor prognosis that were identified in two of three datasets are shown in Table 2. Accordingly, the survival analysis showed significant prognostic values for most of the predicted genes. Table 3 shows the genes predicted above shortlisted based on our RNs criteria. After searching the drug bank, these 17 predicted genes were classified into two types: 11 genes were drug targets, while six were non-drug targets. We also annotated drug targets in the drug bank by their related drugs and diseases. It should be noted that MSLN was the only proven drug target for PDAC, and there are many drugs that inhibit ALB. Thus, we concluded that these two genes had been studied widely and would not give us more insight regarding discovering new targets. Considering the potential of other predicted genes as drug targets for PDAC, we performed functional and "druggability" annotations for all. Among the 15 genes, 11 (ADAM10, TIMP1, EGF, APLP2, ITGAV, VCAN, ITGA2, PKM, APOL1, ACTN1, and BST2) have been reported to be contributing factors in PDAC invasion, growth, or metastasis, which indicated that our pipeline had good performance for finding potential drug targets for PDAC.

Characterization of Predicted Drug Targets for PDAC
The protease ADAM10 was predicted as the highest ranked gene, and it has been reported that ADAM10 influences the  FIGURE 3 | Potential drug targets in the PPI network. The genes that were predicted by our pipeline are marked with red labels. The node size denotes the average RNs of the gene in two or three datasets.
FIGURE 4 | Kaplan-Meier survival curves of overall survival from the human protein atlas datasets for potential drug targets divided by high (red) or low (green) expression level. *"YES" means drug target, and "NO" means non-drug target; # "NA" means no drug and disease information, or no druggable pockets.
progression and metastasis of cancer cells, as it promotes PDAC cell migration and invasion (Gaida et al., 2010). Inhibiting ADAM10 could be a novel approach for natural killer (NK) cell-based immunotherapy (Pham et al., 2017). Tissue inhibitor of metalloproteinases-1 (TIMP-1) correlated with tumor progression, and elevated levels of TIMP-1 in tumor tissue and peripheral blood were associated with poor clinical outcomes in numerous malignancies, including PDAC (Prokopchuk et al., 2018). The third gene was epidermal growth factor (EGF), which was a common disease gene for many cancers, and EGF mutations were associated with PDAC (Grapa et al., 2019). Amyloid precursor-like protein 2 (APLP2) affects the actin cytoskeleton and also increases PDAC growth and metastasis (Pandey et al., 2015). ITGAV (Villani et al., 2019), VCAN (Skandalis et al., 2006), and ITGA2 (Nones et al., 2014) are matrix proteins that have been shown to contribute to pancreatic cancer cell migration, invasion, and metastasis. PKM2 is one of the isoforms of pyruvate kinase muscle isozyme (PKM) and promotes the invasion and metastasis of PDAC through the phosphorylation and stabilization of PAK2 (Cheng et al., 2018). The final three genes, APOL1 (Liu et al., 2017), ACTN1 (Rajamani and Bhasin, 2016), and BST2 (Grutzmann et al., 2005) have previously been reported to be effective biomarkers for PDAC. Although 11 genes were already known drug targets, "druggability" annotations based on protein structures can improve our knowledge and understanding of the mechanisms of proteins as drug targets. The "druggability" of proteins is a measure of their ability to bind drug-like molecules based on molecular shapes. For the "druggability" of all 17 genes, we first obtained their structural modes by retrieved data from the PDB database or homology modeling. The PDB codes of proteins or their templates are listed in Table 3. Then, Fpocket was used to compute all possible pockets and their corresponding "druggability score" (DS). The "druggability" of the protein was defined as the DS of the highest scoring pocket. As expected, most of the predicted proteins were druggable (DS ≥ 0.5), except VCAN, IGALS1, and MFGE8. ALB had the largest DS (1.00), which can partially explain why so many ALB inhibitors exist. Among the six non-drug targets, TIMP1, ITGA2, and BST2 were predicted as highly druggable (DS ≥ 0.5), which meant that these three genes had the structural abilities to be drug targets. In particular, the non-drug target ITGA2 had a larger DS than ITGAV, suggesting that a more detailed structural comparison between these two integrin proteins is needed.

Identification of Functional Modules and Pathways
Within PPI networks, cancer targets interact with different modules to perform biological functions. A module within a network is defined a set of nodes that are densely connected within subsets of the network but may not all directly interact with each other. To get further insight into the topological and biological functions of potential targets, we performed module detection in the PPI network using a GN algorithm and functional semantic similarity. As shown in Figure 5, we FIGURE 5 | Four modules were discovered within PPI networks. Genes that were predicted in at least two datasets are marked red, while genes that were predicted in only one dataset are marked blue.
identified four modules (the pink, yellow, green, and blue nodes) and labeled the genes that were predicted in at least two datasets (red) or in only one dataset (blue). Except PKM and ACTN1, 15 of the 17 predicted genes were detected by the modular analysis and are included in these four modules. The top module (pink) was formed of 19 genes, including the most of our predicted genes (12/17, ADAM10, CALU, ALB, APLP2, MSLN, LGALS1, TIMP1, MATN3, VCAN, EGF, MFGE8, and APOL1). Most of these genes have been previously reported as disease genes in PDAC or drug targets in other cancers. Another three predicted genes were included in two other modules, while ITGAV and ITGA2 were detected in the second largest module (yellow). Although there were only two predicted genes, this module deserves more attention, as it primarily contains two types of gene targets: integrins (ITGA5, ITGA3, ITGB5, ITGA2, and ITGAV) and collagens (COL6A3, COL11A1, COL1A1, COL10A1, COL5A1, COL1A2, and COL3A1). Research into integrins and collagens and their interactions may provide more insights into the molecular mechanisms of PDAC.
We next performed an enrichment analysis on genes in the PPI network ( Figure 6 and Table 4). The genes were enriched for the GO terms related to extracellular structure and matrix, such as extracellular structure and matrix organization in BP, ECM in CC, and ECM structural constituent and binding in MF. Table 4 shows the top 10 most significantly enriched KEGG pathways. Most of the pathways are associated with cancer, such as ECM-receptor interaction, focal adhesion, and proteoglycans in cancer. Moreover, integrins were enriched in most of the carcinogenesisassociated pathways, such as focal adhesion, which play essential roles in important BPs, including cell motility, proliferation, and differentiation. Interestingly, several altered molecular pathways were identified, which suggests that genes in the secondary module were involved in these pathways. These modules and pathways not only contained integrins, but also another group of collagens. In particular, two predicted integrins (ITGAV and ITGA2) were involved in nine out of the top 10 pathways, while the top four pathways (ECM-receptor interaction, focal adhesion, proteoglycans in cancer, and human papillomavirus infection) also contained collagens, especially COL1A1 and COL1A2. Except for these pathways, the list of integrins and collagens was used to define the traditional cancer-related PI3K/ AKT pathway. It was previously known that collagen is a major component of the tumor microenvironment that participates in cancer fibrosis, which can influence tumor cell behavior through integrins (Xu et al., 2019). Our results indicated that ITGAV, ITGA2, and their interactions with COL1A1 and COL1A2 may play important roles in PDAC, suggesting they could serve as potential drug targets. For example, the predicted genes and their interactions were highlighted in the ECM-receptor interaction pathway ( Figure S1). This systems biology evidence of gene cluster-and pathway-based distributions suggested that targeting several key genes together could be a more promising approach.

ITGAV and ITGA2 as Potential Drug Targets for PDAC
By combining SVM-RFE, PPI network, and survival analysis, 11 out of 17 candidate genes have been predicted as biomarkers in pancreatic cancer patients. Among them, two integrins of ITGAV and ITGA2 were further screened as two potential drug targets according to the following evidences: 1) Both ITGAV and ITGA2 are involved in all PDAC-related pathways include ECM-receptor interaction and focal adhesion pathways, suggesting that ITGAV and ITGA2 may play an important role in PDAC progression; 2) Based on the druggability criteria, ITGAV and ITGA2 have relatively high DS. In addition, ITGAV is already a drug target for other cancer. Due to the structural similarity, ITGA2 can also be considered as a potential drug target; 3) Current experimental data suggest that several other integrins are overexpressed in various cancer types, being involved in tumor progression through tumor cell invasion and metastases. For example, the therapeutic potential of ITGA5 in the PDAC stroma has been proved efficacy (Kuninty et al., 2019). Collectively, our data together with some know results point towards ITGAV and ITGA2 as two potential drug targets for PDAC. Thus, the emerging understanding of their structural properties will guide the development of new strategies for anticancer therapy.
Integrins are transmembrane receptors that are central to the biology of many human pathologies. Classically, integrins are known for mediating cell-ECM and cell-cell interaction, and they have been shown to have an emerging role as local activators of TGF-b, influencing cancer, fibrosis, thrombosis, and inflammation (Raab-Westphal et al., 2017). Integrins are composed of a and b subunits to form a complete signaling molecule. Their ligand binding and some regulatory sites are extracellular and sensitive to pharmacological intervention, as proven by the clinical success of seven drugs that target integrins (Hamidi et al., 2016). Although peptides and small molecules are generally designed to target integrin ab dimers, the individual integrin a subunits may also be therapeutic targets. ITGAV always bind with five b subunits that form receptors for vitronectin, cytotactin, fibronectin, fibrinogen, and laminin. ITGAV has mostly been investigated for its role in malignant tumor cells and tumor vasculature (Xiong et al., 2001;Xiong et al., 2009). ITGAV recognizes the Arg-Gly-Asp (RGD) sequence in a wide array of ligands at the interface between the a and b subunits (Xiong et al., 2002). ITGA2 forms with b 1 and belongs to the collagen receptor subfamily of integrins (Emsley et al., 2000).
The structure of ITGAV was taken from chain A of the x-ray structure of complete integrin aVb 3 (PDB code: 3IJE). It contains a b-propeller domain of seven 60-amino-acid repeats, and three other domains including the Thigh, Calf-1, and Calf-2 domains ( Figure 7A). The PDB repository contains no crystal structure for full-length ITGA2. The highest sequence similarity between ITGA2 and searched models (PDB code: 5ES4) was 28%, so we employed I-TASSER to generate a composite model of ITGA2 based on several templates. A subsequent analysis of the structure of ITGA2 revealed similar domain structures with ITGAV but with the addition of an I domain (Emsley et al., 1997) and a WKp GfFkR helix tail, which may suggest more drugtargeting possibilities for ITGA2. Based on the structures of ITGAV and ITGA2, Fpocket was used to detect their druggable pockets. For ITGAV, there were two highly druggable pockets, both located within the b-propeller domain. The largest druggable pocket was located on the outer side of the b-barrel, consisted of Val192, Lys104, Ala189, Asp132, Val188, Ala189, Asp167, Leu130, Gln187, Glu190, Lys135, Val137, and Gln131, and had a DS of 0.663 ( Figure 7A). The second largest druggable pocket was located at the hole of the b-barrel, consisted of Trp93, Leu111, Gln156, Phe159, Pro110, Ala96, Phe21, Tyr406, Tyr224, and Phe278, and had a DS of 0.599 ( Figure S2A). For ITGA2, only one highly druggable pocket was found at the b-propeller domain and had a DS of 0.92. This pocket consisted of His416, Phe162, His414, Ser159, Phe156, Leu417, Ser161, Val409, Leu396, Lys411, Leu158, Gln157, Leu394, Ala160, Leu417, Asp155, Asp392, Val381, Gly415, and Ser413 ( Figure 7B). Despite progress in the development of drugs that target different integrins, there are only two clinical approved drugs in the drug bank for ITGAV (Levothyroxine and Antithymocyte immunoglobulin) ( Table 3). Thymoglobulin is a polyclonal antibody, while Levothyroxine is currently the only approved small molecule that targets ITGAV. The small ligand Levothyroxine was docked to the two druggable pockets in ITGAV to study the stability of the complex and protein-drug interactions. When docked to the largest druggable pocket, Levothyroxine formed hydrogen bonds with Asp167, Thr134, Lys135, and Val192, and a hydrophobic interaction with Ala189, and the binding free energy was −8.3 kcal/mol ( Figure 7C). For the other pocket, hydrogen bonds were formed between Levothyroxine and Phe21, Trp93, Ala96, and Pro110 with the binding free energy of −10.08 kcal/mol ( Figure S2B). We further docked Levothyroxine to ITGA2 at its druggable pocket. The binding free energy of −9.09 kcal/mol suggested a good interaction between ITGA2 and Levothyroxine, with the potential binding sites at Phe162, Lys411, Asp392, and Leu158 ( Figure 7D).
To determine residues that play a key role in the global dynamics of ITGAV and ITGA2, we performed a GNM analysis. GNM analysis provides information on the mechanisms of collective movements intrinsically accessible to the structure, which usually enable structural changes relevant to function (Bahar et al., 2010). The most discriminative feature in dynamic analysis is hinge prediction, which are expected to be sites for drug development (Sumbul et al., 2015). We predicted hinges sites by the minima of corresponding GNM slow modes. By applying GNM to ITGAV ( Figure 7E), GNM mode 1 highlights the hinge region located in the Thigh domain, especially at Asn455, Ser471, Arg553, and Gly594, which are located at the interface between the Thigh and Calf-1 domains. We also note that the b-propeller domain became the major hinge region in GNM mode 2, while Ile286, Asn287, Asp352, Phe377, Ser389, Thr413, Asp414, Pro421, and Tyr436 have minimal fluctuations. Hinge sites located at the b-propeller domain in GNM mode 2 may correspond to pocket sites, as the first and second largest druggable pockets were within the bpropeller domain. For ITGA2 ( Figure 7F Phe681 to Ser737. Accordingly, our GNM modeling suggested that both the b-propeller domain and the Thigh domain play important roles in modulating the collective movements of ITGAV and ITGA2. The b-propeller domain has been indicated to be a druggable domain by pocket detection. Here, some hinge sites located within the Thigh domain offer other reasonable starting points for inhibitor design.

CONCLUSIONS
In this study, we developed a computational framework that integrated ML (SVM-RFE), biomolecular networks (PPI network analysis), and structural modeling analysis (homology modeling, molecular docking, and GNM modeling) to help future drug targets for PDAC. The core of the new method was that we defined a new score, termed RNs, based on cancer-related information from gene expression data and topological information obtained from PPI network analysis. Research using three GEO datasets (GSE28735, GSE71989, and GSE15471) yielded 17 genes (ADAM10, TIMP1, MATN3, PKM, APLP2, ACTN1, CALU, VCAN, LGALS1, ITGA2, BST2, MFGE8, ITGAV, EGF, APOL1, ALB, and MSLN) that were predicted to be potential drug targets. The survival and "druggability" analysis of these genes showed that most of the identified genes had poor survival associations and good DS values, further providing evidence that they can be used as therapeutic targets in PDAC. The important roles of integrins as well as their interactions with collagens were highlighted by combining network modules and KEGG pathway analysis, in term of four pathways, ECM-receptor interaction, focal adhesion, proteoglycans in cancer, and human papillomavirus infection pathways. By focusing on ITGAV and ITGA2, we identified druggable pockets, drug binding sites, and hinge sites that are potential sites for designing small molecules. In summary, this new methodology will provide new avenues for discovering drug targets in PDAC and other cancers.
Of course, our method in this work has some limitations. Firstly, our method only used SVM-REF to the gene expression data to rank the DEGs. With the growth of other omics data, we need to apply our method by including more kinds of data, such as RNA-Seq data for PDAC (Raphael et al., 2017), which will make our method more practical. Secondly, our method just combined the systems level analysis of PPI construction and analysis and the molecular level analysis of "druggability" prediction, and thus, the drug target prediction needs some structural research experience to some extent. To address this, the real integration of structure knowledge into PPI networks is still needed.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation, to any qualified researcher.

AUTHOR CONTRIBUTIONS
WY, XYL, and GH analyzed the data and wrote the manuscript. XYL and WY conducted the SVM calculation and network analysis. FW assisted in network analysis. FX and SH conducted the structural modeling and docking. XL assisted in molecular docking. WY, FX, and GH conceived and designed all experiments, and interpreted all results. GH revised the manuscript. All authors contributed to the work.