Identification of a Nine-Gene Signature and Establishment of a Prognostic Nomogram Predicting Overall Survival of Pancreatic Cancer

Background: Pancreatic cancer is highly lethal and aggressive with increasing trend of mortality in both genders. An effective prediction model is needed to assess prognosis of patients for optimization of treatment. Materials and Methods: Seven datasets of mRNA expression and clinical data were obtained from gene expression omnibus (GEO) database. Level 3 mRNA expression and clinicopathological data were obtained from The Cancer Genome Atlas pancreatic ductal adenocarcinoma (TCGA-PAAD) dataset. Differentially expressed genes (DEGs) between pancreatic tumor and normal tissue were identified by integrated analysis of multiple GEO datasets. Univariate and Lasso Cox regression analyses were applied to identify overall survival-related DEGs and establish a prognostic gene signature whose performance was evaluated by Kaplan-Meier curve, receiver operating characteristic (ROC), Harrell's concordance index (C-index) and calibration curve. GSE62452 and GSE57495 were used for external validation. Gene set enrichment analysis (GSEA) and tumor immunity analysis were applied to elucidate the molecular mechanisms and immune relevance. Multivariate Cox regression analysis was used to identify independent prognostic factors in pancreatic cancer. Finally, a prognostic nomogram was established based on the TCGA PAAD dataset. Results: A nine-gene signature comprising MET, KLK10, COL17A1, CEP55, ANKRD22, ITGB6, ARNTL2, MCOLN3, and SLC25A45 was established to predict overall survival of pancreatic cancer. The ROC curve and C-index indicated good performance of the nine-gene signature at predicting overall survival in the TCGA dataset and external validation datasets relative to classic AJCC staging. The nine-gene signature could classify patients into high- and low-risk groups with distinct overall survival and differentiate tumor from normal tissue. Univariate Cox regression revealed that the nine-gene signature was an independent prognostic factor in pancreatic cancer. The nomogram incorporating the gene signature and clinical prognostic factors was superior to AJCC staging in predicting overall survival. The high-risk group was enriched with multiple oncological signatures and aggressiveness-related pathways and associated with significantly lower levels of CD4+ T cell infiltration. Conclusion: Our study identified a nine-gene signature and established a prognostic nomogram that reliably predict overall survival in pancreatic cancer. The findings may be beneficial to therapeutic customization and medical decision-making.

Background: Pancreatic cancer is highly lethal and aggressive with increasing trend of mortality in both genders. An effective prediction model is needed to assess prognosis of patients for optimization of treatment.
Materials and Methods: Seven datasets of mRNA expression and clinical data were obtained from gene expression omnibus (GEO) database. Level 3 mRNA expression and clinicopathological data were obtained from The Cancer Genome Atlas pancreatic ductal adenocarcinoma (TCGA-PAAD) dataset. Differentially expressed genes (DEGs) between pancreatic tumor and normal tissue were identified by integrated analysis of multiple GEO datasets. Univariate and Lasso Cox regression analyses were applied to identify overall survival-related DEGs and establish a prognostic gene signature whose performance was evaluated by Kaplan-Meier curve, receiver operating characteristic (ROC), Harrell's concordance index (C-index) and calibration curve. GSE62452 and GSE57495 were used for external validation. Gene set enrichment analysis (GSEA) and tumor immunity analysis were applied to elucidate the molecular mechanisms and immune relevance. Multivariate Cox regression analysis was used to identify independent prognostic factors in pancreatic cancer. Finally, a prognostic nomogram was established based on the TCGA PAAD dataset.
Results: A nine-gene signature comprising MET, KLK10, COL17A1, CEP55, ANKRD22, ITGB6, ARNTL2, MCOLN3, and SLC25A45 was established to predict overall survival of pancreatic cancer. The ROC curve and C-index indicated good performance of the nine-gene signature at predicting overall survival in the TCGA dataset and external validation datasets relative to classic AJCC staging. The nine-gene signature could classify patients into high-and low-risk groups with distinct overall survival and differentiate tumor from normal tissue. Univariate Cox regression revealed that the nine-gene signature was an independent prognostic factor in pancreatic cancer. The nomogram incorporating the gene signature and clinical prognostic factors was superior to AJCC staging in predicting overall survival. The high-risk group was enriched with multiple oncological signatures and aggressiveness-related pathways and associated with significantly lower levels of CD4 + T cell infiltration.

INTRODUCTION
Pancreatic cancer is lethal and aggressive with a 5-year survival rate of only 2-9% (1). Despite its low incidence, pancreatic cancer is the fourth leading cause of cancer-related death in the United States. Its mortality is increasing for both genders and it is expected to become the second most common cause of cancerrelated death by 2030 after lung cancer and surpassing colorectal and breast cancers (2). Surgical resection is the only curative treatment and it significantly improves the five-year survival rate to 20-30%. However, only <20% of all patients are eligible for resection as most patients are diagnosed at an advanced stage when there is metastasis (3). Poor prognosis is caused by the rapid progression, early metastasis, and lack of typical clinical presentation or sensitive screening methods for earlystage pancreatic cancer (4). Neoadjuvant therapy, radiotherapy, chemotherapy, targeted molecular therapy, and immunotherapy have been used for treatment and have achieved certain therapeutic effects. However, for individual patients, the survival benefits of these treatments are questionable and side effects occur. Pancreatic cancer should be managed by individualized systemic treatment, which may prolong survival and improves quality of life. Therefore, an effective prediction model is needed for the accurate assessment of patient's prognosis. In this way, efficacious treatments may be selected to balance side effects and survival benefits and to decide whether to administer more aggressive treatment. Clinicopathological parameters such as AJCC TNM staging have been used for predicting prognosis of patients (5). The advancement of tumor molecular biology has facilitated the development of new prediction tools based on prognosis-related genes. These prognostic markers reflecting tumor progression at molecular level may be beneficial to realize individualized survival predictions with better accuracy.
Advances in gene chips and high-throughput sequencing have demonstrated that prognostic gene signatures at the mRNA level are able to predict overall survival in pancreatic cancer. Birnbaum et al. proposed a 25-gene signature based on clinicopathological and gene expression data that predicts post-operative overall survival independent of classical factors and molecular subtypes (6). Raman et al. reported a fivegene prognostic model (ADM, ASPM, DCBLD2, E2F7, and KRT6A) that accurately predicts overall survival from the TCGA PAAD dataset (7). Yan et al. identified a survivalrelated four-gene signature (LYRM1, KNTC1, IGF2BP2, and CDC6) significantly associated with progression and prognosis of pancreatic cancer (8). In-depth exploration of the public datasets (GEO and TCGA etc.) may reveal other prognosticrelated genes and establish a reliable prognostic gene signature which, in combination with clinicopathological parameters, may be a powerful tool for predicting prognosis of pancreatic cancer and individualized treatment.
Here, we integrated seven pancreatic cancer datasets from the GEO database to identify differentially expressed genes (DEGs). Univariate and Lasso-Cox regression analyses were applied to identify overall survival-related DEGs and propose a prognostic gene signature based on gene expression and clinical data from the TCGA PAAD dataset. The prognostic gene signature was validated with external datasets. The molecular mechanism and tumor immunity relevance of the gene signature and its potential in guidance of immune therapy were also investigated. Independent prognostic factors of overall survival were identified by multivariate Cox survival analysis. A prognostic nomogram incorporating the prognostic gene signature and clinical prognostic factors was established to predict overall survival. Overall, our prognostic gene signature and nomogram may accurately predict overall survival of pancreatic cancer.

Acquisition of Gene Expression and Clinical Data
The mRNA expression and clinical data for pancreatic ductal adenocarcinoma were searched and downloaded from the Gene Expression Omnibus (GEO) (https://www.ncbi.nlm.nih. gov/geo/) using the keywords "pancreatic cancer, " "PAAD, " and "pancreatic adenocarcinoma." "Homo sapiens" and "Expression profiling by array" were included in the next round of screening. "Cell line" and "xenograft" were excluded from the search. The gene expression microarray datasets GSE71729, GSE62165, GSE62452, GSE28735, GSE15471, GSE16515, and GSE32676 were selected and downloaded for DEG analysis (9)(10)(11)(12)(13)(14)(15). The datasets met the following criteria: (1) human pancreatic tissue samples; (2) tumor-and non-tumor pancreatic control tissue samples; (3) ≥30 samples. GSE57495 had 63 tumor tissues that were downloaded with their associated follow-up information for subsequent validation of the prognostic gene signature (16). Probes were matched to the gene symbols using the annotation files provided by the manufacturer. The median ranking value accounted for the expression value if multiple probes matched a single gene. Robust multi-array average (RMA)-normalized data were log2-transformed for further analysis.
Normalized RNA-sequencing data as transcripts per million (TPM) and the associated clinical information of the PAAD samples were downloaded from The Cancer Genome Atlas (TCGA) dataset (https://portal.gdc.cancer.gov/; ≤May 20, 2019). They included 185 cases, 182 samples, and four normal tissue samples. Eight cases without corresponding tumor samples, one case missing pathological information, fives cases with a pathological diagnosis of colloid (mucinous non-cystic) carcinoma or undifferentiated carcinoma, six cases with followup period ≤30 days and two samples with metastasis were eliminated. Thus, 165 cases with corresponding tumor tissues and clinical information were included in the study. Normalized gene expression data for the TCGA PAAD dataset were log2transformed for further analysis.

DEG Identification and Integrated Microarray Dataset Analysis
DEGs between tumor-and non-tumor tissues were identified using Limma package in R. |Log2FC| > 1, P < 0.05, and false discovery rate (FDR) < 0.05 were set as the cutoffs for the DEGs. The robust rank aggregation (RRA) method-based R package "RobustRankAggreg" was used for the integrated analysis of the DEGs identified from the seven GEO datasets. P < 0.05 was considered statistically significant. GEPIA (http:// gepia.cancer-pku.cn) is a newly developed interactive web server analyzing RNA sequencing expression data for 9,736 tumors and 8,587 normal tissues in the TCGA and Genotype-Tissue Expression (GTEx) projects with a standard processing pipeline (17). As there were few normal pancreatic tissues in TCGA, the expression level of a specific DEG identified by integrated analysis of the GEO datasets were validated by GEPIA using TCGA PAAD tumor data and matched data of normal tissue from TCGA and GTEx. |Log2FC| > 1 and P < 0.01 were considered statistically significant. Protein expression of the DEGs in pancreatic tumor and non-tumor tissues was evaluated by the human protein atlas (https://www.proteinatlas.org/) (18). Mutation data was obtained from the cBioPortal for Cancer Genomics (https://www.cbioportal.org/) (19).

Bioinformatic DEG Analysis
GO enrichment and KEGG pathway analyses were used to explore the potential biological processes, cellular components, and molecular functions of DEGs. Significantly relevant signal pathways were identified with DAVID (https://david.ncifcrf.gov/) (20). P < 0.05 was considered statistically significant. The STRING database (https://string-db.org) was used to explore potential interactions between DEGs with confidence score ≥0.4 (21). The PPI network of DEGs was constructed and visualized with Cytoscape v. 3.7.1 (https://cytoscape.org/). The Cytoscape plugin cytoHubba was used to identify hub nodes by the maximal clique centrality (MCC) method. Densely connected clusters in the PPI network were identified with the Cytoscape plugin MCODE and the default parameters. GO enrichment analysis was performed on DEG clusters.

Identification of Survival-Related DEGs and Establishment of the Prognostic Gene Signature
The TCGA PAAD dataset was used to identify DEGs associated with overall survival. The expression levels of the DEGs identified by integrated analysis of GEO datasets were analyzed with a univariate Cox proportional hazards regression model. DEGs with P < 0.01 were considered statistically significant and included in subsequent analyses. Lasso-penalized Cox regression analysis was performed to further reduce the number of DEGs in the selected panel with best predictive performance using 10-fold cross validation based on glmnet package in R. A prognostic gene signature of pancreatic cancer patients was constructed based on a linear combination of the regression coefficients (β) derived from the Lasso Cox regression model multiplied with its mRNA expression level. Patients were divided into high-and low-risk groups based on the optimal cutoff of the prognostic gene signature determined using X-Tile software (22). Kaplan-Meier analysis, area under the curve (AUC) of the receiver operating characteristic (ROC) curve, Harrell's concordance index, and a calibration plot comparing predicted and observed overall survival were used to evaluate the performance of the prognostic gene signature. AJCC stage performance was used as a control. The performance of the prognostic gene signature was also compared with three previously defined gene signatures (8,23,24). The GSE62452 and GSE57495 datasets with complete clinical information were used for external validation. Risk scores were calculated using the prognostic gene signature. Optimal cutoffs for each dataset were determined using X-Tile. Performance of the risk score at predicting overall survival was validated using the AJCC stage as control.

Identification of Independent Prognostic Parameters of Pancreatic Cancer
To identify independent prognostic parameters and to validate the independent prognostic value of the gene signature, univariate-, and multivariate Cox regression analyses were performed in the TCGA dataset on the prognostic gene signature and clinicopathological parameters including age, sex, tumor size, tumor site, histological subtype, grade, AJCC TNM stage, residual tumor status, surgical treatment, histories of chemotherapy, histories of radiation therapy, histories of targeted molecular therapy, tobacco smoking histories, alcohol drinking histories, histories of chronic pancreatitis, diabetes, and prior malignancy. P < 0.05 was considered statistically significant. Parameters with P < 0.05 based on the univariate analysis were further included in the multivariate Cox regression analysis.

Predictive Nomogram Construction and Validation
After testing for collinearity, all independent prognostic parameters and relevant clinical parameters were included in the construction of a prognostic nomogram via a stepwise Cox regression model to predict 1-, 2-, and 3-year overall survival of pancreatic cancer patients in the TCGA dataset. Nomogram performance in predicting overall survival was validated using AJCC stage as control. Kaplan-Meier analysis, AUC of the ROC curve, Harrell's concordance index, and a calibration plot comparing predicted and observed overall survival were used to evaluate the performance of the prognostic nomogram. Harrell's concordance index was calculated to assess nomogram discrimination using a bootstrap method with 1,000 resamples. The nomogram calibration curve was plotted to compare predicted vs. observed overall survival. Based on the total points of the nomogram, the patients were divided into three groups by optimal cutoffs determined in X-Tile. Survival curves for the high-, medium-, and low-risk groups were plotted using Kaplan-Meier analysis.

Gene Set Enrichment and Tumor Immunity Analyses
Gene set enrichment analysis (GSEA) was performed to elucidate the molecular mechanisms of the prognostic gene signature (25). The TCGA samples were divided into high-and lowrisk groups according to the optimal cutoffs determined by X-Tile. GSEA was performed in javaGSEA v. 3.0 based on the Molecular Signatures Database v. 6.2. C2 (curated gene sets), C5 (GO gene sets), and C6 (oncogenic signatures) were searched to identify enriched KEGG pathways, biological processes, cellular components, molecular functions, and dysregulated oncogenic signatures associated with poor survival of the highrisk group. |NES| > 1 and FDR < 0.05 were considered statistically significant. Stromal, immune, and estimate scores were calculated with the ESTIMATE (estimation of stromal and immune cells in malignant tumor tissues using expression data) algorithm applied to the expression data downloaded from the TCGA PAAD dataset (https://bioinformatics.mdanderson.org/ public-software/estimate/) (26). The abundances of B, CD4 + T, CD8 + T, and dendritic cells; neutrophils; and macrophages were estimated using the TIMER (tumor immune estimation resource) algorithm (https://cistrome.shinyapps.io/timer/) (27). Survival analysis of immune cell infiltration and correlation of gene expression with immune cell infiltration level in pancreatic cancer were evaluated with TIMER.

Statistical Analysis
Statistical analysis was performed in R v. 3.4.3 and GraphPad Prism v. 8.01 (GraphPad Software, La Jolla, CA, USA). Categorical variables were analyzed by the χ 2 or Fisher's exact test. Continuous variables were analyzed using Student's t-test for paired samples. Multiple groups of continuous variables were analyzed by one-way ANOVA. Univariate-and multivariate Cox regression analyses were performed to evaluate survival. The hazard ratio (HR) and 95% confidence interval (CI) were calculated to identify genes associated with overall survival. Unless otherwise stipulated, P < 0.05 was considered statistically significant.

Identification of DEGs
This study was conducted according to the flow chart shown in Figure 1. Details of the GEO datasets in this study are shown in Table 1. Seven sets of DEGs (GSE71729, GSE62165, GSE62452, GSE28735, GSE15471, GSE16515, and GSE32676) comprised of 453, 2,449, 285, 395, 948, 1,238, and 472 DEGs were identified between tumor and normal tissues (Figure 2A;  Supplementary Figures 1A-G). A total of 234 DEGs including 160 upregulated and 74 downregulated genes were identified after integrated analysis by robust rank aggregation (RRA) method (Supplementary Table 1). The top 20 up-and downregulated DEGs identified by integrated analysis of microarrays are shown in Figure 2B. Hierarchical clustering analysis revealed differences in DEG expression pattern between tumor and normal tissues, which could distinguish tumor from non-tumor tissues ( Figure 2C; Supplementary Figures 2A-F).

Functional Enrichment and PPI Network
Analysis of the DEGs GO and KEGG pathway enrichment analyses were applied to discover the functions of the DEGs (Supplementary Table 2). The DEGs were significantly enriched in biological processes related to interactions between the extracellular matrix and cellular migration. This finding is consistent with the highly invasive and metastatic nature of pancreatic cancer ( Figure 3A). Loss of adhesion and dissociation from in situ of tumor cells is the first step in invasion and metastasis. Significantly enriched biological processes included extracellular matrix organization, cell adhesion, collagen catabolic process, extracellular matrix disassembly, hemidesmosome assembly, proteolysis, and cell migration. Enrichment analyses of the cellular compartment and molecular functions are shown in Supplementary Figures 3A,B. KEGG pathway analysis revealed that the DEGs participated in PI3K-Akt signaling pathway, pathways in cancer and pathways related to cellular dissociation from in situ, including ECMreceptor interaction, focal adhesion, and protein digestion and absorption ( Figure 3B). Furthermore, the DEGs participated in the axon guidance pathway indicating their involvement in the neurological invasion of pancreatic cancer. The interactive network of cancer-related pathways and corresponding DEGs were visualized to elucidate their associations ( Figure 3C).
A PPI network of DEGs that included 186 nodes and 691 interactions was constructed to identify gene interactions. Node degree and betweenness were calculated by the MCC method to obtain hub nodes. The top 25 candidate hub genes were identified which may play a central role in this network (Supplementary Figure 3C). Module analysis identified significant clustering modules in the PPI network. The three highest-scoring clustering modules were obtained (Figures 4A-C). Each hub gene was found in ≥1 module. Thus, the three clustering modules may represent key biological roles of the PPI network. Function enrichment analysis revealed that Module 1 with a score of 8.400 was associated with cell adhesion and extracellular matrix organization. Module 2 with a score of 8.125 was correlated with blood vessel and smooth muscle development, indicating its involvement in tumor-related angiogenesis. Module 3 had a score of 4.727 and was related to cell adhesion and junction assembly. The PPI network analysis showed that the DEGs participated in pancreatic cancer progression especially in terms of invasion and metastasis.

Identification of Survival-Related DEGs and Establishment of the Nine-Gene Prognostic Signature
One hundred sixty-five patients from the TCGA PAAD dataset with a follow-up period >30 d were included in Frontiers in Oncology | www.frontiersin.org subsequent survival analyses. The baseline characteristics of these patients are listed in Table 2. A total of 130 DEGs were identified to be significantly associated with overall survival based on the univariate Cox regression model (P < 0.01, Supplementary Table 3 Table 4). The downregulated MCOLN3 and SLC25A45 with HR < 1 were considered as tumor suppressors, whereas the upregulated COL17A1, CEP55, KLK10, MET, ITGB6, ANKRD22, and ARNTL2 with HR > 1 were regarded as oncogenes. The risk score was calculated as follows: The optimal cutoff values for the risk scores were calculated with X-Tile software. Patients from the TCGA dataset were stratified into two (cutoff value = 2.33) or three (cutoff values = 2.01 and 2.45) groups. The Kaplan-Meier survival curves revealed significantly favorable overall survival in all groups with lower risk scores (P < 0.0001) (Figures 5D,E). Time-dependent ROC and C-index were applied to determine the prognostic values of the nine gene risk scores compared with the AJCC stage (Figures 5A-C). The AUCs for 1-, 2-, and 3-year overall survival predictions for the risk scores were 0.699, 0.637, and 0.621, respectively. The AUCs for 1-, 2-, and 3-year overall survival predictions for the AJCC stage were 0.523, 0.630, and 0.674, respectively. The C-index of the risk score was 0.673 (95% CI; 0.614-0.732), while that of the  Figure 5F). The performance of the risk score was also compared with three previously defined gene signatures. The risk score had the highest C-index (0.673 vs. 0.625, 0.612, and 0.544) indicating a superior prognostic value (Supplementary Figures 5A-C). The performance of the risk score was further explored in different subgroups of patients (Supplementary Figure 6). The risk score performed well in predicting overall survival of patients in subgroup of stage I and II and subgroups without the history of chemotherapy, molecular targeted therapy, and radiation therapy, forecasting the natural course of pancreatic cancer. We further explore whether the relative treatment benefit varies according to the values of the risk score. Kaplan-Meier analyses reveal that patients with higher risk score (top 50%) have better response to chemotherapy, molecular targeted therapy, and radiation therapy than patients with lower risk score (bottom 50%; Supplementary Figure 7). In general, the ninegene signature performed well at predicting overall survival of pancreatic cancer.

External Validation of the Prognostic Performance of the Nine Gene Signature
Two external datasets GSE62452 and GSE57495 were used to validate the prediction performance of the nine-gene prognostic signature (Figure 6; Supplementary Figure 8). Risk scores were calculated with the same formula for each patient. Patients were divided into high-and low-risk groups according to the optimal cutoffs determined for each dataset. The Kaplan-Meier survival curves revealed significant difference in overall survival between groups in both datasets. High-risk groups had markedly poorer outcomes than low-risk groups ( Figure 6D and Supplementary Figure 8D). Prognostic power was then assessed by time-dependent ROC and C-index. In both datasets, the nine-gene signature had a comparable or superior performance to that of the AJCC stage. In GSE62452, the AUCs for 1-, 2-, and 3-year overall survival predictions for the risk scores were 0.544, 0.737, and 0.814, respectively.  Figures 8A-C).
External validation indicated that the nine-gene signature performed well at predicting overall survival in pancreatic cancer patients.

Validation of Expression and Alteration of the Nine Genes
The expression levels of the nine genes were validated using GEPIA. The mRNA expression levels of COL17A1, CEP55, KLK10, MET, ITGB6, ANKRD22, and ARNTL2 were significantly increased in PAAD tumor tissue. In contrast, MCOLN3 and SLC25A45 were significantly decreased in compare with non-tumor tissues ( Figure 7A). Human protein atlas database was used to explore protein expression levels. Typical IHC of eight genes (except KLK10, not included in the database) in tumor and normal pancreatic tissues are shown in Figure 7B. (Images are available from v18.proteinatlas.org).
Of the 165 PAAD patients included in the current study, 15 (9%) presented with alterations in the nine genes. Amplification was the most common type of mutation in the upregulated genes ( Figure 7C).

Evaluation of Prognostic Factors in PAAD
Ninety-one patients from the TCGA PAAD dataset for which complete clinical information was provided, including age, sex, tumor size, tumor site, histological subtype, grade, AJCC TNM stage, residual tumor status, surgical treatment, histories of chemotherapy, histories of radiation therapy, histories of targeted molecular therapy, tobacco smoking histories, alcohol drinking histories, histories of chronic pancreatitis, diabetes, and prior malignancy, were included in the analysis ( Table 3). Reasons of    of chemotherapy (P = 0.0202), history of radiation therapy (P = 0.0228), and history of targeted molecular therapy (P < 0.0001) were significantly correlated with overall survival of pancreatic cancer ( Table 4). Multivariate analysis revealed that risk score, tumor size, and history of targeted molecular therapy were independent risk factors of overall survival (P < 0.05; Table 5).

Building and Validating a Predictive Nomogram
The 91 patients with complete clinical information from the TCGA dataset were used to establish a prognostic nomogram predicting 1-, 2-, and 3-year overall survival based on the stepwise Cox regression model ( Figure 8A). Risk score, age, tumor size, tumor site, histological subtype, T stage, and histories of targeted molecular therapy were parameters included in the nomogram. The AUCs of the 1-, 2-, and 3-year overall survival predictions for the nomogram were 0.793, 0.842, and 0.851, respectively. The AUCs of the 1-, 2-, and 3-year overall survival predictions for the AJCC stage were 0.565, 0.685, and 0.735, respectively. The C-index of the risk score was 0.779 (95% CI; 0.714-0.845), while that for the AJCC stage was 0.587 (95% CI; 0.520-0.654). Thus, the nomogram was superior to the AJCC stage in terms of predicting overall survival of pancreatic cancer ( Figure 8B). The patients were divided into three groups of equal size according to scoring of nomogram. The Kaplan-Meier plot effectively discriminated these groups of various risk ( Figure 8C). Those with higher scores had significantly poorer overall survival (P < 0.0001). Calibration plots showed that the nomogram performed well at predicting overall survival in pancreatic cancer patients ( Figure 8D). When the predicted overall survival was >80 or <60%, the nomogram may underestimate the mortality.

Gene Set Enrichment Analysis (GSEA)
To elucidate the molecular mechanism of the nine-gene signature, 165 patients from the TCGA PAAD dataset were divided into high-and low-risk groups according to the optimal cutoff for the nine-gene risk score determined by X-tile software. GSEA compared the high-and lowrisk groups. In the former, 23 oncological signatures were significantly enriched including the MAL, AGR, HIF, RAS, ECM, ATRBRCA, PTC1, and other pathways ( Figure 9A). KEGG pathways enriched in the high-risk group included regulation of the actin cytoskeleton, ubiquitin-mediated proteolysis, axon guidance, focal adhesion, and tight junction. These enriched KEGG pathways revealed that molecular alteration in the high-risk group was closely associated with the malignant properties of pancreatic cancer, especially invasion and metastasis. Results of the GSEA are shown in Supplementary Table 6.

Clinical-and Tumor Immunity Relevance of the Nine-Gene Signature
Relationships between the nine-gene signature and the clinical characteristics of pancreatic cancer (including AJCC stage, grade, metastasis, and key gene mutation state) were analyzed in datasets providing necessary clinical information. In terms of AJCC stage, stage II patients had higher risk scores than stage I patients in the TCGA dataset. But the risk scores for the stage III and IV patients were not higher than those for the stage II patients (Figure 9B). The risk scores were also comparable across different AJCC stages in GSE63452, GSE62165, and GSE57495 ( Figure 9D; Supplementary Figures 9B,C). In terms of grade, patients of G2, G3, and G4 had higher risk scores than G1 (Figure 9C), which was consistent with the finding from GSE63452 ( Figure 9E). Moreover, data from GSE71729 revealed that the risk scores for the metastases were higher than the primary tumors (Supplementary Figure 9A). In terms of mutation, risk scores were identified to be highly associated with mutation state of key genes. The risk scores for the KRAS, TP53, and CDKN2A mutant groups were significantly higher than those for the wild type (Figures 9F-H). The risk score in the SMAD4 mutant group was non-significantly higher than that for the wild type ( Figure 9I). Finally, the performance of the nine-gene signature at differentiating pancreatic cancer from normal tissue were explored across all the seven GEO datasets ( Figure 9O). Tumor tissues could be effectively identified based on the risk score.
To investigate tumor immunity relevance of the nine-gene signature, the associations of the gene signature with tumor purity and the presence of infiltrating stromal/immune cells in tumor tissues were evaluated. Stromal-, immune-, and estimate scores were calculated by applying the ESTIMATE algorithm to the expression data downloaded from the TCGA PAAD dataset. The stromal-and estimate scores were comparable between the high-and low-risk groups. However, the immune score was significantly lower in the high-risk group, indicating fewer infiltration of immune cells in the tumor tissue (P < 0.05; Figure 9J). The abundances of B, CD4 + T, CD8 + T, and dendritic cells; neutrophils; and macrophages were further estimated using the TIMER algorithm. The high-risk group was associated with relatively lower levels of CD4 + T cell infiltration ( Figure 9K). Consistently, the downregulated SLC25A45 was positively correlated with CD4 + T cell infiltration level, whereas the upregulated MET was negatively associated with it (P < 0.0001) (Figures 9L,M). In addition, a lower level of CD4 + T cell infiltration was associated with poor survival (P = 0.055; Figure 9N).

DISCUSSION
Pancreatic cancer is highly malignant with very poor prognosis. The 5-year survival rate is about 5% (1). Accurate prediction of prognosis can identify patients benefiting from more radical treatment, including neoadjuvant therapy, more intensive surgery, chemotherapy, radiation therapy, targeted molecular therapy, and immunotherapy. Therefore, treatments can be tailored to individual patients to improve prognosis. Traditional clinicopathological parameters have been applied to reflect and   prognosticate disease progression. AJCC staging is currently the most effective tool for prognostic prediction of pancreatic cancer. Besides, molecular prognostic markers may be used as a beneficial supplement to AJCC staging to further improve the accuracy of prognosis prediction. Molecular prognostic markers which can be quantified by standardized detection procedures vary with tumor progression and may dynamically reflect the prognosis of patients. Moreover, they may also play  important roles in progression of pancreatic cancer and serve as new therapeutic targets. By combining with the detection of tumor-associated exosomes and circulating tumor cells (CTC), real-time detection of disease recurrence and treatment response in patients after surgical resection may be achieved. Molecular prognostic markers may also have potential value in early diagnosis of the highly heterogeneous pancreatic cancer, progression of which involves a complex network of multiple signaling pathways. To overcome the hinder of heterogeneity, a panel of molecular markers may be more accurate in reflecting pancreatic cancer prognosis than a single one. Nomograms are widely used in clinical oncology to evaluate prognosis (28). They can integrate several prognostic determinants including molecular and clinicopathological parameters. The numerical probabilities of clinical events can be calculated and visualized with relatively simple output. Compared to conventional staging, nomograms may effectively improve the prediction of prognosis, which is beneficial for the clinical decision-making and personalized treatment.
In the current study, we identified 234 reliable DEGs of pancreatic cancer by integrative analysis of multiple datasets. Functional enrichment analysis revealed that the DEGs were closely related to invasion and metastasis of pancreatic cancer. PI3K-Akt was the most enriched signaling pathway. Survival analysis revealed 130 DEGs associated with overall survival. A novel nine-gene signature predicting overall survival of  pancreatic cancer was established via Lasso-Cox regression. MCOLN3 and SLC25A45 were downregulated and identified as protective genes whereas MET, KLK10, COL17A1, CEP55, ANKRD22, ITGB6, and ARNTL2 were upregulated and associated with poor survival. This nine-gene signature was an independent prognostic factor of pancreatic cancer. Patients in low-risk groups had significantly better prognoses than those in high-risk groups. Prognostic performance of the ninegene signature was validated in the TCGA dataset and the external datasets GSE62452 and GSE57495. The AUC, Cindex, and calibration curves confirmed that the nine-gene signature was comparable or superior to AJCC staging at predicting 1-, 2-, and 3-year overall survival. A nomogram integrated with the nine-gene signature and clinicopathological parameters was established and accurately predicted overall survival. GSEA disclosed that 23 oncological signatures were significantly enriched in the high-risk group defined by the ninegene signature. This group was enriched with pancreatic cancerrelated oncogenic pathways and mutations and was associated with invasion, metastasis, poor survival, and significantly lower levels of CD4 + T cell infiltration. As a supplement to AJCC staging, the nine-gene signature and the nomogram may be useful as progression indicators and predictors of overall survival. Five of the genes in the nine-gene signature were previously reported to be associated with pancreatic cancer. MET is a receptor tyrosine kinase that transduces signals from the extracellular matrix to the cytoplasm by binding its ligand HGF. MET is dysregulated in pancreatic cancer and activated by genetic mutation and gene amplification, participating in pancreatic cancer cell interactions with the tumor microenvironment (29). It establishes the pre-metastatic microenvironment promoting the metastatic phenotype. MET expression is closely associated with clinical stage and activates the RAS-ERK and PI3K-Akt pathways by recruiting downstream effector molecules mediating tumorigenesis, progression, metastasis, and gemcitabine chemotherapy resistance. KLK10 is a serine protease and a member of the kallikrein family. Human tissue kallikreins regulate cancer cell growth, angiogenesis, invasion and metastasis, and either promote or suppress cancer (30). They are also used as cancer biomarkers. KLK10 is downregulated in breast, prostate, and other cancers functioning as a tumor suppressor. In contrast, KLK10 is upregulated in thyroid, gastric, and colorectal cancers and promotes tumors. In pancreatic cancer, KLK10 is highly expressed in pancreatic intraepithelial neoplasia and cancer tissues. KLK10 is upregulated in pancreatic cancer patients with lymph node involvement and remote metastasis (31). KLK10 and KLK6 are co-expressed in pancreatic cancer tissues, positively correlated with R1-resection status and poor prognosis and are independent risk factors (32). KLK10 knockdown attenuated pancreatic cancer cell migration, invasion, and metastasis in vitro and in vivo. KLK10 also mediates pancreatic cancer invasion and metastasis by activating the FAK-SRC-ERK signaling pathway. COL17A1 maintains hemidesmosome integrity and is the direct target of autoantibody in bullous dermatosis. In breast cancer, COL17A1 is hypermethylated and downregulated (33). Downregulation of COL17A1 is associated with poor prognosis in this case. COL17A1 may be a target of wild type p53 in breast tissue. It inhibits cell migration and invasion. In contrast, COL17A1 is hypomethylated and upregulated in cervical cancer, head, neck, and lung squamous cell carcinoma, and lung adenocarcinoma (34). COL17A1 upregulation was associated with poor prognosis in these cancers. COL17A1 enhances invasive squamous cell carcinoma migration and invasion via the FAK/PI3K signal pathway. Thus, COL17A1 may play a dual role in certain cancers. COL17A1 is upregulated in pancreatic cancer and positively associated with poor prognosis (35). However, the mechanisms by which it promotes pancreatic cancer have not been clarified. The methylation levels and specific mechanisms by which COL17A1 promotes pancreatic cancer  The ROC curves of the risk score differentiating pancreatic cancer from normal tissues in the seven GEO datasets. *P < 0.05, ***P < 0.001, ****P < 0.0001.
under various p53 mutation states merit further investigation. CEP55 recruits PDCD6IP and TSG101, playing an important role in cytokinesis. CEP55 is upregulated in pancreatic cancer and is associated with poor survival (36). CEP55 upregulation induces invasion-related matrix metalloproteinase (MMP) and proliferation-related cyclin D1. CEP55 promotes pancreatic cancer proliferation, migration, and invasion in vitro and in vivo by activating the NF-κB signaling pathway and the PI3K/AKT signaling pathway. CEP55 was also reported to be upregulated in gastric, liver, lung, nasopharyngeal, and bladder cancers (37). Upregulation of CEP55 activates the PI3K/AKT signaling pathway in a concentration-dependent manner and promotes tumor proliferation, invasion, and metastasis. Its expression level is closely related to clinical stage and poor prognosis. Therefore, CEP55 is considered an ideal predictor of cancer prognosis. The biological function of ANKRD22 has not yet been fully elucidated. It was thought to be associated with the transition steps of somatic reprogramming, human ovulatory cascade, and T cell-mediated allograft rejection (38). Transcriptional profiling of peripheral blood in pancreatic cancer patients revealed that ANKRD22 mRNA was upregulated and could serve as a diagnostic biomarker in patients with AUC = 0.933 (39). In non-small cell lung cancer (NSCLC), ANKRD22 was upregulated in the tumor and correlated with relapse and overall survival. It promotes NSCLC proliferation by upregulating E2F1 transcription (40). The role of ANKRD22 in pancreatic cancer progression requires further study. The roles of ITGB6 and ARNTL2 in pancreatic cancer have not been reported. ITGB6 forms a complex with ITGA5, which is the receptor for fibronectin and cytokinin. The ITGA5B6 complex recognizes the RGD sequence, mediating cell adhesion, and RGD-dependent TGFB1 release (41). ITGB6 was upregulated and promoted invasion and metastasis in multiple cancers. Internalization of the ITGA5B6 complex promotes cancer cell invasion. ITGB6 upregulation in breast cancer was associated with poor survival and metastasis. Anti-ITGA5B6 antibody alone or with trastuzumab halted tumor growth (42). ITGB6 was also considered as a novel serum marker and a highly efficient target for immunoliposome-mediated drug delivery in colon cancer (43). ARNTL2 is a transcriptional activator and a core component of the circadian clock. Interruption of the circadian rhythm induces cardiovascular disease, cancers, metabolic syndromes, and aging. Variations of the genes governing the circadian pathway may be associated with cancer predisposition. ARNTL2 participates in tumor progression. It is deregulated in B leukemia and repressed by RelB and RelA in EBV-transformed B cells (44). It is associated with colorectal and breast cancer invasiveness, metastasis, and aggressiveness (45). It induces a complex prometastatic secretome and enables self-sufficient lung adenocarcinoma metastasis (46). Invasion and metastasis are common and occur early in pancreatic cancer. The roles of ARTL22 in these processes deserve further investigation.
The roles of MCOLN3 and SLC25A45 in cancer development have not yet been elucidated. MCOLN3 is a non-selective ligandgated cation channel that regulates membrane trafficking and mediates Ca 2+ release from the endosome to the cytoplasm (47). Current research on MCOLN3 focuses on sensory modalities. MCOLN3 resides mainly in endosome membranes, regulates autophagy, and may participate in autophagosome formation (48). Autophagy promotes and suppresses cancer occurrence and progression. Therefore, MCOLN3 may mediate autophagy in these processes. SLC25A45 is a transport protein in the mitochondrial membrane, containing active thyroid-responsive elements (49). SLC25A45 mutations are associated with chronic kidney disease and preterm birth (50). The SNP of SLC25A45 was associated with the mucinous histological subtype of epithelial ovarian cancer (51). Although MCOLN3 and SLC25A45 have not been studied intensively in cancer, data from the TCGA database revealed that MCOLN3 was downregulated in adrenocortical (ACC), breast invasive (BRCA), uterine corpus endometrial (UCEC), kidney renal clear cell (KIRC), and kidney renal papillary cell (KIRP) carcinomas, colon (COAD), lung (LUAD), lung squamous cell (LUSC), rectal (READ), and stomach (STAD) adenocarcinomas, pheochromocytoma, and paraganglioma (PCPG), thymoma (THYM), and uterine carcinosarcoma (UCS) (|LOG2FC| > 1 and P < 0.01) and was associated with relatively better survival in KIRC (HR < 1 and P < 0.05). SLC25A45 was downregulated in lower grade glioma (LGG), LUSC, and thyroid carcinoma (THCA). The roles of MCOLN3 and SLC25A45 in cancer are worthy of further investigation.
Immunotherapy is currently a routine cancer treatment option. CD8 + cytotoxic T lymphocytes recognize MHC Ipresenting antigens and are preferred for targeting tumor cells. On the other hand, CD4 + T lymphocytes play complex and important roles in tumor immunity. It is generally considered that CD4 + T cells compromise the majority of T cells in pancreatic cancer and are positively associated with metastasis and negatively associated with survival (52). The pioneer study of Zhang et al. further revealed that Kras-driven oncogenesis of pancreatic cancer established an immunosuppressive microenvironment via recruitment and activity of CD4 + T lymphocytes (53). Elimination of CD4 + T lymphocytes restored the antitumor function of CD8 + T lymphocytes and blocked carcinogenesis. The specific subsets of CD4 + T lymphocytes that play major immunosuppressive role remain to be elucidated. On the other hand, certain subsets of CD4 + T lymphocytes may also be needed for antitumor immunity. CD4 + helper T cells may promote and maintain cytotoxic T lymphocyte (CTL) memory, amplify T-and B cells, and help CTL overcome negative regulation (54). CD4 + T lymphocytes may eliminate tumor cells by cytolysis or by regulating the tumor microenvironment (55). In the current study, results calculated by algorithms indicated that CD4 + T lymphocyte infiltration was significantly downregulated in highrisk tumor tissues and was associated with poor prognosis. These results require verification by further experimental studies. Considering the differential infiltration level of immune cells between high-risk and low-risk groups of pancreatic cancer, highrisk patients may benefit from more accurate immunotherapy strategies. More detailed studies are also required to elucidate the specific role of each CD4 + T lymphocyte subset in order to enhance the antitumor efficacy of CD8 + cytotoxic T lymphocyte.
Our predictive model is based on the expression levels of genes in a selected panel. This approach is more economical and clinically practical than whole-genome sequencing. Our nomogram incorporating nine-gene signature and clinicopathological parameters may enable clinicians to determine individual patient's prognosis. Its graphical scoring system is easy to understand facilitating the customized treatment and making of medical decisions. To the best of our knowledge, the nine-gene prognostic signature described herein and the nomogram based on it have not been reported previously. Three previously defined prognostic signatures with published algorithms were used as controls in the current study. Yan et al. identified a four-gene signature (LYRM1, KNTC1, IGF2BP2, and CDC6) significantly associated with progression and prognosis of pancreatic cancer (8). Chen et al. proposed a 3-gene signature (SULT1E1, IGF2BP3, and MAP4K4) based on DNA methylation data that predicts poorer overall survival of pancreatic cancer (23). Liao et al. reported a nine-gene prognostic model (ARHGAP30, HCLS1, CD96, FAM78A, ARHGAP15, SLA2, CD247, GVINP1, and IL16) using weighted gene co-expression network analysis that may predict overall survival of pancreatic cancer patients after pancreaticoduodenectomy (24).These articles explored prognosis-related genes from different perspectives to establish prognostic signatures. No overlap was identified between the nine-gene prognostic signature we developed and the one previously defined. Our prognostic signature was identified to be superior or comparable to the previous defined signatures. Our study provides new insight into the molecular mechanism of pancreatic cancer and prediction of prognosis. Moreover, the DEGs obtained in this study were derived from the integrated analysis of multiple datasets, which is highly reliable. Four of the genes in the nine-gene signature had not yet been reported to be associated with pancreatic cancer prior to this study. These DEGs may be potential molecular targets to fight pancreatic cancer.
However, the present study had certain limitations. First, the main sources of our clinical information were datasets from the TCGA and GEO databases. Most of the patients therein are Whites, Africans, or Latinos. Caution must be taken when extrapolating our findings to patients from other ethnicities. The current study is driven by statistics of available retrospective data and the optimal cutoff is required to be determined before clinical application. Second, the establishment and verification of the nomogram were based on the TCGA database. Therefore, it will be necessary to verify using external datasets with complete clinical information and gene expression information in the future. Moreover, the protein expression levels of the prognosis related DEGs and their molecular mechanisms in pathogenesis and progression of pancreatic cancer depend on further experimental studies to elucidate.

CONCLUSION
Our study identified a nine-gene signature and a prognostic nomogram incorporating the gene signature and clinical prognostic factors to predict overall survival of pancreatic cancer. The nine-gene signature was closely associated with the progression, aggressiveness, and prognosis of pancreatic cancer and its constituents are potential therapeutic targets. The prognostic nomogram reliably predicted overall survival in pancreatic cancer and may facilitate individualized treatment and making of medical decisions.

AUTHOR CONTRIBUTIONS
ZL and YZ: conception and design, and study supervision. MW and XL: development of methodology, analysis and interpretation of data, and writing of the manuscript. ZL, TZ, and YZ: review of the manuscript.