Systematic Analysis of an Invasion-Related 3-Gene Signature and Its Validation as a Prognostic Model for Pancreatic Cancer

Background Pancreatic adenocarcinoma (PAAD) is a malignant tumor of the digestive system that is associated with a poor prognosis in patients owing to its rapid progression and high invasiveness. Methods Ninety-seven invasive-related genes obtained from the CancerSEA database were clustered to obtain the molecular subtype of pancreatic cancer based on the RNA-sequencing (RNA-seq) data of The Cancer Genome Atlas (TCGA). The differentially expressed genes (DEGs) between subtypes were obtained using the limma package in R, and the multi-gene risk model based on DEGs was constructed by Lasso regression analysis. Independent datasets GSE57495 and GSE62452 were used to validate the prognostic value of the risk model. To further explore the expression of the hub genes, immunohistochemistry was performed on PAAD tissues obtained from a large cohort. Results The TCGA-PAAD samples were divided into two subtypes based on the expression of the invasion-related genes: C1 and C2. Most genes were overexpressed in the C1 subtype. The C1 subtype was mainly enriched in tumor-related signaling pathways, and the prognosis of patients with the C1 subtype was significantly worse than those with the C2 subtype. A 3-gene signature consisting of LY6D, BCAT1, and ITGB6 based on 538 DEGs between both subtypes serves as a stable prognostic marker in patients with pancreatic cancer across multiple cohorts. LY6D, BCAT1, and ITGB6 were over-expressed in 120 PAAD samples compared to normal samples. Conclusions The constructed 3-gene signature can be used as a molecular marker to assess the prognostic risk in patients with PAAD.


BACKGROUND
Pancreatic adenocarcinoma (PAAD) is a malignant tumor of the digestive tract and is the fourth leading cause of cancer-related deaths worldwide (1). Since the early symptoms of PAAD are not obvious, its diagnosis is often difficult, and the primary tumor exhibits vascular invasion. Approximately 80-85% of patients with pancreatic cancer present with distant metastases at the time of diagnosis, thus making radical resection ineffective (2). Therefore, the search for more accurate and effective diagnostic and prognostic markers is of great significance for the stratification and individualization of patients with pancreatic cancer in the clinical setting.
At present, the prognostic prediction of patients with pancreatic cancer is mainly based on clinicopathologic features. However, the prognosis of patients with the same clinical stage and grade differs because of the high heterogeneity of pancreatic cancer. Moreover, the malignant progression of pancreatic cancer is accompanied by genetic changes. Therefore, the study of the molecular mechanisms underlying pancreatic cancer progression is key to prolonging the overall survival of patients with pancreatic cancer (3). However, the effect of a single gene in predicting the prognosis of a pancreatic tumor is often unsatisfactory and presents with some limitations; the combined detection of multiple genes is expected to facilitate the prognostic prediction of patients with pancreatic cancer. With the rapid development of bioinformatics and sequencing technology, an increasing number of studies have provided potential prognostic assessments for patients with pancreatic cancer. Li et al. (4) constructed a 9-gene signature using macrophage phenotypic switch-related genes in patients with pancreatic cancer. Wang et al. (5) constructed a 9-gene signature for predicting PAAD based on the expression of immune-related genes. However, most prognostic models include a large number of genes, which greatly increases the cost of medical treatment in clinical practice. Moreover, most studies are based on a comprehensive analysis of public databases and lack experimental data to verify and explore the role of the identified genes in the development of pancreatic cancer.
In this study, a molecular subtype of pancreatic cancer was constructed based on invasion-related genes using gene expression data from The Cancer Genome Atlas (TCGA), Gene Omnibus Expression (GEO), and other public databases. The relation between molecular subtypes, prognosis, and clinical features was further analyzed. A 3-gene prognostic model, composed of LY6D, BCAT1, and ITGB6, constructed with differentially expressed genes (DEGs) between the PAAD subtypes, could be used to evaluate the prognosis of patients with PAAD.

Data Source and Preprocessing
RNA-sequencing (RNA-seq) data and clinical follow-up information data from TCGA-PAAD samples were downloaded from the TCGA database. The expression data and clinical information from the GSE57495, GSE62452 and GSE28735 datasets were downloaded from the GEO database. A total of 97 invasion-related genes were collected from the CancerSEA website (Supplement Table 1).
The RNA-seq data from the TCGA-PAAD dataset was processed through the following steps: 1) Samples with no clinical follow-up information were removed; 2) The ENSEMBL gene IDs were converted to the Gene Symbol format; 3) The median value was calculated with multiple Gene Symbol expressions.
The following steps were used to process the GEO dataset: 1) Samples without clinical follow-up information were removed; 2) The probe IDs were converted to the Gene Symbol format; 3) Probes that corresponded to multiple genes were removed. 4) When multiple probes correspond to one gene, take the average value as the gene expression.
After preprocessing, we enrolled 176 samples from TCGA-PAAD, 63 samples from GSE57495 data set, 66 samples from GSE62452 data set, and 42 samples from GSE28735 dataset. The clinical characteristics of the patient samples are listed in Table 1.

Consistency Clustering Algorithm and Gene Set Enrichment Analysis (GSEA)
The expression profiles of 97 invasion-related genes were extracted from the TCGA-PAAD dataset, and univariate Cox regression analysis was performed to select significant prognostic genes using coxph function in R (p < 0.05). Next, the genes with significant results from the univariate Cox analysis were clustered using ConsensusClusterPlus (V1.48.0; parameters: reps = 100, pitem = 0.8, pfeature = 1, and distance = "Canberra"). The Pam and Canberra distances were used as a clustering algorithm and distance measure, respectively.
The gene set c2.cp.kegg.v7.0.symbols.gmt was selected, and significantly enriched pathways between different molecular subtypes were analyzed by GSEA. PAAD samples were divided into either a C1 or C2 subtype based on gene expression data from the TCGA-PAAD dataset in the GSEA input file. The thresholds for pathway enrichment analysis were p < 0.05 and false discovery rate (FDR) < 0.25.

Identification of DEGs
DEGs between C1 and C2 subtypes were calculated using the limma package (6), and the filtering thresholds were FDR < 0.05 and | log 2 fold-change (FC) | > 1. The identified DEGs were subjected to Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis and Gene Ontology (GO) enrichment analysis using the WebGestaltR (v0.4.2) package in R software.

Random Grouping of Training Set Samples
The 176 samples in the TCGA-PAAD dataset were divided into a training set and validation set. To avoid the effect of random assignment bias on the stability of subsequent modeling, 200 samples were assigned to random groups. The samples were grouped according to a training set: validation set ratio of 3:2. After dividing the samples, there were 106 samples in the training set and 70 samples in the validation set.

Lasso Regression Analysis and Stepwise Regression Analysis of Training Set Data
Univariate Cox regression analysis was performed for each DEG (538 in total) using the coxph function in R to identify prognostic genes, and p < 0.05 was selected as the threshold for filtering. Lasso regression analysis was performed to further reduce the number of genes in the risk model using the glmnet package in R (7). In stepwise regression analysis, the selection of the model starts with the most complex model from which one variable is removed at a time to reduce the number of parameters according to the Akaike Information Criterion (AIC). The smaller the pvalue of the regression model, the more superior the model. This indicates that the regression model fits the data well with fewer parameters. The prognostic model is made fit for clinical applications by performing stepwise regression to further reduce the number of genes.
The prognostic model was constructed based on the following equation: where n refers to the number of genes identified for the multivariate Cox regression model; exp(Gi) is the expression value of gene i; and bi is the coefficient for gene i.

Immunohistochemistry
To verify the expression of the candidate three genes, tissue microarrays (TMA) comprised of 120 PAAD tissues and 30 normal samples were obtained from Shanghai Outdo Biotech Co., Ltd. (Shanghai, China). The clinicopathological details of 120 PAAD tissues were shown in Table 2. The studies were conducted in accordance with the International Ethical Guidelines for Biomedical Research Involving Human Subjects (CIOMS), and the research protocols were approved by the  The TMA slides were dried overnight at 37°C, dewaxed in xylene, and dehydrated in a gradient ethanol series. Antigens retrieval was performed by heating the tissue sections in a microwave oven inside a vessel filled with EDTA antigen retrieval buffer (pH 9.0). Subsequently, the tissue sections were immersed in 3% hydrogen peroxide for 25 min to block the activity of endogenous peroxides. Next, the TMA tissues were coated with 3% bovine serum albumin (BSA) and sealed at room temperature for 30 min to reduce non-specific staining. Then, the TMA slides were incubated with anti-LY6D (1: 200 dilution; Novus Biologicals, NBP1-84029), anti-BCAT1 (1:50 dilution; Abcam, ab197941), and anti-ITGB6 (1:10 dilution; Abcam, ab197672) overnight at 4°C.
The tissues were rinsed with 0.01 mol/L phosphate buffer saline (PBS; pH = 7.4) for 5 min each. The tissues were incubated at room temperature for 50 min with horseradish peroxidase (HRP)-labeled goat anti-rabbit secondary antibody (1:200 dilution, ServiceBio, GB23303). Then, the tissues were washed in PBS and stained with 3,3-diaminobenzidine (DAB). Finally, the TMA sections were counterstained with Mayer's hematoxylin, dehydrated, and fixed. To evaluate IHC staining, semi-quantitative scoring criteria were used.

Identification of Molecular Subtypes Based on Invasion-Related Genes
Thirty-five genes were found to be significantly associated with the prognosis of pancreatic cancer using univariate Cox analysis (Supplement Table 2). Consistent cluster analysis showed that the samples could be clustered together at k=2 ( Figures 1A, B). The expression levels of the invasion-related genes were significantly different between the C1 and C2 subtypes, and most genes were overexpressed in the C1 subtype ( Figure 1C). The relationship between the subtypes and prognosis was further analyzed, and results showed that there were significant differences in survival times between the C1 and C2 subtypes ( Figures 1D, E, log-rank p < 0.05).
The results of the GSEA analysis showed the activation of more tumor-related pathways in the C1 subtype, such as pathways in cancer, notch signaling pathway, focal adhesion, extracellular matrix (ECM)-receptor interaction, and TGF-b signaling pathway ( Figure 1F), suggesting that the C1 subtype is more closely related to cancer than the C2 subtype.

Analysis of DEGs Between Subtypes
According to the thresholds mentioned in the methods section, 538 DEGs were obtained, of which 531 genes were upregulated and 7 genes were downregulated (Supplement Table 3). The results demonstrated that the C1 subtype contains more upregulated genes than the C2 subtype. The volcano map of upregulated and downregulated DEGs between the two subtypes is shown in Supplementary Figure

Comparison of Immune Score Between Molecular Subtypes
To identify the relationship between molecular subtypes and immune scores in the TCGA-PADD dataset, the ESTIMATE package was used to evaluate the three immune scores: stromal, immune, and estimate scores. MCPcounter was used to evaluate 10 types of immune cells, and the single-sample GSEA (ssGSEA) method in the GSCA package was used to evaluate 28 types of immune cells (8). Meanwhile, the difference in immune scores between the two molecular subtypes was compared. The results showed that the immune scores of the C1 subtype were higher than those of the C2 subtype (Figures 2A-C). The heatmap of the immune scores of the two subtypes is shown in Figure 2D.

Risk Model of Pancreatic Cancer Based on Invasion-Related Genes
By performing univariate Cox analysis of the DEGs between the C1 and C2 subtypes, 18 prognostic genes were identified. Lasso regression analysis was performed to further reduce the number of prognostic genes. The locus of each independent variable is shown in Supplementary Figure 2A. As the value of lambda (l) increased, the number of independent variables tending to zero also increased. A 10-fold cross-validation was performed to construct the model, and the confidence interval under each l is shown in Supplementary Figure 2B. The model was found to be optimal when l = 0.05667557, so a l of 0.0567557 was chosen for further analysis of the prognostic genes. Six genes, namely LY6D, DKK1, Risk scores of each sample were calculated based on the expression level of LY6D, BCAT1, and ITGB6, and a risk score distribution was plotted for each sample, as shown in Figure 3A.
The results showed that a higher risk score was associated with worse outcomes, and high expression levels of LY6D, BCAT1, and ITGB6 were associated with a higher risk score. The timeROC package was used to analyze the receiver operating characteristic (ROC) curve of risk score; the 1-, 2-, and 3-year predictive classification efficiencies were 0.76, 0.78, and 0.75, respectively, as shown in Figure 3B. The samples were divided into a high-risk group and a low-risk group based on the risk scores. Finally, 50 and 56 samples were placed into the high-and low-risk groups, respectively. The KM curve showed a significant difference in the expression of DEGs between the high-and lowrisk groups (p < 0.01) ( Figure 3C).

Verification of Robustness of the 3-Gene Prognostic Model Using Internal and External Datasets
Verification of the Robustness of the 3-Gene Prognostic Model Using Internal Datasets To determine the robustness of the model, the risk score distribution of the TCGA validation set and all dataset samples was calculated using the same coefficients as those of the training set. The risk score distribution of the TCGA validation set suggested that samples with a high risk score are associated with a worse prognosis, as shown in Figure 4A. The 1-, 2-, and 3-year predictive classification efficiencies of the risk scores were 0.67, 0.76, and 0.87, respectively ( Figure 4B). These results demonstrated that the prognosis of the high-risk group was significantly worse than that of the low-risk group ( Figure 4C).
The risk score distribution trend of all TCGA datasets was consistent with those of the training set ( Figure 5A). The predictive classification efficiencies of the 1-, 2-, and 3-year ROCs were 0.73, 0.77, and 0.81, respectively ( Figure 5B). According to the above classification, 89 and 87 samples were categorized into the high-and low-risk groups, respectively, in all TGGA datasets. The prognosis of the high-risk group was significantly worse than that of the low-risk group ( Figure 5C).

Validation of the Robustness of the 3-Gene Prognostic Model Using Three Independent Cohorts
The robustness of the model was further verified with three independent validation cohorts GSE57495, GSE62452 and GSE28735. The 1-, 3-, and 5-year ROCs in the GSE57495 dataset were 0.63, 0.74, and 0.78, respectively ( Figure 6A). The 1-, 3-, and 5-year ROCs in the GSE62452 were 0.56, 0.71, and 0.84, respectively ( Figure 6C). The 1-, 3-, and 5-year ROCs in the GSE28735 were 0.61, 0.72, and 0.68, respectively ( Figure 6E). Therefore, the predictive performance of the model was stable in different cohorts. Finally, the samples with a risk score greater than zero after zscore method were classified into the high-risk group and those with a risk score less than 0 were classified into the low-risk group. In the GSE57495 cohort, 30 and 33 samples were categorized into the high-and low-risk groups, respectively, with significant prognostic differences between the two groups ( Figure 6B). In the GSE62452 cohort, 33 samples each were categorized into the high and low-risk groups, respectively, with significant prognostic differences between the two groups ( Figure 6D). In the GSE28735 cohort, 21 samples each were categorized into the high and low-risk groups, respectively, with significant prognostic differences between the two groups ( Figure 6F).

Risk Model and Prognostic Analysis of Clinical Features
Further analysis of the relationship between the risk score and clinical features showed that the 3-gene prognostic model could significantly distinguish between age, sex, TNM stage, clinical stage, tumor grade, alcohol consumption, chemotherapy, and radiation therapy between the high-and low-risk groups ( Figures 7A-P, p < 0.05). This suggests that the model also has good predictive power in distinguishing different clinical features. The distribution of the risk score among the clinical features of the two groups was further compared. The results demonstrated that the risk score is significantly different between the N-stage and tumor grade (p < 0.05). The higher the tumor grade, the higher the risk score ( Figure 7Q). The risk score of N1 was significantly higher than that of N0 ( Figure 7R). The risk score of the C1 subtype with a poor prognosis was significantly higher than that of the C2 subtype with a good prognosis ( Figure 7T). Moreover, the risk score was significantly different among existing immune molecular subtypes ( Figure 7S).

Construction of the Nomogram
In the TCGA-PAAD dataset, the univariate Cox regression analysis showed a significant correlation between the risk type and survival, while the multivariate Cox regression analysis showed a significant correlation between the risk score (Hazard ratio [HR] = 1.94, 95% confidence interval [CI] = 1.25-3.01, and p = 0.003) and survival. These results demonstrate the good predictive performance of the identified 3-gene prognostic model in clinical applications. Furthermore, the N stage (HR = 4.17, 95% 1. 16-14.93, and p = 0.028) and grade (HR = 3.06, 95% CI = 1.3-7.21, and p = 0.011) were identified as independent prognostic risk factors for patients with pancreatic cancer. Chemotherapy (HR = 0.13, 5% CI = 0.05-0.36, and p < 0.001) was identified as an independent prognostic protective factor ( Figures 8A, B).
The nomogram, which displays the results of the risk model directly and effectively, can be conveniently applied to the prediction of an outcome. The nomogram uses the length of the line to indicate the degree of influence that different variables have on the result and the influence of different values of variables on the result. According to the results of the univariate and multivariate analyses, the nomogram was constructed with the following clinical features: N stage, tumor grade, chemotherapy, and risk score ( Figure 8C). The results showed that the risk score has the greatest effect on survival prediction, indicating that the risk model based on the 3-gene signature can accurately predict the prognosis of patients with pancreatic cancer. A calibration diagram was used to visualize the nomogram. The results showed that the nomogram performed well in determining the prognostic risk of patients with pancreatic cancer ( Figure 8D).

Comparison of Risk Model With Other Models
Four prognostic risk models, including 15-gene signature (Chen) (9), 7-gene signature (Cheng) (10), and 6-gene signature (Stratford) (11) models, were compared with the identified 3gene prognostic model. To facilitate comparison among the models, the risk score of each TCGA-PAAD sample was calculated using the same method, and the risk score was zscored according to the corresponding gene in all three models. Genes with a risk score greater than zero were categorized into a high-risk group and those with a risk score less than zero were categorized into a low-risk group. The prognosis difference between the two groups was further analyzed. There were significant differences in outcomes between the high-risk and low-risk groups in all three risk models ( Figures 9B, D, F, logrank p < 0.05), the area under the curve (AUC)s at 1-, 2-, and 3year of Cheng and Stratford models were lower than that of our model ( Figures 9C, E). Although our 1-year AUC is smaller than the Chen model (0.73 vs 0.74), the AUC at 2 and 3 years is larger  Table 4). Therefore, the 3-gene signature identified in this study represents a more reasonable and efficient model to determine the prognostic risk of patients with pancreatic cancer with the use of fewer genes.

Expression of LY6D, BCAT1, and ITGB6 in Pancreatic Cancer
The differences in the expression of the LY6D, BCAT1, and ITGB6 genes in PAAD and adjacent tissues were investigated. The expressions of LY6D, BCAT1, and ITGB6 in 120 cases of pancreatic cancer and 30 cases of para-carcinoma were detected by immunohistochemistry. The results showed that BCAT1, LY6D, and ITGB6 were significantly overexpressed in cancer tissues ( Figures 10A-C). Many cases in the TMA cohort were not effectively followed up. Therefore, to compensate for this limitation, the Kaplan-Meier plotter database was used to obtain 177 samples with overall survival data and 69 cases with recurrence-free survival data. The results showed that patients with high expression of LY6D, BCAT1, and ITGB6 genes have a significantly worse prognosis than those with a low expression both in terms of overall survival and recurrence-free survival ( Figures 10D-I). Our immunohistochemical results demonstrated that LY6D, BCAT1, and ITGB6 proteins were all overexpressed in PAAD samples compared to normal samples. Therefore, it can be speculated that these genes act as oncogenes in pancreatic cancer, and the upregulation of these genes is associated with a significantly worse prognosis in patients with pancreatic cancer.

Flow Chart of Research Methodology
A flowchart has been drawn to allow readers to better understand the research process of this study ( Figure 11).

DISCUSSION
Pancreatic cancer is a highly aggressive malignancy that is associated with a high mortality rate and poor prognosis. The 5-year survival rate for patients with pancreatic cancer is less than 10% (12). In 2021, 60,430 new pancreatic cancer cases and 48,220 pancreatic cancer-related deaths are expected in the United States (12). By 2030, pancreatic cancer is estimated to be the second most common cause of cancer-related deaths in the United States (13). The malignant progression of pancreatic cancer is often accompanied by changes in the expression of multiple genes, and the abnormal expression of specific genes may affect the prognosis of patients with pancreatic cancer. These genes may also serve as effective targets for personalized cancer therapy (14,15). In recent years, with the rapid development of sequencing technology, high-throughput genomics has allowed for the exploration of key genes involved in cancer tumorigenesis and development. Moreover, highthroughput genomics allows for further analysis of the mechanisms related to tumorigenesis and development.
In this study, 176 TCGA-PAAD samples were genotyped based on 97 invasion-related genes, and two subtypes (C1 and C2) were obtained. The C1 subtype with a poor prognosis was more associated with the involvement of tumor-related pathways such as the Notch signaling pathway and ECM-receptor interaction. The Notch signaling pathway plays an important role in the maintenance of pancreatic tumor phenotypes (16), and the downregulation of the Notch receptor is associated with decreased proliferation, increased apoptosis, anchor-dependent growth, and decreased invasiveness of pancreatic cancer cells (17). However, matrix proteins derived from tumor cells may promote the development and metastasis of ductal adenocarcinoma of the pancreas (18). Five hundred and thirtyeight DEGs between the C1 and C2 subtypes were identified using the limma package, of which 531 genes were upregulated, and 7 genes were downregulated. We constructed a 3-gene signature using the LY6D, BCAT1, and ITGB6 genes out of the 538 identified DEGs. Lymphocyte 6 (Ly6) complex is a group of alloantigens, and LY6D is an important member of the Ly6 family. LY6D plays an important role in the maintenance of phenotypic and transcriptome heterogeneity of progenitor cells and the proliferation and differentiation of lymphocyte B during the early stages of lymphogenesis (19,20). LY6D also plays an important role in cancer; it serves as a prognostic marker for advanced prostate cancer (21) and stage I non-small cell lung carcinoma (NSCLC) (22), drug resistance-associated marker for laryngeal squamous cell carcinoma (23), long-range metastasis marker for patients with ESR1-positive breast cancer (24), and a marker of urothelial and squamous cell differentiation (25). Apart from its involvement in cell adhesion, LY6D also regulates important interactions between endothelial cells and head and neck squamous cell carcinoma cells (26). In addition to glucose and fatty acid metabolism, amino acid metabolism plays an important role in tumor metabolic reprogramming.
The study has shown that the metabolism of Branchedchain amino acids (BCAA) is potentially linked with development of pancreatic ductal adenocarcinoma (27), and BCAT1, an enzyme involved in the degradation of branchedchain amino acids, is responsible for initiating the catabolism of such amino acids (28).   It has been reported that pancreatic ductal adenocarcinoma cells reprogram fibroblasts to upregulate the expression of BCAT1, to meet the cancer cells' demand for branched-chain a-ketoacid (BCKAs) under BCAA deprivation (29).
The expression of BCAT1 is also upregulated in hepatocellular carcinoma (HCC) (30), breast cancer (31), and NSCLC (32), and indicates a poor prognosis. In HCC, BCAT1 plays a pathogenic role by promoting cell proliferation and chemoresistance (33). BCAT1 regulates mTOR-mediated autophagy via branched-chain amino acid metabolism, thus reducing the sensitivity of cancer cells to cisplatin (34).
As a member of the integrin b (ITGB) superfamily, the overexpression of ITGB6 is associated with the upregulation of the Notch signaling pathway in pancreatic cancer and is associated with immunosuppression in pancreatic cancer (35). Nine genetic markers, including ITGB6, can be used to predict the overall survival of patients with pancreatic cancer (36). ITGB6, which is highly expressed in colorectal cancer, is associated with a poor prognosis (37). ITGB6 can also be used as a tumor-specific surface antigen (TSA) to identify cell surface targets of CAR-T cell therapy and antibody-drug conjugates in breast cancer (38). Studies have shown that ITGB6 was a liver-metastasis-related gene for PAAD patients (39) and the overexpression of ITGB6 was significantly associated with advanced AJCC stage and histologic grade, and w o r s e p r o g n o s i s i n p a n c r e a t i c c a n c e r ( 4 0 ) . O u r immunohistochemical results showed that LY6D, BCAT1, and ITGB6 were all overexpressed in pancreatic cancer, which was consistent with the previous results.
Although there are many multi-gene prognostic models for PAAD, there is no model based on invasion-related gene signature to predict the prognosis of pancreatic cancer. Invasion genes play an important role in metastasis as well as the development of cancer. Moreover, some prognostic signatures contain multiple genes (15-gene signature, 7-gene signature, and 6-gene signature), indicating that it is necessary to assess the expression profile of more genes in a patient-specific manner, which adds extra cost to medical care. Our 3-gene prognostic model has a higher ROC than the above models in terms of prediction of 1-, 2-, and 3-year survival rates of patients with pancreatic cancer, while having fewer genes. Therefore, our model has certain advantages in PAAD. However, our model also presents certain limitations. First, information in the TCGA database is primarily limited to Caucasian and African populations; therefore, and data from the Asian population are missing from this study. Additionally, our study was a retrospective study of patients with pancreatic cancer, and prospective studies should be conducted to validate the prognostic characteristic and confirm the stable performance of the 3-gene prognostic model. Finally, the molecular mechanisms by which LY6D, BCAT1, and ITGB6 drive the malignant progression of pancreatic cancer require further verification.

CONCLUSIONS
In this study, we divided the TCGA-PAAD samples into two subtypes based on the differential expression of the invasionrelated genes and constructed a prognostic molecular signature consisting of three genes, including LY6D, BCAT1, and ITGB6, based on the DEGs between the two subtypes. The LY6D, BCAT1, and ITGB6 genes were upregulated in pancreatic cancer samples. The 3-gene prognostic model also exhibited a good AUC in both the training and validation sets. Therefore, this 3-gene prognostic model, based on the expression of three invasion-related genes, may be used to assess the prognosis of patients with pancreatic cancer. This will help in the stratification of patients for personalized cancer therapy.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding author.