Qualitative Transcriptional Signature for the Pathological Diagnosis of Pancreatic Cancer

It is currently difficult for pathologists to diagnose pancreatic cancer (PC) using biopsy specimens because samples may have been from an incorrect site or contain an insufficient amount of tissue. Thus, there is a need to develop a platform-independent molecular classifier that accurately distinguishes benign pancreatic lesions from PC. Here, we developed a robust qualitative messenger RNA signature based on within-sample relative expression orderings (REOs) of genes to discriminate both PC tissues and cancer-adjacent normal tissues from non-PC pancreatitis and healthy pancreatic tissues. A signature comprising 12 gene pairs and 17 genes was built in the training datasets and validated in microarray and RNA-sequencing datasets from biopsy samples and surgically resected samples. Analysis of 1,007 PC tissues and 257 non-tumor samples from nine databases indicated that the geometric mean of sensitivity and specificity was 96.7%, and the area under receiver operating characteristic curve was 0.978 (95% confidence interval, 0.947–0.994). For 20 specimens obtained from endoscopic biopsy, the signature had a diagnostic accuracy of 100%. The REO-based signature described here can aid in the molecular diagnosis of PC and may facilitate objective differentiation between benign and malignant pancreatic lesions.


INTRODUCTION
Pancreatic cancer (PC) is the fourth leading cause of cancer deaths in the United States and the sixth leading cause in China. Patients with PC have a 5-year survival rate of 8.5% in the United States and 7.2% in China (Siegel et al., 2019;Zhao et al., 2019). The diagnosis and treatment of pancreatic cancer remain challenging. Patients with early-stage PC are usually asymptomatic, and only about 10% of patients are diagnosed at an early stage (Singhi et al., 2019). Serum cancer antigen 19-9 (CA 19-9) is the only marker approved by the United States Food and Drug Administration for use in the routine management of PC. Imaging techniques, such as computed tomography (CT), magnetic resonance imaging (MRI), endoscopic ultrasonography (EUS), and endoscopic retrograde cholangiopancreatography (ERCP), can help in the diagnosis of PC. EUS is currently the most effective imaging method for diagnosis and is superior to CT and MRI (Costache et al., 2017). However, the accuracy of EUS for early detection of PC is still unsatisfactory, and it is often difficult to distinguish benign and malignant pancreatic lesions (Singhi et al., 2019).
Therefore, it is sometimes necessary to perform a tissue biopsy for the pathological diagnosis of PC. In clinical practice, more accurate and definitive pathological diagnoses can be made using biopsies from EUS-guided fine needle aspiration (EUS-FNA), and this method has an overall diagnostic accuracy of 91% (Banafea et al., 2016). However, biopsy samples may be collected from incorrect locations, and this can lead to falsenegative results. Repeated EUS-FNA not only increases the diagnostic accuracy to 96.3% but also increases the risk of complications (Suzuki et al., 2013). Thus, it is vital to develop molecular signatures to complement the present histological methods for diagnosis of early PC, especially when the locations of biopsy samples are incorrect. Recent studies reported that within-sample relative expression orderings (REOs) of genes are insensitive to experimental batch effects and can provide qualitative transcriptional signatures that can be applied to samples at an individual level (Eddy et al., 2010;Wang et al., 2015). Within-sample REOs are also insensitive to variable proportions of tumor epithelial cells sampled from different tumor locations in the same patient , RNA degradation during specimen storage and preparation , and amplification bias for minimum specimens, which can lead to failure of quantitative transcriptional signatures in clinical applications . Thus, within−sample REOs may provide a robust qualitative signature for the early diagnosis of PC.
In this study, based on the REOs of 12 gene pairs, we identified a qualitative transcriptional signature for the early diagnosis of PC. The signature can accurately discriminate both PC tissues and adjacent-normal tissues from normal pancreatic tissues and non−PC pancreatitis in both biopsy and surgical resection samples.
Transcriptome HTSeq-counts data of the TCGA-Pancreatic Adenocarcinoma (PAAD) project were downloaded from the Genomic Data Commons using the R package "TCGAbiolink, " including 183 non-formalin-fixed and paraffin-embedded (FFPE) samples of primary pancreatic tumors. Ensembl ID for proteincoding messenger RNAs (mRNAs) was annotated to symbol name using GENCODE27. The number of fragments per kilobase of non-overlapped exons per million fragments mapped (FPKM) was calculated first and was then transformed into transcripts per kilobase million (TPM) values. All mRNAs with TPM values < 1 in more than 90% of the samples were considered to be noise and removed prior to downstream analysis. Data of the GTEx project were retrieved from the UCSC Xena browser 2 . For microarray datasets measured by Affymetrix platforms, the robust multiarray average (RMA) procedure was performed, with raw CEL files for background, using the R package "affy" for adjustment (Irizarry et al., 2003). For other platforms, the processed data were obtained from GEO and utilized for subsequent analyses.

Identification of Qualitative REO-Based PC Diagnosis Signature
For the purpose of this study, tumor samples were identified as "cancer" or "cancer-adjacent normal" because the transcriptional characteristics of apparently normal tissue that is adjacent to a tumor differs from healthy normal tissues (Aran et al., 2017), whereas non-tumor samples involve "healthy normal" or "pancreatitis." Data analysis consisted of several sequential steps (Figure 1). First, a pairwise gene or REO within a sample with genes i and j was assigned a value of "1" if gene "i" had higher expression and "0" if gene j had greater expression. A "reversal gene pair" (RGP) was defined by the presence of the same REO pattern in more than 85% of the tumor samples ("cancer" and "canceradjacent normal") and a reversed pattern in more than 85% of the non-tumor samples ("healthy normal" and "pancreatitis") in the training dataset. Then, RGPs were filtered using the tuning dataset to establish a candidate REO signature of PC. The rank difference for each RGP was computed for each sample as: where R i and R j represent the rank of gene i and j within a sample, and R ij represents the rank difference.
Then, R ij , the geometric mean of the mean value of R ij in tumor samples was used to assess the extent of reversal of the gene pair between tumor and non-tumor samples: where R T ij is the mean value in tumor samples, and R N ij is the mean value in non-tumor samples. Ideally, the sign for each R X ij should be uniform in each sample type, X{X ∈ (T, N)}. However, in most cases, a sample in a specific X may have an R X ij with a different sign, and this can cause bias because of the absolute value operation. Therefore, an R X ij with the "wrong" sign was forced to zero before subsequent analysis. Thus, a higher R ij value corresponds to a larger reversal of the REO of the gene pair between tumor and non-tumor samples.
The selected candidate REO signatures were then sorted in a descending order according to their R ij values, and the RGP with the largest R ij was set as the seed. Then, a forward selection procedure was used, with one RGP entered at a time, to evaluate the classification accuracy based on a voting rule. Thus, if more than half of the RGPs of a sample in a signature framework had an REO for the tumor, the sample was classified as "tumor"; otherwise, it was classified as "non-tumor". This selection procedure eventually selected the minimum and optimal number of RGPs with the highest accuracy.

Performance Evaluation
All samples in the training, tuning, and validation datasets were first pooled together to assess the general predictive performance in different samples (tumor, cancer-adjacent normal, pancreatitis, and healthy normal) using receiver operating characteristic (ROC) analysis and calculation of the area under the curve (AUC). The accuracy was defined as the portion of correctly identified samples in the entire cohort, and the AUC, sensitivity, and specificity were calculated using the R package "pROC". The diagnostic performance was further evaluated in each independent dataset by calculation of sensitivity and specificity. In this procedure, sensitivity refers to the proportion correctly identified true positives and specificity to the proportion of correctly identified true negatives. Sensitivity and specificity were recorded at the median cutoff of the voting threshold.

Identification of the Qualitative Gene Pair Signature for PC
We first used the training cohort from a merger of five microarray datasets (GSE101462, GSE71989, GSE91035, E-MEXP-1121, and E-MTAB-1791) to identify common gene pairs with stable REOs in 74 normal pancreatic tissues (18,476,925 gene pairs) and 72 pancreatitis samples (17,395,594 gene pairs). Then, we identified 14,633,175 gene pairs with identical REO patterns in at least 85% of normal pancreas and pancreatitis samples as stable gene pairs of non-tumor samples. We also identified 18,300,104 gene pairs with stable REO patterns in at least 85% of the 269 PC samples in the training cohort. Next, we used these data to identify 269 RGPs between the non-tumor and tumor tissues (Supplementary Table S1). After a tuning procedure using 6 normal pancreas and 6 PC tissues, we selected 20 gene pairs with identical REO patterns in the testing dataset (GSE41368). Then, we sorted the 20 RGPs into descending order based on the rank difference (R ij ) between PC and non-tumor tissues (normal pancreas and pancreatitis) in the merged data from the training set and utilized the top−ranked k gene pairs for sample classification using majority vote rule. The results indicated that for k ranging from 1 to 20, the largest geometric mean of sensitivity and specificity (93.79%) occurred for k = 12 (Figure 2). We thus identified these 12 gene pairs (Table 2) as the transcriptional signature for discriminating tumor and non-tumor samples.

Validation of the Diagnostic Signature in External Validation Datasets
Next, we assessed the performance of the 12-gene pair signature to discriminate PC (including cancer-adjacent tissue) from nontumor samples. For the 1,007 PC tissues and 257 non-tumor samples from the 9 external validation databases, the geometric mean of sensitivity and specificity was 96.7% and the area under receiver operating characteristic curve was 0.978 (95% confidence interval, 0.947-0.994; Figure 3).
FIGURE 2 | Accuracy of top-ranked gene pairs among the 20 reversal gene pairs (RGPs) in the training data. Twenty RGPs were sorted in a descending order according to the extent of reversal between tumor and non-tumor tissues in the training datasets. Twelve gene pairs provided the highest classification accuracy according to the "majority voting rule" and were used for the qualitative diagnostic signature.
normal pancreatic tissue. The other validation sets consisted of samples from surgically resected tissues. For data measured by microarray, our signature correctly identified 96.07% of 842 tumor tissues as tumor samples. The details of diagnostic performance for the transcriptional signature are shown in Table 3 and Supplementary Table S2.

DISCUSSION
PC is expected to become the leading cause of cancer-specific mortality in Western countries by 2030, and yet early diagnosis   (Rahib et al., 2014). It is therefore necessary to identify a molecular diagnostic signature to reduce the uncertainty of pathological diagnosis due to sample error. In this study, we developed and validated a qualitative REO-based signature consisting of 12 different gene pairs with 17 genes for the early and accurate molecular diagnosis of PC. The proposed transcriptional signature can discriminate malignant tissues and most PC-adjacent tissues from benign tissues. Because the signature is effective even when sample site was inaccurate due to imperfect biopsy procedures, this can help to prevent the need for a second procedure.
In contrast to previous studies that used a transcriptomic diagnostic signature based on complicated procedures of data normalization and parameter fitting (Klett et al., 2018;Long et al., 2019), we analyzed the relative expression of gene pairs, instead of the expression of single genes, to differentiate PC from benign lesions. Our transcriptional REO-based signature employed relative ranking of gene expression by identifying multiple gene pairs and can be used without confounding from the batch effect and use of different sequencing platforms (Eddy et al., 2010). Previous studies successfully used this approach in the molecular diagnosis of colorectal cancer, gastric cancer, and hepatocellular carcinoma (Ao et al., 2018;Guan et al., 2019;Yan et al., 2019). The recent developments of high-throughput sequencing technologies have accompanied dramatic decreases in price. Given the limited amount of tissue sampled from biopsies, it is more efficient to measure the expression of a set of genes as markers for aiding pathological diagnosis, molecular subtype classification (Danilova et al., 2019), and chemoresistance of PC as part of the approach of "whole genome sequencing for all" (Xuan et al., 2013;Shao et al., 2018).
Several genes in the transcriptional signature proposed here have established roles in the carcinogenesis of PC. For instance, laminin γ2-chain (LAMC2) is a well-known PC-related gene whose level is elevated in the circulation of PC patients (Katayama et al., 2005). This gene upregulates mesenchymal markers in the microenvironment by activating the Akt/NHE1 signaling pathway and thus mediates the invasion and metastasis of cancer cells (Wang et al., 2020). Cystatin 6 (CST6) is overexpressed in pancreatic ductal adenocarcinoma (PDAC) cells and can stimulate PDAC cell growth by reducing the activity of intracellular cathepsin B (Hosokawa et al., 2008). There is evidence that S100 calcium-binding protein P (S100P) can be used as a biomarker in duodenal fluid and fine needle aspiration biopsies for detection of PDAC, and this protein has a diagnostic sensitivity of 84.8% in biopsy samples (Matsunaga et al., 2017;Aksoy-Altinboga et al., 2018). S100P secretes matrix metalloproteinase 9 (MMP9) and regulates the invasion of PC cells into the lymphatic endothelial monolayer, thereby promoting tumor cell invasion and metastasis (Dakhel et al., 2014;Nakayama et al., 2019). Cadherin-3 (CDH3) regulates cell migration and tumor growth by interacting with cadherin-1 in PC and is upregulated during early-stage PC (Siret et al., 2018).
There were some limitations in our study. First, due to the limited number of biopsy tissues, we only used samples collected from surgery (not from biopsy) in the training set, and this may have led to selection bias. Second, our study design was retrospective, with genomic data derived from publicly available databases. Prospective clinical studies are needed to validate our findings.
In summary, we constructed and validated an REO-based signature consisting of 12 gene pairs and 17 genes that can aid in the early diagnosis of PC.

DATA AVAILABILITY STATEMENT
All datasets generated for this study are included in the article/Supplementary Material.

ETHICS STATEMENT
Ethical review and approval was not required for the study on human participants in accordance with the local legislation and institutional requirements. The patients/participants provided their written informed consent to participate in this study.

AUTHOR CONTRIBUTIONS
Y-JZ and C-JY obtained data from TCGA, GEO, and GTEx database, designed the study, and wrote the manuscript. Y-JZ, X-FL, J-LM, X-YW, Q-WW, Y-JG, and H-MC analyzed and interpreted the data. X-FL was responsible for the statistical analyses. X-BL and F-RY contributed to conception, design, and funding. All authors have been involved in revising and proofreading of the manuscript, and approved the manuscript.