A Pre-operative Nomogram for Prediction of Lymph Node Metastasis in Bladder Urothelial Carcinoma

The status of lymph node (LN) metastases plays a decisive role in the selection of surgical procedures and post-operative treatment. Several histopathologic features, known as predictors of LN metastasis, are commonly available post-operatively. Medical imaging improved pre-operative diagnosis, but the results are not fully satisfactory due to substantial false positives. Thus, a reliable and robust method for pre-operative assessment of LN status is urgently required. We developed a prediction model in a training set from the TCGA-BLCA cohort including 196 bladder urothelial carcinoma samples with confirmed LN metastasis status. Least absolute shrinkage and selection operator (LASSO) regression was harnessed for dimension reduction, feature selection, and LNM signature building. Multivariable logistic regression was used to develop the prognostic model, incorporating the LNM signature, and a genomic mutation of MLL2, and was presented with a LNM nomogram. The performance of the nomogram was assessed with respect to its calibration, discrimination, and clinical usefulness. Internal validation was evaluated by the testing set from the TCGA cohort and independent validation was assessed by two independent cohorts. The LNM signature, which consisted of 48 selected features, was significantly associated with LN status (p < 0.005 for both the training and testing sets of the TCGA cohort). Predictors contained in the individualized prediction nomogram included the LNM signature and MLL2 mutation status. The model demonstrated good discrimination, with an area under the curve (AUC) of 98.7% (85.3% for testing set) and good calibration with p = 0.973 (0.485 for testing set) in the Hosmer-Lemeshow goodness of fit test. Decision curve analysis demonstrated that the LNM nomogram was clinically useful. This study presents a pre-operative nomogram incorporating a LNM signature and a genomic mutation, which can be conveniently utilized to facilitate pre-operative individualized prediction of LN metastasis in patients with bladder urothelial carcinoma.


INTRODUCTION
Bladder cancer is the 9th most prevalent cause of cancer worldwide and the 2nd most common genitourinary malignancy, with transitional cell carcinomas comprising about 90% of primary bladder tumors (1). In 2019, ∼80,470 new cases and 17,670 deaths of bladder cancer were estimated to occur in the United States (2). Previous research has revealed that Lymph node (LN) involvement-which is frequently found in bladder cancer-possesses prognostic implications, and both the pathological stage of primary bladder tumor and the presence of LN metastasis are considered the most important determinants of survival in bladder cancer patients undergoing radical cystectomy (3). Early and accurate identification of LN metastasis holds significance in improving patient triage and management (4) and may suggest potential alteration of the lymphadenectomy template in patients who undergo surgery. In cases where local staging is equivocal, it also expedites care for patients, particularly when nodal disease can be definitively identified (5)(6)(7). Pre-operative knowledge of LN metastasis provides valuable information about the necessity of adjuvant therapy and the adequacy of surgical resection, thereby aiding pretreatment decision-making, but unfortunately, most histopathologic findings identified as predictors of LN metastasis cannot be observed pre-operatively. That is to say, the status of LN metastases plays a decisive role in the selection of surgical procedures and post-operative treatment. Reliable and robust methods for pre-operative assessment of LN status (8) have been continuously explored. Fine-needle aspiration lymphangiography has been evaluated in several investigations but failed to show reliability due to a high false negative rate (9). Only a few studies have appraised positron emission tomography (PET) and its ability to detect LN metastases in bladder cancer, but the conclusions have been largely disappointing (10). Additionally, computed tomography (CT) revealed a high false negative rate of 21% (11). Next-generation sequencing technology has brought massively high throughput sequencing data to bear on research questions with low cost, which enables us to decipher the difference of bladder cancer in terms of status of LN metastasis in a genomic level. Currently, the analysis strategy for multiple biomarkers has evolved from individual analyses to combined analysis of a panel of biomarkers that constitute a signature, which appears to be a most promising approach and powerful enough to innovate clinical management (12,13). Therefore, this study aims to develop and validate a pre-operative nomogram that incorporates a LNmetastasis signature and genomic mutations for individualized pre-operative prediction of LN metastasis in patients with bladder cancer.

Patients and Samples
Molecular data were obtained from The Cancer Genome Atlas Project (TCGA) patients diagnosed with bladder urothelial carcinoma. Transcriptome HTSeq-counts data of TCGA-BLCA project was obtained from the Genomic Data Commons (https:// portal.gdc.cancr.gov/) using the R package "TCGAbiolinks" (14). Somatic mutation profiling and detailed clinicopathological information were downloaded from cBioPortal (http://www. cbioportal.org/datasets). For the purpose of the present study, 196 samples were selected as the TCGA cohort including 49 samples with LN metastasis only (LN+) and 147 without any metastasis (LN-). Two independent cohorts were gathered for validation including one obtained from Gene Expression Omnibus (https://www/ncbi.nlm.nih.gov/geo/) (GEO cohort) by using R package "GEOquery" with a query of GSE106534 and another that is publicly available through the Memorial Sloan-Kettering Cancer Center (MSKCC cohort) cBioPortal for Cancer Genomics. The GEO cohort contains five LN+ and five LNbladder tissues, of which RNA was extracted and hybridized on an Illumina Hiseq 2500 (13). The MSKCC cohort contains 58 tumor samples with Agilent microarray mRNA expression profiling (15). Survival information of the MSKCC cohort was obtained from the cBioPortal.

Data Preprocessing for Transcriptome HTSeq-Counts
Ensembl ID for genes (protein-coding mRNAs) was annotated in GENCODE27 to generate Gene Symbol name. Gene type of protein-coding was selected for mRNAs. We calculated the number of fragments per kilobase of non-overlapped exons per million fragments mapped (FPKM) first and subsequently transferred FPKM into transcripts per kilobase million (TPM) values, which are more similar to those resulting from microarrays and more comparable between samples (16). To reduce noise, only mRNAs with TPM value equal to or above one in at least 10% of the samples were kept for downstream analysis.

Differential Expression and Functional Enrichment Analysis
Differential expression analysis was performed by R package "DESeq2" with the standard comparison mode between the two experimental conditions (17). P-values were adjusted for multiple testing with an embedded Benjamini-Hochberg procedure in the package. Gene set enrichment analysis (GSEA) was performed by R package "clusterProfiler" (18,19) to impute functional pathway enrichment for the LN+ and LN-groups by mRNA expression profile.

Genetic Analysis on Somatic Mutation
We used MutSigCV_v1.41 (20) (www.broadinstitute.org) to infer significant cancer mutated genes (q < 0.05) across the two classes currently identified with default parameters. Tumor mutation burden was computed by summing all kinds of nonsilent mutation. Mutation landscape oncoprint was drawn by R package "ComplexHeatmap" (21). Significant frequent non-silent  mutations were identified by independent test between the LN+ and LN-groups with a p < 0.05.

Feature Selection
The TCGA cohort of 196 samples was randomized into two sets based on 10-fold stratified sampling, where the training set included 9 folds of LN+ samples and LN− samples and the testing set included the rest, 1 fold with 5 LN+ samples and 15 LN− samples. Least absolute shrinkage and selection operator (LASSO) regression, which is often applied as a dimensionally reduction technique, was performed on the training set to select primary predicative features. Ten-fold cross validation was performed to tune the optimal value of lambda (λ) that gives the minimum mean cross-validated error. A score was calculated for each sample via a linear combination of the selected features, namely LNM signature, weighted by the corresponding coefficients. The potential association of the LNM signature with LN status was first assessed in the training set and then validated in the testing set by using the Mann-Whitney U-test.

Development of an Individualized Prediction Model
Multivariable logistic regression analysis on the training set began with the following candidate predictors: LNM signature and significant frequent mutations. Those with respective p < 0.05 were retained in the prognostic model. A LNM nomogram was built by R package "regplot" as a quantitative tool for clinicians for individualized prediction of LN metastasis probability.

Validation of the LNM Nomogram and LNM Signature
Internal validation was performed using the testing set of the TCGA cohort with 20 samples. The logistic regression formula formed in the training set was applied to all samples in the testing set, with total points for each sample calculated. Logistic regression in the testing cohort was then performed by using the total points as a factor. Finally, the receiver operating characteristic curve (ROC) with area under the curve (AUC) and calibration curve were derived based on the regression analysis by using R packages "pROC" and "rms". Independent validation for LNM-score was tested in the GEO and MSKCC cohorts. Since the mutation data were absent and several genes of the LNM signature failed to be mapped in independent cohorts, we harnessed unsupervised clustering to determine if LNM signature could help distinguish LN metastasis status in the GEO cohort and whether it was associated with overall survival (OS) or progression-free survival (PFS) in the MKSCC cohort. Supervised hierarchical clustering based on mapped LNM signature was performed by using R function hclust() via the Ward.D clustering method 1-Pearson's correlation distance, with k = 2 as the number of clusters. Expression profiling of mRNAs was transformed by log 2 (x+1) and median-centered before clustering.

Statistical Analysis
All statistical tests were executed by R/3.5.2, with a χ 2 or Fisher's exact test for categorical data when appropriate, a twosample student's t-Test or Mann-Whitney U-test for continuous data when appropriate, a log-rank test Kaplan-Meier curve (22) and Cox regression (23) for survival analysis performed by R package "survival." Survival of patients belonging to different defined groups was compared by the Kaplan-Meier Method, with p-value determined by the log-rank (Mantel-Cox) test.
Fisher's exact test of independence was used to statistically test the association between categorical clinical information and LN metastasis status. For all statistical analysis, a two-sided p < 0.05 was considered statistically significant. Decision curve analysis was conducted to determine the clinical usefulness of the LNM nomogram by quantifying the net benefits at different threshold probabilities by using R package "rmda" (24,25).  Figure 1A) and a tendency could be observed where LN+ tumors presented a higher recurrence rate (p = 0.083, HR = 1.58, 95% CI = [0.88-2.82], Figure 1B).

Overview of Differential Expression Results From LN+ and LN-Tumors
Supervised differential expression analysis using LN metastasis status as the group variable identified 180 differentially expressed genes (p < 0.05, false discovery rate (FDR) < 0.05, Figure 2A; Supplementary Table S1). GSEA manifested a universal down-regulation of immune-related pathways in LN+ tumors as compared to LN-tumors ( Figure 2B; Supplementary Table S2).

Somatic Mutation Landscape Between LN+ and LN-Tumors
After filtering out non-silent mutation, tumors with LN+ exhibited a significant lower burden of mutation load as compared to LN-tumors (p = 0.008) ( Figure 2C). MutSigCV identified 16 significantly mutated genes (SMGs, q < 0.05) for the present 193 samples with available mutation data (Supplementary Table S3), all of which were reported from previous research (26) (Figure 2D). We identified 716 genes with a mutation rate >5%, among which 26 genes were found to be differentially mutated between LN+ and LN-tumors by  independent test (Figure 2E; Supplementary Table S4). By intersecting with SMGs, mutation of MLL2 (also known as KMT2D) and PSIP1 were identified for constructing a predictive model.

Feature Selection and LNM Signature Building
Of 180 differentially expressed genes, 48 features were selected on the basis of the training set of the TCGA cohort (see Materials and methods for more details), including 22 up-regulated genes and 26 down-regulated genes, as they were features with non-zero coefficients in the LASSO logistic regression model (Figures 3A,B;

Supervised Clustering by Using LNM Signature in Two Independent Cohorts
Independent validation for LNM signature was performed in the GEO ( Figure 4A) and MSKCC cohorts ( Figure 4B) by supervised hierarchical clustering where samples in the GEO cohort could be distinguished according to LN metastasis status (p = 0.048), and a tendency could be observed that LNM signature was associated with OS (p = 0.075) ( Figure 4C) and PFS (p = 0.098) (Figure 4D) in the MSKCC cohort.

Development of an Individualized Prognostic Model
Logistic regression analysis identified the pre-operative features of LNM signature and MLL2 mutation as independent predictors ( Table 3). We also considered other pre-operative clinical variables when designing the prognostic model, and interestingly, only LNM signature and MLL2 mutation survived in the full model with p < 0.05 in the logistic regression (Supplementary Table S6). Thus, we further removed these variables in this study not only due to the insignificance of other variables, but also because we hope that patients could accept the acquisition of these measurements relatively easily. For this reason, we considered that in most cases, pathologic stage and detailed TNM classification should be detected by biopsy, an invasive procedure that may be much less acceptable than predesigned multi-gene assay that may only need a small amount of blood. Hence, a model incorporating these two features was developed and presented as a LNM nomogram ( Figure 5).  Predictions made by calibration curve of the nomogram for LN metastasis conformed well to observations in the training set, with a Hosmer-Lemeshow test suggesting no departure from perfect fit (p = 0.973).

Validation of the LNM Nomogram and Its Clinical Usefulness
Total points calculated by LNM nomogram for each sample in the testing set was determined to be a significant predictor when performing logistic regression (p = 0.032), and no departure from perfect fit was identified (p = 0.485) (Figure 6A, see Supplementary Figures S1A-T for total point of each testing sample). Internal validation obtained AUCs of 98.7 and 85.3% when deploying the LNM nomogram to the training set and the testing set ( Figure 6B). The decision curve analysis showed that the LNM nomogram offered a net benefit over the "treat-all" or "treat-none" strategies at a really small threshold probability of a patient or doctor, which indicated that the LNM nomogram was clinically useful. For example, if the personal threshold probability of a patient is 60% (i.e., the patient would opt for treatment if his probability of LNM was >60%), then using the LNM nomogram to predict LN metastases could provide an added net benefit of 0.7386 compared to the "treat-all" or "treat-none" strategies ( Figure 6C).

DISCUSSION
Bladder cancer ranks fourth in men and eighth in women among the most common malignancies in terms of frequency (27). However, little progress was made in the past decades toward prolonged survival of high-grade bladder cancer, leaving it still a lethal disease (28). Since a considerable amount of research has recognized LN involvement as the strongest independent prognostic variable for patient outcomes, proper identification of LN metastasis is of paramount importance (29). To date, several histopathologic findings that are known to be predictors of LN metastasis are commonly available post-operatively. Medical imaging has made great strides in pre-operative diagnosis, but the results are not fully satisfactory due to substantial false positives. Therefore, we sought here to develop and validate a diagnostic, LNM signature-based nomogram for pre-operative individualized prediction of LN metastasis in patients with bladder cancer. The nomogram offers an easy-to-use tool for pre-operative individualized prediction of LN metastasis and incorporates only two pre-operative items, LNM signature which   stratifies patients by their risk of LN metastasis, and mutation status of MLL2. For the construction of the LNM signature, 180 candidate genes were reduced to 48 potential predictors by examining the predictor-outcome association by shrinking the regression coefficients with the LASSO algorithm. Compared to predictor selection based on strength of univariable association between predictor and outcome, this method further enables the combination of all selected features and creates a single signature, i.e., marker panels. Marker panels have been embraced in recent studies for multi-marker analysis (30,31), such as a novel 6-microRNA-based model that was proposed to improve prognosis prediction of breast cancer (32), and a 6-DNA methylation signature that was recognized as a novel prognostic biomarker in ovarian serous cystadenocarcinoma (32). Moreover, the Oncotype DX is a 21-gene assay that represents the first clinically validated multi-gene assay which can quantify the likelihood of breast cancer recurrence (33,34). Another 70-gene assay, MammaPrint, was developed by the Netherlands Cancer Institute and was used to predict the risk of developing metastasis within 5 years for breast cancer (35). Similarly, the LNM signature that combined multiple genes demonstrated adequate discrimination in the training set of TCGA cohort (AUC = 98.7%) and was satisfactory in the testing set (AUC = 85.3%). LNM signature was also presented as an independent predictor for overall survival in the TCGA cohort. In addition, supervised clustering using LNM signature enabled us to distinguish LN metastasis status in an independent GEO cohort (p = 0.048) and associated with patients' outcomes to some extent (p < 0.1). Thus, the noninvasive LNM signature allows for more convenient prediction of LN metastasis. Note that mutation of MLL2, which was differentially mutated between LN+ and LN-tumors (7:46, p = 0.0166), was also a significant variable in the predictive model (p = 0.012). MLL2, also known as KMT2D (Lysine Methyltransferase 2D), of which mutation is a driver in numerous different cancer types resulting in transcription stress and genome instability (36), and a recent study demonstrated that MLL2 could sustain prostate carcinogenesis and metastasis (37). Because the LNM signature and the mutation of MLL2 are available pre-operatively, our nomogram which generates individual probability by integrating the two factors provides clinicians and patients with a preoperative individualized prediction of the LN metastasis risk, which is in line with the current trend toward personalized medicine (38). Finally, and most importantly, the nomogram was designed to interpret individualized patient need for additional treatment or care. While the clinical consequences of a particular level of discrimination or degree of miscalibration could hardly be captured by risk prediction, discrimination, or calibration, a decision curve analysis assessing whether nomogram-assisted decision making improves patient outcomes was performed to justify the clinical usefulness of the LNM nomogram. This novel method offers insights into clinical consequences on the basis of threshold probability by deriving the net benefit (defined as the proportion of true positives minus the proportion of false positives weighted by relative harm of false-positive and false-negative results) (38, 39). Results showed that decisions based on the LNM nomogram yielded more favorable clinical consequences than the treat-allpatient scheme and the treat-none scheme, even given an extremely small threshold probability. However, our study harbored limitations, including the fact that radiomics characteristics and other pre-operative clinical features (e.g., hematuresis or not) were not considered under the existing framework. There has been tremendous growth in radiomics research in the past few years for assisting clinical diagnosis and improving predictive accuracy. An emerging field that is closely related to radiomics is radiogenomics, which integrates imaging and genomics data with the goal of improving patient stratification for precision medicine.
In short, this study presents a LNM nomogram incorporating a LNM signature and a genomic mutation, which can be conveniently used to facilitate pre-operative individualized prediction of LN metastasis in patients with bladder cancer.

DATA AVAILABILITY
The datasets used and analyzed during the current study are available from the corresponding author on reasonable request.

AUTHOR CONTRIBUTIONS
XL and YW proposed the conception and design of this research. XL and LJ developed methodology. XL, BZ, and FY collected data and performed preprocessing. XL, WH, YZ, JW, XR, ZX, and XM analyzed and interpreted the data including statistical analysis, biostatistics, bioinformatics, and computational analysis. XL, JG, and YW were major contributors in writing the manuscript. All authors read and approved the final manuscript. funders had no role in the study design, data collection, and analysis, decision to publish, or preparation of the manuscript.

ACKNOWLEDGMENTS
The main results published here are based upon data generated by TCGA, managed by the NCI and NHGRI. We are grateful to TCGA for this source of data. Information about TCGA can be found at http://cancergenome.nih.gov. We also sincerely thank Sergio Chavez of the Department of Bioinformatics and Computational Biology (The University of Texas, MD Anderson Cancer Center) for editing the manuscript.