Deep Neural Networks Outperform the CAPRA Score in Predicting Biochemical Recurrence After Prostatectomy

Background Use of predictive models for the prediction of biochemical recurrence (BCR) is gaining attention for prostate cancer (PCa). Specifically, BCR occurs in approximately 20–40% of patients five years after radical prostatectomy (RP) and the ability to predict BCR may help clinicians to make better treatment decisions. We aim to investigate the accuracy of CAPRA score compared to others models in predicting the 3-year BCR of PCa patients. Material and Methods A total of 5043 men who underwent RP were analyzed retrospectively. The accuracy of CAPRA score, Cox regression analysis, logistic regression, K-nearest neighbor (KNN), random forest (RF) and a densely connected feed-forward neural network (DNN) classifier were compared in terms of 3-year BCR predictive value. The area under the receiver operating characteristic curve was mainly used to assess the performance of the predictive models in predicting the 3 years BCR of PCa patients. Pre-operative data such as PSA level, Gleason grade, and T stage were included in the multivariate analysis. To measure potential improvements to the model performance due to additional data, each model was trained once more with an additional set of post-operative surgical data from definitive pathology. Results Using the CAPRA score variables, DNN predictive model showed the highest AUC value of 0.7 comparing to the CAPRA score, logistic regression, KNN, RF, and cox regression with 0.63, 0.63, 0.55, 0.64, and 0.64, respectively. After including the post-operative variables to the model, the AUC values based on KNN, RF, and cox regression and DNN were improved to 0.77, 0.74, 0.75, and 0.84, respectively. Conclusions Our results showed that the DNN has the potential to predict the 3-year BCR and outperformed the CAPRA score and other predictive models.


INTRODUCTION
Radical prostatectomy (RP) with a concomitant pelvic lymph node dissection is one of the standard treatment for patients with intermediate-risk prostate cancer (PCa) according to the D'Amico classification (1). However, in this population, extremely heterogeneous definitions and outcomes have been reported, and more precise stratification is desirable to guide decision making (2,3). In this context, the CAncer of the Prostate Risk Assessment (CAPRA) score was developed in 2005 with a patient population from the Cancer of the Prostate Strategic Urologic Research Endeavor (CaPSURE) cohort, which included 1,439 men who had undergone RP, followed in a longitudinal, community based disease registry of patients with prostate cancer (4). The CAPRA score is a pre-treatment scoring system which substratifies patients into 8 risk categories according to five variables from clinical, biochemical and histopathological data. The CAPRA score was built in order to further assess the risk of biochemical and metastatic recurrence among patients treated with RP (5). The same team similarly has been developed a post-operative score, the CAPRA-S score, with improved accuracy via incorporation of pathologic data from the RP specimen (6). Noted that the CAPRA score technique outperforms the limitations of counterparts such as D'Amico classification or national comprehensive cancer network (NCCN) score, at predicting several endpoints (7). We note also that the risk nomograms offer more precise risk stratification and prediction, while the calculations can be cumbersome (7). Consequently, an automatic tool based on machine learning (ML) algorithms is needed to predict outcomes following RP, and to guide adjuvant or salvage treatment.
ML algorithms like logistic regression and Cox proportional hazard regression have been employed in the healthcare statistics field for several decades (8,9). Specifically, logistic regression uses a logit transform to provide event probabilities from input variables, while Cox regression considers the risk of an event occurring based on a linear combination of the covariates. We note that the ML models (e.g., random forest, nearest neighbors) cannot applied directly in predicting the survival outcomes since they don't consider the censored data (10). To solve this issue some imputation techniques could be considered, like to use the imputation of survival time with random forest model to predict the survival (11). Recently, deep neural networks algorithms have shown promising results in medical applications (12) in order to improve the diagnostic accuracy (13,14) For example, The Memorial Sloan Kettering Cancer Center (MSKCC) in the United Statesoffers a tool, probably less frequently used in Europe than the CAPRA score, to predict the probability of 2-, 5-, 7-, 10-, and 15-year BCR-free survival after prostate cancer surgery. This tool considers the predictive models like linear regression, logistical regression, and survival progress models to show the cancer recurrence prediction (15,16).
In this study, using data from a multicentric national database, we aim to compare the accuracy of CAPRA score and others models to predict biochemical recurrence (BCR)-free survival for patients treated with RP. Also, we aim to consider the ML algorithms using the pathological data that considered in CAPRA-S score to improve the accuracy of the predictive models.

Patients
A total of 5,043 patients who underwent RP between 2000 and 2015 for clinically localized prostate carcinoma in six French university hospitals were analyzed retrospectively. All patients underwent a multicore transrectal ultrasound-guided prostate biopsy after digital rectal examination. The Gleason score and percentage of involved biopsies were assigned by dedicated pathologists. Pretreatment PSA was recorded in all men. The clinical stage was assigned by the attending urologist according to the American Joint Committee on Cancer TNM guidelines in effect at the time of inclusion. All patients were preoperatively staged for metastases with a contrast-enhanced abdominal and pelvic computed tomography (CT) and bone scan. The patients received no neoadjuvant/adjuvant hormone therapy or radiation therapy. The CAPRA score was calculated from the available pretreatment variables, and the patients were grouped according to the resulting CAPRA score for analysis (5). Biochemical recurrence after RP was defined according to the American Urological Association (AUA) guidelines as two consecutive PSA values ≥ 0.2 ng/mL at any time post-operatively or any additional treatment more than 6 months after RP (17). The analysis was restricted to patients with a follow-up duration of longer than 12 months.

Statistical Analysis
The 3-year BCR probability from the CAPRA score assigned for each patient in our cohort was compared to the original CAPRA score related 3-year BCR from the original CaPSURE cohort, using a Kaplan Meier survival analyses. The 3-year BCR probability corresponding to the CAPRA score from the original CaPSURE cohort was assigned to each patient and compared to the actual BCR outcome at 3 years. Non recurring patients who were lost prior to follow-up before 3 years were handled by inferring the survival probability though Kaplan-Meyer actuarial estimation according to the split-andweighting methods described in Zupan et al. (10). Then, a multivariate predictive model using Cox regression under the assumption of proportional hazards was performed using the variables required for CAPRA score computation (pre-operative PSA, Gleason score and T stage).

Machine Learning Algorithms and Models Definitions
The results were compared to predictions of BCR by a set of ML models. We performed a binary classification using KNN, RF, logistic regression and DNN. We note that the DNN sequential architecture comprised several fully connected layers that included a varying number of nodes. An input layer takes numeric and one-hot encoded categorical variables and propagates information through the layers. The last layer comprises a single node that outputs the three-year BCR as a single-class probability. All the details of the considered ML models are reported in Supplementary Material. We considered the single split, where we divided sample randomly into training (80%) and testing (20%) set, train the classifier models using the training sample and test the models using the test samples. The outcome classes in the training set were weighed to compensate for the initial imbalance in survival status. To achieve the tradeoff metrics on the test subset, we tuned the hyperparameters on the training subset using a step-by-step grid search. Area under the curve (AUC) of the receiver operating characteristics (ROC) was measured to assess the performance of the predictive models in predicting 3-year BCR on the test set.
To measure potential improvements to the model performance due to additional data, each model was trained once more with an additional set of post-operative surgical data from definite pathology. Note that the available post-operative variables were not sufficient to compute the CAPRA-S score. For this reason, we combined the pathological tumor stage (pT), pathological lymph nodes dissection status (pN), margin status, prostate volume and surgical Gleason score and used them as input to the predictive models. The performance on the test set was compared with previous results.

Patient Characteristics
Among 5,043 patients, 803 cases were excluded due to missing clinical (n=83), biochemical (n=9), pathological (n=338) or follow-up (n=98) data; 275 patients underwent subsequent adjuvant therapy and were ultimately excluded. Thus, the complete records of 4246 patients were available for analysis, as reported in Table 1. The characteristics of our cohort were compared to those of the CaPSURE cohort, which was initially used to build the CAPRA score. Results including all variables used in our data set and in the CaPSURE cohort are presented in Table 2. Repartition of the CAPRA scores from our cohort and CaPSURE cohort are summarized in Table 3. The median CAPRA score of our cohort was 3, compared to 2 for the CaPSURE cohort. The median follow-up duration was 49 months, while the minimum follow-up duration was 12 months. Overall, biochemical recurrence occurred in 817 (19%) of the patients in our cohort with a median of 25 months after RP, compared to 15% with a median of 22 months in the CaPSURE cohort.

CAPRA Score and Multivariate Analysis
Patients with CAPRA scores of 2 and 3 accounted for 64% of the population in our cohort ( Table 2). Patient survival according to the CAPRA score is shown in Figure 1. Regarding the performance of the CAPRA score for predicting biochemical recurrence at 3 years, the c-index was 0.63. Similarly, Cox regression analysis using the same variables (age, Gleason score, involved biopsy percentage, clinical tumor stage, and PSA) predicted recurrence with a c-index of 0.64.  Figure 2 illustrates the AUC-ROC for the predictive models when the input features are restricted to CAPRA score variables. Considering these pre-operative variables, we found that the DNN model is given the highest AUC value of 0.7 compared to the CAPRA score, logistic regression, KNN, RF, and cox regression with AUC value of 0.63, 0.63, 0.55, 0.64, and 0.64, respectively.
One more time, we found that the DNN model shows the highest AUC value of 0.84 compared to logistic regression, KNN, RF, and cox regression with AUC value of 0.77, 0.58, 0.74 and 0.75, respectively, using the combined pre-and post-operative variables (pT, pN, margin status, prostate volume and surgical Gleason score) (Figure 3).

DISCUSSION
In this retrospective multi-institutional study, we investigated and compared the potential of CAPRA score and predictive models in predicting the BCR risk after RP using routine variables. We note that the CAPRA score is a commonly used prediction model for the occurrence of biochemical and clinical recurrences developed from the CaPSURE registry (5) with many studies providing external validation with other cohorts (7,19).
In this study, we found that CAPRA score showed a c-index of 0.63 to predict the 3-year-BCR rates with the prognostic variables obviously differing from those of the original CaPSURE cohort. Overall, the median CAPRA score of our patient cohort was higher, compared to the CAPRA score from the CaPSURE cohort, suggesting a worse prognosis in our series. However, our cohort revealed better survivals among our patients. Other factors may further limit the performance of the CAPRA score: despite substratification of our cohort according to the CAPRA score, most patients (64%) remained in CAPRA score groups 2 and 3, thus reducing the discriminatory power of the score. The heterogeneous nature and prognosis of this intermediary-risk population (2) are not accurately captured by the D'Amico classification and CAPRA score, thus reducing the c-index. Interestingly, while the original study reported a c-index of 0.66 for this score, almost all validation studies published thereafter have reported much higher c-indexes, up to 0.81, raising concerns of bias (20).
With the same restricted set of 5 input variables, predictive models have been able to provide more accurate predictions on a test set after training and tuning the hyperparameters. Specifically, a DNN model showcased the best performance metrics compared to logistic regression, KNN, RF, and cox regression. Our findings are consistent with many published data. For example, ML models showed higher c-indexes with a range value of 0.92-0.94 comparing to conventional statistical methods in predicting biochemical recurrence after prostatectomy (21). Unfortunately, they considered a limited dataset without imputing the censored cases. Other sophisticated models based on active learning have been used to improve Cox regression and to predict prostate cancer survival among patients in the Surveillance, Epidemiology, and End Results (SEER) database, with c-indexes over 0.8 (22,23).
We note that the use of the predictive models in predicting clinical outcomes (e.g., survival, grade, treatment, etc.) has become popular (24)(25)(26). However, to ensure a common understanding, data scientists and clinical researchers need to define a common set of outcome metrics. Defining 'accuracy' performance as the ratio of correct predictions to the total number of predictions is seldom appropriate in comparing predictive models, especially for survival analysis (27,28). So far, the AUC and c-index, sensitivity and specificity, provide better performance metrics.
Whether deep learning performs better than conventional ML and statistical models in survival analysis remains unclear. The binary classification of tabular data is not the strength of neural network models (23). Recent breakthroughs based on deep learning (e.g., convolutional neural networks) and neural network algorithms rely primarily on deep analysis of medical images for a computer aided diagnosis (29,30). Furthermore, the development of rigorous methods like neural networks to handle censored data with follow-up imaging may provide much better survival analyses for the future. Thus, the accuracy of our model is modest and could be enhanced by using a more contemporary approach such as MRI guided biopsies, with a central pathology review and a validation cohort.
The main asset of such models relies on their ability to be nurtured with prospectively acquired data, in order to gradually improve predictions. Moreover, a model could be shaped "locally" (learning from specific local databases) to take into account local specificities thus better be applicable to certain populations of patients. Nevertheless, there is still a need of prospective validation of these models before their integration from bench to bedside. Also, one of the downsides could potentially be related to the "black box" nature of algorithms such as DNN. Indeed, it is very difficult as an observer to decipher how the model intertwines the variables between them to eventually come with a prediction, possibly generating reluctancy among clinicians to use such tools. In daily practice, regarding the recent studies (31)(32)(33) published in the post-operative setting, such models can enhance the clinician decision making confidence for proposing adjuvant or salvage radiotherapy.
Our study has some limitations that should be noted. First, mpMRI data might be a promising addition dataset for improving the accuracy of the predictive models. The data analysis is represented by the fact that a standard ultra-sound guided prostate biopsy was used for most cases in this cohort. This does not reflect the current standard practice as MRI is now recommended in first line biopsy setting. Second, the median follow-up time was relatively short considering the natural history of the biochemical progression of intermediate-risk PCa. Third, the modifications to the Gleason score grading system in 2005 could have also introduced bias. In addition, the pathology data were not centralized among the different tertiary centers. However, only dedicated uropathologists reviewed the RP specimens at these referral centers, and to limit potential bias, we restricted our analyses to contemporary patients. Finally, we must admit that the difference found between AUC results is small.

CONCLUSIONS
The results of this study indicate that predictive models could improve the prediction of 3-year BCR after RP based on routine variables used in CAPRA score with a population presenting intermediate-risk disease. Specifically, a deep neural network model showcased the highest performance metrics for predicting the BCR. This model will help clinicians to achieve the goal of personalized medicine and develop a strategic approach for prostate cancer treatment.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding author.