Artificial Intelligence Combined With Big Data to Predict Lymph Node Involvement in Prostate Cancer: A Population-Based Study

Background A more accurate preoperative prediction of lymph node involvement (LNI) in prostate cancer (PCa) would improve clinical treatment and follow-up strategies of this disease. We developed a predictive model based on machine learning (ML) combined with big data to achieve this. Methods Clinicopathological characteristics of 2,884 PCa patients who underwent extended pelvic lymph node dissection (ePLND) were collected from the U.S. National Cancer Institute’s Surveillance, Epidemiology, and End Results (SEER) database from 2010 to 2015. Eight variables were included to establish an ML model. Model performance was evaluated by the receiver operating characteristic (ROC) curves and calibration plots for predictive accuracy. Decision curve analysis (DCA) and cutoff values were obtained to estimate its clinical utility. Results Three hundred and forty-four (11.9%) patients were identified with LNI. The five most important factors were the Gleason score, T stage of disease, percentage of positive cores, tumor size, and prostate-specific antigen levels with 158, 137, 128, 113, and 88 points, respectively. The XGBoost (XGB) model showed the best predictive performance and had the highest net benefit when compared with the other algorithms, achieving an area under the curve of 0.883. With a 5%~20% cutoff value, the XGB model performed best in reducing omissions and avoiding overtreatment of patients when dealing with LNI. This model also had a lower false-negative rate and a higher percentage of ePLND was avoided. In addition, DCA showed it has the highest net benefit across the whole range of threshold probabilities. Conclusions We established an ML model based on big data for predicting LNI in PCa, and it could lead to a reduction of approximately 50% of ePLND cases. In addition, only ≤3% of patients were misdiagnosed with a cutoff value ranging from 5% to 20%. This promising study warrants further validation by using a larger prospective dataset.


INTRODUCTION
Prostate cancer (PCa) is the most common type of malignant tumor in American men and accounts for nearly 15% of all cancer cases. Recurrence and metastasis are the most common causes of death in PCa patients (1). Radical prostatectomy (RP) is the gold-standard treatment for patients with PCa and those with either organ-confined or locally advanced PCa can benefit from it (2)(3)(4). Because of the biological characteristics of this disease and its response to effective treatment, patients with PCa generally have an excellent long-term prognosis with the vast majority of patients surviving over 5 years (5). However, if a PCa patient is diagnosed with lymph node involvement (LNI), the probability of tumor recurrence will increase, and the prognosis will deteriorate significantly (6,7).
For a lymph node (LN)-positive PCa patient, extended pelvic lymph node dissection (ePLND) can often offer a better cancerspecific outcome either with or without adjuvant androgen deprivation therapy (ADT) (8,9). As LNI is a significant component in the prognosis of a patient, it can influence any clinical decisions made by the surgeon. Before surgery, it is essential to know the precise clinical status of LNI in patients with PCa. To this end, researchers have reported several approaches and tools that can help the clinician to estimate the occurrence of LNI, including prostate-specific membrane antigen-positron emission tomography (PSMA-PET) scans, multiparametric magnetic resonance imaging (mpMRI), and the use of some emerging molecular biomarkers (10)(11)(12)(13). However, these imaging techniques are not accurate and most of the molecular biomarkers are unproven. Therefore, the most commonly used tools available are Briganti, Partin, and Memorial Sloan Kettering Cancer Center (MSKCC) nomograms which have an accuracy of less than 80% (14)(15)(16)(17). Machine learning (ML) is an emerging intersection approach involving many fields that allows for accurate prediction of outcomes from multiple unrelated datasets, which would otherwise be discrete and difficult to associate (18).
With the rapid development of evidence-based medicine, vast and complex medical datasets need more advanced techniques for their interpretation, and ML is becoming a promising option for the diagnosis and prognosis/prediction of many diseases (19,20). In addition, ML has demonstrated excellent performance of predictive abilities and a good potential application in several areas of medicine (21). Our goal was to develop a new decisionsupport ML model based on big data for predicting the risk of LNI in PCa patients. We used area under the curves (AUCs), calibration plots, and decision curve analysis (DCA) to evaluate the performance of the model. We further validated the accuracy of our ML model by using a validation set.

Data Source and Study Population
In this study, we used the Surveillance, Epidemiology, and End Results (SEER) database (https://seer.cancer.gov/) from the National Cancer Institute, a freely available cancer registry in the United States. We obtained permission to access the files of the SEER database and all authors followed the SEER database regulations throughout the study. No personally identifiable information was used in this study and informed consent was not necessary from individual participants. The Medical Ethics Committee at Jinan University's First Affiliated Hospital examined and approved this work.
Data of the patients were downloaded from the SEER 18 Regs Research Data Nov 2018 Sub (1975-2016) by using the SEER*Stat 8.3.9.1 software. The selection criteria included men aged 35-90 years who were diagnosed with histologically confirmed prostatic adenocarcinoma (site code C61.9, morphology code 8140/3). The patients were diagnosed between 2010 and 2015 and PCa was the first malignant tumor found. All cases were treated with radical prostatectomy (RP) and ePLND without neoadjuvant systematic therapy. However, as the data obtained did not include ePLND as defined by anatomical location, we referred to the literature and defined this as more than or equal to 10 LNs being removed from the patients (22,23). The T stage of the tumors was derived from preoperative examination and postoperative specimens. The number of needle cores examined was between 4 and 24, and the autopsy cases were excluded. The downloaded data of the patients were obtained from 3,257 cases. The exclusion criteria were as follows: unknown information from the American Joint Committee on Cancer seventh TNM stage, race, marital status, prostate-specific antigen (PSA) levels before biopsy, and the absence of Gleason scores (GS). From the 3,257 initial cases considered, 2,884 patients remained in the final cohort for use in model development. The population data selection procedure used is shown in Figure 1. The validation set was derived from the same SEER database and cases were diagnosed in 2016. Besides the above criteria, an extra one from the Briganti nomogram was chosen: patients with PSA >50 ng/ml were excluded from the validation set. A total of 535 patients were included in the analysis as the validation set.

Variable Selection
Several readily available clinical and demographic characteristics were chosen as independent variables for analysis. Since the SEER database does not include tumor size based on preoperative imaging, this parameter is prone to some errors. However, the tumor size of postoperative specimens in most patients did not differ significantly from the tumor size evaluated by imaging, and therefore, we considered this error to be within an acceptable range (24)(25)(26)(27). At last, eight demographic and clinicopathological variables, namely, age at diagnosis, race, marital status, T stage, tumor size, PSA levels before biopsy, GS on biopsy, and percentage of positive cores (PPC), were selected as independent variables for analysis.

Model Establishment and Development
All statistical analyses in the study were performed using SPSS (version 22, IBM SPSS Software Foundation) and Python (version 3.8.1, Python Software Foundation). All variables were tested for Pearson correlations and the results are presented as a heat map ( Figure 2). All patients studied were randomly divided into a training set and a test set at a ratio of 7:3 ( Table 1). The chi-square test was used to analyze the differences between the training and test sets. The training set was used to establish the extreme gradient boosting (XGB) and multivariate logistic regression (MLR) models, and the test set was then applied to evaluate both of them. We used 600 trees in XGB to build the ML model. For MLR, we used an entry variable selection method to establish the model. Then, the 5%, 10%, 15%, and 20% cutoff values of the XGB and MLR models were calculated, and these were used respectively to test their clinical value for directing the possible treatment options.

Model Improvement
To ensure that the model was stable, a 10-fold cross-validation was adopted to evaluate the predictive capability of the model. The training set was randomly divided into 10 groups. In each iteration of 10-fold cross-validation, nine groups were randomly selected for training, and the remaining group was used as the test set. This means that each group was chosen as the test set in turn, which ensured that the evaluation results were not accidental. Then we averaged the results of the 10 evaluations in order to reduce the errors caused by any unreasonable selections made in the test set. To achieve the overall optimum value in the XGB model, we used the learning curve method to find the optimal parameters. The learning curve is shown in Figure 3 where the abscissa axis represents the number of trees and different learning rates, and the ordinate axis represents the average AUC of the 10-fold cross-validation. The final optimal parameter combination was as follows: the number of trees ("n tree") = 851 ( Figure 3A), the learning rates ("eta") = 0.16 ( Figure 3B), the maximum length from the root node to leaf node ("max depth") = 6, the sum of weights of the minimum leaf node samples ("min child weight") = 1, and the L2 regularization FIGURE 1 | Flowchart of the study population selected from the SEER database, based on the inclusion and exclusion criteria outlined above; 2,884 patients were included in this study. parameters ("reg lambda") = 120. All other parameters were selected as default values for the calculations.

Evaluation of Model Performance
The performance analysis used comprised of three components. Firstly, model discrimination was quantified with receiver operating characteristic (ROC) curve analysis, and its predictive accuracy was assessed with the AUCs obtained. Secondly, we used calibration plots to evaluate the performance of the model, which indicated the calibration and how far the predictions of the model deviated from the actual event. Thirdly, clinical usefulness and net benefits were assessed with DCA which could estimate the net benefit by calculating the difference between the true-and falsepositive rates and weighted these by the odds of the selected threshold probability of risks involved. Also, additional ML algorithms such as decision tree (DT) and support vector machine (SVM) were introduced for comparison. ROC curves and calibration plots were used to further evaluate the appropriateness and generalizability of our model.

Demographic and Clinicopathological Characteristics
A total of 2,884 PCa patients were analyzed in this study. Three hundred forty-four patients had LNI (11.9%) and 2,540 (88.1%) did not have and these were classified as none LNI (nLNI). All patients were completely randomized with a ratio of 7:3 into a training set (n = 2,018) and a test set (n = 866). The demographic and clinicopathological variables of the patients are detailed in Table 1.

Model Analysis and Variable Feature Importance of the Prediction
The Pearson correlation analysis was performed for all the factors. A correlation heat map showed weak correlations between several clinicopathological variables (T stage, tumor size, PSA, GS, and PPC), and there was a moderate correlation between T stage and tumor size ( Figure 2). For the MLR model, five of the eight variables were identified as independent risk factors, namely, T stage (p < 0.001), tumor size (p < 0.001), PSA before biopsy (p < 0.001), GS (p < 0.001), and PPC (p = 0.006) ( Table 2). For the XGB model, we identified the feature of importance by the size of the gain value for each variable, with the higher values indicating more importance for the prediction target: GS (158 points), T stage (137 points), PPC (128 points), tumor size (113 points), PSA (88 points), race (64 points), age at diagnosis (51 points), and marital status (36 points) ( Figure 4).

Model Performance
ROC curves, calibration plots, and DCA for the training set (n = 2,018) and the test set (n = 866) were constructed in order to determine the accuracy of our models. The XGB model had the The sensitivity, specificity, and cutoff values of the predictions of the two models were also calculated for the patients having LNI or not. In addition, the predictive accuracy and error when the patients were predicted to be at risk of LNI are given in Table 3.  The results showed that whether 5%, 10%, 15%, or 20% was chosen as the cutoff value, the XGB model was better than the MLR model in reducing omissions and avoiding overtreatment of patients, with a lower false-negative rate and a higher percentage of ePLND avoided. With a 5%-20% cutoff value, the XGB model could keep the risk of missing patients below 3% (1.2%-2.9%). The calibration plots of the two case sets indicated that the predictive probabilities against observed LNI rates showed excellent concordance in the XGB model, followed by the SVM and DT models, respectively. The calibration of the MLR model tended to underestimate the LNI risk across the entire range of predicted probabilities compared with the other two models ( Figure 6).
DCA of the four models was subsequently constructed in our study (Figure 7). The y-axis of the decision curve represents the net benefit which is a decision-analytic measure to judge whether any particular clinical decision results in more benefit than harm. Each point on the x-axis represents a threshold probability that differentiates between those patients with and without LNI (LNI vs. nLNI). This shows that all the models achieved net clinical benefit against a treatall-or-none plan. With a risk threshold of less than 80%, the ML models showed a greater net benefit for patient interventions in the test set than the MLR model, and the XGB model had the highest net benefi t across the whole range of threshold probabilities.

Model Validation
The validation set was used for model validation and the clinical and pathological characteristics of the validation sets are detailed in Table 4. In addition, ROC curves and calibration plots were constructed to compare the accuracy   Figure 8A). Moreover, the calibration plots indicated that the XGB model has better consistency across the 0% to 60% range of prediction probability. Instead, the calibration of the nomogram tended to underestimate the LNI risk in the same range ( Figure 8B).

DISCUSSION
In this study, we developed a more accurate model to predict the risk of LNI in patients with PCa by combining eight clinicopathologic parameters. To our knowledge, this is the first ML model for predicting LNI established by using big data and readily available clinicopathological parameters. LNI is found in up to 15% of PCa patients upon postoperative pathological examination and is associated with the recurrence and prognosis of PCa (28). As a standard treatment for PCa patients with LNI, ePLND can accurately help diagnose occult micrometastases, allowing PCa patients to get effective treatment and also identifying a more accurate stage of the disease. This is important for postsurgical follow-up and the subsequent selection of adjuvant and salvage therapies (29,30). However, the rate of detection of an earlier stage or the presence of low-risk tumors at the time of diagnosis rises with PSA screening which leads to a higher rate of surgeries in patients with low-and intermediate-risk tumors. In addition, the likelihood of finding postsurgery-positive LNs decreases.
Our study found that the detection rate of postsurgical positive LN was lower than that reported in the literature, at approximately 12% (28). This indicated that some patients were overtreated, resulting in complications such as pelvic lymphocele, ileus, thrombosis, scrotal swelling, nerve injury, and so on (31,32).
Therefore, determining the LN stage of PCa in patients before surgery is critical in determining whether they should receive ePLND. As routine imaging examinations such as CT scans and MRIs are currently ineffective at detecting nodal metastases (33), and there are only a few promising biomarkers in the preclinical stages of PCa (12,13), ePLND is the only accurate way of detecting nodal metastases. In order to weigh the benefits and drawbacks of ePLND, the National Comprehensive Cancer Network (NCCN) and the European Association of Urology (EAU) guidelines both recommend the use of nomograms, such as Briganti, MSKCC, and Partin nomograms. These nomograms are largely based on clinicopathologic characteristics that can be easily acquired in the actual clinical procedure (34,35). These together with A B FIGURE 7 | Decision curve analysis graph showing the net benefit against threshold probabilities based on decisions from model outputs. Three curves were obtained based on predictions of the four different models, and the two curves obtained were based on two kinds of extreme decisions. The curves referred to as "All" represent the prediction that all the patients would progress to LNI, and the curves referred to as "None" represent the prediction that all the patients were nLNI. several clinicopathologic characteristics which can be readily acquired during the clinical procedure may predict the risk of LNI before surgery. Because of the convenience and practicality of nomograms, they are the most commonly used tools for LNI prediction. However, the performance of these old version nomograms is not always reliable, with a prediction accuracy below 80% (36), and the new version nomograms are less used due to the acquisition of some non-conventional variables (37). Hence, a more advanced prediction model based on the basic variables is needed. XGB is a gradient boosting algorithm that is one of the most powerful techniques for constructing prediction models, and it has been widely used in various medical studies (38,39). We found that the five independent risk factors identified by the MLR model were identical to the top 5 most important factors calculated by the XGB model, including T stage, tumor size, PSA before biopsy, GS, and PPC. GS is an evaluation method for determining the state of differentiated PCa tumor cells, and a higher score represents a less differentiated tumor. This parameter is highly correlated with the aggressiveness of the malignancy, and highly aggressive tumors progress more rapidly and are usually associated with LN metastasis. In our study, the XGB model showed that the weight value of GS was the highest, showing the importance of this parameter and indicated that it contributed most to the results obtained (40,41). As our results demonstrated, the XGB model assigned weighted values to all variables and arranged them by order of importance, thus allowing for more variables to be involved in the analysis and helping physicians to better understand the risk factors.
In this study, the predictive accuracy of our XGB model was the highest in both training and test sets (AUC = 0.908 and 0.883, respectively). Compared to this, the MLR model was less accurate in both training and test sets (AUC = 0.769 and 0.763, respectively). The XGB model was also more accurate than the Briganti nomogram in the validation set (AUC: 0.850 vs. 0.816). Previous studies have shown that the predictive accuracy of the Briganti nomogram was 0.798, which was close to our result (0.798 and 0.816) (42). Considering that the nomogram is a visualization tool used in the MLR model, it is established from the MLR algorithm, and this result was not surprising. In addition, the calibration plots for the validation set indicated that the XGB model has a better consistency across the 0% to 60% range of prediction probability. In actual clinical practice, physicians are more concerned with accurately identifying LNI in low-risk PCa patients. It is exciting that our model has better consistency within the low-risk range, and conversely, the nomogram tends to underestimate the risk of LNI in this range. This indicates that our model is better to avoid omissions within the low-risk range. Our results indicate that the MLR model has a weakness in its accuracy when analyzing the linkages seen in multiple data, whereas the XGB model excels at accurately predicting outcomes from multiple unrelated datasets.
Other researchers have tried to use ML to predict LNI. Hou et al. used an ML algorithm combined with mpMRI to build a model for predicting lymph node metastasis in PCa patients. It had a very high prediction accuracy (AUC = 0.906), but the sample size was small and its practical use might be limited as mpMRI was not an easily available parameter (43). Instead, our model was established using a more advanced algorithm based on a big sample and using basic parameters. Similar results were obtained from the calibration plots in both the test and validating sets, which predicted probabilities against observed average LNI rates, indicating that the XGB model had excellent consistency with the MLR model. In addition, the XGB AUC curve obtained here was closer to the ideal line. The MLR model tended to underestimate the LNI risk across the entire range of predicted probabilities, indicating that the use of nomograms based on the MLR model might result in a higher false-negative rate and lead to some LNI patients being omitted.
The significance of determining a cutoff value for a predictive model is to guide the physician during clinical decisions, but it also implies that a certain number of patients below that cutoff value may be missed. The current EAU guidelines suggest that the indication for ePLND is based on the risk of LN metastasis >5% using the Briganti nomogram (44). Using this cutoff, our model could spare about 50% (428/866) of ePLND and only 1.2% (5/428) of patients would be missed. Compared with the MLR model, XGB had a higher positive predictive value (34.5% vs. 27.1%) and a negative predictive value (98.9% vs. 97.5%), but the percentage of overtreated patients was lower (65.5% vs. 72.9%). In addition, our model had low false-negative rates in all the 5%-20% cutoff values. Choosing 20% as the cutoff value can largely reduce the number of ePLND and the possibility of missing patients would be as low as 2.9%. These results imply that our model has a low missing rate, and more LNI patients would be identified. This is consistent with the results of previous studies (44).
Considering that physicians focus on different goals in different situations during clinical practice, DCA was developed by Vickers and Elkin to evaluate clinical effectiveness. The method uses the net benefits at different thresholds to evaluate the clinical utility (45). In our study, within a threshold of 5% to 20%, the net benefit of our XGB model was higher than the other models, which indicates that the value of benefits (i.e., the correct identification of LNI) minus the drawbacks (i.e., overtreatment and omissions) is the biggest.
However, this study still has several limitations. Firstly, the model is based on the SEER database which collects data from the North American population, so there may be gaps in population applicability, necessitating the use of broader populations in future studies. Secondly, tumor size as determined by preoperative imaging was not available for our data and this could cause some errors. We will further improve our model with more complete external validation data in future studies. Thirdly, we excluded patients with <10 LNs examined so as to avoid patients treated with PLND. This is not the ideal definition of ePLND.
This is the first model to predict LNI in PCa patients based on standard clinicopathological features and a novel AI algorithm. Our model performed exceptionally well in predicting LNI in presurgical PCa patients and could potentially assist clinicians to make more accurate and personalized medical decisions. The practical use of this model would be to help surgeons predict the probability of LNI in PCa patients and so help guide surgical alternatives. For patients, it could make some who are more at high risk to be more vigilant, thereby improving their prognosis of PCa.

CONCLUSIONS
We established an ML model based on big data for predicting LNI in PCa patients. Our model had excellent predictive accuracy and practical clinical utility, which might help guide the decision of the urologist and help patients to improve their long-term prognosis. This study follows the trend toward precision medicine for all future patients.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The software packages of AI algorithms are available at Scikit-learn (https://scikit-learn.org/stable/) and matplotlib (https://matplotlib.org/).The names of the repository/repositories and accession number(s) can be found in the article/Supplementary Material.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by The Medical Ethics Committee at Jinan University's First Affiliated Hospital. Written informed consent for participation was not required for this study in accordance with the national legislation and the institutional requirements.

AUTHOR CONTRIBUTIONS
YZ, LW, and YH conceived and designed the study. YZ and LC provided administrative support. LW, ZC, and XQ collected and assembled the data. YH, LW, HL, and LC analyzed and interpreted the data. All authors contributed to the article and approved the submitted version.

FUNDING
This study was supported by the Natural Science Foundation of China (81902615) and the Leading Specialist Construction Project-Department of Urology, the First Affiliated Hospital, Jinan University (711006).