Machine Learning Improves Upon Clinicians' Prediction of End Stage Kidney Disease

Background and Objectives Chronic kidney disease progression to ESKD is associated with a marked increase in mortality and morbidity. Its progression is highly variable and difficult to predict. Methods This is an observational, retrospective, single-centre study. The cohort was patients attending hospital and nephrology clinic at The Canberra Hospital from September 1996 to March 2018. Demographic data, vital signs, kidney function test, proteinuria, and serum glucose were extracted. The model was trained on the featurised time series data with XGBoost. Its performance was compared against six nephrologists and the Kidney Failure Risk Equation (KFRE). Results A total of 12,371 patients were included, with 2,388 were found to have an adequate density (three eGFR data points in the first 2 years) for subsequent analysis. Patients were divided into 80%/20% ratio for training and testing datasets. ML model had superior performance than nephrologist in predicting ESKD within 2 years with 93.9% accuracy, 60% sensitivity, 97.7% specificity, 75% positive predictive value. The ML model was superior in all performance metrics to the KFRE 4- and 8-variable models. eGFR and glucose were found to be highly contributing to the ESKD prediction performance. Conclusions The computational predictions had higher accuracy, specificity and positive predictive value, which indicates the potential integration into clinical workflows for decision support.


INTRODUCTION
Chronic kidney disease (CKD) is a major cause of morbidity and mortality globally, having a reported prevalence of 11-13% (1). The prevalence of CKD is rising, especially in developed nations where lifestyle related diseases are endemic (2). Whilst CKD may ultimately culminate in end stage kidney disease (ESKD), rate of progression is highly variable and difficult to predict. ESKD is associated with a marked increase in mortality and morbidity: it is a terminal condition without renal replacement therapy (RRT) in the form of haemodialysis, peritoneal dialysis, or kidney transplantation. As ESKD approaches, patients and clinicians are required to make difficult decisions (3). RRT requires the formation of permanent dialysis access and/or evaluation of suitability for transplantation. Preparation for RRT is associated with significant cost and risks of complications, such as post-operative infection and bleeding (4). Premature access formation exposes patients to these risks without benefit and, in some cases, result in RRT access not being used at all (5). However, the capacity of physicians to correctly predict patient outcomes is poor (6). Therefore, any method which will improve the ability to correctly identify patients who will require RRT is highly desirable.
Data-driven predictive modeling is a rapidly advancing field and has been employed in a range of clinical scenarios such as opioid overdose (7) and acute kidney injury (8). Recent advances have demonstrated the capacity of predictive modeling to robustly predict acute kidney injury in individuals with varying levels of kidney function (8). For CKD, multiple risk factors for initiation and progression to ESKD have been characterized (9). The most pronounced risk factors include male gender, proteinuria, excess weight, hypertension and diabetes (10,11). Statistical modeling of these data has been shown to be predictive of an individual developing CKD (12,13). Application of machine learning (ML) approaches in small clinical cohorts has been shown to be capable of predicting accurate estimated glomerular filtration rate (eGFR) values (14,15). Analysis of over sixty thousand electronic medical records (EMRs) and prediction with a random forest regression method has shown eGFR to be predicted with a correlation coefficient of better than 0.95 (16). However, to date, ML has not been used to predict the clinical outcomes of ESKD or death in CKD. It has also not been compared pragmatically to the predictions of expert clinicians.
In this work, we describe the training of a ML model that was capable of predicting which CKD patients will progress to ESKD within 2 years. We compare the predictive power of this model against the prospective predictions of six nephrologists in a cohort of fifty CKD patients. The predictions for ESKD occurring within a 2-year period were better than the most experienced clinician. The work here shows that predictive models built with machine learning can be accurate and have a potential role in providing decision support in a clinical environment.

Data Cleaning
The raw clinical time-series data included much manually entered information, and included erroneous extreme outliers in some variables, particularly, height, weight, standing systolic pressure, standing heart rate and protein-creatinine. These were outliers detected and removed using chi-squared outlier tests with an inclusion threshold of 99.999999%, excluding points with a p-value of lesser than 10 −8 for distribution membership. A total of 538 outlying points were flagged, out of 869,901 data points (0.062%).

Data Filtering
Out of 12,371 patients extracted 7,565 patients recorded an eGFR value. We filtered the primary data to remove patients with fewer than three pre-ESKD eGFR measurements. This filtering step retained 4,477 patients. Primary measures of creatinine level were excluded from subsequent study due to the output predictor, eGFR, being derived from it.

Machine Learning Model
The Training Span (TS) and the Test Timepoint (TT) values were chosen as a trade-off between accuracy and the clinical usefulness of the size of the prediction window. Here, TS = 2 and TT = 8 years provided the most clinically relevant predictor, with a large amount of useful training data and relatively high accuracy. With this selection of TS, we removed patients who had reached ESKD within 2 years of commencement of their data collection. This reduced our sample size to 2,388 patients.

Data Partitioning
To ensure that the model was trained on distinct data to the test data, individual patients were fist randomly partitioned into two groups. 80% (N = 1,910) of the patients formed the training set and the remaining 20% (N = 478) was used for testing. To ensure that a similar proportion of ESKD patients were randomly included in each set, a two-sample test for equality of proportions with continuity correction was performed, which confirmed that the proportions were not significantly different (χ² = 0.000538, p = 0.9812).

Time-Series Featurisation
Data from 19 different time-series from each of the 1,910 training and 478 test patients was featurised with the tsfresh v0.12.0 python library (17), using the internal Comprehensive calculator setting to produce 764 features per individual timeseries (including power spectral density, Fourier, cosine and wavelet transforms). Four case studies with varying degrees of Comprehensive featurisation was performed: zero (all Minimal), 2 (only eGFR and glucose), 6 (adding standing heart rate, systolic and diastolic blood pressure and heart rate), and 19 (all Comprehensive).

Model Training and Hyperparameter Tuning
A model was trained on the featurised time series data for the 1,910 training set patients with XGBoost (18) (version 0.90.0.2), as implemented in R 3.6.1. This method is particularly suited to efficiently handle sparse data. Since the prediction accuracy of XGBoost can vary greatly according to various hyperparameters, we performed 10-fold cross-validation (CV) and hyperparameter grid search across five frequently-cited (19) tunable XGBoost parameters (nrounds, eta, max_depth, subsample, colsample_bytree). In order to properly handle the over 8-fold class imbalance between ESKD cases (N = 263) and non-ESKD cases (N = 2,125), the metric for tuning and evaluating performance of the predictive models was changed from accuracy to Area Under the Precision-Recall Curve, AUCPR (20). Upon subsequent prototyping, the Matthews Correlation Coefficient (MCC) metric, reported to be optimal for imbalanced binary classification (21), indeed performed better than AUCPR on our dataset and was solely employed in all subsequent analyses. The highest 10-fold CV mean MCC of 0.4327 corresponded with nrounds = 200, eta = 0.01, max_depth = 4, subsample = 1, colsample_bytree = 0.8, and this hyperparameter set was used to build the final optimal model employing the entire training dataset. The fully-featurised (all-Comprehensive) run had a 10-fold CV mean MCC of 0.4303 with the corresponding best hyperparameter set: nrounds = 500, eta = 0.04, max_depth = 4, subsample = 1, colsample_bytree = 0.6. SHapley Additive exPlanation (SHAP) values (22) were calculated and graphed using the SHAPforxgboost 0.0.2 R library (23). Our software and trained models are available for download from (https://github.com/catlyst/trendal).

Dataset
We obtained pathology and clinical records for patients attending hospital and outpatient clinics at The Canberra Hospital Department of Renal Medicine, Australian Capital Territory, Australia. This included 12,371 patients spanning 21.5 years from September 1996 to March 2018. In addition to birth year, sex and date of death, this dataset consists of 17 timestamped clinical measures including creatinine and derived eGFR (168,500 points each), glucose (145,961 points), sitting and standing blood pressure (42,818 points), heart rate (28,741 points), weight (47,981 points), height (35,421 points) and derived Body Mass Index (BMI), HbA1c (15,349 points), urine protein-creatinine ratio (14,777 points) and 24-h proteinuria (883 points). All data was observed in a normal distribution with the exception of creatinine, eGFR, glucose, HbA1c, proteinuria and urine protein/creatinine ratio (Supplementary Figure 1). In addition to the clinical measures, we derived two features, sitting and standing pulse pressure. With the addition of these derived measures, a total of 19 time-based measurements were obtained. We applied a threshold of three eGFR data points obtained in the first 2 years for each individual as a required minimum data density for subsequent analysis. From 7,565 patients, 2,388 were found to have adequate density for ML prediction (31.6%) (Supplementary Figure 2).
Out of 2,388 patients, 1,910 were used to train the model and 478 were used as a holdout test dataset. Patients baseline characteristics are shown in Table 1. There was no difference in the age at presentation, gender proportion, diagnosis, and CKD stage. The majority of patients were in CKD stage 2 and 3.

Machine Learning Model Identifies EGFR and Cardiovascular Risk Factors as Key Predictors of ESKD
We aimed to develop a model capable of predicting whether a patient would reach ESKD and to estimate the timeframe in which this would occur. We first aggregated individual patients' clinical measurements into a single, initial predictive model, including potentially interacting and confounding variables (see Methods). Throughout this analysis, we defined ESKD as the composite endpoint of an eGFR below a threshold of 10 mL/min/1.73 m 2 (24) or the commencement of RRT, whichever came first. Using the initial model, we then incorporated data for each patient with an observation period, or training span (TS), of 2 years to predict if the patient would reach ESKD by a test timepoint (TT) of 8 years. The initial model incorporated all 19 time series measures present in the primary data and this was augmented with full featurisation derived from these timeseries (see Methods). When predicting if ESKD would occur by 8 years follow-up, the initial model had a prediction accuracy for ESKD of 84.5% with a sensitivity and specificity of 55.8 and 88.0%, respectively. To understand the relative contribution of each feature in the initial model, SHapley Additive exPlanation (SHAP) values (25) were calculated, representing the weight of a particular feature in the model's ESKD prediction outcome. A SHAP value >0 suggests the feature value is a risk for ESKD, and <0 suggests the feature value is protective against the risk of ESKD.
For the prediction of ESKD, as expected eGFR-based features (the minimum value and values at the 80, 20, and 10th percentiles) had the highest contributions, comprising six of the top 10 features ( Table 2). Other quantities present in the top 10 predictive features included the patient's minimum standing heart rate, their age at first consultation and several derived or transformed values related to blood glucose levels and eGFR (Figure 1; Supplementary Figure 3). Further, blood glucose was a major contributor to the prediction of ESKD (6 of top 25 features) when >7 mmol/L (Figure 1; Table 2). Other identified features contributing to risk include known clinical risk factors such as blood pressure features including minimum standing diastolic blood pressure where a value of >75 mmHg increases risk. The model also identified other features such as minimum heart rate (increased risk at more than ∼85 bpm), age at initial attendance (increased risk, <55 yrs) and mean pulse pressure difference between sitting and standing (increased risk, >10) which are not known as clinical risk factors for ESKD but appear to have cutoff values associated with ESKD risk. We hypothesized that increased cardiovascular disease explained the association of increased HR and pulse pressure difference, and that the competing interest of death at higher age groups made youth a risk factor for progression to ESKD.
Given the contribution of eGFR and glucose we modified our ML algorithm to emphasize these variables. We hypothesized that the prominence of eGFR and glucose suggested these variables contributed to the majority of the prediction. Further, we noted that the greatest data density existed for eGFR. We trained a further, optimized model utilizing all 19 time-series measures but only comprehensively-featurised two timeseries values: that of eGFR and glucose, which were featurised each into 764 numerical features. We then only minimally-featurised the remaining 17 timeseries values (eight numerical features each). In predicting ESKD at 8 years on 2 years of patient data, this optimized model improved prediction accuracy from 84.5 to 86.2%. Sensitivity improved from 55.8 to 65.4% and specificity from 88 to 88.7%. The most predictive quantities identified from this optimized model are only slightly different (Supplementary Table 1) from the initial model. Therefore, we concluded that unbiased predictions of ESKD by our ML algorithm is based on classic ESKD risk factors and has a strong predictive capability.
To contrast optimized model performance by disease severity, we grouped patients in the test set by their eGFR at presentation (either below, or at least, 30 ml/min/1.73 m 2 ) (Supplementary Table 2 The model's accuracy in predicting ESKD in the former group was 67.8% (84.6% sensitivity, 60.9% specificity). In contrast, the accuracy of the model in the latter group was 90.4% (46.2% sensitivity, 93.6% specificity). These numbers highlight the sensitivity-specificity trade-off, which our single optimized model attempts to finely balance (overall accuracy 86.2% at 65.4% sensitivity and 88.7% specificity), but also suggests room for stratified ensemble models for future work.

Comparison of ML Algorithm Against Clinician Prediction
Forty-nine patients with CKD at risk of ESKD were selected as a test cohort. We considered the risk of ESKD at 2 years to be of greater clinical significance for RRT planning. Therefore, 2 years of clinical data was utilized by ML algorithm to provide a probability of reaching ESKD within the next 2 years. Six consultant nephrologists were given baseline demographics and diagnosis, medical comorbidities, proteinuria, blood pressure, and eGFR measurements over 2 years for the same cohort. The nephrologists predicted prospectively for each individual, within a 10-year window, the number of years (to the nearest halfyear) it would likely take for the patient to reach ESKD. The ML algorithm and clinician predictions were compared to the outcomes in the 49 patients at the 2-year mark (the time of  writing). The variability in the clinician predicted ESKD date for each patient was high (Figure 2A). As expected each clinician appeared to have a systematic bias, predicting longer or shorter times until ESKD was reached for the same cohort ( Figure 2B). Variation was also high in the number of predictions made, with some clinicians predicting that all patients will reach ESKD, while others did not. The pairwise-correlations in the predicted dates of each clinician with correlation-coefficients ranging from 0.257 to 0.757 (Supplementary Figure 4). Clinician performance compared to the predictive model was poor ( Table 3). The ML algorithm demonstrated a 2-year predictive accuracy of 93.9%, with sensitivity of 60%, specificity of 97.7% and precision (or positive predictive value) of 75%. This was superior to all clinicians, with the best performance being Clinician 4, who had an accuracy of 79.6% (sensitivity 80%, specificity 79.5%, precision 30.8%). All other clinicians performed significantly worse, primarily due to higher false positive counts (Table 3).
Cumulatively, this data suggests that trained predictive ML modeling may assist nephrologists in making predictions of ESKD.

Comparison of ML Algorithm Against Kidney Failure Risk Equations
The Kidney Failure Risk Equations (KFRE) (26) have been extensively used and validated to predict ESKD within 2 and 5 years. We used the updated 4-and 8-variable KFRE (13) to predict ESKD within 2 years for the test cohort of 49 patients, however as these equations rely on serum albumin tests, 3 patients had to be excluded ( Table 4). The 4-variable KFRE (91.3% accuracy, 25% sensitivity, 97.6% specificity, 50% precision) performed better than the 8-variable KFRE (89.1% accuracy, 0% sensitivity, 97.6% specificity, 0% precision), which was consistent with validation studies (27), but both were inferior to our model.

Other Metrics of Classification Performance
Like many other clinical predictive settings, one of the main challenges in building a reliable model was its class imbalance, e.g., in the clinician test cohort, only 5 out of the 49 patients actually reached ESKD, and even a null-predictor that predicts that none would reach ESKD would achieve a misleading accuracy of 89%. Commonly, F1 score (the harmonic mean of precision and recall) has been used to provide a fairer measure of predictor performance, but recently the Matthews correlation coefficient has been shown (28) to be better than both F1 and accuracy as a binary classification metric. Revisiting our results (Tables 3, 4) shows that our model performs better at both F1 (0.667) and MCC (0.638) than the best of 6 clinicians (F1 = 0.444, MCC = 0.408) and the better of the KFRE predictors (F1 = 0.333, MCC = 0.271).

DISCUSSION
The prediction of hard outcomes of CKD such as death and ESKD is often inaccurate, especially progressing toward RRT (29,30). Here we developed a predictive machine learning model that in our optimized model, comprising values from 19 dynamic measures including two fully-featurised time-series (eGFR and blood glucose), is able to predict the incidence of ESKD within an 8-year timeframe with an accuracy of 86.2%.
In a trial cohort of forty-nine patients assessed by six clinicians, the model was retrained to predict ESKD within a 2-year timeframe. The model proved to be more accurate and precise than all clinicians, however sensitivity was lower than some clinicians. Our model was trained from prospectively collected clinical data, and we showed that the 25 most predictive measures overlap with existing recognized clinical risk factors and recapitulate accepted clinical thresholds. For example, the inflection point for eGFR values is ∼60 mL/min/1.73 m 2 (stage 3A CKD) below which SHAP values increase sharply (31) and the inflection point of blood glucose (7 mmol/L) associated with risk of CKD progression is also accepted diabetic glucose treatment targets (32). Importantly, some measurements of accepted risk factor values (serum potassium, bicarbonate and uric acid) were not necessary or not included in our trained predictive models. This may indicate they closely replicate other information and trends already included in the models, or that fluctuations in these values are only weakly correlated with progression to ESKD.
The superior performance of the machine learning model compared to assessment by six experienced renal physicians leads supports the recognized variability of clinician performance. On the short-term outcome of the occurrence of ESKD within 2 years, clinician performance ranged from an accuracy of 47 to 81%.
Risk factors for ESKD have been extensively investigated in clinical studies and elevated blood pressure, proteinuria and kidney function are closely linked to ESKD (33). The description of risk factors, however, does not translate into an accurate assessment of future risk for an individual patient by clinicians, especially when accounting for the synergistic effects of these risk factors. Increasingly vast amounts of data accrued through medical records are difficult for clinicians to absorb, integrate and analyse into meaningful predictions.
We have purposefully restricted our analysis to factors known to be associated with an increased risk of ESKD to demonstrate the feasibility of using ML in a clinical setting to improve upon clinical prediction of hard endpoints. ML has the advantage of the ability to examine the features of the data that are most influential in the risk prediction. In our current model eGFR and glucose appeared to the be most influential. It is encouraging that specific parameters of risk factors appear to be linked to ESKD risk, such as diastolic blood pressure >75 mmHg and blood glucose over 7 mmol/L. This would suggest that the ML approach is able to detect appropriate clinical parameters.
The ML approach is also able to identify features that are not conventional risk factors for ESKD. In our current analysis, a sitting minimum heart rate below 75 bpm and an overall minimum heart rate below 85 bpm appears to be potentially protective, likely reflecting superior underlying cardiovascular health. This suggests ML may identify potential new clinical indicators of ESKD risk. Currently we have excluded other potential risk factors such as serum potassium, bicarbonate and uric acid, each of which have data suggesting they may be associated with risk of CKD progression.
Therefore, we conclude that unbiased ML modeling is capable of integrating large amounts of data points, establishing a predictive model that agnostically identifies classic cardiovascular ESKD risk factors to create robust predictions of ESKD. This forms the basis for improving prediction of ESKD to assist with RRT planning and prognostication, as well-potential identification of novel risk factors for ESKD.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by ACT Health Research Ethics and Governance Office. Written informed consent for participation was not required for this study in accordance with the national legislation and the institutional requirements.