Systematic review of externally validated machine learning models for predicting acute kidney injury in general hospital patients

Acute kidney injury (AKI) is one of the most common and consequential complications among hospitalized patients. Timely AKI risk prediction may allow simple interventions that can minimize or avoid the harm associated with its development. Given the multifactorial and complex etiology of AKI, machine learning (ML) models may be best placed to process the available health data to generate accurate and timely predictions. Accordingly, we searched the literature for externally validated ML models developed from general hospital populations using the current definition of AKI. Of 889 studies screened, only three were retrieved that fit these criteria. While most models performed well and had a sound methodological approach, the main concerns relate to their development and validation in populations with limited diversity, comparable digital ecosystems, use of a vast number of predictor variables and over-reliance on an easily accessible biomarker of kidney injury. These are potentially critical limitations to their applicability in diverse socioeconomic and cultural settings, prompting a need for simpler, more transportable prediction models which can offer a competitive advantage over the current tools used to predict and diagnose AKI.


Introduction
An estimated one in five adults admitted to hospital develop acute kidney injury (AKI) (1).Acute kidney injury is associated with an increased risk of in-hospital death and long-term development of chronic kidney disease (CDK) and other non-kidney complications, such as HTN, cardiovascular disease and stroke (2).
Multiple prediction models and risk scores have been developed over the years to predict the occurrence of hospital-acquired AKI (HA-AKI) in order to improve short-and long-term management (3).In patients presenting to the emergency department (ED) these interventions may be as simple as performing a fluid assessment to optimize hydration status and avoiding nephrotoxic medications.AKI prediction models or risk scores have been developed for various populations, from undifferentiated general hospital patients to critically ill or post-operative patients.Accordingly, predictors used have ranged from general admission observations to laboratory data and specific procedural interventions.Acute kidney injury is most commonly defined using the Kidney Disease Improving Global Outcomes (KDIGO) criteria which uses the rise in serum creatinine or decrease in urine output (4), however, other definitions -such RIFLE, AKIN -have been used in AKI models.
Over the past decade as larger and more complex patient databases have become available, development of machine learning (ML) models for risk prediction has become increasingly popular.Compared to traditional regression models, ML models have the advantage of not being constrained by assumptions between variables and outcomes allowing more subtle relationships in the data to be explored.Additionally, their computational power means that a greater number of input variables and data points can be processed to make more accurate and individualized forecasts.
Past systematic reviews of prediction models in AKI have identified a high risk of bias in most studies resulting from variability in the use of performance measures (in particular calibration reporting), sparse interpretability considerations, possibility of overfitting, narrow scope for reproducibility and a lack of external validation limiting the generalizability and applicability of the models (3,5).Moreover, while AKI risk prediction is useful in sub-specialized patient populations, the majority of AKI occurs in the general wards, placing a premium on flexible risk prediction models that can perform well in these settings (6).
Hence, the research question of this systematic review is: what are the available machine learning models which can assist a health professional to quantify the risk of KDIGO-defined AKI during a hospital stay in patients admitted to a general hospital or clinic?Accordingly, the aim of this systematic review was to evaluate and critically appraise studies of models developed from general hospital populations, using the current KDIGO definition of AKI, and which have been externally validated.

Methods
This review was designed and conducted according to Preferred Reporting Items and Systematic Review and Meta-Analyses (PRISMA) Protocol guideline (7) and the Checklist for critical Appraisal and data extraction for systematic Reviews of prediction Modelling Studies (CHARMS) (8).

Study identification
We used PubMed (pubmed.ncbi.nlm.nih.gov) and the Institute of Electrical and Electronic Engineers (IEEE) Xplore Digital library (https://ieeexplore.ieee.org/Xplore/home.jsp)for our search.In PubMed we searched titles or abstract with a combination of string and Mesh terms (Table S1).The search was conducted on July 13 th , 2022 and was limited to studies published in the last five years (up to July 13 th 2017).

Study inclusion and selection
We included studies that validated machine learning models, with or without initially developing those models, for the prediction of AKI (stage 1, 2, 3 or dialysis) in a general hospital population.We excluded studies using the following criteria: (i) studies in patients under 18 years of age; (ii) studies which used an outcome definition of AKI other than KDIGO; (iii) studies presenting models based on regression, such as linear, logistic or penalized regression, or those which are well understood theoretically and by clinicians, such as Naïve Bayes; (iv) studies presenting models with no external validation; (v) any comments or editorials or meta-analysis or systematic reviews.
We considered external validation to exist if the developed model was applied to new individuals, either from the same institution but a different time period (temporal validation), from a different institution or country (geographic validation) or on very different individuals (domain validation) (9).
Titles and abstracts were screened by two reviewers (MW and EE), and full articles were reviewed if they were eligible or if reviewers disagreed in the screening process.Disagreements were resolved by a third reviewer (SS).

Data extraction and critical appraisal
A data extraction form was created based on the Checklist for Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modelling Studies (CHARMS) checklist (8) and the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) (10).Items extracted included study type, number and location of centers recruited from, patient characteristics (AKI incidence, staging and mortality), model characteristics including prediction windows, predictors used, types of models evaluated, type of external validation performed, method used to calculate baseline serum creatinine and metrics used to evaluate the models among others.Risk of bias was assessed using the Prediction Model Risk of Bias Assessment Tool (PROBAST), which includes four domains and 20 signaling questions (11).Domains were rated as high risk, low risk or unclear risk.Models would need be low risk on all domains to be evaluated overall as low risk.It should be noted that this tool was created for traditional regression models and is currently being updated to include newer model methodology.Data extraction and appraisal were done by EE and MW.Disagreements were resolved through discussion with SS.

Results
Our search returned 889 studies.A majority of studies were excluded initially either because they did not present or validate a prediction model (this included commentaries, editorial letters, narrative reviews and systematic reviews), the subject was irrelevant and/or the population studied was not comprised of general hospital patients (Figure 1).Forty-three were selected for full text review either because they fit the inclusion criteria or because insufficient information was given in the title or abstract to decide on eligibility.A further 40 studies were excluded following full text review of which 16 were excluded for not having external validation.Of the remaining three studies, two developed and validated a model (1 and 3) and one validated a previously developed model ( 12) (study 2).A list of the types of models excluded based on criteria (iii) can be seen in Table S2.

Data source and population characteristics
All studies were from urban or suburban tertiary, referral centers in high income countries, two from the United States and one from South Korea (Table 1).Data was sourced either from the individual hospital electronic health records (EHR) (studies  All validation cohorts had close or over 50,000 patients, except for site 5 in study 3 which had 19,542.Gender distribution was even in all studies.The median or average age was close to 60 in most cohorts with study 3 exhibiting wider variability between the six sites.Black or African American patients comprised 50% of the internal validation cohort of study 2 and up to 20% in the remaining US study.Only study 3 reported the prevalence of Hispanic, Native American and Asian patients, who accounted for no more than 10% of all patients.Incidence of AKI was lowest in study 1 (3-6%), followed by study 2 (8.3 -14.3%), and study 3 (15.1%).

Outcome
Acute kidney injury was defined by the serum creatinine criteria of the KDIGO definition in all studies.Studies 1 and 3 used AKI of any grade as the primary outcome while study 2 predicted AKI stage 2 or higher (Table 2).All studies used a 48-hr prediction window except for study 1 that predicted AKI within seven days of the present.Prediction was continuous throughout a patient's admission in all studies with data binned into time windows between 6 to 24 hrs.Two of three studies (1 and 3) prioritized a pre-admission serum creatinine to represent the baseline value, followed by admission measurements (study 2) or imputation using the Modification of Diet in Renal Disease (MDRD).

General model characteristics Predictor data
The lowest number of predictor variables was 59 (study 2) and the highest was 1933 (study 3) (Table 2).Study 3 used all available variables in their health system or network EHR, while study 1 selected those known to be associated with AKI risk and study 2 did not specify the variable selection process.Most variable categories included demographic information, vital signs, laboratory and diagnostic data, comorbidities, medications, clinical conditions (usually labelled by ICD 9 or 10 codes) procedures or interventions as well as nursing documentation (study 2) and length of stay.All studies included sCr and blood urea nitrogen (BUN) as predictors but only study 3 evaluated model performance with and without the inclusion of

Type of prediction models and performance
In the external validation cohorts and for the prediction of stage 2 or higher AKI within 48 hrs, Study 2, which used a Gradient Boosting Model with discrete time survival framework (DS-GBM) had the best performing model (AUROC 0.85-0.86)followed by study 3 (AUROC 0.68 -0.81), which also used DS-GBM.Study 1, which used a Recurrent Neural Network (RNN) to predict AKI within 7 days had excellent performance (AUROC all AKI 0.84 and stage ≥ 2 AKI 0.90).Performance was compared to a GBM in study 1 and to a Least Absolute Shrinkage and Selection Operator (LASSO) model in study 3.
Beyond discrimination, studies evaluated models using calibration (1 and 2), sensitivity, specificity, positive and negative predictive value (1-3), area under the precision receiver curve (3) and F1 score (1).All studies performed some form of interpretability analysis to contextualize predictions which showed that serum creatinineeither as a baseline absolute value, in the form of change over time or integrated within an eGFR equation -was consistently the strongest predictor of future AKI risk.Further study and model information can be found in Data Sheet 1 of the Supplemental Material.

Type of external model validation
All three studies performed geographic external validation and study 3 also performed temporal validation using the later encounters from the source population.

Assessment of bias, applicability and reproducibility
All except for study 3 were found to have a high risk of bias in the participant and outcome domains.The former resulting from the use of retrospective EHRs as primary data source and the latter to the incorporation bias inherent in the inclusion of serum creatinine in the outcome definition.We considered the development of a second model without serum creatinine and BUN information in Study 3, as an attempt to mitigate this bias.The predictors and analysis domains were uniformly deemed as having low risk of bias (Table 3).Applicability was a concern for studies 1 and 3 that used more than 100 predictor variables.With regards to reproducibility, only study 3 shared code in a publicly accessible domain.Individual study PROBAST assessment forms can be found in the Supplemental Material.

Discussion
We screened 889 studies from the last five years in search of externally validated ML prediction models of AKI based on general hospital patients.Only three studies were retrieved that fit these criteria.While most performed well and had a sound methodological approach, the main concerns relate to their development and validation in populations with limited diversity, comparable digital ecosystems and use of large number of predictor variables.These are potentially critical limitations to their applicability in diverse socioeconomic and cultural settings, prompting a need for simpler and more transportable prediction models.In addition, all models rely heavily on serum creatinine, which questions their added utility and advantage over the current tools used to predict and diagnose AKI.
Of the 886 studies not included, a majority were excluded due to lack of external validation, use of sub-populations (mainly ICU and surgical cohorts) or predictions based on standard regression, parametric models.The finding of a growing number of prediction models generated on clinical sub-populations, often from publicly available datasets, without any external validation has been noted by previous systematic reviews (3,5).
Most models performed well, especially when predicting stage 2 or higher AKI, which is often regarded as the more meaningful outcome in the acute setting (16).Discrimination remained high between internal and external validation sites in studies 1 and 2 but varied widely across the 6 sites of study 3 which had the highest diversity in demographic profiles and EHR structures.The main barriers to independent validation of a prediction model include patient heterogeneity, clinical process variability, EHR configuration and data warehouse heterogeneity leading to noninteroperable databases across hospitals (15).Models should be ideally trained and validated on diverse populations from healthcare sites with a spectrum of maturity and sophistication in their digital ecosystems.There is little point in training a model with thousands of variables that it is unlikely to ever encounter when applied to a From applicability to implementation, there is a growing concern that much of the attention in the field of machine learning risk prediction has centered around model development and not on the transition to implementation and deployment (17).Providing a framework on the lifecycle management of a model is critical to its success as a clinical decision support tool.This includes information on how often a model should be re-validated, potentially re-trained and fine-tuned as well as how it should be integrated into clinical workflows (18,19).While this information may be available in the coming years, at present none of the studies specifically addressed this.It should also be said that the reluctance of most groups to share their development code, hinders efforts to make the models more reproducible and transportable.
The use of serum creatinine as a predictor and as part of the outcome definition is problematic.The latter results in incorporation bias whereby the association between the predictor and the outcome is overestimated rendering model performance evaluations optimistic (11).In reality, however, any model that uses the current serum creatinine based KDIGO definition of AKI and has access to repeated measures of serum creatinine, will suffer from this bias.With regards to the use of creatinine as a predictor, the interpretability analyses of all studies revealed that either the baseline serum creatinine and eGFR (which includes serum creatinine in its calculation) or the change in serum creatinine over time were among the top three heaviest contributors to the prediction.The question then becomes: what can these models offer that a single or trend in serum creatinine readings cannot?Study 3 trained a model without any information on serum creatinine or BUN which performed competitively with an AUROC of 0.75 for any AKI and 0.82 for stage 2 or higher AKI.The main predictors of AKI in this model were vancomycin exposure, blood pressure change, age, BMI, height, having a chest x-ray and receiving Tazosin antibiotic.It should be noted, however, that this model used almost 2000 predictor variables to arrive at this prediction.
Perhaps the real advantage of these models lies not in their capacity to process large and complex health data in order make timely predictions but in their potential for integration within clinical workflows.A health professional must know to order a serum creatinine test on a particular at-risk patient, they must then remember to check it once it has been reported and to act on it using evidence-based guidelines.A disruption to this process, which is time-consuming and often falls to the most junior medical staff in the hospital, may result in a missed opportunity to avoid or minimize the severity and consequences of AKI.Automating this process with a prediction algorithm that can seamlessly aggregate the relevant historical and admission information and present the clinician with a real-time risk profile for each patient on any given day, may go a long way towards decreasing the incidence of hospital-acquired AKI.It remains to be seen, however, whether the same can be achieved with more comprehensive education of junior doctors and clearer accountability structures within healthcare teams.
The main limitation of our study is the narrow search and selection criteria.In wanting to explore only machine learning as a novel technology in AKI risk prediction we have not compared or evaluated the more traditional regression models and risk scores, such as linear and logistic regression or Naïve Bayes classifier.It would be prudent to say that, at this stage, the benefits of ML in AKI are not clear.Relying on more explainable, less complex statistical models while this technology matures, may be the wiser approach.In addition, by selecting only models that had been externally validated we may have missed studies with sound methodology that were in the early stages of validation.
The use of ML models for AKI prediction requires large and complex data structures to support them, have potentially limited applicability beyond the health systems they were trained on and, at present, rely heavily on a cheap and readily accessible biomarker of kidney injury with little apparent benefit from the other predictors.
In the future, simpler and more transportable models that offer a significant predictive advantage to using serum creatinine measurements may find greater clinical utility and acceptance within the healthcare community.

1 and 2 )
or from clinical research networks, such as the Greater Plains Collaborative (GPC) (study 3).Studies included hospitalized patients with at least two documented serum creatinine measures and/or a minimum of 48 hours of admission (studies 1 and 3) Patients with severe forms of AKI or on kidney replacement therapy (KRT) were uniformly excluded (Data Sheet 1).Patient data was variably collected between 2006 and 2018.The experimental design used to develop and/or validate model predictions varied between studies.Study 1 split the population of one hospital into training and internal validation cohorts, and the population of the second hospital, from the same city, was used for external validation.Study 2 was based on a previously developed model which was internally validated in the original hospital population and externally validated in a four-hospital health care

TABLE 1
Study population characteristics.

TABLE 2
Prediction model characteristics and results.

TABLE 2 Continued
Hence, the concerns of applicability in most of the presented studies.To address this, Song et al. (study 3) explored the source of performance variability between the original and target sites and proposed a metric for model transportability.Interestingly, the study brings to light how differences in data representation and format between individual EHRs can have a significant impact on what a model learns and how it performs (15).Overall, this would suggest that a model which uses fewer and more commonly available clinical variables, like the one from study 2, is likely to have better uptake and be more easily validated.

TABLE 3
Traffic light plot of risk of bias assessment., risk of bias; ?, unclear risk.Red = high risk, Green = low risk, Yellow = Unclear risk. ROB