Edited by: Belén Rodriguez-Sanchez, Gregorio Marañón Hospital, Spain
Reviewed by: Axel Nierhaus, University of Hamburg, Germany; Gilbert Greub, University of Lausanne, Switzerland
This article was submitted to Infectious Diseases – Surveillance, Prevention and Treatment, a section of the journal Frontiers in Medicine
†These authors have contributed equally to this work
‡These authors jointly directed this work
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Sepsis is a life-threatening organ dysfunction triggered by dysregulated host response to infection (
In addition to the conventional approaches,
Considering the rapid pace at which the research in this field is moving forward, it is important to summarize and critically assess the state of the art. Thus, the aim of this review was to provide a comprehensive overview of the current state of machine learning models that have been employed for the search of digital biomarkers to aid the early prediction of sepsis in the intensive care unit (ICU). To this end, we systematically reviewed the literature and performed a quality assessment of all eligible studies. Based on our findings, we also provide some recommendations for forthcoming studies that plan to use machine learning models for the early prediction of sepsis.
The study protocol was registered with and approved by the international prospective register of systematic reviews (PROSPERO) before the start of the study (registration number: CRD42020200133). We followed the Preferred Reporting Items for Systematic reviews and Meta-Analysis (PRISMA) statement (
Five bibliographic databases were systematically searched, i.e., EMBASE, Google Scholar, PubMed/Medline, Scopus, and Web of Science, using the time range from their respective inception dates to July 20, 2020. Google Scholar was searched using the tool “Publish or Perish” (version 7.23.2852.7498) (
Two investigators (MM and CRJ) independently screened the titles, abstracts, and full texts retrieved from Google Scholar in order to determine the eligibility of the studies. Google Scholar was selected by virtue of its promise of an inclusive query that also captures conference proceedings, which are highly relevant to the field of machine learning but not necessarily indexed by other databases. In a second step, two investigators (MM and MH) queried EMBASE, PubMed, Scopus, and Web of Science for additional studies. Eligibility criteria were also applied to the full-text articles during the final selection. In case multiple articles reported on a single study, the article that provided the most data and details was selected for further synthesis. We quantified the inter-rater agreement for study selection using Cohen's kappa (κ) coefficient (
All full-text, peer-reviewed articles
The following information was extracted from all studies: (i) publication characteristics (first author's last name, publication time), (ii) study design (retrospective, prospective data collection and analysis), (iii) cohort selection (sex, age, prevalence of sepsis), (iv) model selection (machine learning algorithm, platforms, software, packages, and parameters), (v) specifics on the data analyzed (type of data, number of variables), (vi) statistics for model performance (methods to evaluate the model, mean, measure of variance, handling of missing data), and (vii) methods to avoid overfitting as well as any additional external validation strategies. If available, we also reviewed supplementary materials of each study. A full list of extracted variables is provided in
Owing to its time sensitivity, setting up the early sepsis prediction task in a clinically meaningful manner is a non-trivial issue. We extracted details on the prediction task as well as the alignment of cases and controls. Given the lack of standardized reporting, the implementation strategies and their reporting vary drastically between studies. Thus, subsequent to gathering all the information, we attempted to create new categories for the sepsis prediction task as well as the case–control alignment. The goal of this new terminology and categories is to increase the comparability between studies.
Based on 14 criteria relevant to the objectives of the review, which we adapted from Qiao (
The funding sources of the study had no role in study design, data collection, data analysis, data interpretation, or writing of the report. The corresponding author had full access to all the data in the study and had final responsibility for the decision to submit for publication.
The results of the literature search, including the numbers of studies screened, assessments for eligibility, and articles reviewed (with reasons for exclusions at each stage), are presented in
PRISMA flowchart of the search strategy. A total of 22 studies were eligible for the literature review and 21 for the quality assessment.
Overview of included studies.
1 | Abromavičius et al. ( |
Emory University Hospital, MIMIC-III | Sepsis-3 (with modified time windows) | 2,932 | 7.3 | Yes | No | No | AdaBoost and Discriminant Subspace Learning | – | – | No | Demographics, labs, vitals | 11 |
2 | Barton et al. ( |
MIMIC-III, UCSF | Sepsis-3 | 3,673 | 3.3 | No | No | No | XGBoost | 0.88 | 0 | No | Vitals | 6 |
3 | Bloch et al. ( |
RMC | Sepsis-2 related | 300 | 50.0 | No | No | No | Neural Networks, SVM, logistic regression | 0.88 | 4 | No | Vitals | 4 |
4 | Calvert et al. ( |
MIMIC-II | Sepsis-2 related | 159 | 11.4 | No | No | No | InSight Algorithm | 0.92 | 3 | No | Demographics, labs, vitals | 9 |
5 | Desautels et al. ( |
MIMIC-III | Sepsis-3 | 1,840 | 9.7 | No | No | No | InSight Algorithm | 0.88 | 0 | No | Demographics, vitals | 8 |
6 | Futoma et al. ( |
Duke University Health System | Sepsis-2 related | 11,064 | 21.4 | No | No | No | MGP-RNN | 0.91 | 0 | No | Comorbidities, demographics, labs, medications, vitals | 77 |
7 | Kaji et al. ( |
MIMIC-III | Sepsis-2 related | 36,176 | 63.6 | Yes | Yes | Yes | LSTM | 0.88 | “Next day” | No | Demographics, labs, medications, vitals | 119 |
8 | Kam and Kim ( |
MIMIC-II | Sepsis-2 related | 360 | 6.2 | No | No | No | SepLSTM | 0.99 | 0 | No | Demographics, labs, vitals | 9 |
9 | Lauritsen et al. ( |
Danish EHR | Sepsis-2 related | – | – | No | No | No | CNN-LSTM | 0.88 | 0.25 | No | Diagnoses, labs, imaging, medications, vitals, procedures | – |
10 | Lukaszewski et al. ( |
Queen Alexandra Hospital | Sepsis-2 related | 25 | 53.2 | No | No | No | MLP | – | – | No | Clinical parameters, cytokine mRNA expression | – |
11 | Mao et al. ( |
MIMIC-III, UCSF | Sepsis-2 related | 1,965 | 9.1 | Yes | No | No | InSight Algorithm | 0.92 | 0 | Yes | Vitals | 30 |
12 | McCoy and Das ( |
CRMC | Sepsis-3, Severe Sepsis | 407 | 24.4 | No | No | No | InSight Algorithm | 0.91 | – | – | Labs, vitals | – |
13 | Moor et al. ( |
MIMIC-III | Sepsis-3 | 570 | 9.2 | Yes | Yes | Yes | MGP-TCN | 0.91 | 0 | No | Labs, vitals | 44 |
14 | Nemati et al. ( |
Emory Healthcare system, MIMIC-III | Sepsis-3 (modified time windows) | 2,375 | 8.6 | No | No | No | Weilbull-Cox proportional hazards model | 0.85 | 4 | Yes | Demographics, vitals | 48 |
15 | Reyna et al. ( |
Emory University Hospital, MIMIC-III | Sepsis-3 (modified time windows) | 2,932 | 7.3 | Yes | No | No | – | – | – | Yes | Demographics, labs, vitals | 40 |
16 | Schamoni et al. ( |
University Medical Centre Mannheim | Sepsis tag by ICU clinicians | 200 | 32.3 | No | No | No | Non-linear ordinal regression | 0.84 | 4 | No | Comorbidities, demographics, labs, vitals | 55 |
17 | Scherpf et al. ( |
MIMIC-III | Sepsis-2 related | 2,724 | 7.7 | No | No | No | RNN-GRU | 0.81 | 3 | No | Labs, vitals | 10 |
18 | Shashikumar et al. ( |
Emory Healthcare system | Sepsis-3 | 242 | 22.0 | No | No | No | ElasticNet | 0.78 | 4 | No | Comorbidities, clinical context, demographics, vitals | 17 |
19 | Shashikumar et al. ( |
Emory Healthcare system | Sepsis-3 | 100 | 40.0 | No | No | No | SVM | 0.8 | 4 | No | Demographics, comorbidity, clinical context, vitals | 2 |
20 | Sheetrit et al. ( |
MIMIC-III | Sepsis-2 related | 1,034 | 41.4 | No | No | No | Temporal Probabilistic Profiles | – | – | No | Demographics, labs, vitals | – |
21 | van Wyk et al. ( |
MLH System | Sepsis-2 related | – | 50.0 | No | No | No | Random Forests, RNN | – | – | No | Labs, vitals | 7 |
22 | van Wyk et al. ( |
MLH System | Sepsis-2 related | 377 | 50.0 | No | No | No | Random Forests | 0.79 | 0 | No | Vitals | 7 |
Of the 22 included studies, 21 employed solely retrospective analyses, while one study used both retrospective and prospective analyses (
A boxplot of the sepsis prevalence distribution of all studies, with the median prevalence being highlighted in
As shown in
Approximately 80% of the studies employed one type of cross-validation (e.g., 5-fold, 10-fold, or leave-one-out cross-validation) to avoid overfitting. Additional validation of the models on out-of-distribution ICU data (i.e., external validation) was only performed in three studies (
In this review, we identified two main approaches of implementing sepsis prediction tasks on ICU data. The most-frequent setting (
An overview of experimental details: the used sepsis definition, the exact prediction task, and which type of temporal case–control alignment was used (if any).
1 | Abromavičius et al. ( |
Online training, online evaluation | Sepsis-3 (with modified time windows) | – | – |
2 | Barton et al. ( |
Offline training, horizon evaluation | Sepsis-3 | Random onset matching | Inpatients, age ≥18 years, at least one observation per measurement, prediction times between 7 and 2,000 h |
3 | Bloch et al. ( |
Offline training, horizon evaluation | Sepsis-2 related: SIRS criteria plus diagnosis of infection | Random onset matching (at least 12 h after admission to the ICU) | age >18 years, admitted to ICU; minimum stay of 12 h in the ICU; patients did not meet SIRS criteria at time of admission to the ICU; Continuous documented measurements were available for at least 12 h for vital signs |
4 | Calvert et al. ( |
Offline training, horizon evaluation | Sepsis-2 related: ICD-9 code 995.9 and a 5-h persisting window of fulfilled SIRS | – | Medical ICU, age >18 years, SIRS not fulfilled upon admission, measurements for set of nine variables available |
5 | Desautels et al. ( |
Offline training, horizon evaluation, but retrained for each prediction horizon | Sepsis-3 | – | Age ≥15 years, any measurements present, Metavision logging, for cases: sepsis onset between 7 and 500 h after ICU admission, all variables at least once measured, excluded patients that received antibiotics before ICU |
6 | Futoma et al. ( |
Offline training, horizon evaluation | Sepsis-2 related: SIRS fulfilled and blood culture drawn and 1 abnormal vital (time windows not stated) | Relative onset matching | Entire EHR cohort included |
7 | Kaji et al. ( |
Offline training, horizon evaluation | Sepsis-2 related: SIRS criteria plus ICD-9 code consistent with infection | Fixed length of 14 days in ICU (truncation if longer, zero filling, and masking if shorter) | Individual patient ICU admissions 2 days or longer were identified |
8 | Kam and Kim ( |
Offline training, horizon evaluation | Sepsis-2 related: ICD-9 code 995.9 and the first 5-h persisting window of fulfilled SIRS | insufficient detail: during training, 5-h windows are randomly extracted from case before sepsis and entire control stay, during testing it is not stated which data are used for controls | Medical ICU, age >18 years, patient can be checked for 5-h SIRS window plus ICD-9 995.9 code (if only one of the two was available, patients were excluded) |
9 | Lauritsen et al. ( |
Offline training, horizon evaluation | Sepsis-2 related: SIRS criteria plus clinically suspected infection | Random onset matching (excluding the first and last 3 h) | Inpatients, admissions ≥3 h, hospital departments with sepsis prevalence ≥2%, ≥1 observations for each vital sign measurement |
10 | Lukaszewski et al. ( |
Offline training, offline evaluation (fixed 24-h horizon) | Sepsis-2 related: SIRS criteria plus positive microbiological culture | Insufficient detail (but age-matching between cases and controls; healthy volunteers used as controls) | Blood samples taken daily; last sample on day of diagnosis or last stay in ICU |
11 | Mao et al. ( |
Offline training, offline evaluation (single fixed 4-h horizon) | Sepsis-2 related (suspected infection and first hour of fulfilled SIRS criteria), Severe Sepsis: ICD-9 plus SIRS plus organ dysfunction criteria; Septic Shock: ICD-9 plus manually defined conditions | – | Inpatients, age ≥18 years, ≥1 observations for each vital sign measurement, prediction time between 7 and 2,000 h |
12 | McCoy and Das ( |
Offline training, evaluation on retrospective dataset, prospective evaluation implemented as risk score | Sepsis-3, Severe Sepsis (SIRS criteria plus 2 organ dysfunction lab values) | – | Age >18 years; two or more sirs criteria during stay (hard to tell “Patient encounters were included in the sepsis-related outcome metrics if they met two or more SIRS criteria at some point during their stay.” Is this an inclusion criterion or their label definition?) |
13 | Moor et al. ( |
Offline training, horizon evaluation | Sepsis-3 | Absolute onset matching | Age ≥15 years, chart data including ICU admission/discharge time available, Metavision logging, cases: onset at least 7 h into ICU stay |
14 | Nemati et al. ( |
Offline training, horizon evaluation | Sepsis-3 (with modified time windows) | – | Age ≥18 years; sepsis onset not earlier than 4 h within ICU admission |
15 | Reyna et al. ( |
Online training, online evaluation | Sepsis-3 (with modified time windows) | – | ≥8 h of measurements |
16 | Schamoni et al. ( |
Offline training, horizon evaluation as well as prediction of severity (ordinal regression) | Sepsis tag by ICU clinicians via electronic questionnaire | – | Sepsis onset not earlier than on the second day after ICU admission |
17 | Scherpf et al. ( |
Offline training, horizon evaluation | Sepsis-2 related: ICD-9 codes plus SIRS criteria | Random onset matching via drawing fixed size time windows | Age ≥18 years, at least one measurement for SIRS parameters, no sepsis on admission, at least 5 h plus prediction time of measurements |
18 | Shashikumar et al. ( |
Offline training, Offline prediction (single fixed 4-h horizon) | Sepsis-3 | – | – |
19 | Shashikumar et al. ( |
Offline training, Offline prediction (single fixed 4-h horizon) | Sepsis-3 | – | – |
20 | Sheetrit et al. ( |
Offline training, horizon evaluation on two prediction windows (12 and 1 h) | Sepsis-2 related: ICD-9 Codes 995.91 or 995.92 plus antibiotics administered. Onset time is defined as the earliest of either antibiotics prescription or fulfilled qSOFA criteria | Insufficient detail: the paper uses the “equivalent time” as the feature window of the control group | ICU admission, age ≥15 years, for sepsis cases: onset not before third day |
21 | van Wyk et al. ( |
Offline training, horizon evaluation | Sepsis-2 related: SIRS criteria plus suspicion of infection, indicated by the presence of a blood culture and the administration of antibiotics during the encounter, along with relevant ICD10 | Insufficient detail: the paper uses “a given 6-h observational period” for the control group | At least 8 h of continuous data, absence of cardiovascular disease |
22 | van Wyk et al. ( |
Offline training, horizon evaluation | Sepsis-2 related: SIRS criteria plus suspicion of infection, indicated by the presence of a blood culture and the administration of antibiotics during the encounter, along with relevant ICD10 | Insufficient detail: the paper uses “a given 3-h observational period” for the control group | Age >18 years, physiological data available for at least 3 or 6 h, respectively; absence of cardiovascular disease |
Online training and evaluation scenario. Here, the model predicts at regular intervals during an ICU stay (we show predictions in 1-h intervals). For sepsis cases, there is no prima facie notion at which point in time positive predictions ought to be considered as true positive (TP) predictions or false positive (FP) predictions (mutatis mutandis, this applies to negative predictions). For illustrative purposes, here we consider positive predictions up until 1 h before or after sepsis onset (for a case) to be TP.
Selecting the “onset” for controls (i.e., case–control alignment) is a crucial step in the development of models predicting the onset of sepsis (
The results of the quality assessment are shown in
Quality assessment of all studies.
1 | Abromavičius et al. ( |
✓ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ | ✓ | ✓ | ✗ | ✓ | ✗ | ✗ | 50% |
2 | Barton et al. ( |
✓ | ✓ | ✗ | ✗ | ✗ | ✗ | ✓ | ✗ | ✓ | ✓ | ✓ | ✗ | ✓ | ✓ | 57% |
3 | Bloch et al. ( |
✓ | ✓ | ✗ | ✓ | ✗ | ✗ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✓ | ✓ | 71% |
4 | Calvert et al. ( |
✓ | ✓ | ✗ | ✓ | ✗ | ✗ | ✗ | ✗ | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ | 43% |
5 | Desautels et al. ( |
✓ | ✓ | ✗ | ✓ | ✗ | ✗ | ✗ | ✗ | ✓ | ✓ | ✓ | ✗ | ✗ | ✓ | 50% |
6 | Futoma et al. ( |
✓ | ✓ | ✗ | ✓ | ✗ | ✗ | ✗ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ | 50% |
7 | Kaji et al. ( |
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✓ | ✓ | 93% |
8 | Kam and Kim ( |
✓ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ | 36% |
9 | Lauritsen et al. ( |
✓ | ✗ | ✗ | ✓ | ✗ | ✗ | ✗ | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ | ✓ | 57% |
10 | Lukaszewski et al. ( |
✓ | ✓ | ✗ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ | ✗ | ✓ | ✓ | 43% |
11 | Mao et al. ( |
✓ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ | ✓ | ✓ | ✓ | ✓ | ✗ | ✓ | 64% |
12 | McCoy and Das ( |
✓ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ | ✗ | ✓ | ✗ | ✗ | ✓ | 36% |
13 | Moor et al. ( |
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✓ | ✓ | 93% |
14 | Nemati et al. ( |
✓ | ✓ | ✗ | ✓ | ✗ | ✗ | ✗ | ✗ | ✓ | ✓ | ✗ | ✓ | ✗ | ✓ | 50% |
15 | Schamoni et al. ( |
✓ | ✓ | ✗ | ✓ | ✗ | ✗ | ✗ | ✗ | ✓ | ✓ | ✓ | ✗ | ✓ | ✓ | 57% |
16 | Scherpf et al. ( |
✓ | ✓ | ✗ | ✓ | ✗ | ✗ | ✗ | ✗ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ | 43% |
17 | Shashikumar et al. ( |
✓ | ✓ | ✗ | ✓ | ✗ | ✗ | ✗ | ✗ | ✓ | ✓ | ✗ | ✗ | ✓ | ✓ | 50% |
18 | Shashikumar et al. ( |
✓ | ✓ | ✗ | ✓ | ✗ | ✗ | ✗ | ✗ | ✓ | ✓ | ✗ | ✗ | ✓ | ✓ | 50% |
19 | Sheetrit et al. ( |
✓ | ✓ | ✗ | ✓ | ✗ | ✗ | ✗ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ | 43% |
20 | van Wyk et al. ( |
✓ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ | ✗ | ✓ | ✗ | ✗ | ✓ | 36% |
21 | van Wyk et al. ( |
✓ | ✓ | ✗ | ✓ | ✗ | ✗ | ✗ | ✗ | ✓ | ✗ | ✓ | ✗ | ✗ | ✓ | 43% |
100% | 95% | 19% | 81% | 10% | 10% | 19% | 29% | 95% | 81% | 62% | 14% | 38% | 86% | |||
Study | Unmet need | Reproducibility | Stability | Generalizability | Clinical significance | Total |
In this study, we systematically reviewed the literature for studies employing machine learning algorithms to facilitate early prediction of sepsis. A total of 22 studies were deemed eligible for the review and 21 were included in the quality assessment. The majority of the studies used data from the MIMIC-III database (
While initial studies employing machine learning for the prediction of sepsis have demonstrated promising results (
Concerning the comparability of the reviewed studies, we note that there are several challenges that have yet to be overcome, namely the choice of (i) prediction task, (ii) case–control onset matching, (iii) sepsis definition, (iv) implementation of a given sepsis definition, and (v) performance measures. We subsequently discuss each of these challenges.
As described in section 3.5, we found that the vast majority of the included papers follow one of two major approaches when implementing the sepsis onset prediction task: Either an offline training step was followed by a horizon evaluation, or both the training and the evaluation were conducted in an online fashion. As one of our core findings, we next highlight the strengths but also the intricacies of these two setups. Considering the most frequently used strategy, i.e., offline training plus horizon evaluation, we found that the horizon evaluation provides valuable information about how early (in hours before sepsis onset) the machine learning model is able to recognize sepsis. However, in order to train such a classifier, the choice of a meaningful time window (and matched onset) for controls is an essential aspect of the study design (for more details, please refer to section 4.2.2). By contrast, the online strategy does not require a matched onset for controls (see
Futoma et al. (
A heterogeneous set of existing definitions (and modifications thereof) was implemented in the reviewed studies. The choice of sepsis definition will affect studies in terms of the prevalence of patients with sepsis and the level of difficulty of the prediction task (due to assigning earlier or later sepsis onset times). We note that it remains challenging to fully disentangle all of these factors: on the one side, a larger absolute count of septic patients is expected to be beneficial for training machine learning models (in particular deep neural networks). On the other side, including more patients could make the resulting sepsis cohort a less severe one and harder to distinguish from non-septic ICU patients. Then again, a more inclusive sepsis labeling would result in a higher prevalence (i.e., class balance), which would be beneficial for the training stability of machine learning models. To further illustrate the difficulty of defining sepsis, consider the prediction target
Another factor exacerbating comparability is the heterogeneous sepsis prevalence. This is partially influenced by the training setup of a given study, because certain studies prefer balanced datasets for improving the training stability of the machine learning model (
A boxplot of the number of sepsis encounters reported by all studies, with the median number of encounters being highlighted in
The last obstacle impeding comparability is the choice of performance measures. This is entangled with the differences in sepsis prevalence: simple metrics, such as accuracy are directly impacted by class prevalence, rendering a comparison of two studies with different prevalence values moot. Some studies report the area under the receiver operating characteristic curve (AUROC, sometimes also reported as AUC). However, AUROC also depends on class prevalence and is known to be less informative if the classes are highly imbalanced (
Our findings indicate that quantitatively comparing studies concerned with machine learning for the prediction of sepsis in the ICU is currently a nigh-impossible task. While one would like to perform meta-analyses in these contexts to aggregate an overall trend in performance among state-of-the-art models, at the current stage of the literature this would carry little meaning. Therefore, we currently cannot ascertain the best performing approaches by merely assessing numeric results of performance measures. Rather, we had to resort to
Reproducibility, i.e., the capability of obtaining similar or identical results by independently repeating the experiments described in a study, is the foundation of scientific accountability. In recent years, this foundation has been shaken by the discovery of failures to reproduce prominent studies in several disciplines (
Considering that the exact sepsis onset is usually unknown, most of the existing works have approximated a plausible sepsis onset via clinical criteria, such as Sepsis-3 (
A limitation of this review is that our literature search was restricted to articles listed in Embase, Google Scholar, PubMed/Medline, Scopus, and Web of Science. Considering the pace at which the research in this area—in particular, in the context of machine learning—is moving forward, it is likely that the findings of the publications described in this paper will be quickly complemented by further research. The literature search also excluded gray literature (e.g., preprints and reports), the importance of which to this topic is unknown
This section provides recommendations how to harmonize experimental designs and reporting of machine learning approaches for the early prediction of sepsis in the ICU. This harmonization is necessary to warrant meaningful comparability and reproducibility of different machine learning models, ensure continued model development as opposed to starting from scratch, and establish benchmark models that constitute the state-of-the-art.
As outlined above, only few studies score highly with respect to reproducibility. This is concerning, as reproducibility remains one of the cornerstones of scientific progress (
As for the datasets used in a study, different rules apply. While some authors suggest that peer-reviewed publications should be come with a waiver agreement for open access data (
Moreover, we urge authors to report additional details of their experimental setup, specifically the selection of cases and controls and the label generation/calculation process. As outlined above, the case–control matching is crucial as it affects the difficulty (and thus the significance) of the prediction task. We suggest to either follow the absolute onset matching procedure (
This study performed a systematic review of publications discussing the early prediction of sepsis in the ICU by means of machine learning algorithms. Briefly, we found that the majority of the included papers investigating sepsis onset prediction in the ICU are based on data from the same center, MIMIC-II or MIMIC-III (
Make code publicly available or usable | A prerequisite of being able to replicate the results of any study, or to use any model in a comparative setting, is having access to the raw code or a binary variant thereof that was used to perform the experiments. Authors are encouraged to share their code, for example via platforms, such as GitHub, or their binaries using container technologies like Docker. | GitHub, Docker |
Use external validation for the machine learning model | External validation of a classifier is crucial for assessing the model's generalizability. Several publicly available data sources exist that can be used for this purpose. | MIMIC-II, MIMIC-III, eICU, HiRID |
Provide exact definition of sepsis label | Implementations vary drastically in terms of prevalence and number of sepsis encounters. Thus, reporting the label generation process is essential, particularly when labels deviate from the international definitions of sepsis. For instance, when using the eICU dataset, microbiology measurements are under-reported for defining suspected infection, yet the exact modifications of sepsis implementations have not explicitly been stated ( |
Provide code of how sepsis label was determined. |
Provide an detailed description of a control and, if applicable, its matched onset | While there is a defined point in time for an event in the sepsis cohorts, it is much more challenging to determine at what time to extract data for a control case when was the non-event. For transparency and replication reasons, it is crucial to provide details on how controls were defined and how the onset was determined. | Provide code of how a control was defined and, if applicable, its matched onset was determined. |
Make data available | If possible and in compliance with international data protection laws, data sources should be made accessible to bona fide researchers. There are multiple data repositories, which researchers can use to make their data accessible, while complying with data protection laws. | Harvard Dataverse, PhysioNet, Zenodo |
Ensure comparability of models and their performances | To advance the field, it is important that researchers compare their models to existing models in order to evaluate and compare the performance across different studies. This necessitates improvements in prevalence reporting as well as the choice of different performance metrics. | Report prevalence and AUPRC in addition to other metrics. |
Use licenses for code | Licenses protect the creators and the users of code. Numerous open source licenses exist, making it possible to satisfy the constraints of most authors, including companies that want to protect their intellectual property. | Apache license, BSD licenses, GPL |
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found at:
MM, BR, and CJ contributed substantially to the data acquisition, extraction, analysis (i.e., quality assessment), and interpretation. Furthermore, they drafted the review article. MH made a substantial contributions to data interpretation (i.e., quality assessment) and participated in revising the review article critically for important intellectual content. KB made a significant contributions to the study conception and revised the review article critically for important intellectual content. All authors contributed to the article and approved the submitted version.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
This manuscript has been released as a preprint at
The Supplementary Material for this article can be found online at:
1This includes peer-reviewed journal articles and peer-reviewed conference proceedings.
2The dataset was not publicly available. However, with the 2019 PhysioNet Computing in Cardiology Challenge, a pre-processed dataset from Emory University Hospital has been published (
3In the machine learning community, for example, it is common practice to use preprints to disseminate knowledge about novel methods early on.