Meta-Analysis for the Prediction of Mortality Rates in a Pediatric Intensive Care Unit Using Different Scores: PRISM-III/IV, PIM-3, and PELOD-2

Introduction: The risk of mortality is higher in pediatric intensive care units (PICU). To prevent mortality in critically ill infants, optimal clinical management and risk stratification are required. Aims and Objectives: To assess the accuracy of PELOD-2, PIM-3, and PRISM-III/IV scores to predict outcomes in pediatric patients. Results: A total of 29 studies were included for quantitative synthesis in meta-analysis. PRISM-III/IV scoring showed pooled sensitivity of 0.78; 95% CI: 0.72–0.83 and pooled specificity of 0.75; 95% CI: 0.68–0.81 with 84% discrimination performance (SROC 0.84, 95% CI: 0.80–0.87). In the case of PIM-3, pooled sensivity 0.75; 95% CI 0.71–0.79 and pooled specificity 0.76; 95% CI 0.73–0.79 were observed with good discrimination power (SROC, 0.82, 95% CI 0.78–0.85). PELOD-2 scoring system had pooled sensitivity of 0.78 (95% CI: 0.71–0.83) and combined specificity of 0.75 (95% CI: 0.68–0.81), as well as good discriminating ability (SROC 0.83, 95% CI: 0.80–0.86) for mortality prediction in PICU patients. Conclusion: PRISM-III/IV, PIM-3, and PELOD-2 had good performance for mortality prediction in PICU but with low to moderate certainty of evidence. More well-designed studies are needed for the validation of the study results.


INTRODUCTION
The main aim of the pediatric intensive care unit (PICU) is to decrease mortality in infants by both monitoring and treating critically ill patients who are considered at risk of dying. To provide the better quality of care with available resources and optimal management of such patients, a suitable management plan and prioritization of resource utility after the identification of "at-risk" patients are needed (1). In China, mortality rates associated with PICU admission are approximately two or three times higher than in America and Europe (2). It is, therefore, essential to identify predictors and determinants of death in PICU for the risk stratification and optimal management of such patients. Death prediction scores have been constantly explored by critical care health care providers since the establishment of PICU.
The scoring system aims to predict the outcome during treatment and to provide a better quality of care with available resources. Many mortality prediction scoring systems are being used for predicting outcomes in PICU patients. Although it is a complicated process to assess the individual patient outcome precisely, there have been efforts to develop and validate models for prediction accuracy of outcomes, such as Pediatric Risk of Mortality (PRISM) III/IV, Pediatric index of mortality (PIM-3), and Pediatric Logistic Organ Dysfunction-2 (PELOD-2). However, their predictive accuracy varied significantly in different populations worldwide (3)(4)(5).
The PIM was developed from data collected from PICUs in three prospective studies, from 1988 to 1995, and a cohort study, conducted from 1994 to 1997 by Shann et al. (6). PIM constructs a simple 10-variable model that is assessed at the time of admission to the PICU. Apart from the prediction of morality, this model also helps in the assessment of medical care quality and employment of resources. The revised version of the PIM study (PIM-3) has better calibration and discrimination capability than the previous model, PIM-2, reported in 2013 (7,8).
PRISM score is another widely used model that was developed using data collected from PICUs in the United States. PRISM was later updated to PRISM-III and PRISM-IV with better calibration and discrimination efficiency (9) and is used to predict the risk of mortality during admission at PICU.
Several recent studies have evaluated various prediction models to predict outcomes in PICU patients but have shown inconsistent findings, such as underestimation or overestimation of mortality prediction, poor discriminatory power, and absence of reporting of calibration statistics. (4,11) As of today, there is no pooled evidence on the accuracy of these scores for PICU patients. The main goal of the current study is to conduct a systematic review and meta-analysis to evaluate the predictive accuracy of PRISM-III/IV, PIM-3, and PELOD-2 scores to predict mortality in pediatric patients in the PICU.

MATERIALS AND METHODS
Study Design: Systematic review and meta-analysis Ethical Clearance: Not Required.

Search Strategy
The present meta-analysis was conducted according to the reporting guidelines suggested in the PRISMA 2020 and Cochrane library. Search engines and electronic databases, such as Google Scholar, PubMed, and CENTRAL (Cochrane Central Register of Controlled Trials) were used to retrieve English language papers published up to May 2021. Free text words and medical subject heading (MeSH) terms were used, and the reference lists of potentially eligible studies and relevant review articles on a similar topic were scanned for additional possible studies. The following search key words were used: ((("pediatrics"[All Fields] OR "pediatrics"[MeSH Terms] OR "pediatrics"[All Fields] OR "pediatric"[All Fields] OR "pediatric"[All Fields]) AND ("pediatric Risk of Mortality" [All Fields] ("prism"[All Fields] OR "prism s"[All Fields] OR "prisms"[All Fields])) OR "Pediatric Logistic Organ Dysfunction-2" OR "PELOD"[All Fields] OR "Pediatric index of mortality" OR "PIM"

Participants
We included studies on patients admitted to PICU for any conditions.

Prognostic Tests
Studies with PRISM-III/IV, PIM-3, and PELOD-2 model Comparator: Threshold values reported in the published articles Outcome: The outcome assessed was mortality. Mortality was defined as death at hospital or follow-up.

Inclusion Criteria
Study design: All studies evaluating the accuracy of PELOD-2, PIM-3, or PRISM-III/IV scores to predict outcomes in pediatric patients admitted to the ICU. These prognostic models should aim to predict mortality at any time point in PICU patients aged <18 years.

Exclusion Criteria
Not reporting relevant outcome (mortality) in PICU patients, case reports, review articles.

Data Collection
Two independent authors screened the title, shortlisted the relevant articles, and extracted the data from the potentially eligible articles that meet the inclusion criteria of the study. Disagreements were resolved by discussion. The data extraction form consisted of the following information: first author of the published article, publication year, details of participants, sample size, details of prediction scoring system, settings, and country from where the data were reported.

Statistical Analysis
STATA software version 13 was used to analyze the data. A random-effects model was used to calculate pooled   sensitivity and pooled specificity with a 95% confidence interval (CI), and summary area under the curve with 95% CI. Heterogeneity was calculated with the I 2 statistic. The I 2 = 50% was considered as significant heterogeneity. The methodological quality of studies was assessed using the PROBAST (Prediction Model Risk of Bias Assessment Tool) (12) on four domains: (a). participants selection, (b). prediction selection and measurement, (c). outcome definition and measurement, and (d). statistical analysis which consists of a total of 20 signaling questions to assess the risk of bias. The signaling questions are rated as yes, probably yes, no, probably no, or no information. In case all signaling questions are rated yes or probably yes, then the study is rated as low risk of bias, whereas no or probability no on one or more questions was rated as potential risk of bias.
The studies in which there was insufficient information to judge on one or more question were rated as unclear risk of bias. All the studies were rated as low risk of bias for mortality in consideration that there would be no bias in the measurement.

GRADE Evidence
An adapted GRADE framework for determining the certainty of evidence in predictive accuracy studies was used (13). The GRADE of evidence was judged using risk of bias, indirectness, inconsistency, impression, publication bias, large effect, and possible cofounding effects.

Study Characteristics
Study characteristics are shown in Table 1. The study flow diagram is shown in Figure 1 (27), one from Italy (31), and one multicentric (25). A total of 18 studies reported sufficient data to compute pooled sensitivity and pooled specificity for the PRISM-III/IV scoring system. Sixteen studies were conducted in PRISM-III and two studies used PRISM-IV models. The meta-analysis of combined PRISM-III/IV studies showed pooled sensitivity of 0.78, 95% CI: 0.72-0.83, and a pooled specificity of 0.75, 95% CI: 0.68-0.81 (Figure 2). Our pooled analysis observed good ability of test performance of PRISM-III/IV (diagnostic odds ratio 11, 95% CI; 7-18).
Studies including only PRISM-III reported pooled sensitivity of 0.79, 95% CI 0.72-0.85, and specificity 0.75, 95% CI 0.68-0.82. The summary area under the curve suggested 84% discriminatory power of PRISM-III/IV for mortality (SROC 0.84, 95% CI: 0.80-0.87) (Figure 3). We could not compute the pooled sensitivity and pooled specificity of the PRISM-IV due to the small number of studies, insufficient for subgroup analysis. There was significant heterogeneity between the studies for pooled sensitivity (p < 0.001) and specificity (p < 0.001) analyses (Figure 2), with no significant publication bias (p = 0.81) (Supplementary Figure 1). We observed moderate to high risk of bias in the risk of bias analysis between studies, which was mainly in the statistical analysis domain (Supplementary Figures 2A,B). Our metaregression analysis did not observe the significant influence of differences in mortality rates among different populations, study design, mean age of PICU patients, female gender, and setting (specialized children hospital/tertiary care hospitals), study period, and length of hospital stay on the discriminatory and predictive performance of PRISM III/IV (Supplementary Figure 3). The level of evidence using GRADE criteria observed very low certainty of evidence (Supplementary Table 1).
No significant heterogeneity was observed for both sensitivity (p = 0.14, I 2 = 32.85), but significant heterogeneity was noted in pooled specificity (p < 0.001, I 2 = 91%) (Figure 4). Publication bias was absent in the combined sensitivity and specificity (p = 0.36) (Supplementary Figure 4). The summary area under the curve indicated that the PIM-3 scoring system had 82% prediction power to predict mortality (SROC 0.82, 95% CI: 0.78-0.85) (Figure 5). Our pooled analysis observed good ability of test performance for PIM-3 (diagnostic odds ratio 9, 95% CI; 7-13). In the assessment of the methodological quality of studies using the PROBAST tool, we observed moderate to high risk of bias mainly due to inadequate statistical analysis (Supplementary Figures 5A,B).
The meta-regression analysis did not observe the significant effect of differences in mortality rates and length of stay on pooled effect size (Supplementary Figure 6). The certainty of evidence was moderate for sensitivity and very low for specificity (Supplementary Table 2).
Nine studies reported sufficient data for pooled analysis of the sensitivity and specificity of the PELOD-2 scoring system. Pooled analysis showed a pooled sensitivity of 0.78, 95% CI 0.71-0.83, and pooled specificity of 0.75, 95% CI 0.68-0.81 (Figure 6). Heterogeneity was significant for both sensitivity and specificity (p < 0.001, I 2 = 65.53% for sensitivity and 92.3% for specificity). Discriminatory performance was observed good as depicted by SROC 0.83; 95% CI 0.80-0.86 (Figure 7), with no statistically significant publication bias (p = 0.07) (Supplementary Figure 7). Our pooled analysis observed good ability of test performance for PIM-3 (diagnostic odds ratio 11, 95% CI; 7-17). Methodological quality was moderate to high (Supplementary Figures 8A,B). Our meta-regression analysis did not observe the significant influence of differences in mortality rates, study design, mean age of PICU patients, female gender, study period, and length of hospital stay on the discriminatory and predictive performance of PELOD-2 (Supplementary Figure 9).

DISCUSSION
In this study, we investigated the predictive accuracy and discriminating power of commonly used scoring systems such as PRISM-III/IV, PELOD-2, and PIM-3 to predict mortality risk in patients admitted to PICU. In China, mortality rates associated with PICU admission are approximately two or three times higher than in America and Europe. It is a need of the hour to identify predictor or prediction models of death in the PICU. There are constant explorations of death risk prediction score for providing optimal management to PICU patients with available resources.
Accurate and reliable information about predicted mortality improves communication with patients about possible prognoses and optimal stratification of patients at risk. These three scoring systems have potential to provide the predictive accuracy for prognosis in PICU patients.
We observed the evidence for good performance of these models; however, risk of bias assessment showed that evidence is with moderate to high risk of bias among studies. This bias was observed mainly due to inadequate presentation and reporting of statistical analysis, and failure to conduct the internal and external validation of models. The calibration of models is an essential component for evaluation of a test model; however, in our analysis, a total of 36% for PRISM-III/IV, 33% for PELOD-2, and 9% for PIM-3 models did not report the calibration of the model, which leads to bias in the statistical analysis domain. In the case of event per variable, 68% of studies in PRISM-III/IV, 88% of studies in PELOD-2, and 72% of studies in PIM-3 had <100 death events, resulting in high risk of bias as per PROBAST tool, which resulted into a risk of over fitting of the model in the validation studies. The most commonly used method to report calibration was the Hosmer-Lemeshow test, whereas this test is limited by neither the presence nor the magnitude of miscalibration (12). To overcome this, it is recommended to present the calibration plot, but most of the studies considered in the present meta-analysis did not present the same.
The development of valid and reliable models for predicting mortality in PICU patients is an ongoing practice. We noted that the PRISM-III/IV score had the best predictive accuracy and discrimination in an individual patient (sROC 0.84), closely followed by PELOD-2 and PIM-3. We found the almost similar discriminatory performances of these scoring systems.
Each of the prediction scores is applied at a specific timeframe in which reliable and optimal performance of prediction is to be expected. In the case of PRISM-III/IV scores, the optimal time point for prediction is after 24 h, while PIM-3 scores show the best performance and discrimination during the early hour after admission. A delayed timeframe that occurs in the case of PRISM-III/IV carries a risk of a patient dying before the assessment of PRISM-III/IV score, which could provide the probability of prognosis (38). On the other hand, the examination in the first few hours may result in an inaccurate predictive ability of prognosis. A study, assessing the predictive ability of PRISM-III, PIM-3, and PELOD-2 in a PICU setting, demonstrated that PIM-3 had better discrimination power and calibration compared to PRISM-III and PELOD-2 (3).
The PELOD-2 score may serve as an optimal measure to monitor the development of disease conditions and predict the outcome when evaluated in continuous time intervals at the time of disease progression (39). A study reported by Zhong et al. (32) reported that the PELOD-2 score was effective to assess the prognosis of PICU patients with sepsis and has shown an excellent discriminatory power with 0.916. On the other hand, PRISM-III/IV score and PELOD-2 performance becomes better when sepsis is pronounced (16). Another study reported by Mathews et al. showed that the PELOD-2 score of over 20 was able to predict mortality in 72.2% of PICU patients, and the cutoff score >16 showed a sensitivity of 100% and specificity of 54.1% (40). The study by Karam  A study on large sample size (21,335 subjects in the entire cohort) published by Christoper et al. (41) conducted a retrospective, single-center cohort derived from structured electronic health record data in the large quaternary PICU at a freestanding, university-affiliated children's hospital. The findings of this study demonstrated good to excellent discrimination measured by area under the curve (electronic-PRISM-IV had an area under the curve of 0.90 (95% CI 0.86-0.94), and PELOD-2 0.97 (95% CI 0.96-0.98) of PELOD-2, further strengthening the validity and reliability of scoring systems for accurate prediction of mortality in PICU patients. However, the findings of this study were largely limited by inclusion of only structured electronic data. This study also reported that bias associated with entry of diagnostic codes by physician could not be excluded.
Our meta-regression analysis was to explore the source of variation on the discriminatory and predictive performance indicating the need of well-designed studies with additional clinically relevant variables to explore the source of heterogeneity between the studies. Regarding the certainty of evidence using GRADE analysis, we rated our certainty of evidence at very low for PRISM-III/IV, low for PELOD-2, and moderate for PIM-3 for predicting mortality in PICU patients. This means that the true effects are likely to be close to the estimated prognostic significance, but there are possibilities that it is substantially different.

LIMITATION
This study has several limitations. A high degree of heterogeneity was noted in the pooled analysis, which can originate from differences between study population, setting, and methodological quality of the studies. Considering the heterogeneity across the studies, further research will be necessary to obtain homogenous findings. A large sample size study reported by Christoper et al. could not be included in the analysis due to insufficient required data that resulted in the underestimation or overestimation of some of the studied scoring systems. Studies included in the meta-analysis were conducted in a wide range of conditions and settings leading to heterogeneity in the study findings.

CONCLUSION
PIM-3, PELOD-2, and PRISM III/IV demonstrated good discriminatory power for mortality prediction in PICU patients with low to moderate quality of evidence. Further better-designed studies are needed to provide a better and accurate judgment of the performances of these models.