Considerations in the reliability and fairness audits of predictive models for advance care planning

Multiple reporting guidelines for artificial intelligence (AI) models in healthcare recommend that models be audited for reliability and fairness. However, there is a gap of operational guidance for performing reliability and fairness audits in practice. Following guideline recommendations, we conducted a reliability audit of two models based on model performance and calibration as well as a fairness audit based on summary statistics, subgroup performance and subgroup calibration. We assessed the Epic End-of-Life (EOL) Index model and an internally developed Stanford Hospital Medicine (HM) Advance Care Planning (ACP) model in 3 practice settings: Primary Care, Inpatient Oncology and Hospital Medicine, using clinicians' answers to the surprise question (“Would you be surprised if [patient X] passed away in [Y years]?”) as a surrogate outcome. For performance, the models had positive predictive value (PPV) at or above 0.76 in all settings. In Hospital Medicine and Inpatient Oncology, the Stanford HM ACP model had higher sensitivity (0.69, 0.89 respectively) than the EOL model (0.20, 0.27), and better calibration (O/E 1.5, 1.7) than the EOL model (O/E 2.5, 3.0). The Epic EOL model flagged fewer patients (11%, 21% respectively) than the Stanford HM ACP model (38%, 75%). There were no differences in performance and calibration by sex. Both models had lower sensitivity in Hispanic/Latino male patients with Race listed as “Other.” 10 clinicians were surveyed after a presentation summarizing the audit. 10/10 reported that summary statistics, overall performance, and subgroup performance would affect their decision to use the model to guide care; 9/10 said the same for overall and subgroup calibration. The most commonly identified barriers for routinely conducting such reliability and fairness audits were poor demographic data quality and lack of data access. This audit required 115 person-hours across 8–10 months. Our recommendations for performing reliability and fairness audits include verifying data validity, analyzing model performance on intersectional subgroups, and collecting clinician-patient linkages as necessary for label generation by clinicians. Those responsible for AI models should require such audits before model deployment and mediate between model auditors and impacted stakeholders.

O/E = 5.3. There was significantly lower sensitivity for Age: (60,70] at 0.09 and Age: (70,80] at 0.07. The model also underpredicted events more for Age: (40, 50] by a factor of O/E = 15.0, Age: (60,70], by a factor of O/E = 19.3, and Age: (70,80] by a factor of 9.8 (Supplemental Tables 27-30). For several other groups, there were statistically significant differences in prevalence, performance or O/E, but these subgroups had less than 10 patients to calculate the metric for, making results inconclusive .

Comparison of Class Balanced Analysis with Unbalanced Analysis
Prevalence was higher in the class balanced analysis (0.5) compared with the unbalanced analysis (0.2), due to the random oversampling of the positive label used to generate the class balanced data set. In both the class balanced analysis and the unbalanced analysis, prevalence was significantly higher for Age: (80,90]. In the class balanced analysis only, prevalence was higher for Age: (90, 100] and lower for Age: (40,50], Age: (50,60], Hispanic patients with Race listed as "Other", and Hispanic female patients with Race listed as "Other". In the unbalanced analysis only, prevalence was lower for Age: (20,30] and Age: (30,40].
In the class balanced analysis, the Epic EOL -Low Threshold model for Primary Care flagged more patients (20% vs 9%), had a higher PPV (0.95 vs 0.85), and had a higher O/E (5.3 vs 4.1) than in the unbalanced analysis. Otherwise, sensitivity was similar (0.38 vs 0.37), as was specificity (0.98 for both). In both the class balanced analysis and the unbalanced analysis, sensitivity was significantly lower for Age: (60,70] and Age: (70,80], and the model underpredicted events more for Age: (60,70]. In the class balanced analysis only, the model underpredicted events more for Age: (40,50] and Age: (70,80].

Epic EOL High Threshold in Inpatient Oncology -Class Balanced Analysis
Before oversampling, the data set size for the Epic EOL -High Threshold model in Inpatient Oncology, was 150 with 105 positive labels. After oversampling the negative labels, the data set size was 210.
The overall prevalence was 0.5. There was significantly lower prevalence for Age: (20,30] at 0.11. There were no significant differences in prevalence by Sex, Ethnicity/Race, and the intersection of Ethnicity/Race and Sex (Supplemental Tables 31-34).
The model flagged 40 patients out of 210 (19%) with a sensitivity of 0.27, specificity of 0.89, and PPV of 0.7. The model predicted fewer events relative to the number of positive clinician labels, with an O/E ratio of 2.4. There was significantly lower sensitivity for Ethnicity: Hispanic or Latino, Race: Other at 0.09; and Ethnicity: Hispanic or Latino, Race: Other, Sex: Male at 0. There was significantly lower specificity for Age: (60,70] at 0.33 and Ethnicity: Not Hispanic or Latino, Race: White, Sex: Male at 0.59. The model significantly underpredicted events for Ethnicity: Hispanic or Latino, Race: Other at 5.3; and Ethnicity: Hispanic or Latino, Race: Other, Sex: Male at 6.3. Several other subgroups exhibited statistically significant differences in model sensitivity, specificity or O/E, but these subgroups had less than 10 patients to calculate the metric for, making such claims inconclusive. See Supplemental Tables 31-34 for details.

Comparison of Class Balanced Analysis with Unbalanced Analysis
Prevalence was lower in the class balanced analysis (0.5) compared with the unbalanced analysis (0.7), due to the random oversampling of the negative label used to generate the class balanced data set. In both the class balanced analysis and the unbalanced analysis, prevalence was significantly lower for Age: (20,30].
In the class balanced analysis, Epic EOL High Threshold model in Inpatient Oncology flagged less patients (19% vs 21%), had a lower PPV (0.7 vs 0.88), and had a lower O/E (2.4 vs 3) than in the unbalanced analysis. Otherwise, sensitivity was similar (0.27 vs 0.27), as was specificity (0.89 vs 0.91). In both the class balanced analysis and the unbalanced analysis, sensitivity was significantly lower for Hispanic or Latino patients with Race "Other" and in particular Hispanic or Latino males with Race "Other"; these two groups also both had significant underprediction of events. In the class balanced analysis only, the model had lower specificity for Age: (60,70] and Ethnicity: Not Hispanic or Latino, Race: White, Sex: Male.

Stanford HM ACP in Inpatient Oncology -Class Balanced Analysis
Before oversampling, the data set size for the Stanford HM ACP model in Inpatient Oncology was 114 with 79 positive labels. After oversampling the negative labels, the data set size was 158.
The overall prevalence was 0.5. There was significantly lower prevalence for Age: (20,30] at 0.16, and significantly higher prevalence for Age: (60,70] at 0.85. There were no significant differences in prevalence amongst Sex, Ethnicitiy/Race, or Ethnicity/Race and Sex .
The Stanford HM ACP model flagged 105 patients out of 158 (66%) with sensitivity 0.89, specificity 0.56, and PPV 0.67. The model moderately underestimated events relative to clinicians, with an O/E of 1.4. For Age: (40,50], there was significantly lower specificity at 0.18 and significantly lower PPV at 0.26 .
Model performance and O/E appeared to differ for some other subgroups, but these subgroups had less than 10 patients to calculate the metric for, making any associated claims inconclusive. See Supplemental Tables 35-38 for details.

Comparison of Class Balanced Analysis with Unbalanced Analysis
Prevalence was lower in the class balanced analysis (0.5) compared with the unbalanced analysis (0.69), due to the random oversampling of the negative label used to generate the class balanced data set. In the class balanced analysis only, prevalence was significantly lower for Age: (20, 30] and significantly higher for Age: (60,70].
In the class balanced analysis, the Stanford HM ACP model in Inpatient Oncology flagged less patients (66% vs 75%), had a lower PPV (0.67 vs 0.82), and had a lower O/E (1.4 vs 1.7) than in the unbalanced analysis. Otherwise, sensitivity was similar (0.89 vs 0.89), as was specificity (0.56 vs 0.57). In the class balanced analysis only, the model had significantly lower specificity and significantly lower PPV for Age: (40,50].

Model Comparison in Inpatient Oncology -Class Balanced Analysis
Comparing model performance in Inpatient Oncology, the Stanford HM ACP model flagged more patients (66% vs 19%), had significantly higher sensitivity (0.89 vs 0.27), and exhibited similar PPV (0.67 vs 0.7, 95% confidence intervals overlap). The Epic EOL High Threshold model had significantly higher specificity (0.89 vs 0.56). Comparing model calibration, the Stanford HM ACP model had significantly better calibration in terms of O/E (1.4 vs 2.4).

Comparison of Class Balanced Analysis with Unbalanced Analysis
In both the class balanced analysis and unbalanced analysis in Inpatient Oncology, the Stanford HM ACP model flagged more patients, had significantly higher sensitivity, exhibited similar PPV, and had significantly better calibration in terms of O/E compared to the Epic EOL High Threshold model, while the Epic EOL High Threshold model had significantly higher specificity than the Stanford HM ACP model.

Epic EOL High Threshold in Hospital Medicine -Class Balanced Analysis
The final data set size for the Epic EOL -High Threshold model in Hospital Medicine, was 305 with 133 positive labels. After oversampling the positive labels, the data set size was 344.
The overall prevalence was 0.5. Prevalence did not differ by sex, but was significantly lower for younger patients (0.22 for Age: (20, 30] and 0.15 for Age: (30,40]). Prevalence was also significantly higher for Non-Hispanic Asian patients (0.76) and, in particular, Non-Hispanic Asian Males (0.79). Prevalence was significantly lower for Hispanic or Latino patients with Race "Other" (0.23) and, in particular, Hispanic or Latino Males of Race "Other" (0.17) Supplemental Tables 39-42.
The model flagged 44 out of 344 patients (13%). The model demonstrated a sensitivity of 0.21, specificity of 0.95, and PPV of 0.82. The model underpredicted events relative to clinicians (O/E ratio of 2.6). The model underestimated events relative to clinicians significantly more for Non-Hispanic White patients (O/E = 3.4) and in particular for Non-Hispanic White Females (O/E = 4.2). Differences in performance and O/E were statistically significant for other subgroups, but these subgroups had less than 10 patients to calculate the metric for, preventing conclusive statements regarding disparate performance. See Supplemental Tables 39-42 for details.

Comparison of Class Balanced Analysis with Unbalanced Analysis
Prevalence was higher in the class balanced analysis (0.5) compared with the unbalanced analysis (0.44), due to the random oversampling of the positive label used to generate the class balanced data set. In both the class balanced analysis and the unbalanced analysis, prevalence was significantly lower for Age: (20,30], Age: (30,40], Hispanic or Latino patients with Race "Other" and, in particular, Hispanic or Latino Males of Race "Other", and was significantly higher for Non-Hispanic Asian patients. In the class balanced analysis only, prevalence was significantly higher for Non-Hispanic Asian Males. In the unbalanced analysis only, prevalence was significantly higher for older patients (Age: (80,90], Age: (90, 100])

Stanford HM ACP in Hospital Medicine -Class Balanced Analysis
The final data set size for the Stanford HM ACP model in Hospital Medicine, was 225 with 99 positive labels. After oversampling the positive labels, the data set size was 252.
The Stanford HM ACP model flagged 106 out of 252 patients (42%), with sensitivity 0.71, specificity 0.87, and PPV 0.84. Relative to clinicians, the model underestimated events by a factor of O/E = 1.6. For patients Age: (80, 90] and (90, 100], this underestimation was even more substantial with O/E ratios of 2.2 and 2.7, respectively. Specificity was lower (0.57) for Age: (70,80]. Model performance disparities in other subgroups were inconclusive given they had less than 10 patients to calculate the metric for. See Supplemental Tables 43-46 for details.

Comparison of Class Balanced Analysis with Unbalanced Analysis
Prevalence was higher in the class balanced analysis (0.5) compared with the unbalanced analysis (0.44), due to the random oversampling of the positive label used to generate the class balanced data set. In both the class balanced analysis and the unbalanced analysis, Prevalence was significantly higher for older patients (Age: (80,90], Age: (90, 100]) and Non-Hispanic Asian patients, especially Non-Hispanic Asian Males, and significantly lower for younger patients (Age: (30,40]) and for Hispanic or Latino patients with Race "Other." In the class balanced analysis only, prevalence was significantly lower for specifically both Hispanic or Latino Females with Race "Other" and Hispanic or Latino Males with Race "Other".
In the class balanced analysis, the Stanford HM ACP model in Hospital Medicine flagged more patients (42% vs 38%) and had a higher PPV (0.84 vs 0.8) than in the unbalanced analysis. Otherwise, sensitivity was similar (0.71 vs 0.69), as was specificity (0.87 vs 0.87) and O/E (1.6 vs 1.5). In both the class balanced analysis and the unbalanced analysis, the model had lower specificity for Age: (70,80] and had a higher underestimation of events for Age: (90,100]. In the class balanced analysis only, the model had higher underestimation of events for Age: (80, 90. In the unbalanced analysis only, the model had significantly lower PPV for Hispanic or Latino patients with Race "Other."

Model Comparison in Hospital Medicine -Class Balanced Analysis
Comparing model performance in Hospital Medicine, relative to the Epic EOL -High Threshold model the Stanford HM ACP model flagged more patients (42% vs 13%), had significantly higher sensitivity (0.71 vs 0.21), similar specificity (0.87 vs 0.95, 95% confidence intervals overlap), and similar PPV (0.84 vs 0.82, 95% confidence intervals overlap). Comparing model calibration, the Stanford HM ACP model had significantly better calibration in O/E (1.6 vs 2.6).

Comparison of Class Balanced Analysis with Original Analysis
In both the class balanced analysis and unbalanced analysis in Hospital Medicine, the Stanford HM ACP model flagged more patients, had significantly higher sensitivity, similar specificity, similar PPV, and better calibration in terms of O/E compared to the Epic EOL High Threshold model.

Drivers to make these reliability and fairness audits standard practice Responses
Findings that AI models are not fair 10 Findings that AI models are not reliable 9 Academic medicine's push toward racial equity 9 Supplemental Table 23: Survey responses to "What are some key drivers to making these reliability and fairness audits standard practice?"

Barriers to make these reliability and fairness audits standard practice Responses
Poor demographic data quality 8 Poor data quality 6 Lack of data access 5 Audits are not built into our incentives 4 Lack of knowledge about how to do an audit 3 The reliability of deployed AI models is not prioritized 3 The fairness of deployed AI models is not prioritized 3 Lack of data science expertise in my practice setting 2 Other: I don't see us as designing them, but if teams want to engage providers in helping with these audits, I think the most significant barrier is the time, but if there is incentive/appreciation/protected time to do the audit, I can't think of any other barriers 1 Other: Death data 1 I do not see any barriers to making reliability and fairness audits standard practice. 0 Supplemental Table 24: Survey responses to "What are some key barriers to making these reliability and fairness audits standard practice?"

Pros in using AI to support my work Responses
Helps triage patients and identify who would benefit the most 10 Shared understanding of patients for our whole care team 9 Reduces work for me 3 I do not see any pros to using an AI model to support my work. 0

Cons in using AI to support my work Responses
Lack of transparency of the model 5 Takes effort to maintain 4 I disagree with the model 3

Loss of my decisionmaking autonomy 2
Pressure to act even if I disagree with the model 1 Other: Worry that the model may miss some patients who might benefit 1 Other: The HM model although is more sensitive-so many patients flag. Is it possible to risk stratify who is highest risk (using green, yellow, red) like the Epic AI models 0 I do not see any cons to using an AI model to support my work. 1