Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Psychiatry, 12 January 2026

Sec. Public Mental Health

Volume 16 - 2025 | https://doi.org/10.3389/fpsyt.2025.1679618

This article is part of the Research TopicPrevention-Oriented Suicide Risk AssessmentView all 12 articles

Navigating extreme class imbalance in suicide risk prediction

Christopher Kitchen*Christopher Kitchen1*Anas BeloualiAnas Belouali2Paul S. Nestadt,Paul S. Nestadt3,4Holly C. Wilcox,Holly C. Wilcox3,4Hadi Kharrazi,Hadi Kharrazi1,2
  • 1Center for Population Health Information Technology (IT), Johns Hopkins School of Public Health (JHSPH), Baltimore, MD, United States
  • 2Section of Biomedical Informatics and Data Science, Johns Hopkins School of Medicine (JHSOM), Baltimore, MD, United States
  • 3Psychiatry and Behavioral Sciences, Johns Hopkins School of Medicine (JHSOM), Baltimore, MD, United States
  • 4Center for Suicide Research, Johns Hopkins School of Public Health (JHSPH), Baltimore, MD, United States

Background: The implementation of suicide risk models is challenging because the conditions in which they are developed often do not reflect those in which they are being used. The setting of an arbitrary classification threshold limits the interpretability of predictions, and their associated performance statistics. This work endeavors to explore different class imbalance ratios, across training sample compositions, time horizons and patient characteristics to understand how degree of imbalance affects the associated performance of regression-based predictive models of suicide.

Patients and setting: The study population included 1,649,577 patients who were selected from the Maryland Suicide Data Warehouse (MSDW) between 2016 and 2020. The MSDW contains clinical and demographic features derived from claims (Maryland Health Care Commission, MHCC)and hospital discharge records Health Services Cost Review Commission (HSCRC), for decedents and living patients within the state of Maryland. Suicide death was our primary outcome of interest in a cross validated framework stratified by sources of data in the MSDW.

Results: Cross validated AUROC was not found to vary consistently with respect to training sample imbalance nor time horizon, but both were found to have a direct association with AUPRC. Indeed, AUPRC increased with greater sample imbalance for training or outcome horizon (AUPRC 0.246; 0.246; 0.593 for all decedents, HSCRC, and MHCC respectively). Stratified samples revealed no significant cross validated performance than the overall sample for AUROC (0.832; 0.913; 0.927, for decedents, HSCRC and MHCC). However, AUPRC was significantly greater when limiting our HSCRC and MHCC samples to patients seen in the emergency room (AUPRC 0.417; 0.782) or in the inpatient settings (0.371; 0.773), or patients who had ICD-10-CM coded social needs (0.479, HSCRC only). Performance was significantly worse when restricting samples to patients aged less than 18 years (AUPRC 0.036; 0.208, HSCRC and MHCC respectively).

Conclusion: A low precision for estimated suicide risk can be understood as a consequence of some tradeoffs during model development, particularly training models with matched cases, balanced classes or within short time horizons. This work demonstrates the improved AUPRC performance of regression models in a cross validated framework when these conditions are made more realistic, in the context of class imbalance or less restrictive in that of time horizon. Additionally, we illustrate using the same data that training models in certain clinical cohorts (e.g., defined by age, care utilization and social need) can lead to robustly different estimates for precision and recall, but not AUROC.

Introduction

Clinical researchers often struggle with implementing suicide risk and prevention models in health care settings as these prediction models often lack adequate practical precision (13). Clinicians also have a difficult time with such predictive tools for decision support, both because of alert fatigue and a lack of clarity in how models score risk (4). Although some suicide prediction models have yielded clinically reasonable performances, such models are designed for narrow use cases among specific subpopulations (e.g., patients with chronic psychiatric conditions, substance use disorders, or military veterans) (58). Indeed, despite a few studies that suggest existing tools are sufficiently actionable, and multiple systematic reviews trying to make sense of very diverse methods of development, a general risk score to accommodate all patients and settings with sufficient precision and sensitivity remains elusive (913).

The assessment of suicide risk, like most clinical support tasks, constitutes an imbalanced classification problem. A severe case of imbalance exists when generalizable, health system-wide suicide risk prediction is attempted, leaving us with fewer options for fairly assessing performance in observational research (9, 14). This strategy suffers from the classic tradeoff between precision and recall (i.e., to make more precise estimates at the cost of sensitivity). However, the decision threshold is hardly the only parameter that controls precision. C-statistics, widely used in model assessment, ceases to have meaning in such scenarios and the task shifts from binary classification to anomaly detection (i.e., sorting both needles and hay, rather than just needles) (15, 16). A better strategy for improving precision may be to evaluate the context in which models are developed or applied (e.g., training sample composition, observation period, or cohort characteristics) (1, 5, 10, 15, 17).

The variation in scale and scope of data used for training has a major influence on performance. A comprehensive meta-analysis noted that positive predictive values (PPV) corresponding to near-perfect sensitivity and specificity (both at 0.99) would be around 0.02, assuming a mortality rate of 20 per 100,000 patients, but PPV will increase to 0.33 for a mortality rate of 50 per 100,000. Indeed, the reported PPVs reveal a very strong concordance with suicide mortality ratio across 11 of the included studies (r = 0.971, p < 0.001), and no significant association between sensitivity and suicide mortality (r = 0.262, p = 0.437). In other words, precision is tied to the volume of non-events (i.e., no suicides), and with a larger control group there is less precision when classifying top 1st and 5th percentile of risk. Thus, non-events confound the appropriateness of most resampling methods to control for class imbalance (16, 1820). Nonetheless, dropping the number of controls or training on extreme cases will lead to algorithmic bias and overfitting (21, 22).

Clinical characteristics that are strongly associated with suicide risk are also known to vary between cohorts and different time horizons (2, 7, 2325). For example, a recent meta-analysis found suicide ideation to predict eventual death over 9 years to be roughly 0.3% of the time in non-psychiatric samples but 3.9% among psychiatric patients (2). This difference in precision mirrors sensitivities found for the same cohorts: 22% of suicide deaths were ascertained by ideation among non-psychiatric patients and 46% among psychiatric patients. Ironically, the odds ratio of death by suicide, given prior ideation, was roughly the same between groups (i.e., 3.86 for non-psychiatric and 3.23 for psychiatric patients), suggesting that the balance of cases-to-controls explains these differences in classification, and not the association of ideation and mortality.

Risk factors aside, it is important to consider denominator constraints that bias training data in a way that yields models that are poorly calibrated for a particular use case. As such, we propose evaluating the precision-recall tradeoff under different circumstances, using multiple samples of differing size and scope of features, but an identical cut point for classifying the response. Each model is then fitted using a cross-validation framework to parameterize uncertainty and illustrate trends for different rates of suicide during training, observation periods, and clinical subpopulations. We hypothesize that differential precision-recall can guide decisions about training data to maximize performance for anomaly detection, using a combination of sensitivity, PPV, F1 and area under the precision recall curve (AUPRC) as guides.

Methods

Participants and setting

A total of 1,649,577 patients from the Maryland Suicide Data Warehouse (MSDW) were selected for predictive modeling tasks. The MSDW is a statewide repository of deidentified patient-level health records, linked across hospital discharges (HSCRC), commercial claims (MHCC), and decedent information from the Maryland Office of the Chief Medical Examiner (OCME) (2629). Decedents were selected on a database wide assessment of data quality, removing those with duplicative identification mappings, missing or invalid age/sex, missing or invalid Maryland Census tract or 5-digit ZIP code, and 1+ recorded encounter for either HSCRC or MHCC between January 1, 2016, and December 31, 2020.

Across the selected denominator of the MSDW, 1944 (0.1%) patients had been identified as suicide decedents who died between January 1, 2017, and December 31, 2020. Three control groups were identified by overlapping sources of information: 45,585 (2.8%) deceased patients from OCME who experienced a manner of death other than suicide, 847,716 (51.4%) living patients from HSCRC and 849,323 (51.5%) living patients from MHCC. There were 94,991 (5.8%) patients found in both the HSCRC and MHCC living control groups, which accounts for proportions not adding up to 100%. For living patients, a random index date was selected for the same outcome window as a proxy for date of death when determining windowed observations. The offset of January 1, 2016, to January 1, 2017, ensured at least 1 year of observation per living patient, even if no encounters occurred during that period.

Feature definitions and selection

Several key predictive features were identified based on expert input and after reviewing relevant literature for suicide prediction modeling. Given the underlying data represented in MSDW, these features were limited to demographic, diagnostic, procedure, medication, and social need information. Diagnostic and procedure coded information included ICD-10-CM and CPT-4 coded observations categorized using the 2022 Clinical Classifications Software Refined (CCSR) tool (3032) and the ‘comorbidity’ package of R, version 1.1.0 (33). Established pharmacy classes (EPC) were identified using MHCC data and the openFDA API look up tool for matching with the national drug code (NDC) (34). Social needs were identified using ICD-10-CM codes identified and validated in prior research (35). Geospatial clusters that correspond to regions in the state of Maryland, where suicide death is statistically elevated, were identified in earlier work and attached to these data by matching FIPS codes at the Census tract level for residence (36).

Only 7 of the 180 standalone attributes used in modeling, which correspond to demographics or patient level characteristics, did not change during the observation period (e.g., age at index date, region of last residence). All other variables were selected to reflect observations up to 7 days before the patient’s index date (i.e., date of death for decedents, and a random date for living patients). This is analogous to a health system only having recorded information up to the moment of assessment, even if some time has elapsed since the patient had their most recent encounter. Healthcare utilization was reflected by the number of encounters at certain points of care and within 30 or 365 days of the index date. Points of care included emergency room encounters, all cause hospitalization, outpatient or ambulatory care, and psychiatric hospitalization. Specific details on the selection and cleaning of our data can be obtained from the project GitHub repository (https://github.com/ckitche2/MSDW).

Several variables were combined as interaction terms for our models to control moderating effects. As controlling all interactions was not feasible, we selected a small number of conditions known to have a strong association with suicide risk that could be combined with other observations (1, 10, 25, 37). These interactions included CCSR primary categories for depressive, anxiety, bipolar and psychotic disorders, post-traumatic stress, attention deficit disorder, alcohol and opiate use disorders and prior ICD-10-CM coded observations for suicide ideation, attempt or intentional self-poisoning (1, 10, 25, 30, 37). Missingness was generally handled through zero imputation of binary features, but complete missingness in several key factors required development of source-specific feature sets (38, 39). Feature selection consisted of a single stage LASSO filter of the full decedent denominator to reduce the number of colinear or highly associated predictors. This cohort was chosen as it was the only one with fully available clinical and geographic features for these tasks (described below).

Predictive modeling tasks

The modeling task consisted of 5-fold cross validation across 5 iterations to produce more meaningful estimates of generalizable performance. In each fit, a logistic regression model for suicide death was used with three sets of inputs, one for each source of data. This approach was used to account for differences in available attributes between HSCRC and MHCC, the former lacking pharmacy-based observations and the latter lacking marital status or geographic characteristics mapped to Census tract of residence. Decedents had the highest number of total available attributes, as they were selected on the basis of being found in one or both records, while source-specific cohorts were required for HSCRC-only and MHCC-only analyses. Living controls from the MHCC record were not robustly linked to those in the HSCRC, and vice versa.

Model performance was evaluated using a combination of classification and response metrics, assuming a decision threshold probability of 0.5 across all cases to facilitate interpretation. This means that for each model to predict a case of suicide, it must correspond to a response value where that outcome is more likely than not. The actual likelihood of suicide, contrasted with the predicted likelihood of a model, is small at virtually any threshold for risk (1). This tends to bias classification towards greater precision with this threshold, where a very small number of observations are expected to reach a probability of greather-than-0.5, especially when class imbalance is most pronounced (1, 2, 14).

Cross-validated performance for area under receiver operating characteristic (AUROC), area under precision-recall curve (AUPRC), point-sensitivity, and positive predictive value (PPV) were recorded and aggregated across all iterations and folds and with respect to the contrast tested. Estimates were considered significantly different if the average value fell outside of the 95% confidence interval of a comparison distribution of that same score.

Sensitivity analyses

Three different contrasts were evaluated for the modeling task and for each cohort: (1) Non-suicide decedents: suicide versus other manners of death, (2) HSCRC: suicide versus living controls represented in hospital discharges, and (3) MHCC: suicide versus living controls represented in commercial insurance claims. These contrasts assessed training sample balances, prospective outcome windows, and clinical cohort types/stratified samples.

Sample balance refers to the ratio of control group patients to suicide decedents for each data source, where a base ratio consists of the full sample in each source. These approximate rates of suicide were 4.1% (roughly 1 in 24) among decedents, 0.2% (1 in 436) in HSCRC and 0.2% (1 in 437) in MHCC. While these rates are not directly comparable to those observed at the level of the general population (i.e., approximately 14 per 100,000, or 1 in 7,143), these rates reflect a substantial imbalance much like what is seen in prior suicide risk modeling research. Indeed, prior studies have used a variety of population samples with one model using a sample of 1 suicide in 2,778 U.S. veterans but another model using a sample with 1 suicide in 13 high-risk patients with prior suicidal behavior (1, 4043). Each training sample composition was used for fitting a model that was then validated against a test data set that had a normal case-control composition reflective of the overall sample (e.g., 1 in 24; 1 in 436; and 1 in 437). This mimics a situation where researchers train models using a convenience sample of screened at-risk patients or through down sampling controls and then applying the model to a real-world sample/setting.

Outcome windows were identified by using the difference between index date and date of most recent encounter, ignoring cases where the two are identical. A binary variable was created as the outcome of interest: suicide death within 7 days (16.7% of suicide deaths), 30 days (32.5%), 90 days (46.6%), 180 days (57.4%) and 365 days (70.2%). Only the outcome of interest was varied for these tasks, while the ratio of case-to-control was constant across the cross-validation process (Supplementary Figure 1).

Finally, clinical cohort types (i.e., stratified samples) were contrasted to document meaningful differences in performance based on clinical subgroupings for patients aged 65 or older, aged 17 or younger, having 1+ ICD-10-CM coded social need, having 1+ ICD-10-CM coded psychiatric condition (among those identified for interaction terms), 1+ ICD-10-CM coded Charlson comorbidity index (33), 1+ emergency room visit, 1+ all cause hospitalization, and 1+ psychiatric hospitalization during the observation period. For these strata, performances may differ both in terms of clinical characteristics and the relative ratios of case-to-control; however, these strata highlight how expected values change as a function of clinical use case.

Results

Several patient characteristics were summarized across each of the predictive modeling tasks, including demographics and select diagnostic information used for establishing our interaction terms (Table 1). Compared with either randomly selected living control group, suicide decedents were somewhat older (mean age 49.7 years), considerably more likely to be male (76.9%), have 1 or more depressive disorder diagnoses (40.3%), bipolar disorder (11.9%), anxiety disorders (32.2%), psychotic disorders (8.1%), post-traumatic stress (13.3%), ADHD (7.5%), prior suicide ideation or attempt (18.2%), opiate use disorder (9.9%), or alcohol use disorder (19.6%). However, compared to decedents where manner of death is other than suicide, few of these comparisons were as considerable except for average age (64.7 years), male sex (59.3%), and prior suicide ideation or attempt (5.4%; Table 1).

Table 1
www.frontiersin.org

Table 1. Specification of the population cohorts used for predictive modeling tasks, by status as suicide case.

The average estimated performance for each contrast in the training sample ratio and prospective time window tasks were measured using AUROC, AUPRC, PPV and sensitivity (Table 2) and further visualized with 95% confidence intervals (Figure 1). For both tasks (i.e., varying sampling ratios and time windows), AUROC did not change significantly from the initial fit (i.e., 1-to-1 training ratio and 7-day prospective window) to using the full control group denominator in training and prospective suicide prediction without a window. The one exception was for non-suicide decedents and only in the prospective time window task, where AUROC is improved by roughly 0.1 (i.e., AUROC increased from 0.737 to 0.832; Table 2).

Table 2
www.frontiersin.org

Table 2. Average cross-validated performance of models based on sampling ratios and time windows when compared to a base model.

Figure 1
Average model performance is plotted with 95% confidence interval (AUROC, AUPRC, PPV, Sensitivity) for different training sample ratios (top) and time windows (bottom). Three groups are compared: Decedents (green), HSCRC (orange) and MHCC (purple).

Figure 1. Mean and 95% confidence interval of predictive model performances contrasting suicide decedents with different control groups based on sampling ratios (top row) and time windows (bottom row).

AUPRC is significantly improved for all data sources moving towards our full denominator in training and any prospective suicide. For non-suicide decedents, AUPRC shifts from 0.162 when the training composition is 1-to-1 to 0.246 for the full denominator. AUPRC shifts from 0.045 to 0.246, and from 0.111 to 0.593, for living controls of HSCRC and MHCC, accordingly. While overall AUPRC is increasing, PPV (precision) increases with the progression while sensitivity decreases, but at different rates for each data source (Table 2).

AUPRC, PPV, and sensitivity tended to increase uniformly with window size, suggesting no precision-recall trade-off exists for these horizons. For time windows of both non-suicide decedents and HSCRC’s living controls, a slight tendency was observed for point-sensitivity to peak between 6 and 12 months before declining for any prospective suicide. This reduced sensitivity corresponds to the addition of suicide cases that have a gap greater than 12 months between last encounter and date of death and might therefore reflect patients for whom we have insufficient clinical information or no routine care. Indeed, roughly 34.5% of suicide decedents had dates of death beyond 12 months from their last clinical encounter (Supplementary Table 1).

Performances of suicide prediction models for stratified samples were also measured along with full denominator counts and respective suicide rates (Table 3). Model performances, along with their respective 95% confidence intervals, were also visualized for each data source (Figure 2). AUROC was not significantly different from the full cohort estimates across all three data sources, except for patients with a history of psychiatric hospitalization where it was significantly lower (non-suicide decedents: 0.606; living HSCRC: 0.740; and living MHCC: 0.509). Considerable variability was observed in AUROC for both psychiatric patients and those aged younger than 18 years (Table 3).

Table 3
www.frontiersin.org

Table 3. Performance of models contrasting stratified samples by source of data.

Figure 2
Average model performance is plotted with 05% confidence interval (AUROC, AUPRC, PPV, Sensitivity) to reflect calibration for specific sub-populations within Decedents (green), HSCRC (orange), and MHCC (purple). Each trace includes a horizontal dashed line to depict average full cohort performance.

Figure 2. Mean and 95% confidence interval of predictive model performances contrasting suicide decedents with different control groups based on select stratified samples*. * Horizontal dashed line for each trace represents performances for the full cohort in each data source.

Compared with total cohort performance, patients with 1 or more psychiatric conditions were found to have significantly higher AUPRC (non-suicide decedents: 0.293; living HSCRC: 0.349; and living MHCC: 0.781) and point-sensitivities (0.129, 0.203, and 0.649), but not point-precision (0.579, 0.730, and 0.934). Notably, the rates of suicide death increased for both HSCRC (229/100K, to 947/100K) and MHCC (228/100K, to 1405/100K), but changed minimally for non-suicide decedents (4090/100K, to 3855/100K). A similar case was observed for patients with 1 or more ICD-10-CM coded social need. AUPRC and point-sensitivities were significantly higher compared to total cohort for decedents (0.303 and 0.226, respectively), living HSCRC (0.479 and 0.361), and just point-sensitivity for living MHCC (0.660; Table 3).

Overall, AUPRC was greater than that of the total sample for patients with social needs, psychiatric conditions, emergency visits and all cause hospital stays and only for the data sources using living controls (Table 3). Wherever AUPRC was significantly higher, point-sensitivity was higher as well, and PPV was generally lower or at level with total sample performance (Figure 2). Though only 8 samples were evaluated using this approach (after excluding rare psychiatric hospitalizations), AUPRC was not significantly associated with rate of suicide, except where models were derived from MHCC records (r = 0.759, p < 0.05). In all three sources of data, point-estimated sensitivity was strongly and positively correlated with rate of suicide (decedents: r = 0.828, p < 0.05; HSCRC: r = 0.938, p < 0.05; MHCC: r = 0.869, p < 0.05), but PPV was not. Rate of suicide death was uniformly the greatest among patients with a psychiatric hospitalization, however, making it an outlier for these correlations, and thus excluded from the aforementioned statistics.

Discussion

Predicting suicide death is often a challenging task due to the extreme class imbalance between suicide deaths versus the rest of the population (13). The low PPV of the suicide death prediction models often result in large false positives being identified as suicidal, thus limiting the generalizability as well as cost-effectiveness of such models in clinical settings (1, 4). To address these issues, this study evaluated the precision-recall tradeoff under different circumstances, using multiple samples of differing size and scope of features, but an identical cut point for classifying the response. Each model was fitted using a cross-validation framework to parameterize uncertainty and illustrate trends for different rates of suicide during training, observation periods, and clinical subpopulations.

Several findings were notable in this study. First, the practice of systematically undersampling control groups to account for substantial imbalance in real-world suicide events should be viewed with caution (1, 44). Many researchers using machine learning algorithms encounter this concern but then fail to consider trade-offs in precision and recall, noting instead that the AUROC does not change much by undersampling. Negatively biased AUPRC were observed in all three data sources as a consequence of undersampling and progressively worsened precision (Tables 2, 3). Risk estimates obtained this way will tend to overstate true risk, leading to many false positives and poor application in decision support (e.g., alert fatigue).

Second, and conversely, relying on imbalanced samples for training and test cases favored point-precision at the cost of missed events. Furthermore, the progressive decay in sensitivity was very different for each of the three sources of data (Table 2). For example, compared to the decedents and HSCRC, estimates in the MHCC had the shallowest decay for sensitivity. Such a difference can be attributed to both the large number of living controls in MHCC as well as the availability of pharmacy records with other clinically relevant information (e.g., accounting for an interaction between medication and psychiatric condition among living individuals). In other words, it is possible for predictions to achieve enhanced precision and sensitivity simultaneously when using an appropriate imbalance between cases and control groups as well as the availability of more comprehensive clinical attributes that are predictive of suicide.

Third, temporal proximity to date of death and availability of clinical information can affect precision-recall performance. The more temporally distant death is from the last encounter, the better the AUPRC and point-precision of estimates (Table 2). However, when all suicide deaths, including those outside of 12-months are considered, the improvements plateau or reverse. We hypothesize that the roughly 34% of suicide decedents that have no encounters within a year of their death likely represent a portion of the cohort with poor healthcare access and very low utility (Supplementary Table 1). This calls into question the completeness of their clinical information and appropriateness for inclusion in the prediction task. After all, if a decedent’s record only consisted of a single encounter for an urgent care visit, we can hardly expect much information by way of mental health screening or treatment. These information gaps are attributable to data quality or missingness and a challenge for statistical risk assessment (45).

Finally, all else being equal, the clinical characteristics of a population can substantially affect model performance, making it a challenge in the first place to identify a one-size-fits-all risk model. AUPRC is significantly improved by subsetting each data source by patients having one or more psychiatric diagnosis (Table 3). This is hardly surprising since the rate of suicide death is higher among psychiatric patients (and therefore a lower imbalance of outcome), but sensitivity appears to be improved more than precision, mirroring the same trend as sample training imbalance (i.e., more balance, more sensitivity). Those with social needs, emergency room visitation, or hospitalizations seem to fit this pattern in HSCRC and MHCC cohorts as well (Table 3).

Our study findings are largely consistent with those of a meta-analysis conducted by Belsher et al, and observations by others (1, 2, 10, 11). Cross-validated precision is improved by increasing the classification threshold for risk (e.g., 95th to 99th percentile of risk), but is also tied to control group size. By increasing class imbalance in data during training and thereby decreasing the rate of suicide, our model has learned a response that yields better PPV in test data. The reverse is also true in our analysis, in which sensitivity is improved by lowering the classification threshold (e.g., 99th to 95th percentile of risk) and decreasing control group imbalance (i.e., increased rate of suicide). The main difference between our experiment and Belsher’s meta-analysis is that we used a static classification threshold for all analyses (i.e., the response likelihood falling above 0.5). To our knowledge, this analysis is the only endeavor to identify a trend in AUPRC improving with increasing imbalance in suicide mortality, either through modifying training sample imbalance, changing the outcome horizon, or selecting a clinical substratum for model development.

The findings of this study can be used to increase the generalizability of suicide prediction models; however, this study has a few limitations. First, although the data used in this study (i.e., MSDW) is considered the gold standard in identifying suicide cases in Maryland, a considerable number of deaths are categorized as ‘undetermined’ by the OCME even when suicide is suspected in post-mortem findings. Our prior work has suggested that a disproportionate number of these cases affect minority populations and those with limited access to care, thus our results may not show the true effect of class imbalance in such populations (46, 47). Additionally, the MHCC data was also limited to commercial claim records only thus potentially biasing our sample towards higher income working age individuals.

Thus, our findings might require replication in an external, nationally representative source of data. Second, we identified the list of features used in our prediction models using expert feedback and a review of literature; however, a more systematic review of the literature may result in a different set of predictors. Future research should explore the effect of other data types, especially ones not represented in our data (e.g., lab results), on the effect of class imbalance on predicting suicide. Furthermore, administrative claims and discharge records lack any characterization of symptom severity, toxicology findings, screening, assessment, or clinical narratives. We plan to both leverage and validate electronic health records containing such relevant predictors in our future work. Third, we used a regression model to assess the effect of sampling, time windows, and population strata on the performance and generalizability of the models. Future research should test these effects on non-parametric and ensemble AI algorithms, and how fine tuning such models will affect the outcomes of interest. We conducted a parallel set of analyses using an extreme gradient boosted tree, and found very similar findings, especially incremental improvement in AUPRC (Supplementary Table 2). This suggests these findings of class imbalance translate to other families of algorithms. Finally, our study only focused on model generalizability and performance as it relates to the class imbalance challenges and did not consider its potential effect on cost-effectiveness of such models in practice (48). Indeed, the balance of costs has not yet been explored thoroughly in our work but might be documented by varying the weight of recall relative to precision as a series of F-beta statistics to support stakeholder decisions on whether to adopt a statistical model (49, 50). When there is a clear idea of costs associated with type 1 and type 2 errors, the same model can be viewed as working well for some clinical populations over others. Put differently, the tolerance for misclassifying cases of suicide will depend on what kinds of interventions and routine care are involved, as well as size of cohort and timing of risk.

Conclusion

Prediction of rare events, like death by suicide, in real-world observational data is challenging as the probability of the event is far less likely than observing non-events at every level of risk. In a series of experiments, we have demonstrated that AUPRC is improved through training models under the same imbalance as the eventual use case (i.e., test data), and through maximizing the time horizon for which we hoped to predict outcomes. A careful selection of training samples is required for enhanced model precision. Higher cohort-specific rates of death (i.e., greater class balance) tended to result in highly sensitive performance, but often at the cost of lower average precision. Finally, the precision and recall of fitted models is affected by idiosyncratic risks, such that different points of care or patient populations might benefit from specially tuned models to fit their use case.

Data availability statement

The datasets presented in this article are not readily available because previously established agreements for data use and access with respective custodians and providers prohibit use and sharing of these data beyond the scope of our original research. Requests to access the datasets should be directed to Y2tpdGNoZTJAamguZWR1.

Ethics statement

The studies involving humans were approved by Institutional Review Board - Johns Hopkins School of Public Health. The studies were conducted in accordance with the local legislation and institutional requirements. Written informed consent for participation was not required from the participants or the participants’ legal guardians/next of kin in accordance with the national legislation and institutional requirements.

Author contributions

CK: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Validation, Visualization, Writing – original draft, Writing – review & editing. AB: Conceptualization, Data curation, Investigation, Methodology, Writing – review & editing. PN: Conceptualization, Data curation, Funding acquisition, Investigation, Supervision, Validation, Writing – review & editing. HW: Data curation, Funding acquisition, Investigation, Supervision, Validation, Writing – review & editing. HK: Conceptualization, Data curation, Funding acquisition, Investigation, Methodology, Project administration, Supervision, Validation, Writing – original draft, Writing – review & editing.

Funding

The author(s) declared that financial support was received for this work and/or its publication. This work is primarily supported by awards (R01MH124724; R56MH117560) from the National Institute for Mental Health.

Conflict of interest

The authors declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declare that Generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpsyt.2025.1679618/full#supplementary-material

References

1. Belsher BE, Smolenski DJ, Pruit LD, Bush NE, Beech EH, Workman DE, et al. Prediction models for suicide attempts and deaths: a systemic review and simulation. JAMA Psychiatry. (2019) 76:642–51. doi: 10.1001/jamapsychiatry.2019.0174

PubMed Abstract | Crossref Full Text | Google Scholar

2. McHugh CM, Corderoy A, Ryan CJ, Hickie IB, and Large MM. Association between suicidal ideation and suicide: meta-analyses of odds ratios, sensitivity, specificity and positive predictive value. BJPsych Open. (2019) 5:e18. doi: 10.1192/bjo.2018.88

PubMed Abstract | Crossref Full Text | Google Scholar

3. Barak-Corren Y, Castro VM, Nock MK, Mandi KD, Madsen EM, Seiger A, et al. Validation of an electronic health record-based suicide risk prediction modeling approach across multiple health care systems. JAMA Netw Open. (2020) 3:e201262. doi: 10.1001/jamanetworkopen.2020.1262

PubMed Abstract | Crossref Full Text | Google Scholar

4. Bentley KH, Zuromski KL, Fortgang RG, Madsen EM, Kessler D, Hyunjoon L, et al. Implementing machine learning models for suicide risk prediction in clinical practice: focus group study with hospital providers. JMIR Form Res. (2022) 6:e30946. doi: 10.2196/30946

PubMed Abstract | Crossref Full Text | Google Scholar

5. Kessler RC, Hwang I, Hoffmire CA, McCarthy JF, Petukhova MV, Rosellini AJ, et al. Developing a practical suicide risk prediction model for targeting high-risk patients in the Veterans health Administration. Int J Methods Psychiatr Res. (2017) 26:e1575. doi: 10.1002/mpr.1575

PubMed Abstract | Crossref Full Text | Google Scholar

6. Kessler RC, Bauer MS, Bishop TM, Bossarte RM, Castro VM, Demler OV, et al. Evaluation of a model to target high-risk psychiatric inpatients for an intensive post discharge suicide prevention intervention. JAMA Psychiatry. (2023) 80(3):230–40. doi: 10.1001/jamapsychiatry.2022.4634

PubMed Abstract | Crossref Full Text | Google Scholar

7. Corke M, Mullin K, Angel-Scott H, Xia S, and Large M. Meta-analysis of the strength of exploratory suicide prediction models; from clinicians to computers. BJPsych Open. (2021) 7:e26, 1–11. doi: 10.1192/bjo.2020.162

PubMed Abstract | Crossref Full Text | Google Scholar

8. Matarazzo B, Brenner LA, and Reger MA. Positive predictive values and potential success of suicide prediction models. JAMA Psychiatry. (2019) 76:869–70. doi: 10.1001/jamapsychiatry.2019.1519

PubMed Abstract | Crossref Full Text | Google Scholar

9. Kitchen C, Zirikly A, Belouali A, Kharrazi H, Nestadt P, and Wilcox HC. Suicide death prediction using the maryland suicide data warehouse: A sensitivity analysis. Arch Suicide Res. (2024) 2024:1–15. doi: 10.1080/13811118.2024.2363227

PubMed Abstract | Crossref Full Text | Google Scholar

10. Simon GE, Johnson E, Lawrence JM, Rossom RC, Adhmedani B, Lynch FL, et al. Predicting suicide attempts and suicide deaths following outpatient visits using electronic health records. Am J Psychiatry. (2018) 175:951–60. doi: 10.1176/appi.ajp.2018.17101167

PubMed Abstract | Crossref Full Text | Google Scholar

11. Simon GE, Shortreed SM, and Coley RY. Positive predictive values and potential success of suicide prediction models. JAMA Psychiatry. (2019) 76:868–9. doi: 10.1001/jamapsychiatry.2019.1516

PubMed Abstract | Crossref Full Text | Google Scholar

12. Nock MK, Millner AJ, Ross EL, Kennedy CJ, Al-Suwaidi M, Barak-Corren Y, et al. Prediction of suicide attempts using clinician assessment, patient self-report, and electronic health records. JAMA network Open. (2022) 5:e2144373–e2144373. doi: 10.1001/jamanetworkopen.2021.44373

PubMed Abstract | Crossref Full Text | Google Scholar

13. Kusuma K, Larsen M, Quiroz JC, Gillies M, Burnett A, Qian J, et al. The performance of machine learning models in predicting suicidal ideation, attempts, and deaths: A meta-analysis and systematic review. J Psychiatr Res. (2022) 155:579–88. doi: 10.1016/j.jpsychires.2022.09.050. ISSN 0022-3956.

PubMed Abstract | Crossref Full Text | Google Scholar

14. Luu J, Borisenko E, Przekop V, Patil A, Forrester JD, and Choi J. Practical guide to building machine learning-based clinical prediction models using imbalanced datasets. Trauma Surg Acute Care Open. (2024) 9:e001222. doi: 10.1136/tsaco-2023-001222

PubMed Abstract | Crossref Full Text | Google Scholar

15. Hancock JT, Khoshgoftaar TM, and Johnson JM. Evaluating classifier performance with highly imbalanced Big Data. J Big Data. (2023) 10:42. doi: 10.1186/s40537-023-00724-5

Crossref Full Text | Google Scholar

16. Kong J, Kowalczyk W, Menzel S, and Bäck T. Improving imbalanced classification by anomaly detection. In: Bäck T, et al, editors. Parallel Problem Solving from Nature – PPSN XVI. PPSN 2020. Lecture Notes in Computer Science(), vol. 12269 . Springer, Cham (2020). doi: 10.1007/978-3-030-58112-1_35

Crossref Full Text | Google Scholar

17. Cartus AR, Samuels EA, Cerdá M, and Marshall BDL. Outcome class imbalance and rare events: An underappreciated complication for overdose risk prediction modeling. Addiction. (2023) 118:1167–76. doi: 10.1111/add.16133

PubMed Abstract | Crossref Full Text | Google Scholar

18. Saito T and Rehmsmeier M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One. (2015) 10:e0118432. doi: 10.1371/journal.pone.0118432

PubMed Abstract | Crossref Full Text | Google Scholar

19. Harrell F. Classification vs. Prediction. In: Blog (2020). (Nashville TN: Vanderbilt University). Available online at: https://www.fharrell.com/post/classification/ (Accessed April 27, 2022).

Google Scholar

20. Hassanat AB, Tarawnch AS, Altarawneh GA, and Almuhaimeed A. Stop oversampling for class imbalance learning: A critical review. ArXiv:2202.03579v2. (2022) 1–19. doi: 10.48550/arXiv.2202.03579

Crossref Full Text | Google Scholar

21. Yang J, Triendl H, Soltan AAS, Prakash M, and Clifton DA. Addressing label noise for electronic health records: insights from computer vision for tabular data. BMC Med Inform Decis Mak. (2024) 24:183. doi: 10.1186/s12911-024-02581-5

PubMed Abstract | Crossref Full Text | Google Scholar

22. GianFrancesco MA, Tamang S, Yazdany J, and Schmajuk G. Potential biases in machine learning algorithms using electronic health record data. JAMA Intern Med. (2018) 178:1544–7. doi: 10.1001/jamainternmed.2018.3763

PubMed Abstract | Crossref Full Text | Google Scholar

23. Fernández A, García S, Galar M, Prati RC, Krawczyk B, and Herrera F. Learning from imbalanced data sets Vol. 10. (Cham Switzerland: Springer) (2018).

Google Scholar

24. Branco P, Torgo L, and Ribeiro RP. A survey of predictive modeling on imbalanced domains. ACM Computing Surveys (CSUR). (2016) 49:1–50. doi: 10.1145/2907070

Crossref Full Text | Google Scholar

25. Walsh CG, Johnson KB, Ripperger M, Sperry S, Harris J, Clark N, et al. Prospective validation of an electronic health record–based, real-time suicide risk model. JAMA Netw Open. (2021) 4:e211428. doi: 10.1001/jamanetworkopen.2021.1428

PubMed Abstract | Crossref Full Text | Google Scholar

26. CRISP Health (CRISP). Clinical data. Available online at: https://crisphealth.org/applications/clinical-data/ (Accessed April 27, 2022).

Google Scholar

27. Health Services Cost Review Commission (HSCRC). Available datasets for public use. Available online at: http://www.hscrc.state.md.us/Pages/data.aspx (Accessed April 27, 2022).

Google Scholar

28. Maryland Department of Health and Mental Hygiene (DHMH). Maryland Health Care Commission (MHCC). Available online at: http://mhcc.maryland.gov/ (Accessed April 27, 2022).

Google Scholar

29. Maryland Department of Health Office of Chief Medical Examiner (OCME). OCME annual reports. Available online at: https://health.maryland.gov/come/Pages/Home.aspx (Accessed April 27, 2022).

Google Scholar

30. HCUP Clinical Classifications Software Refined (CCSR) for ICD-10-CM diagnoses, v2021.2. Healthcare Cost and Utilization Project (HCUP). Rockville, MD: Agency for Healthcare Research and Quality. () (Rockville MD: Agency for Healthcare Research and Quality) Available online at: www.hcup-us.ahrq.gov/toolssoftware/ccsr/dxccsr.jsp (Accessed April 4, 2021).

Google Scholar

31. HCUP Procedure Classes Refined for ICD-10-PCS, v2021.2. Healthcare Cost and Utilization Project (HCUP). Rockville, MD: Agency for Healthcare Research and Quality. (2021) (Rockville MD: Agency for Healthcare Research and Quality). Available online at: www.hcup-us.ahrq.gov/toolssoftware/procedureicd10/procedure_icd10.jsp (Accessed April 2, 2021).

Google Scholar

32. HCUP Clinical Classifications Software (CCS) for Services and Procedures, v2020.1. Healthcare Cost and Utilization Project (HCUP). Rockville, MD: Agency for Healthcare Research and Quality. (2020). Available online at: www.hcup-us.ahrq.gov/toolssoftware/ccs_svcsproc/ccssvcproc.jsp.

Google Scholar

33. Gasparini A. Comorbidity: An R package for computing comorbidity score. J Open Source Software. (2018) 3:648. doi: 10.21105/joss.00648

Crossref Full Text | Google Scholar

34. U.S. Department of Health and Human Services. U.S. Food & Drug Administration, openFDA (2022). Available online at: https://open.fda.gov/ (Accessed April 27, 2022).

Google Scholar

35. Arons A, DeSilvey S, Fichtenberg C, and Gottlieb L. Compendium of Medical Terminology Codes for Social Risk Factors. San Francisco, CA: Social Interventions Research and Evaluation Network (2018). Available online at: https://sirenetwork.ucsf.edu/sites/default/files/Compendium%2520Social%2520Risk%2520Factors%2520Codes%25206.20.18.xlsx (Accessed June 8, 2023).

Google Scholar

36. Otani T and Takahashi K. Flexible scan statistics for detecting spatial disease clusters: the rflexscan R package. J Stat Software. (2021) 99:1–29. doi: 10.18637/jss.v099.i13

Crossref Full Text | Google Scholar

37. Franklin JC, Ribeiro JD, Fox KR, Bentley KH, Kleiman EM, Huang X, et al. Risk factors for suicidal thoughts and behaviors: A meta-analysis of 50 years of research. psychol Bulletin. (2017) 2:187–232. doi: 10.1037/bul0000084

PubMed Abstract | Crossref Full Text | Google Scholar

38. Belouali A, Kitchen C, Zirikly A, Nestadt P, Wilcox HC, and Kharrazi H. Identifying and characterizing suicide decedent subtypes using deep embedded clustering. Sci Rep. (2025) 15:23069. doi: 10.1038/s41598-025-07007-4

PubMed Abstract | Crossref Full Text | Google Scholar

39. Flores JP, Desjardins MM, Kitchen C, Belouali A, Kharrazi H, Wilcox HC., et al. Use of vital records to improve identification of suicide as manner of death for opioid-related fatalities. Crisis. (2025). doi: 10.1027/0227-5910/a001033

PubMed Abstract | Crossref Full Text | Google Scholar

40. Centers for Disease Control and Prevention. Web-based Injury Statistics Query and Reporting System (WISQARS) Fatal Injury Reports (2020). Available online at: https://webappa.cdc.gov/sasweb/ncipc/mortrate.html (Accessed November 12, 2024).

Google Scholar

41. US Department of Health and Human Services, Centers for Disease Control and Prevention. Facts about suicide (2022). Available online at: https://www.cdc.gov/suicide/suicide-data-statistics.html.

Google Scholar

42. Amini P, Ahmadinia H, Poorolajal J, and Moqaddasi Amiri M. Evaluating the high risk groups for suicide: A comparison of logistic regression, support vector machine, decision tree and artificial neural network. Iran J Public Health. (2016) 45:1179–87.

PubMed Abstract | Google Scholar

43. McCarthy JF, Bossarte RM, Katz IR, Thompson C, Kemp J, Hannemann CM, et al. Predictive modeling and concentration of the risk of suicide: implications for preventive interventions in the US department of veterans affairs. Am J Public Health. (2015) 105:1935–42. doi: 10.2105/AJPH.2015.302737

PubMed Abstract | Crossref Full Text | Google Scholar

44. Ryu S, Lee H, Lee DK, and Park K. Use of a machine learning algorithm to predict individuals with suicide ideation in the general population. Psychiatry Investig. (2018) 15:1030–6. doi: 10.30773/pi.2018.08.27

PubMed Abstract | Crossref Full Text | Google Scholar

45. Kharrazi H, Wang C, and Scharfstein D. Prospective EHR-based clinical trials: the challenge of missing data. J Gen Internal Med. (2014) 29:976–78. doi: 10.1007/s11606-014-2883-0

PubMed Abstract | Crossref Full Text | Google Scholar

46. Adams LB, Kitchen CA, Nestadt P, Thorpe RJ, Boyd R, Kharrazi H, et al. Racial differences in suicide and undetermined deaths in Maryland. JAMA Psychiatry. (2025) 82(10):1020–4. doi: 10.1001/jamapsychiatry.2025.1907

PubMed Abstract | Crossref Full Text | Google Scholar

47. Adams LB, Brooks Stephens JR, Cubbage J, and Bernard DL. Racial, ethnic, and cultural expressions of interpersonal psychological theory of suicide (RECEIPTS): An integrated model of structural racism and suicide risk. Am Psychol. (2025). doi: 10.1037/amp0001545

PubMed Abstract | Crossref Full Text | Google Scholar

48. Ross EL, Zuromski KL, Reis BY, Nock MK, Kessler RC, and Smoller JW. Accuracy requirements for cost-effective suicide risk prediction among primary care patients in the US. JAMA Psychiatry. (2021) 78:642–50. doi: 10.1001/jamapsychiatry.2021.0089

PubMed Abstract | Crossref Full Text | Google Scholar

49. Sokolova M, Japkowicz N, and Szpakowicz S. Beyond accuracy, F-score and ROC: A family of discriminant measures for performance evaluation. In: AI 2006: Advances in Artificial Intelligence, Lecture Notes in Computer Science, (Heidelberg Germany: Springer) vol. 4304. (2006). p. 1015–21. doi: 10.1007/11941439_114

Crossref Full Text | Google Scholar

50. Brownlee J. A gentle introduction to the Fbeta-measure for machine learning. Mach Learn Mastery. (2020). Available online at: https://machinelearningmastery.com/fbeta-measure-for-machine-learning/ (Accessed July 25, 2025).

Google Scholar

Keywords: suicide risk estimation, classification metrics, class imbalance, machine learning, clinical decision support

Citation: Kitchen C, Belouali A, Nestadt PS, Wilcox HC and Kharrazi H (2026) Navigating extreme class imbalance in suicide risk prediction. Front. Psychiatry 16:1679618. doi: 10.3389/fpsyt.2025.1679618

Received: 04 August 2025; Accepted: 15 December 2025; Revised: 19 November 2025;
Published: 12 January 2026.

Edited by:

Wulf Rössler, Charité University Medicine Berlin, Germany

Reviewed by:

Honglei Yin, Southern Medical University, China
Ian Cero, University of Rochester Medical Center, United States

Copyright © 2026 Kitchen, Belouali, Nestadt, Wilcox and Kharrazi. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Christopher Kitchen, Y2tpdGNoZTJAamh1LmVkdQ==

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.