Fatal crashes and rare events logistic regression: an exploratory empirical study

Objective Fatal road accidents are statistically rare, posing challenges for accurate estimation through the classic logit model (LM). This study seeks to validate the efficacy of a rare events logistic model (RELM) in enhancing the precision of fatal crash estimations. Methods Both LM and RELM were employed to examine the relationship between pertinent risk factors and the incidence of fatal crashes. Crash-injury datasets sourced from Hillsborough County, Florida served as the empirical basis for evaluating the performance metrics of both LM and RELM. Results The analysis revealed that RELM yielded more accurate predictions of fatal crashes compared to LM. Receiver operating characteristic (ROC) curves were constructed, and the area under the curve (AUC) for each model was computed to offer a comparative performance assessment. The empirical evidence notably favored RELM over LM as substantiated by superior AUC values. Conclusion The study offers empirical validation that RELM is demonstrably more proficient in predicting fatal crashes than the LM, thereby recommending its application for nuanced traffic safety analytics.


Introduction
The persistently high mortality rates from traffic crashes have intensified their classification as a significant global public health issue (1,2).According to the World Health Organization (3), fatalities attributed to traffic crashes witnessed a 25% increase, rising from 1.08 million in 1990 to 1.35 million in 2016.This uptick not only represents a societal tragedy but also imposes considerable economic strain on communities and families.
Among the models utilized, the binary logit model (LM) is predominant.However, this approach has limitations when dealing with rare events, such as fatal crashes.For instance, the Hong Kong Transport Department's statistics from 2015 reveal that, of 16,170 injury-related crashes, only 117 were fatal, representing a meager 0.72% of the total dataset (17).Extant literature corroborates that LM tends to significantly underestimate the occurrence of such rare events (18).
Against this empirical backdrop, the present study deploys a rare events logistic model (RELM) to enhance the precision of fatal crash estimations.The RELM has been successfully applied in other domains such as geomorphology, social science, and epidemiology (19)(20)(21).To the authors' best knowledge, this study involves the inaugural application of RELM in the specific field of fatal crash estimation.

Methodology . Logit model
Logistic regression is the most used method in crash injury severity analyses.To model the relationship between fatal crashes and the risk factors, the outcome variable y i in the ith crash was set to be one of the two values: y i = 1 representing fatal crashes and y i = 0 representing non-fatal crashes.The probability of y i = 1 is denoted by Pr(y i = 1), which is calculated using the following equation: Logistic regression is the predominant method employed in the analyses of crash injury severities.To elucidate the relationship between fatal crashes and associated risk factors, we define the outcome variable y i for the ith crash as binary: y i = 1 signifies a fatal crash, while y i = 0 indicates a non-fatal crash.The probability that y i = 1, denoted as Pr(y i = 1), is calculated using the logistic function: In Equation ( 1), e −βx ′ i encapsulates the linear combination of predictor variables, known as the utility function, which is expressed as: Here, x ki represents the value of the kth variable for the ith observation and β k is the corresponding coefficient.
There is another way to formulate the aforementioned question.Let us assume an unobserved continuous variable y * i , which represents the propensity of where a fatal crash occurred.y * i follows a logistic distribution, which is close to normal (mathematically, the difference exists but is trivial).If we want to know the effects of x i , the standard approach is to run a regression with x i as the dependent variable.To determine whether the crash is fatal or not, we observed whether this propensity is greater than a specific threshold.As documented by King and Zeng (19), this mechanism turns out to be the chief troublemaker in bias induced by rare events.The coefficients of β are estimated using the maximum-likelihood method with the following equation over a dataset of n observations: In Equation ( 3), e −βx ′ i is the multiple linear combinations of explanatory variables, which are also known as the utility function, and can be represented as: where x ki denotes the value of variable k for sample i and β k is the coefficient of variable k.
Alternatively, one may conceptualize the problem using a latent variable y * i , which signifies the propensity for a crash to be fatal.This latent variable follows a logistic distribution, which, despite its mathematical distinctiveness, is practically akin to a normal distribution.The impact of the predictors x i is typically assessed by regressing them against this unobserved variable.The determination of the crash outcome-fatal or otherwiseis contingent upon whether the propensity surpasses a specified threshold.As highlighted by King and Zeng (19), this threshold mechanism introduces a primary source of bias in the presence of rare events.The logistic regression coefficients β are estimated by employing the maximum-likelihood estimation method applied across a dataset comprising n observations: It is imperative to acknowledge that, in the analysis of rare events data, additional occurrences of the event of interest (coded as "1") provide greater informational value than non-occurrences (coded as "0").During the estimation phase, the standard error of the estimated coefficient β is derived from the variance: In Equation ( 6), the summation n i=1 π i (1 − π i ) is notably influenced by the rarity of the event under study.The term π i (1 − π i ) attains its maximum when π i = 0.5 and approaches zero as π i converges to either extremity of the probability spectrum.Given that rare events data typically yield minuscule estimates of π i for all observations, it is crucial to consider that these estimates will be substantially smaller than 0.5.Nonetheless, if the logit model possesses explanatory significance, the estimated probabilities π i corresponding to the occurrences of "1" will be markedly higher than those associated with "0".These estimates will also lie nearer to the apex of informational value at 0.5.Consequently, this results in the additional occurrences of "1" being more informative for the model than the additional occurrences of "0".

. Rare events logistic model
To ameliorate the bias in estimation attributed to the use of LM in rare events data, King and Zeng (18) introduced the RELM.RELM not only mitigates underestimation bias but also enhances the efficiency of data collection and reduces the requirements for data storage space during the sample selection phase.

. . Sample selection
As highlighted in the preceding discussion, the LM exhibits suboptimal performance when instances of y i = 1 are infrequent within the dataset.To address this limitation, a strategic alteration in data collection is proposed.By archiving all observations where a fatal crash occurred (y i = 1) and a random subset of non-fatal crash observations (y i = 0), we can refine the accuracy of the standard logit model's estimations.

. . Adjustment of estimates for selection bias
To correct for selection bias inherent in choicebased sampling, two primary methods are employed: the prior correction and the weighting correction.The subsequent sections will elucidate these approaches in detail.
Research by King and Zeng (18) demonstrates that the logit model coefficients remain statistically consistent between population estimates and those derived from selected data.The objective of the prior correction method is to adjust the intercept β0 in the logit model using the following formula: where τ represents the proportion of y i = 1 within the population, while y signifies the proportion of y i = 1 within the sampled dataset.The calculation of the probability of rare events occurrence is contingent upon accurate estimations of both β 0 and β k , as indicated in Equation (1).
It is essential to note that the prior correction method necessitates the knowledge of τ , the population proportion of y i = 1.In the context of this study, τ can be directly ascertained from the initial dataset of crash data.A principal benefit of the prior correction method lies in its user-friendliness; it can be readily implemented with any statistical software capable of fitting standard logistic models.For instance, the study by Ren et al. (22) leveraged this method to adjust estimates concerning the influence of various factors on red-light running behavior.Next, we will delineate an alternative approach that can augment the efficacy of the logistic model (LM) when used in conjunction with prior correction.
The weighting correction involves assigning weights to the data to balance the discrepancies in the proportions of y i = 1 between the sample and the population, which arise from choice-based sampling.This method entails optimizing a weighted log-likelihood function rather than the conventional loglikelihood function: In this context, the weights ω 1 and ω 0 are defined as ω 1 = τ/y and ω 1 = τ/y, respectively, where The parameters τ and y retain their definitions from the "prior correction" section.
Although this method may appear more complex than the prior correction technique, Equation 6 is formulated to enable researchers to apply it using any standard logit software package.
Xie and Manski (23) posited that weighting correction could surpass prior correction in effectiveness when the available sample is substantial, and there is a mis-specification of the functional form.Conversely, Amemiya and Vuong (24) indicated that, while weighting correction may be marginally less efficient than prior correction, the difference in efficiency is typically negligible.

. . Computing probability estimates
Subsequent to implementing the prior correction and weighting methods, we adapt modifications suitable for both cohort and choice-based sampling designs in rare events logistic models.The bias in the estimated coefficients β is appraised using the weighted least-squares method, formulated as: where ξ i = 0.5Q ii (1 + ω 1 ) πi − ω 1 symbolizes an adjustment factor, where Q ii are the diagonal constituents of the matrix Consequently, the adjusted coefficients β are calculated as follows: The final corrected probability P i can be approximated by the following expression: where the correction term C i is delineated as follows: the ith observation, and X ′ i is its transpose.Collectively, these amendments constitute the methodology of the RELM.To the authors' knowledge, this is the first instance of applying RELM within the domain of fatal crash estimation.

Data description
Data on crash-related injuries that occurred in the year 2006 in Florida were procured from the Florida Department of Highway Safety and Motor Vehicles (DHSMV).The dataset encompasses 107,464 driver-vehicle units implicated in 53,732 traffic incidents.A meager 0.34% of these incidents resulted in fatalities, highlighting their infrequency.The variables under scrutiny encompass critical attributes, such as those associated with the driver, the vehicle, the roadway, and the environmental context, as delineated in prior research (25)(26)(27)(28).Table 1 delineates the variables and their corresponding characteristics as encapsulated within the Florida dataset.
Notably, the "speed ratio"-defined as the quotient of the estimated speed prior to the collision and the statutory speed limit post-collision-is posited to correlate positively with injury severity (25).Furthermore, the analysis includes "points of impact" (POIs) on the vehicle, enumerated in the Florida crash reports and illustrated in Figure 1.These POIs are categorized in alignment with the schema proposed by Huang et al. (29), where Level 1 encompasses nine POIs (nos.1-2, 5-7, 9-10, 14, and 21) located peripherally relative to the driver's seat, such as the front and rear passenger sides.Level 2 consists of five POIs (nos.3, 8, 11, 15, and 17) situated in closer proximity to the driver than those in Level 1. Level 3 includes POIs (nos.4, 12-13, 18, and 20), which are nearest to the driver, comprising the windshield and the front passenger and driver sides.The final category, Level 4, is assigned to two POIs (nos.16 and 19).

Model evaluation
In the evaluation of our models, namely, RELM and the LM, we quantify the predictive performance using the area under the receiver operating characteristic curve (AUC-ROC).The AUC is a widely accepted metric for model performance evaluation, particularly in binary classification problems.It provides an aggregate measure of performance across all possible classification thresholds.The calculation of the AUC involves plotting the true positive rate (sensitivity) against the false positive rate (1specificity) at various threshold settings (30).The AUC value ranges from 0 to 1, where an AUC of 1 indicates perfect predictive accuracy and an AUC of 0.5 suggests performance no better than random chance.
To estimate the AUC accurately, we employ the trapezoidal rule for numerical integration as this method is well-suited for the discrete data points that characterize an empirical ROC curve (31).Furthermore, we validate the robustness of our AUC estimates through K-fold cross-validation, which mitigates the potential for overfitting by ensuring that each observation is used for both training and validation.This process involves partitioning the data into K equal-sized segments, training the model on K −1 segments,  and validating it on the remaining segment.This is repeated K times, with each segment used exactly once for validation.The average AUC across all K iterations provides a reliable estimate of the predictive performance of the models.In this study, K was set to 5.

. Data sampling
As previously mentioned, the initial step involves the partial extraction of the complete dataset for regression analysis.This entails retaining all instances of fatal crashes while selectively including a subset of non-fatal crashes.To ascertain the optimal proportion of "1" events in the newly constituted dataset, this study computes the coefficients employing both the prior correction and weighting correction methods, incrementally adjusting by 1% within a range from 0.05 to 0.95.The variation in classification accuracy is further assessed using two metrics: the accurate classification rate (ACR), defined as the quotient of correctly identified fatal accidents to the total number of actual fatal accidents; and the false classification rate (FCR), computed as the quotient of erroneously classified incidents to the total number of events.
Figure 2 delineates the interplay between the three aforementioned variables: ACR, FCR, and the ascending fraction of "1" events in the sampled data.The depiction includes red dots representing outcomes via the prior correction method and blue stars indicating results from the weighting method.A 3D subgraph within Figure 2A visualizes the pairwise interactions among these factors, with the remaining panels (Figures 2B-D) presenting projections along different axes.

Analysis of Figure 2 reveals a close alignment between the trajectories of ACR and FCR across both correction methodologies.
A trend emerges where an elevated ACR correlates with a heightened FCR.Notably, the ACR ascends more precipitously than the FCR within the "1" event ratio spectrum from 0.05 to 0.5, while this growth rate inverts for ratios between 0.5 and 0.95.
Figure 3 presents the AUC for both methods across varying proportions of fatal to non-fatal crashes.The diagram indicates that the AUC for the prior correction method remains unaffected by the percentage of "1" event post-selection.In contrast, the weighting method demonstrates superior predictive performance at most "1" event ratios.Green stars mark the coordinates with the maximum AUC values, which inform the selection of rates for the weighting method in the rare events logistic model-specifically, 43% in the corrected dataset.For the implementation of the rare events logistic model, the Stata statistical software package was employed.

. The parameters of models
The parameter estimates for the RELM and the LM are consolidated in Tables 2, 3, respectively.These tables encapsulate the significant parameters deduced from the empirical analysis, illustrating that the magnitude and direction of the coefficients for both models are largely consistent.The significance and impact of the variables, with the salient exception of the POI, are in concordance with the injury severities reported in antecedent research, notably by Zeng and Huang (26).
Our analysis of driver demographics indicates a heightened risk of fatality for older drivers following a collision, corroborating the findings from existing literature that underscores age as a critical determinant in traffic injury severity.In relation to vehicular and environmental factors, the data suggest that more recent vehicle models correlate with a reduction in injury severity, supporting the premise that advancements in vehicular safety technologies have ameliorated crash outcomes.In clear contrast, while operators of medium/heavy trucks exhibit a lower fatality likelihood, drivers of passenger cars show an increased fatal outcome propensity.This disparity may be attributable to inherent variations in vehicle safety features, structural mass, and design specifications.

. Comparative analysis of classification e cacy
Table 4 delineates the predicted outcomes derived from both the RELM and the LM, incorporating statistically significant variables at the 0.05 level into the classification procedure.The predictive classifications of the models are juxtaposed against the actual incident outcomes, with Table 4 providing a comprehensive summary of these predictions.The data articulated in Table 4 highlights the superior performance of RELM in comparison with LM.A notable deficiency of LM is its significant underestimation of fatal accident risk, failing to identify any incident as fatal.In contrast, RELM achieves an accurate classification rate of 77.7%.Despite an increase in the false alarm rate by 12.8%, RELM is deemed tolerable when juxtaposed against the grave implications of underestimating fatal accidents; for instance, Aguero-Valverde (32) equates the impact of 1 fatal crash to that of 20 property-damage-only (PDO) crashes.

FIGURE
AUC values for the weighting method and the prior correction method.An extended evaluation of the performance of the two models was conducted through the ROC curves, as exhibited in Figure 4.The predictive accuracy for fatal and non-fatal cases is contingent upon a predetermined probability threshold.An observation is designated as a fatal accident if its predicted probability transcends this threshold; otherwise, it is categorized as non-fatal.The ROC curves graphically represent the tradeoff between the true positive rate and the false positive rate as the threshold varies from 0 to 1.The AUC for each model is computed, revealing that the ROC curve for the RELM generally resides above that of the LM for thresholds below 0.8, indicative of enhanced predictive accuracy of RELM.Moreover, a juxtaposition of the AUC values in Figure 4

Discussion
This study employs the rare events logistic model to scrutinize the relationship between various risk factors and the incidence of fatal road accidents in Florida.The analysis identifies six variablesolder adult casualties, substance abuse, non-usage of safety equipment, passenger car, POI at level 3, and rural accidentsas positively correlated with driver fatalities.Conversely, five variables-vehicle age, speed ratios 1 and 2, driver at fault, and daylight incidents-exhibited a negative correlation with accident risk.
The findings unequivocally show that RELM supersedes LM in estimating fatal crash risks.As hypothesized, LM systematically underestimates these risks, a shortfall that RELM substantially rectifies, achieving an accuracy rate of ∼80%.While a slight increase in false classification is noted, this tradeoff is deemed acceptable given the enormity of losses associated with each fatal accident.The AUC values further corroborate the superior performance of RELM over LM in this context.The findings of this study have several implications for stakeholders involved in road safety.It is recognized that annual inspections cannot alter the fundamental crashworthiness of older vehicles; however, ensuring that aging vehicles are maintained can help mitigate risks where possible.Nevertheless, the intrinsic limitations in safety offered by older vehicle designs compared to their modern counterparts must be acknowledged.Thus, stakeholders should focus on enhancing public awareness regarding the potentially increased risks associated with older vehicles and should advocate for policies that encourage the use of vehicles with advanced safety features.For demographic groups such as older adult drivers and men who are statistically at a greater risk, targeted safety campaigns and driving aids could be beneficial.This could involve educational initiatives that promote defensive driving techniques and raise awareness about the increased risk factors these demographics face.Furthermore, urban planners and transportation authorities should take into account the findings regarding speed limits.While not the sole factor, the data suggest that higher speed limits can contribute to the severity of crashes.Therefore, a holistic approach to road design that incorporates traffic calming measures and considers the impact of speed on traffic incident severity is warranted.These measures could help in reducing the likelihood of fatal outcomes in crashes.
This study is subject to certain constraints that warrant acknowledgment.The classification of POIs into predefined levels, a method predicated on established literature, may not capture the entirety of POIs that may significantly influence crash severity.The dataset utilized provided a finite array of POIs, thereby omitting potentially crucial impact points not recorded within it.This omission could lead to a partial portrayal of crash dynamics.Moreover, spatial correlation, a factor that could yield valuable insights into the patterns and causes of fatal crashes, was not incorporated into the RELM used in this analysis.Other influential variables, such as law enforcement strategies and traffic volume data, were also not included in our dataset.The absence of these variables limits the breadth of our analysis, potentially affecting the robustness of our findings.Acknowledging these limitations, future investigative efforts in this field should endeavor to integrate a more detailed classification of POIs, alongside variables capturing spatial correlation, law enforcement efforts, and traffic metrics.Such enhancements in data collection and model sophistication would provide a more holistic understanding of the factors contributing to fatal crash outcomes.
Within this equation, V β denotes the estimated variancecovariance matrix of the adjusted coefficients β.X i = (1, x i ) represents the vector of predictors, including the intercept for Frontiers in Public Health frontiersin.orgXiao et al. ./fpubh. .

FIGURE
FIGUREAn illustration of the points of impact.

FIGURE
FIGUREThe relationship between measurements and the ratio of rare events.(A) The relationship between accurate classified rate, false classified rate, and the fatal event ratio.(B) The relationship between false classified rate and the fatal event ratio.(C) The relationship of accurate classified rate versus the fatal event ratio.(D) The relationship between accurate classified rate and false classified rate.

FIGURE
FIGUREROC curves for RELM and LM methods.
TABLE Variables contained in the dataset.
TABLE Model parameters of RELM.
TABLE Model parameters of LM.

TABLE The
prediction results of LM and RELM.LM RELM