Development of a Machine-Learning-Based Classifier for the Identification of Head and Body Impacts in Elite Level Australian Rules Football Players

Background: Exposure to thousands of head and body impacts during a career in contact and collision sports may contribute to current or later life issues related to brain health. Wearable technology enables the measurement of impact exposure. The validation of impact detection is required for accurate exposure monitoring. In this study, we present a method of automatic identification (classification) of head and body impacts using an instrumented mouthguard, video-verified impacts, and machine-learning algorithms. Methods: Time series data were collected via the Nexus A9 mouthguard from 60 elite level men (mean age = 26.33; SD = 3.79) and four women (mean age = 25.50; SD = 5.91) from the Australian Rules Football players from eight clubs, participating in 119 games during the 2020 season. Ground truth data labeling on the captures used in this machine learning study was performed through the analysis of game footage by two expert video reviewers using SportCode and Catapult Vision. The visual labeling process occurred independently of the mouthguard time series data. True positive captures (captures where the reviewer directly observed contact between the mouthguard wearer and another player, the ball, or the ground) were defined as hits. Spectral and convolutional kernel based features were extracted from time series data. Performances of untuned classification algorithms from scikit-learn in addition to XGBoost were assessed to select the best performing baseline method for tuning. Results: Based on performance, XGBoost was selected as the classifier algorithm for tuning. A total of 13,712 video verified captures were collected and used to train and validate the classifier. True positive detection ranged from 94.67% in the Test set to 100% in the hold out set. True negatives ranged from 95.65 to 96.83% in the test and rest sets, respectively. Discussion and conclusion: This study suggests the potential for high performing impact classification models to be used for Australian Rules Football and highlights the importance of frequencies <150 Hz for the identification of these impacts.


INTRODUCTION
Concussion is a common injury in contact and collision sports (Donaldson et al., 2013;Gardner et al., 2014a,b;Makdissi and Davis, 2016;Dai et al., 2018;Ramkumar et al., 2019). There has been considerable medical interest in improving the identification and management of sport-related concussion (McCrea et al., 2013;McCrory et al., 2017). A number of professional sporting leagues, for example, the Australian Football League (AFL) (Davis et al., 2019a), National Football League (Ellenbogen et al., 2018;Davis et al., 2019a), National Hockey League (Davis et al., 2019a), professional rugby union (Gardner et al., 2018), and the National Rugby League (Davis et al., 2019a), have implemented sideline video surveillance as a strategy for improving the identification of concussion (Davis et al., 2019a,b). This is an important strategy. However, concern has also been raised that it may not only be concussion risk to the health of contact and collision sport athletes, but also the career accumulation of sub concussive impacts that may result in current or future health issues (Gavett et al., 2011;Baugh et al., 2012). Researchers have reported that sub concussive head impacts are associated modest elevations of blood biomarkers over a single practice session of American football (Rubin et al., 2019), and college football players might sustain 1,000 or more sub concussive impacts to the head over the course of a season (Gysland et al., 2011;Bazarian et al., 2014). Cumulative exposure to repetitive head impacts, over time during a single season, might be a risk factor for sustaining a concussion during that season in elite American college football players (Stemper et al., 2018), but cumulative exposure to head impacts was not associated with concussion risk in high school football players (Eckner et al., 2011). Researchers have reported that repetitive head impacts are correlated with changes on experimental brain imaging over the course of a season (Merchant-Borna et al., 2016), and cumulative repetitive head impact exposure is associated with later in life deficits in cognitive functioning and symptoms of depression (Montenigro et al., 2017).
A strategy for evaluating player impact loads as part of an injury prevention program is the use of instrumented technology (Wu et al., 2018;Patton et al., 2020). However, the implementation in the field has been limited by the reliability and validity of such technology (Patton et al., 2020). Using simple peak linear acceleration thresholds to differentiate impacts from normal motion is highly likely to be an insufficient method and is fraught with complex challenges. For example, setting a low magnitude acceleration threshold will increase the likelihood of false positive data, whereas setting a high acceleration threshold will likely result in filtering out some true impacts, and the high acceleration false positives will still remain (Wu et al., 2018). In addition, there are concerns that the majority of the research using sensor-recorded events lack a verification method to confirm the accuracy of the instrumented technology to identify impact loads (Patton et al., 2020). As a result, the absence of a verification method to confirm sensor-recorded events and to remove false positives may be factor in overestimation of head impact exposures (Press and Rowson, 2016;Cortes et al., 2017;Carey et al., 2019;Patton et al., 2020).
Video review, while not infallible and reliant on the skill of the reviewer, has been shown to be a reasonable method for impact detection, with the significant drawback of being labor-intensive (Caswell et al., 2017;Carey et al., 2019;Bailey et al., 2020;Patton et al., 2020). Other detection methods include using filtering algorithms or statistically modeling the impact signature/characteristics (Baugh et al., 2012) to determine whether or not the data is consistent with an impact. This "classification" step, together with the video identification, can be used to evaluate reliability, validity, and accuracy. With the introduction of machine learning based models, high performances for impact identification (>90% accuracy) can be achieved (Baugh et al., 2012). However, these models tend to be trained using single sport data [e.g., American football (Baugh et al., 2012)], which may not be generalizable to other contact or collision sports. The Nexus A9 mouthguard is capable of capturing kinematic data from collisions, such as those that occur in contact sport. To date, there are no statistical models present in the literature to identify impacts in Australian Rules Football. Therefore, the aim of this study was to develop and validate an impact classification method (classifier) for Australian Rules Football, with mouthguard events (captures) recorded using the Nexus A9 mouthguard. Given previous work in this area (Wu et al., 2018;Gabler et al., 2020), we hypothesized it would be possible to develop a machine-learning-based classifier with high performance for delineating hit and non-hit captures.

Study Design
This study was conducted with elite level Australian Football League AFL and Women's AFL (AFLW) players. Consenting participants were provided with custom fit, instrumented mouthguards at the beginning of the season and were requested to wear them during match play. Each team was assigned an account manager, who had the role of distributing the mouthguards to the correct players before the match and then collecting them post-match for cleaning, storage, and uploading of the data from the mouthguard while they were housed in the storage and recharging unit. All matches were televised via the league's contracted broadcasters, and footage from the broadcasters were reviewed as part of the video verification process (described in detail below). This study was approved by the University of Newcastle Human Ethics Committee (H-2019-0341).

Data Collection
Data for the classifier were collected from 64 elite level athletes from eight clubs across 119 matches for which consenting players were participating during the 2020 Australian Football League (AFL) season. There were 60 male AFL players (mean age = 26.33; SD = 3.79) and four female players from the Women's AFL (AFLW; mean age = 25.50; SD = 5.91). A total of 21,348 potential impacts (captures) were generated of which 13,744 were used for training and validation purposes (see Data Preprocessing).

Mouthguard Specifications
The HitIQ Nexus A9 instrumented mouth guard (HitIQ Pty. Ltd.) used in this study contained three triaxial accelerometers (Analog Devices ADXL372, range: ±200 G, 12-bit) and a gyroscope (Bosch BMG 250, ±2,000 dps range, 16-bit). These were sampled at 3,200 and 800 Hz, respectively. The circuit board and components such as a battery and antenna system were embedded in the mouthguard using a proprietary process. A three-accelerometer array located in the left, central, and right regions of the mouthguard was used to provide an estimate of the angular acceleration independent of the gyroscope and allowed for a crosscheck to remove spurious readings, such as those originating from actions like mouthguard deformation rather than head kinematics. The Nexus A9 mouthguard has been shown to have good concordance with reference sensors in drop tests [LCCC = 0.997 (Stitt et al., 2021)].

Capture Recording
Recorded mouthguard events (captures) were identified based on thresholding the normed signal from the left linear accelerometer at 10 g's or greater. This magnitude threshold was chosen because below 10 g's has been reported to be indicative of non-impact events (e.g., walking, sitting, etc.) (King et al., 2016). A capture consisted of a lead in the period of 20 ms prior to the 10 g threshold being reached and ended 80 ms after the last trigger event. This allowed for multiple impact events to be recorded in a single capture. The capture was then stored in onboard memory in the mouthguard.

Data Processing
Due to individual variation within linear accelerometer sampling rates, time series for each axis of the three linear accelerometer sensors were resampled to 3,200 Hz. Gyroscope data were upsampled from 800 to 3,200 Hz. All resampling was carried out using polyphase filtering as present in scipy's resample polyfunction.
Resampled data were triaged to decrease the number of vocalization signals or those consisting of high frequency noise (Wu et al., 2018). The normed signal from the left linear accelerometer was low pass filtered at 300 Hz using a Butterworth second order non-phase corrected filter and subject to a 10 g threshold. Captures that passed the triage were included in the final training/validation data.

Data Labeling
Ground truth data labeling on the captures used in this machine learning study was performed through analysis of game footage by two expert video reviewers using SportCode (https:// www.hudl.com/en_gb/products/sportscode) and Catapult Vision (https://www.catapultsports.com/products/vision). The visual labeling process occurred independently of the mouthguard time series data. Reviewers were provided with video footage (720p, 50 frames per second) from four angles, namely a broadcast view with a tight field of view on the ball, a side view, and footage from behind the goals, to determine if a capture represented a legitimate impact (hit). Time stamps of captures were chronologically synchronized with video footage with start and end times provided by the AFL. Obvious hit events were used to make fine adjustments to the synchronization process (within ± 1 s). Capture events were viewed and labeled according to several predefined labels. Study participants (i.e., those wearing the mouth guard) were identified from their AFL guernsey numbers and from known physical characteristics. Captures where the reviewer directly observed contact between the mouthguard wearer and another player, or the ball, or the ground were labeled as hits. Captures where no contact was observed were given a general label (non-hit) and given a sublabel based on the activity observed-hit, biting, chewing, drinking, mouthguard insertion, mouthguard removal, mouthguard in hand, mouthguard in sock, yelling, no video footage (on sideline), and unknown (if video footage was available, but insufficient to directly observe the event). Quantification of hits that failed to reach the 10 g capture trigger threshold (see section Mouthguard Specifications) was not undertaken.

Datasets
Data that passed the triage process (13,712 captures) were divided into two sets, a classifier training and a validation set (Set 1) and a separate hold out set (Set 2). Set 1 contained 13,417 captures (1,580 hits, 11,837 non-hits), which were balanced by downsampling the majority class (non-hit) to the minority, selecting captures to be included through pseudorandom sampling using a uniform distribution. The balanced set (3,160 captures) was divided into training (70% of the balanced data), validation (15%), and test (15%) subsets. Set 2 consisted of captures acquired from a single match that were not included in Set 1(Holdout; 57 hits, 238 non-hits).
The validation set was used to estimate the unbiased error of the tuned hyperparameters. Test, rest, and holdout sets were used to examine how the end model would perform given hypothetical scenarios. The test dataset was used as an additional estimate of model performance given reasonably balanced classes. Conversely, the rest subset consisted of non-hit captures that were not included in the training, validation, or test subsets (10,257 non-hit captures). Since our data showed a large imbalance toward nonhits (roughly 10:1), the rest dataset was used to examine the real-world specificity profile of the model. The holdout set was used to examine model performance on unseen data.

Feature Generation
Features were calculated on signals from all axes of the three linear accelerometers and the gyrometer (12 signals total). Signals were first aligned to cardinal axes using rotational matrices derived from a proprietary calibration process unique for each mouthguard.
Two families of features were generated to capitalize on the shape and spectral characteristics of the signals. Random convolutional kernels were generated (Rahimi and Recht, 2007), with each signal standardized to the signal mean and standard deviation. Three hundred kernels were generated, with the maximum value of the kernel and number of values greater than zero extracted per kernel. A total of 600 features were generated per signal.
Spectral characteristics were examined by calculating the power spectra density of each signal, using scipy's (Oliphant, 2007) implementation of Welch's method (Welch, 1967). Power spectra densities were split into 10-Hz bins, the characteristic value of the bin extracted, and then natural log transformed. The 1,908 power spectra density and 720 convolutional kernel features were then standardized to the mean and standard deviation of the training set.

Classifier Selection
Selection of a classification algorithm to use for final modeling was achieved by assessing performance of untuned algorithms on the training dataset. All available classification methods present in Scikit-learn (Abraham et al., 2014) were examined. Due to its popularity and performance, the eXtreme gradient boosting (XGBoost) algorithm was also included (Chen and Guestrin, 2016). Default settings for each algorithm were used. Performance was assessed based on the number of hits correctly classified as hits (true positive; TP) and the number of non-hits correctly classified (true negative; TN).
The estimator with the highest TP and TN performance and the least difference between performance metrics in the validation set was chosen for further tuning. The least difference was included as a selection criterion to select a classification algorithm that would be unbiased toward label type.

Classifier Training and Evaluation
Randomized Search CV was used to train the highest performing estimator, optimizing using Matthew's correlation coefficient. Fifty candidate combinations of parameters were selected using 5-fold cross validation, for a total of 500 fits. The highest performing combination of hyper parameters was used for further performance validation.
Generalizability of classifier performance was assessed using TP and TN metrics and the F1 score on the validation, test, rest, and hold out data. Performance bounds were calculated using bootstrapped 95% confidence intervals, generated across 10,000 shuffles, with data selected pseudorandomly using a uniform distribution.

Model Interpretation
To assist with model interpretation, including insights into feature importance and the impact of features on individual observation, SHapley Additive exPlanations (SHAP)'s TreeExplainer method was used (Lundberg and Lee, 2017;Lundberg et al., 2020). The validation dataset was used to generate SHAP values.

Classifier Selection Analysis
True positive, (TN), and the absolute difference between metrics (TP -TN) for all valid classifier algorithms in SKlearn and XGBoost are presented in Table 1. The mean classifier performance for TPs and TNs was 77.84% (standard deviation = 31.77%) and 89.55% (standard deviation = 11.77%) respectively, with TP values ranging from 0% (Gaussian process classifier, label propagation, label spreading, quadratic discriminant analysis) to 98.04% (passive aggressive classifier, perceptron) and TNs ranging from 47.53% (dummy classifier) to 100% (Gaussian process classifier, label propagation, label spreading, quadratic discriminant analysis). Due to the mixed results between histogram-based gradient boosting (HGB) and XGBoost for our selection criteria, a second comparison was performed on fully tuned models. The results ( Table 2) showed no clear superior algorithm between the two. HGB is also an experimental method within sklearn that has not been fully tested. Therefore, we decided to continue with analysis of the XGBoost-based model.

XGBoost Model (Classifier Performance Output)
Estimated TP, TN metrics, and the F1 score (F1) were calculated from labels estimated by the trained XGBoost against the video-detected ground truth labels ( Table 3). Point estimate performance of the classifier was above 95% for all hit labeled impacts across all the data subsets (excluding the rest set where no TPs were present). Confidence intervals ranged from 92.51% for the test to 99.60% for the validation set. Point estimate true negative values ranged from slightly below 95% (94.54) for the hold out set to 98.65% for the validation set, while 95% CIs ranged from 91.49% (hold out set) to 100% (validation set). TP CIs suggest that there was no difference between validation and test sets, while performance on the hold out set was superior (not corrected). Overlapping CIs for TNs suggests no significant difference in classifier performance across datasets.  individual impacts plotted from left to right, with color representing whether the value for that feature and observation was "high" (above the mean from memory, red) or "low, " lower than the mean (blue), with intensity as the distance. The x axis shows impact on the model. Values above 0 indicate contribution toward a positive label (hit), while values below 0 are contributions to a non-hit label. In Figure 1, it can be seen that the top 50 features were predominantly spectral in nature, with dominant frequency bands being under 150 Hz. Gyrometer and central linear accelerometer sensors were shown to contribute the majority of information to the classifier.

DISCUSSION
Our classification model that was developed using smart mouthguard technology and video verified impacts of elite Australian Rules Football players showed good performance, being able to correctly identify over 90% hits and non-hit captures. Additionally, we showed the importance of sub-150 Hz frequencies in developing the model from rotational and linear information.
A recent systematic review of the impact sensor literature (Patton et al., 2020) reported that the majority of eligible articles (64%) did not employ an observer or video verification for sensor-recorded events. This raises substantial concern that the head impact sensor literature may inaccurately identify impact events. Those articles that did not apply a verification method may be overestimating the head impact exposure by including false positive data (Patton et al., 2020). While 74% of eligible articles applied a filtering algorithm to automatically remove false positives, they did not also use an observer or video verification method to reaffirm false positives that were removed from the dataset (Patton et al., 2020). The sole use of a filtering algorithm has not been considered to be a valid replacement for observer and/or video verification of head impacts (Nevins et al., 2018), largely because the algorithms were not derived from on-field (i.e., game-play) data (Patton et al., 2020).
While most processing algorithms that are used to remove false positive and other spurious events remain proprietary, head impact telemetry system (Simbex, Lebanon, NH) has previously reported that their system compares the sensorrecorded kinematics to the expected acceleration signals for rigid body head acceleration (Crisco et al., 2004). Another method adopted by some authors in the field is to apply a threshold as a filter (e.g., removing all impacts <10 g peak linear acceleration and all impacts that surpass 200 g peak linear acceleration), (Rahimi and Recht, 2007;O'Connor et al., 2017). Optimal recording thresholds can only be determined via comprehensive video confirmation approaches using unfiltered data (Patton et al., 2020).
Video verification methods are not often used to establish reliability and validity of impacts. Wu et al. (2018) have previously reported on the training and validation of an impact classifier for an instrumented mouthguard in a small sample of seven collegiate football players and a total of 387 impacts collected from practice and games. The authors reported achieving 87.2% sensitivity and 93.2% precision and emphasized the importance of accurate impact detection. By way of comparison, the current study employed a similar methodology in a larger sample of male and female Australian Rules Football players with a larger dataset of 13,712 video verified body and head impacts, with the lowest sensitivity value of 94.67% (sensitivity = true positives/(true positives + false negatives)) and lowest precision of 94.42% (precision = true positives/(true positives + false positives)). The between-study differences in results can be related to multiple factors, including sample size, sports (helmeted vs. unhelmeted), technology, and statistical methodology.
A second study to report results of a mouthguard-based classifier by Kieffer et al. (2020) utilized head injury metrics as features with support vector machine and an artificial neural network as their base algorithms (Benzel et al., 2016). Performance of their impact classifier was evaluated using  positive predictive value (otherwise known as precision) when players were both on and off the field (combined) and on field from non-helmeted Rugby players. Combined PPV was 91.2% while on field alone was 96.4%. The presented classifier showed PPV ranging from 94.42% (test) to 100% (holdout) with both on and off field data. Machine-learning-based models have been criticized for their opaque nature compared with traditional statistical modeling methods (e.g., decision trees, logistic regression) (Lundberg et al., 2020). Exploration of feature importance using SHAP showed low frequency (<150 Hz) rotational power spectra density features to assist in helping model performance. Wu et al. (2018) showed their support vector machine based classifier to be utilizing low frequency components; however, they reported most of the importance to be in linear accelerometer derived features, at much lower frequencies (<30 Hz). This difference in reported feature importance may be due to several factors, including the classifier method used, different feature types used, and potentially different impact characteristics of helmeted compared with non-helmeted sports. In this study, we used a randomized tree based boosted method (XGBoost), which makes inherent use of interactions within the data and generates a multidimensional decision boundary in a stepwise fashion. Conversely, Wu et al. (2018) used a radial kernel support vector machine that attempts to find a linear decision boundary between classes when the features have been projected to a higher dimensional space. This tends to produce a smoother decision boundary (Cortes and Vapnik, 1995). Additionally, Wu et al. (2018) utilized several different groups of features, including power spectra density and wavelet-transformed time frequency information, time domain peak information, and biomechanicalbased features. In comparison, our classifier used power spectra density based features and randomized kernels that can be used to examine both frequency and shape-based characteristics of the signal (Dempster et al., 2020).
Finally, Wu et al. (2018) developed a classifier for use in American Football where players wear protective helmets to guard against head injuries, while the presented classifier was produced using AFL data, in which helmets are not worn. There may be important differences in how impacts are represented in the frequency domain between helmeted and non-helmeted sports, with the helmet absorbing more of the higher frequency components and allowing only sub 50 Hz based kinetic energy to be transferred into the head.

Limitations
This study has several limitations. While video review hits and non-hit captures were noted by the video reviewers, there is a possibility that hits that were not of sufficient magnitude to reach the mouth guard's 10 g threshold and were therefore not captured during the process could have been present but not noted. We did not attempt to identify these impacts during video review. Although, our classifier showed high performance in distinguishing between hits and non-hits, this performance may Frontiers in Sports and Active Living | www.frontiersin.org only reflect impacts at 10 gs or greater. The classifier was also applied specifically to the Nexus A9 mouthguard for classifying the impact of male and female Australian Rules Football players. Whether a similar performance level is achievable for other contact sports using this classifier is unknown. The classifier was applied to adult athletes and may not generalize to child and adolescent athletes. Additionally, the classifier was applied to elite level players and may not generalize to amateur and community level athletes. Training of the classifier was accomplished using data from four female and sixty male Australian Rules Football players. Bias in machine learning/artificial intelligence methods of classification are widely known. The lack of recorded data from women in the training set may lead to inaccurate results when attempting to classify impacts in professional women athletes. Although standardization of the time series during processing may remove bias that is present due to magnitude differences between the sexes based on body mass, other potential sources of bias, such as differences in impact behavior, cannot be ruled out. Finally, the classifier did not include data from all possible positions in Australian Rules Football. Because different positions have different impact probabilities and the form of impact may vary, it is possible that some types of impact may not have been recorded.

Practical Implications
Video verification is an essential element for collecting reliable kinematic data from sensors. Establishing a method of synchronization between video footage and the output from the sensors is critical to the video verification process. An algorithmdriven classifier, in conjunction with the video verification method, optimizes the integrity of the data. Validating a classifier that applies these two principals improves the reliability of the data for making decisions around the suitability of an athlete to remain in play, or to be removed for a medical assessment, following a hard blow to the head. Additionally, this study showed the importance of frequency ranges <150 Hz in creating a tree-based model to accurately identify impacts. Both linear and rotational spectral information provided the majority of information to develop the classifier as evidenced by SHAP, suggesting that temporally based metrics (e.g., peak values) may not be required.

CONCLUSIONS
It is essential for a valid verification method to be used to confirm sensor-recorded events and to remove false positives. Video verification in combination with an algorithm-driven classifier can provide an accurate method for filtering data and optimizing the integrity of the dataset. The current study showed that the classifier for the Nexus A9 mouthguard is an accurate system for identifying impacts to the body and head in elite level Australian Rules Football players. Future research should focus on further validation of impact sensor classifiers in other contact and collisions sports, and also across the various levels of sport.

DATA AVAILABILITY STATEMENT
The datasets presented in this article are not readily available because of contractual obligations surrounding privacy of data as well as the need to protect intellectual property. Consideration will be made for access requests to data not implicated by these obligations. Requests to access the datasets should be directed to peter@hitiq.co.

ETHICS STATEMENT
Ethical review and approval was not required for the study on human participants in accordance with the local legislation and institutional requirements. The patients/participants provided their written informed consent to participate in this study.

AUTHOR CONTRIBUTIONS
PG conceptualized the study, wrote the first draft of the manuscript, and performed the statistical analyses. PG, BN, and SA conceptualized the statistical analyses. All authors critically reviewed the manuscript. All authors read and approved the last version of this manuscript.

FUNDING
Funding for this study was provided by HitIQ Limited.