Can heart rate variability data from the Apple Watch electrocardiogram quantify stress?

Chronic stress has become an epidemic with negative health risks including cardiovascular disease, hypertension, and diabetes. Traditional methods of stress measurement and monitoring typically relies on self-reporting. However, wearable smart technologies offer a novel strategy to continuously and non-invasively collect objective health data in the real-world. A novel electrocardiogram (ECG) feature has recently been introduced to the Apple Watch device. Interestingly, ECG data can be used to derive Heart Rate Variability (HRV) features commonly used in the identification of stress, suggesting that the Apple Watch ECG app could potentially be utilized as a simple, cost-effective, and minimally invasive tool to monitor individual stress levels. Here we collected ECG data using the Apple Watch from 36 health participants during their daily routines. Heart rate variability (HRV) features from the ECG were extracted and analyzed against self-reported stress questionnaires based on the DASS-21 questionnaire and a single-item LIKERT-type scale. Repeated measures ANOVA tests did not find any statistical significance. Spearman correlation found very weak correlations (p < 0.05) between several HRV features and each questionnaire. The results indicate that the Apple Watch ECG cannot be used for quantifying stress with traditional statistical methods, although future directions of research (e.g., use of additional parameters and Machine Learning) could potentially improve stress quantification with the device.


Introduction
According to the WHO, stress is the "Health Epidemic of the 21st Century" (1). Over a quarter of U.S. adults report such high levels of daily stress that they are not able to function properly (2). Stress, as a survival mechanism, is normal and healthy: stress allows the body to generate more energy to deal with a potential threat. The stress response is modulated by the sympathetic nervous system (SNS) and parasympathetic nervous system (PNS). The SNS is responsible for triggering a response to unexpected threats to generate energy and resources for the body -the fight-or-flight response -by signalling adrenal glands to release adrenalin and cortisol, which lead to several physiological changes including increased heart rate, blood pressure, and respiration (3,4). Once the acute stressors are removed, the PNS functions to relax the body, returning it to its normal state (3,4).
Despite the necessity of a stress response to survival, chronic exposure to stressors can lead to severe health consequences including cardiovascular diseases, hypertension, obesity, and diabetes (3,5,6). Chronic stress is an increasingly observed condition worldwide. High levels of daily stress are reported by 38% of United States adults aged 40-49 years and 33% of adults aged 50-59 years (7). In Canada, daily stress was highest amongst individuals between 35-49 years (27.8%) followed by individuals aged 50-64 years (22%) and 18-34 years (21.9%) (8). Individuals over 65 years reported the lowest levels of stress (8). Chronic stress is estimated to cost over USD 300 billion annually in associated healthcare expenses, reduced job performance, and absenteeism (1,9). Workplace stress is connected with 120,000 premature deaths annually (10). The COVID-19 pandemic has amplified this crisis: a recent survey by the American Psychological Association discovered that approximately 80% of respondents identify the pandemic as a major source of stress in their life and almost 70% reported increased levels of stress owing to .
The identification of stress and the application of interventions should be a public health priority. Research data on stress is typically collected through self-reporting surveys, which may have limitations such as low response rates, recall and social bias, cost and delays (12). Smart technologies, such as mobile and wearable devices, have recently been identified as useful tools to measure health parameters. Several of these technologies have embedded sensors that collect objective health data such as sleep, blood pressure, and heart rate (13, 14). In particular, an electrocardiogram (ECG) feature for detecting atrial fibrillation has been introduced to the Apple Watch device (13-15). Unlike the standard 12-lead ECGs, which use electrodes connected to the body, the Apple Watch ECG collects a 30-s 1-lead ECG when users place their finger on an electrode located in the digital crown of the device (16). Interestingly, ECG data can be used to derive Heart Rate Variability (HRV) features which are commonly used in the identification of stress (17). This suggests that the Apple Watch ECG app could potentially identify and monitor individual stress. Apple Watch applications could use this information to provide instant user feedback and interventions, such as suggesting the use of meditation apps (18). Furthermore, the use of a wearable data collection device would improve stress research data by eliminating recall biases and increasing population sample sizes. However, compared to longer measurements, there is not a large amount of evidence suggesting that ultra-short HRV measurements are reliable (19).
The goal of this paper was to explore the associations between HRV data collected from the Apple Watch ECG app with perceived stress levels in a real-life study. To the best of our knowledge, this is the first paper that provides statistical analyses of data derived from the Apple Watch ECG for stress detection, studying the reliability of these short-term measurements, and it is a continuation of previous work that uses a set of the same data, from 40 participants, to create Machine Learning (ML) stress prediction models (12). ECG data from the Apple Watch ECG app was collected from 36 participants in a real-world setting over 2 weeks. We were able to identify significant, albeit weak, correlations between several HRV features and selfreported stress states, as well as significant differences between groups. Results from this study support the continued development of wearable ECG sensors as tools to measure stress.
The paper is organized as follows: section 2 described related work, including previous studies that used different sets of the same data for creating Machine Learning models; section 3 describes the methods, while section 4 presents the results and section 5 discusses our findings. Finally, section 6 presents the conclusions.

Related work
This paper is an extension of previous work performed by the authors that uses data from 40 participants, to derive HRV features from the Apple Watch ECG data and use that data to create machine learning (ML) models for stress prediction -specifically using Random Forest and Support Vector Machines (12). The models, trained on subsets of the data according to age, gender, income, profession, and health status, found a weighted f1-score lying approximately between 55-65%, which is in line with the state-ofthe-art for stress prediction using ML, although towards the low end. The models possessed high specificity -i.e., in general they were capable of successfully predicting when an individual is not stressed -but were less successful when predicting the stressed state. Notably, feature importance of the Random Forest models was calculated to determine, for each model, what features were most important in determining the prediction results. Although they vary per model, in general the heart's acceleration (AC) and deceleration (DC) capacity were some of the most important features, present in most of the models. Another noteworthy feature is the standard deviation of interbeat intervals (SDNN). A more detailed explanation of HRV features and the feature extraction process is provided in the methods section.
Data from the same study, this time from 27 participants, was used by Benchekroun et al. (20), although in this case the HRV data was derived from the Empatica E4 device rather than from the Apple Watch ECG. The Empatica E4 device collects data continually as opposed to cross-sectionally, providing larger datasets. Random Forests trained on this data in an area under the receiver operating characteristic (ROC) curve (ROC AUC) of 0.79 and a macro f1-score of 75%. Further, a cross dataset analysis was performed in which models were trained on a laboratory dataset and tested on the Empatica E4 data, achieving a ROC AUC of 65% and a f1-macro score of 62%.
MCcraty et al. (21) performed repeated measures ANOVA analysis on HRV metrics of 24 patients with panic disorder and healthy control, finding differences in features such as the SDNN index, Total Power of VLF, Normalized LF/HF ratio, among others. Hong et al. (22) conducted repeated ANOVA analyses for participants, finding changes in HF and RMSDD.
Seipäjärvi et al. (23) studies stress and HRV in a laboratory setting among participants in different age groups and health status, finding that with the application of stressors differences in HRV could be observed. Föhr et al. (24) investigated the association between physical activity, HRV and subjective stress measured with the perceived stress scale (PSS), finding significant changes between physical activity and HRV with stress. However, using ecological Frontiers in Public Health 03 frontiersin.org momentary assessments, Martinez et al. (25) found a significant but small relationship between HRV and stress, where only a small amount of variance was explained by models. The author's concluded that HRV might be a good proxy for stress in controlled settings with specific stressors applied, but not in real-life. Silva et al. (26) conducted Spearman correlation analysis between the perceived stress scale (PSS-14) and 5-min HRV variables at rest, and found weak to moderate correlation for the low frequency (LF) band. A similar Spearman correlation analysis was done in this study between HRV features and stress. The Task Force of the European Society of Cardiology and the North American Society of Pacing and Electrophysiology provide widely used guidelines for the analyses of HRV data and were of great help in guiding this research (27). The authors in Acharya et al. provided an extensive review of HRV metrics (17), while several papers explored the feasibility and characteristics of analyzing HRV data. For example, Benchekroun et al. (28) discussed the impact of missing data on several HRV-related metrics and the best interpolation techniques to handle this situation.
It is important to note that there is limited research on the reliability of ultra-short-term HRV measurements (less than 5 min) when compared to long-term methods. Baek et al. studies ultra-shortterm measurements to define recommended minimum intervals for each of these metrics to be valid (19). In general, each metric has a different recommended interval, varying from seconds to minutes. Shaffer et al. (29) conducted a review of ultra-short-term heart rate variability norms, finding that most studies did not use criterion validity to study if the procedures produce comparable results with validates measurement procedures, applying other metrics (e.g., Pearson correlation) which may be insufficient to provide evidence of comparable methods. Studies that did use more appropriate metrics [such as Baek et al. (19) mentioned previously] typically found that different metrics will depend on different intervals. Munoz et al. (30), for example, found that a minimum of 10s was required for RMSSD and 30s for SDNN. The authors also found that ultra-short-term measurements are extremely sensitive to artifacts. For example, a single false heartbeat can alter the HRV metrics, and so special care must be taken when analyzing the data. In short, while ultra-shortterm recordings such as the ones used in this study have potential due to its increased accessibility and ease-of-use, there is a lack of robust evidence base to assert that these recordings can be used as proxies for longer recordings. In this study, as will be described, the Kubios Premium Sofware was used to process the data to mitigate issues with noise or artificats.
In addition, while Apple Watch ECG data was shown to be successful in detecting atrial fibrillation (31) there is also a lack of a robust evidence base on how the HRV data derived from the Apple Watch ECG compares to gold standards. A study by Saghir et al. (32) found good results, showing that the agreement between the Apple Watch ECG and a standard 12-lead ECGs to be moderate to strong in health adults. In other words, there is promising but limited evidence both on ultra-short-term recordings and on how Apple Watch ECG data compares to more traditional, longer-term measurement methods.
It should also be noted that, while on this work we are specifically focusing on HRV derived from ECG -HRV being an essential parameter in stress quantification -other metrics, such as electrodermal activity (EDA), can also be considered for analyses (33).

Participant recruitment
Healthy participants (n = 36) were recruited from the University of Waterloo as well as through Facebook Ads and Kijiji (a Canadian website that allows users to advertise products and services). Participants had to live close to the Kitchener-Waterloo region in Ontario for devices to be delivered in person. Participants were offered CAD 100.00 for 2 weeks of data collection. This study was approved by the University Waterloo Research Ethics Board (REB [43612]). Data collection took place between December 2021 and December 2022. Table 1 shows the characteristics of the study participants. Participants were aged 18 years or older. For the analyses described in this paper, we considered only healthy participants, i.e., who did not drink or smoke, did not have any chronic conditions or take prescription medications.

Data collection
This study followed the Ecological Momentary Assessment (EMA) methodology to obtain self-reports closer to the event to approximate real-life scenarios (34). Participants were given an iPhone 7 with iOS 15.0 and an Apple Watch Series 6 with watch OS 8.3 for 2 weeks. The Apple Watch contained the ECG app and a Mobile Health Platform (MHP) was installed on the iPhone. The MHP was used to collect health data, including ECG recordings, from the iPhone's Apple Health app data repository (12)(13)(14)20).
Users were instructed to perform an ECG measurement on the Apple Watch ECG app 6 times during the day in approximately threehour intervals followed by the stress questionnaire (below) on the iPhone. Figure 1 shows the study protocol (times are included for reference purposes; participants were asked to collect data as soon as they woke up).
The app installed in the iPhone, termed the Mobile Health Platform (MHP), can collect health data saved on the iPhone's health data repository, the Apple Health app, including the ECG recordings. The MHP collected this data, which were then saved in our database using the JSON format (for each ECG reading there are 15,360 voltage measurements and associated timestamps in milliseconds, forming the 30-s ECG). The MHP also contains a tab with the stress questionnaires to be completed, which will be described next. Figure 2 shows the interface of the MHP, including the additional variables collected in the study.
We noticed that several participants had difficulty managing the study protocol with their daily life responsibilities. Therefore, we asked participants to use the devices for additional days to compensate as applicable.
Of note, this study is part of a larger cross-sectional study that investigates the use of smart technologies for stress detection. As part of this larger study, in addition to the Apple Watch and iPhone, participants were also given additional devices capable of collecting other data, such as the Withings Blood Pressure Monitor and the Empatica E4. Since this is not the focus of the paper we will not describe the use of these devices further, but more information on these expanded protocols is provided in Velmovitsky et al. (12)(13)(14) and Benchekroun et al. (20).
Frontiers in Public Health 04 frontiersin.org

Stress questionnaires
As there are a limited number of validated stress questionnaires for the EMA with a validation period relevant to this study, we used the stress subscale of the Depression, Anxiety, and Stress Scale (DASS-21) for our stress questionnaire. While the DASS-21 is usually applied over a week, there is promising evidence of using DASS-21 with EMA (35). In addition, Wang et al. (36) used a single-item measure that, while lacking validation in the literature, was used successfully for stress prediction and is moderately correlated with robust stress questionnaires. The following questionnaire on a LIKERT-type scale was used for our study. Questions 1-7 are related to the DASS-21 and question 8 comprises the single-item measure used by Wang et al. 1. I found it hard to wind down 2. I felt that I was using a lot of nervous energy 3. I found myself getting agitated 4. I found it difficult to relax 5. I tended to over-react to situations 6. I was intolerant of anything that kept me from getting on with what I was doing 7. I felt that I was rather touchy 8. Right now, I am… Questions 1-7 have the options: "Not at all, " "To some degree, " "To a considerable degree, " and "Very much, " while Question 8 has "Stressed Out, " "Definitely stressed, " "A little stressed, " "Feeling good, " and "Feeling great. " The questions were displayed to the user in a random order each time the questionnaire was filled in the MHP, and compose the perceived stress, i.e., the degree to which a stressfull situation affects an individual, is measured.
In addition to self-reporting stress throughout the day, participants were asked to self-report their stress levels at the beginning of the study with the single-item measure (results shown in Table 1).

Data pre-processing
To obtain the HRV features from the ECG readings, we made use of Kubios Premium 3.5.0, a widely used software that analyzes and extracts features from several heart-related signals (17,37). The JSON ECG data was exported into a CSV format and each voltage measurement was sorted by timestamp. The CSV file was imported into Kubios.
Kubios automatic beat correction feature was used and any samples that contained more than 5% of corrected beats were removed. In addition, any ECG sample classified as Poor Recording or Inconclusive by the ECG app was also removed from the analysis (16). Frequency features were calculated using both the Fast Fourier Transform (FFT) and Autoregressive Spectral Analysis (AR). A list of the features generated by Kubios based on the 30-s ECG signal is presented in Table 2 (17,37).
The scores of the DASS-21 questions summed together were multiplied by 2. If the score was bigger than 14, the sample was classified as "stress" according to DASS-21 guidelines (38). For the single-item measure, the sample was classified as "stress" if the score was bigger than 2, as that would represent the user being at least "a little stressed. " If the DASS-21 score or the single-item score were classified as "stress, " the measurement was classified as the "stress" state.

Statistical analysis
Statistical analyses were performed through the Statistical Package for Social Sciences (v. 28.0; SPSS, Chicago, IL, United States). Using baseline stress scores from the Single-Item measure at the beginning Frontiers in Public Health 05 frontiersin.org of the study, repeated measures ANOVA analyses were conducted followed by Tukey's Post-Hoc test in case of statistically significant features. In addition, Spearman's non-parametric correlation test was applied to detect the correlation between each ECG variable with the quantitative DASS-21 and single Item questionnaire scores. For all analyses, p < 0.05 was considered statistically significant. While correlations were performed for every feature, to limit the potential of biases ANOVA analyses were conducted with a subset of the features (Table 3) as seen in other works (21,22). In addition, for the analyses, we considered 13 days of data for each participant (the minimum days of all participants in the study).

Results
To determine whether HRV data collected from an Apple Watch ECG was associated with perceived stress level, we recruited 36 healthy participants to participate in a real-life study. Using the Apple Watch ECG app and an iPhone app developed for this study, users were instructed to collect ECG readings and complete a stress questionnaire 6 times during the day in approximately three-hour intervals for 2 weeks, as well as fill an initial survey about perceived stress levels prior to data collection. Table 2 lists the HRV features captured by the Apple Watch ECG. Questionnaires comprised 8 questions based on the DASS-21 (38) and the measure used by Wang et al. (36) as mentioned in the previous section.
Participants were predominantly female (64%; Table 1). 61% were employed and 36% were students. Participants were mostly South Asian, White, or Latin American (31, 25, and 19% respectively), and reported low to medium income (44 and 36%, respectively) the average of days a participant had in the study was 17.1 (±2.5), and an average of 59 (±16.0) ECG recordings. Participants were also asked to self-report their stress levels at the beginning of the study with the single-item measure (results shown in Table 1).
As described in the previous section, using the questionnaire score, measurements were designated as self-perceived "stress" if (a) the DASS-21 questions were classified as "stress" according to a DASS-21 greater than 14; or (b) the single-item measure was classified as "stress" if the score was greater than 2. Measurements that did not meet this cut-off were designated as "no stress. " Repeated measures ANOVA test was performed to compare differences recorded by the Apple Watch ECG and self-perceived stress. No statistical significance was revealed (Table 3).
To determine which ECG variables correlated with stress, we applied a Spearman's non-parametric correlation analysis between HRV features and self-perceived stress, divided by each of the stress scores (DASS-21 and Single-Item measure). Spearman correlation coefficients (r) and p-values were calculated and shown in Table 4.  Study protocol.

Discussion
Overall, some HRV features captured by the Apple Watch weakly correlate to the stress questionnaires. Repeated measures ANOVA test and Tukey's Post-Hoc test indicated that Apple Watch ECG features in the current study design cannot statistically differentiate between stress states in a real-world setting. Therefore, the answer of "can Heart Rate Variability data from the Apple Watch ECG Quantify Stress" with the use of the statistical methods investigated in this work seems to be no. Regarding Spearman correlations, while several features in the domains (time domain, frequency domain, non-linear) were shown to have a significant correlation with the DASS-21 and single-item measure, all were weak. Nevertheless, interesting points can be made by comparing the differences between the two questionnaires.
In general, the significant correlations between HRV features and the single-item measure are a subset of the ones from DASS-21. One of the main differences in the correlations between the DASS-21 and the single-item measure is that the latter does not seem to be significantly correlated to the absolute power high frequency components (FFT Absolute Power HF and AR Absolute Power HF). In this way, the use of both questionnaires for the study seem to complement each other in capturing differing dimensions of selfperceived stress, although it should be noted that the weak correlations may limit the validity of these results.
Interestingly, Silva et al. (26), also found weak to moderate correlations using the Spearman test while comparing HRV metrics with stress from the PSS-14 questionnaire but failed to find any significant correlations except for the LF band. Given that participants' measurements were taken at rest and the PSS-14 stress scores were in the mid to low range, it is possible that physiological changes owing to stress affected the correlation values in our current work.
Indeed, several factors may have affected the quality of the data. First, being a "real-life" experiment, data may be subjected to noise and errors in measurements. For example, respondents may forget to take measurements throughout the day, take the measurements incorrectly, or be influenced by the Hawthorne Effect in which respondents change their behaviour because they are being monitored. On the same token, elements such as sweat, or movement may affect the measurement. These factors may have influenced the results, leading to potentially inaccurate data. Future work should explore data collection of ECG in controlled conditions, potentially with an intervention (e.g., applying stressors in a lab) to evaluate the robustness of this data. This recommendation is also in line with Martinez et al. conclusions that HRV may be best represented in controlled environments with specific stressors (25).  While this would diminish the validity of ECG data to be used in real-life scenarios to identify stress, it could provide further clues as to how the relationship between these variables work and new directions of research. Further, future work on this dataset can consider the distribution of the data per day and HRV diurnal fluctuations, which could provide more significant and illuminating results. In addition, a convenience sample was used in this pilot study, and as can be seen by Table 2, there is a predominance of females and participants with low to medium SES which may affect the external validity of the results. Finally, since we used the EMA methodology, we decided to combine both the DASS-21 and the single-item measure for stress classification, which can potentially affect how individuals report stress and may lead to some of the contradictory findings in terms of group differences presented here. On that note, this study focused on perceived stress, i.e., the degree to which a situation perceived as stressful affects individuals. In this context, subjective ratings of stress may be affected by each participant's internalized definition of stress, which in turn may influence responses (39). Nevertheless, the fact that several significant -albeit weak -correlations were found are encouraging and additional, more controlled, and stratified experiments should be conducted to confirm and clarify these relationships between the HRV features from the Apple Watch ECG and self-perceived stress.
As described in the Related Work section, there is promising but limited evidence on the reliability of ultra-short-term measurements and the Apple Watch ECG when compared to traditional measurement methods and data. It is possible that inaccuracies in the Apple Watch ECG led to a lack of statistical differences between stress states in this study. In addition to controlled experiments, future research could also consider using different methods of ultra-short-term data collection to verify the results. Given that weak correlations were found, the use of additional parameters in addition to simply the Apple Watch ECG might also help with quantifying stress. Indeed, several physiological and behavioural variables have been widely used in stress research. This could include brain activity measured through electroencephalogram (EEG), electrodermal activity (EDA), speech, mobile phone usage, among others (5). Physical activity (24,40,41) and sleep (40,42,43) could also be potentially used to discriminate stress and can also be collected passively with the Apple Watch sensors -if ECG and other Apple Watch data were successfully used in conjunction to differentiate between stressed states, potential solutions could focus simply on the Apple Watch for stress quantification, which would be of great value in studying the prevalence of these conditions and providing feedback to users. Finally, the use of Machine Learning for prediction, as previously mentioned, has shown promising results (12), and further studies also using other parameters could help improve prediction accuracy and realize the potential of the Apple Watch for stress studies.

Conclusion
The use of an Apple Watch ECG to quantify individual stress was piloted in a real-world scenario. Significant but weak correlations were found between several HRV features and measures of self-perceived stress. This study highlights the potential usefulness of the Apple Watch ECG as a minimally invasive tool for stress monitoring, quantification, and intervention, although more robust evidence is needed to establish the relationships between the data and its relevancy.

Data availability statement
The original contributions presented in the study are included in the article/supplementary materials, further inquiries can be directed to the corresponding author: plinio.morita@uwaterloo.ca.

Ethics statement
The studies involving human participants were reviewed and approved by University Waterloo Research Ethics Board (REB [43612]). The patients/participants provided their written informed consent to participate in this study.

Author contributions
PV was responsible for conducting the study, collecting, and analysing the data, and writing the manuscript. ML provided help in the statistical analyses and in the writing of the manuscript. PM, PA, SL, and DC provided direction to the manuscript's writing and preparation as well as editing and revision. All authors contributed to the article and approved the submitted version.

Funding
This work was supported by an Ontario Trillium Scholarship from the Ontario Government.