Criterion-Validity of Commercially Available Physical Activity Tracker to Estimate Step Count, Covered Distance and Energy Expenditure during Sports Conditions

Background: In the past years, there was an increasing development of physical activity tracker (Wearables). For recreational people, testing of these devices under walking or light jogging conditions might be sufficient. For (elite) athletes, however, scientific trustworthiness needs to be given for a broad spectrum of velocities or even fast changes in velocities reflecting the demands of the sport. Therefore, the aim was to evaluate the validity of eleven Wearables for monitoring step count, covered distance and energy expenditure (EE) under laboratory conditions with different constant and varying velocities. Methods: Twenty healthy sport students (10 men, 10 women) performed a running protocol consisting of four 5 min stages of different constant velocities (4.3; 7.2; 10.1; 13.0 km·h−1), a 5 min period of intermittent velocity, and a 2.4 km outdoor run (10.1 km·h−1) while wearing eleven different Wearables (Bodymedia Sensewear, Beurer AS 80, Polar Loop, Garmin Vivofit, Garmin Vivosmart, Garmin Vivoactive, Garmin Forerunner 920XT, Fitbit Charge, Fitbit Charge HR, Xaomi MiBand, Withings Pulse Ox). Step count, covered distance, and EE were evaluated by comparing each Wearable with a criterion method (Optogait system and manual counting for step count, treadmill for covered distance and indirect calorimetry for EE). Results: All Wearables, except Bodymedia Sensewear, Polar Loop, and Beurer AS80, revealed good validity (small MAPE, good ICC) for all constant and varying velocities for monitoring step count. For covered distance, all Wearables showed a very low ICC (<0.1) and high MAPE (up to 50%), revealing no good validity. The measurement of EE was acceptable for the Garmin, Fitbit and Withings Wearables (small to moderate MAPE), while Bodymedia Sensewear, Polar Loop, and Beurer AS80 showed a high MAPE up to 56% for all test conditions. Conclusion: In our study, most Wearables provide an acceptable level of validity for step counts at different constant and intermittent running velocities reflecting sports conditions. However, the covered distance, as well as the EE could not be assessed validly with the investigated Wearables. Consequently, covered distance and EE should not be monitored with the presented Wearables, in sport specific conditions.


INTRODUCTION
In the past years, there was an increasing development of physical activity trackers (Wearables) which earned them the first place in the ACSM Worldwide Survey of Fitness Trends in 2016 and 2017, leaving popular topics like "High-intensity interval training" and "strength training" behind (Thompson, 2015(Thompson, , 2016. Besides having applications for physical fitness and health in the general population by monitoring a plethora of different variables like step count, covered distance and energy expenditure (EE), Wearables may be useful for (elite) athletes as well. In these populations, Wearables might be used to monitor aspects of training load (Düking et al., 2016) as well as physical activity during leisure time and provide biofeedback to optimize exercises (Düking et al., 2017).
However, before Wearables can be used beneficially, the parameters they provide need to be scientifically trustworthy which implies that Wearables have sufficient validity which unfortunately is often an issue especially with commercially available Wearables . Several studies, recently summarized by Evenson et al. (2015) and Düking et al. (2016), tackled this issue and investigated the scientific trustworthiness of different Wearables under a variety of different conditions like walking, jogging, cycling, or resistance exercise under laboratory as well as under free-living conditions. Yet, scientific evaluations are strictly speaking only meaningful for the specific conditions the device was tested in and transfer of the results of these studies should be done carefully (Bassett et al., 2012). For recreational people, testing under walking or light jogging conditions might be sufficient. For (elite) athletes, however, scientific trustworthiness needs to be given for a broad spectrum of velocities or even fast changes in velocities reflecting the demands of the sport. There is scarce literature stating the validity of consumer level Wearables under sport specific conditions, even though some of the herein analyzed wearables are validated in the general population (El-Amrawy and Nounou, 2015;Alsubheen et al., 2016;An et al., 2017;Price et al., 2017).
Therefore the aim of the present study was to investigate the (concurrent) criterion-validity of eleven consumer Wearables concerning the amount of step count, covered distance and EE during running at four different velocities, an intermittent profile reflecting conditions in a soccer match and a 15-min outdoor trial at a constant velocity.

MATERIALS AND METHODS
For the determination of the validity of step count, covered distance and EE, the criterion measures are described below. In order to test the validity of the eleven Wearables in a standardized situation under laboratory conditions, participants performed a running protocol of a total duration of 25 min, which consisted of four stages of different constant velocities lasting 5 min each, as well as a 5 min period of intermittent velocity. Validity for outdoor conditions was subsequently tested during a 15-min run at a constant velocity. The validity of the Wearables for step count, covered distance and EE was assessed during a single session of treadmill walking and running, using methods similar to previous validation studies (Takacs et al., 2014).

Subjects and Ethics Statement
A total of 20 healthy and active sport students (10 male and 10 female) volunteered to participate in this study. All subjects gave written informed consent to the participation in the study. The study was performed in accordance with the declaration of Helsinki and approved by the Ethic Committee of the German Sport University Cologne.

Criterion Measures
The Optogait system (OPTOGait, Microgate Srl, Bolzano, Italy) was used as the criterion measure for monitoring step count on the treadmill. The system is integrated within the sidebars of the treadmill (Pulsar, h/p/ cosmos sports and medical GmbH, Traunstein, Germany) and uses a photoelectric cell system to precisely measure the number of step count, which is a reliable (ICC = 0.962) and valid (ICC = 0.997) method for measuring step counts during treadmill trials (Lee et al., 2014).
Step count was additionally assessed by a manual counter, which was also used in the outdoor condition.
The covered distance measured by the treadmill was used as a criterion measure and was determined based on the calibrated treadmill output (displayed on the electronic output of the treadmill in meters, based on the speed of the treadmill belt and time for each revolution of the belt) according to Takacs et al. (2014). The slope of the treadmill was automatically set at 1%.
The Metamax 3B (Metamax 3B, CORTEX Biophysik GmbH, Leipzig, Germany) is a portable gas analyzer allowing measurements of oxygen uptake under laboratory and freeliving conditions, which was used in this study to calculate EE via indirect calorimetry as the criterion measure for EE. For the calculation of EE, oxygen uptake (VO 2 ) was measured continuously breath by breath during the whole exercise and calculated according to previous reports (Scott et al., 2006). Before each session, the Metamax 3B flowmeter and gas analyzers were calibrated using a 3-liter syringe and a known gas mixture (15% O 2 and 5% CO 2 ). During calibration of the gas analyzer (O 2 and CO 2 sensors), the Metamax3B alternates sampling of the known gas mixture and ambient air. The Metamax 3B is a valid and reliable system for measuring oxygen uptake (Vogler et al., 2010). Methods of indirect calorimetry are the most commonly used to quantify human EE in both laboratory and field settings, typically by measuring oxygen uptake (Hills et al., 2014).

Exercise Study Protocol
After arriving in the laboratory, anthropometric (weight, height, body fat) and personal data (date of birth, sex, handedness) of the participants were collected and transferred to all devices. Afterward, eleven Wearables were fixed at the wrist in a randomized order. The Bodymedia Sensewear armband and one Withings Pulse O x device were placed on the backside of the upper arm and the hip, respectively. For the measurement of heart rate of the Garmin Wearables, the participants were fitted with a heart rate chestbelt.
First, the participants were asked to lay down for 20 min. After the first 10 min, the measurement of resting EE was started using indirect calorimetry technique. Second, the running protocol was started, consisting of four 5 min stages of different constant velocities (walking: 4.3; 7.0; running: 10.1; 13.0 km·h −1 ) each separated by 5 min of passive rest. After these constant velocities stages, a 5 min period of intermittent velocity followed. This protocol was extracted from a smoothed running trial during a real soccer match (Amisco Data from a soccer match of the 1. German soccer league). The mean running velocity was 9.1 km·h −1 , including twelve sprints with a maximal velocity of 22.4 km·h −1 . Maximal acceleration and deceleration were 5.47 km·h −2 (1.52 m·s −2 ) and −4.88 km·h −2 (−1.36 m·s −2 ), respectively. Remaining time was covered with walking, defined by velocities smaller than 7.33 km·h −1 , which is considered as preferred transition speed between walking and running (Rotstein et al., 2005). Besides the tests under laboratory conditions, ten participants (5 men, 5 women) performed a run of 2.4 km at a constant velocity of 10.1 km·h −1 under free-living conditions (Figure 1).

Statistical Analysis
Descriptive statistics (mean ± SD) summarize the characteristics of the participants, including age, weight, height and percent of body fat. All data were tested for normality with no further transformation needed. The validity of the Wearables was determined, as previously performed by other validation studies (Kooiman et al., 2015;Bai et al., 2016;An et al., 2017), by several statistical tests: 1) Systematic differences between the Wearables and the criterion measurement: mean absolute percentage error (MAPE) compared to the criterion measurement (mean difference Wearables-criterion measurement ·100· mean criterion measurement −1 ). 2) Correlation between the Wearables and the criterion measurement: Intraclass Correlation Coefficient (ICC) (twoway random, absolute agreement, single measure, 95% confidence interval) (Shrout and Fleiss, 1979), common cutoff points for validity assessment:

DISCUSSION
The aim of the present study was to investigate the criterionvalidity of eleven Wearables for step count, covered distance and EE over a large spectrum of constant and intermittent velocities reflecting sports conditions. The results indicate that most Wearables, except Beurer AS80, Polar Loop, Bodymedia Sensewear provide an acceptable level of validity concerning step count for all constant velocities, the intermittent protocol as well as for the outdoor condition. The parameters covered distance and EE, however, exhibited a low validity for any of the conditions for most of the Wearables. The Xaomi Miband did lack a high amount of data and we, therefore, want to discourage using this Wearable to monitor step count, distance, and EE in sports conditions.
Step Count In line with the present study, other laboratory-based studies also showed generally high correlations for step count between the criterion measure and Wearables (Takacs et al., 2014;Diaz et al., 2015;Evenson et al., 2015). Tudor-Locke et al. (2006) stated that Wearables generally should not exceed a MAPE of 1% compared to the criterion measure during walking on a treadmill at a speed of 4.8 km·h −1 in order to be considered accurate. Garmin Vivosmart, Garmin Vivoactive, Garmin Forerunner 920 XT, Fitbit Charge HR, and Withings Pulse O x (Hip) had a MAPE <1% over all test conditions. Fitbit Charge and Garmin Vivofit had a slightly higher MAPE of <3%, still representing good results. Bodymedia Sensewear, Polar Loop, and Beurer AS80 had MAPE between 3.7 and 15.5%, whereby all devices underestimated the number of steps taken. When errors were higher, the direction tended to be an under-estimation of step count by the tracker compared to the criterion. This may be particularly problematic at slow walking speeds (Evenson et al., 2015). Garmin Vivosmart, Garmin Vivoactive, Fitbit Charge HR, and Withings Pulse Ox indicated the narrowest LoA (less than 50 steps for the constant velocities). This can be considered as a relatively small range. The range between the upper and lower LoA of Bodymedia Sensewear, Polar Loop, and Beurer AS80 (up to 200 steps) are considered to be too large to be used interchangeably with the criterion measure. In a sport specific condition like a marathon run with an average velocity of 10.1 km·h −1 an average step count of 60.000 steps represents an error of +60 steps for Fitbit Charge HR or −7.500 steps for Bodymedia Sensewear.
For the intermittent velocities, which are typical for most sport disciplines, the discrepancy was high, revealing an underestimation for all Wearables between −14 ± 40 steps (Garmin Vivosmart) up to −198 ± 91 (Withings Pulse O x Wrist). For intermittent sports, like a 90 min competitive soccer game, players will cover on average about 13.000 steps, which represents a small error of −143 steps for Fitbit Charge HR/Garmin Vivosmart up to a high underestimation of 2.106 steps for Beurer AS80.
The outdoor condition, which resembled the same velocity as the third speed on the treadmill (10.1 km·h −1 ), showed similar results as the laboratory testing using constant velocities.
In summary, the step count for most of the Wearables, except Bodymedia Sensewear, Polar Loop, and Beurer AS80 showed to be valid. However, generally, there is a tendency to underestimate the number of steps. One might speculate, that a reduced arm movement while walking/running leads to an underestimation of the step count. Furthermore, it might be a problem of the adjustment of the sensitivity of the accelerometers and different algorithms. The manufacturers have the problem, that wearables should not count every single arm movement during daily life as a step. Therefore, the acceleration needs to exceed a certain threshold to be processed by the algorithm and to be counted as a step.

Covered Distance
The measurement of covered distance showed no consistent discrepancy over the different velocities between the Wearables and the criterion measure. The Wearables mainly showed an overestimation of distance for constant slower velocities (4.3 and 7.2 km·h −1 ) and an underestimation of distance for higher velocities (13.0 km·h −1 ). This is in line with the study of Takacs et al. (2014), showing an overestimation for slower speeds (3.2-4.7 km·h −1 ) and an underestimation for faster speeds (6.4 km·h −1 ). In elite sport fast running velocities often occur, and consequently, the covered distance will be underestimated in these instances with the presented Wearables. The highest MAPE (−18.1 to 58.3%) of all Wearables was reached at the velocity of 7.2 km·h −1 , whereas the lower velocity of walking (4.3 km·h −1 ) showed a better MAPE (1.3 to 19%). The ICC ranged from 0.0 to 0.2 for all tested conditions, indicating poor agreement with the criterion measure. This is line with the study of Takacs et al. (2014), showing small ICC between 0.0 and 0.05. Although Garmin Vivosmart, Garmin Vivoactive, Fitbit Charge, and Fitbit Charge HR showed the narrowest LoA, the range is still insufficiently high. In sport specific situations, like a marathon run at 10.1 km·h −1 , covered distance will be overestimated by ∼2.94 km with Garmin Forerunner 920XT, or underestimated by ∼16.9 km with Beurer AS80.
In the intermittent protocol, the covered distance derived from Wearables show a high discrepancy compared to the criterion measure, with some Wearables overestimating (Withings Pulse Ox Hip, Garmin Forerunner 920XT, Garmin Vivoactive, Garmin Vivosmart), others underestimating this parameter (Fitbit Charge HR, Fitbit Charge, Garmin Vivofit, Beurer AS80). For intermittent sports, like a 90 min soccer game (mean distance 12 km), the covered distance will be underestimated by ∼1.080 m using Withings Pulse Ox hip up to ∼5.076 m using Beurer AS80 based on our findings.
The outdoor condition (10.1 km·h −1 ) showed similar high MAPE compared to the laboratory condition with the same Wearables overestimating (Withings Pulse O x Wrist and Hip, Garmin Forerunner 920XT, Garmin Vivoactive, Garmin Vivosmart) or underestimating (Fitbit Charge HR, Fitbit Charge, Garmin Vivofit, Beurer AS80) the covered distance.
In summary, for monitoring the covered distance, no Wearable could achieve good validity for all laboratorybased constant and intermittent velocities as well as in the outdoor condition. We acknowledge that the covered distance can be assessed by other Wearables employing for example receivers for Global Navigation Satellite Systems such as Global Positioning Systems (Cummins et al., 2013) and it seems that this technology is superior       The outdoor condition showed a completely contrary pattern compared to the laboratory condition (10.1 km·h −1 ). While all devices underestimate the EE in the outdoor condition, most of the devices overestimate EE in the comparable laboratory condition. This is surprising, but may be an issue of reliability, an aspect we intentionally did not target in our study. To clarify this, we want to encourage researchers in conducting reliability studies on the presented Wearables. In summary, the presented Wearables should be used very cautiously to assess EE.

LIMITATIONS
Generally, we have to acknowledge some limitations of the present study. First, there might be some limitations arising from calculating EE via indirect calorimetry using the device Metamax 3B (Lighton, 2008). Even though the experiments were conducted within 2 weeks of time, which might limit the degradation of the oxygen sensor, previous studies showed, that the Metamax 3B produces acceptably stable and reliable results, but is not adequately valid during moderate and vigorous exercise without some further correction of VO 2 and VCO 2 (Macfarlane and Wong, 2012). As in every validation study, we cannot be entirely sure if some error arises from the criterion-measure and encourage to see the results of this study in light of these limitations.
Second, the velocities on the treadmill were not randomized, as we expected that higher velocities would influence slower velocities more than the other way round. Therefore, we decided not to randomize the velocities, but to gradually increase the velocity. Additionally, during the 5 min rest periods, spirometric and heart rate values decreased to resting levels. Anyhow, we cannot completely discard a cardiovascular drift.
Third, in comparison to several previous validation studies (Kooiman et al., 2015;Bai et al., 2016;An et al., 2017), we investigated a similar number of subjects. However, the relatively small sample size might limit the statistical power of the present results. There are several statistical approaches for validation studies. However, possibly no statistical approach will remain uncriticised and every approach has its advantages and drawbacks. According to previously published validation studies (Kooiman et al., 2015;Bai et al., 2016;An et al., 2017), we used the statistical approach from this studies.

CONCLUSION
In our study, most Wearables provide an acceptable level of validity for step counts at different constant and intermittent running velocities reflecting sports conditions. The most valid Wearables, represented by the smallest MAPE, to monitor step count were Garmin Vivosmart, Garmin Vivoactive, Garmin Forerunner 920XT, Fitbit Charge, Fitbit Charge HR and Withings Pulse Ox (Hip). Yet, the covered distance, as well as the EE, could not be assessed validly with the investigated Wearables. Especially in sport specific conditions, like a marathon run or a 90 min soccer game, covered distance and EE showed high errors for nearly all Wearables. Consequently, covered distance and EE should not be monitored with the presented Wearables.