Comparing the reliability of muscle oxygen saturation with common performance and physiological markers across cycling exercise intensity

Introduction Wearable near-infrared spectroscopy (NIRS) measurements of muscle oxygen saturation (SmO2) demonstrated good test–retest reliability at rest. We hypothesized SmO2 measured with the Moxy monitor at the vastus lateralis (VL) would demonstrate good reliability across intensities. For relative reliability, SmO2 will be lower than volume of oxygen consumption (V̇O2) and heart rate (HR), higher than concentration of blood lactate accumulation ([BLa]) and rating of perceived exertion (RPE). We aimed to estimate the reliability of SmO2 and common physiological measures across exercise intensities, as well as to quantify within-participant agreement between sessions. Methods Twenty-one trained cyclists completed two trials of an incremental multi-stage cycling test with 5 min constant workload steps starting at 1.0 watt per kg bodyweight (W·kg−1) and increasing by 0.5 W kg−1 per step, separated by 1 min passive recovery intervals until maximal task tolerance. SmO2, HR, V̇O2, [BLa], and RPE were recorded for each stage. Continuous measures were averaged over the final 60 s of each stage. Relative reliability at the lowest, median, and highest work stages was quantified as intraclass correlation coefficient (ICC). Absolute reliability and within-subject agreement were quantified as standard error of the measurement (SEM) and minimum detectable change (MDC). Results Comparisons between trials showed no significant differences within each exercise intensity for all outcome variables. ICC for SmO2 was 0.81–0.90 across exercise intensity. ICC for HR, V̇O2, [BLa], and RPE were 0.87–0.92, 0.73–0.97, 0.44–0.74, 0.29–0.70, respectively. SEM (95% CI) for SmO2 was 5 (3–7), 6 (4–9), and 7 (5–10)%, and MDC was 12%, 16%, and 18%. Discussion Our results demonstrate good-to-excellent test-retest reliability for SmO2 across intensity during an incremental multi-stage cycling test. V̇O2 and HR had excellent reliability, higher than SmO2. [BLa] and RPE had lower reliability than SmO2. Muscle oxygen saturation measured by wearable NIRS was found to have similar reliability to V̇O2 and HR, and higher than [BLa] and RPE across exercise intensity, suggesting that it is appropriate for everyday use as a non-invasive method of monitoring internal load alongside other metrics.


Background
Muscle oxygenation is becoming an increasingly popular physiological marker of the internal load to exercise for athletes, practitioners, and sport scientists (1). With the introduction of wearable near-infrared spectroscopy (NIRS), the potential to monitor muscle oxygen saturation (SmO 2 ) in real-time at the location where the activity occurs is becoming increasingly viable. For measurements intended to be used in everyday training environments, the reliability between sessions, i.e., test-retest reliability is a necessary step before use. Despite its importance, deciphering what change over time is expected in common physiological markers remains challenging. Part of this challenge is due to the variety of ways reliability can be assessed using the same measure for different sporting applications (2).
Both relative and absolute test-retest reliability of muscle oxygenation signals measured by the Moxy muscle oxygen monitor (Fortiori Design LLC., Hutchinson, MN, USA), in particular muscle oxygen saturation, has been investigated in the conditions of rest, vascular occlusion, and exercise (3)(4)(5). Relative reliability evaluates the observed measurement error relative to observed differences between participants and relates to the ability to consistently differentiate participants. Absolute reliability evaluates the consistency of repeated measurements (repeatability) within a single individual (2, 6). Agreement evaluates how close any two measurements are, in the original units of the parameter being evaluated (2, 6). For this sensor, comparing the reliability of SmO 2 with other common physiological measures such as heart rate (HR), rate of systemic oxygen uptake (V̇O 2 ), concentration of blood lactate ([BLa]), and rating of perceived exertion (RPE), will provide a useful contrast for each of these common markers during the same exercise protocol.
The behaviour of SmO 2 has been studied in both incremental ramp exercise and constant workloads (CW) protocols at various intensities (7)(8)(9)(10)(11)(12)(13)(14). SmO 2 is a measure of the relative concentration of oxygenated and deoxygenated hemoglobin and myoglobin in the tissue under the sensor, and reflects the balance of local O 2 delivery and O 2 extraction (i.e., O 2 supply to match energetic demand) (3,4,15). A higher rate of O 2 delivery than extraction, all else equal, will tend to produce rising SmO 2 (excess supply). A higher rate of O 2 extraction over delivery will produce falling SmO 2 (excess demand). At higher exercise intensity SmO 2 at locomotor muscles tends to be lower, reflecting the greater difficulty matching O 2 delivery to extraction. During incremental cycling exercise, SmO 2 measured at the vastus lateralis (VL) typically desaturates as a function of increasing exercise intensity, generally following a sigmoidal or segmented linear profile (16-20). The behaviour of SmO 2 has been studied for both continuous and constant workloads (CW) at various intensities (7)(8)(9)(10)(11)(12)(13)(14). With increasing workload, a desaturation is expected for SmO 2 , indicating increased oxygen extraction at the observed tissues (19). SmO 2 desaturation typically varies monotonically as a function of exercise intensity, generally following a sigmoidal or segmented linear profile (16-20).
During CW cycling exercise, SmO 2 at the VL will rapidly deoxygenate before reaching a quasi-steady state within 60-120 s, similar to the primary onset phase of systemic V̇O 2 kinetics (21,22). At this quasi-steady state, SmO 2 may gradually rise at lower intensity, and gradually fall at higher intensity (15,23,24).
An incremental multi-stage cycling test combines aspects of both incremental ramp and CW exercise in a single protocol. Constant workload stages are performed at progressively increasing workloads up to maximal exercise tolerance. Work steps are ideally 3 min in duration or longer to permit physiological measures to reach a relative steady-state at each workload (25,26). Between each work step, a 1 min passive recovery interval is added to observe the onset desaturation response of each work step. By employing this protocol, the progressive effects of each workload on all performance and physiological markers can be observed in relative isolation in a time efficient single session (27,28). Recently, this protocol has been used to compare SmO 2 with [BLa], using a similar rationale outlined previously (28). Despite the benefits of segmenting each intensity with an integration of passive rest periods for identifying the internal load response of each work step, reliability of SmO 2 during such a protocol remains unclear.
Multiple studies have validated various stationary NIRS sensors with each of the physiological and performance markers outlined above, during various modes of exercise (12,17,22,(29)(30)(31). Despite these reports, what portable, wearable NIRS test-retest reliability will be in comparison with each of these markers during a common exercise assessment protocol remains unclear. If wearable NIRS is intended as a commercially available, noninvasive instrument to be used by athletes and coaches to provide useful information about internal load during exercise, addressing this question is of great importance. The purpose of our study was to estimate the reliability of muscle oxygenation, alongside common performance, and physiological markers across exercise intensity, as well as to provide absolute agreement in original units that can be expected between sessions for each marker.
We hypothesized that SmO 2 measured with the Moxy muscle oxygen monitor at the VL would demonstrate good test-retest reliability across intensity during an incremental multi-stage cycling test incremental protocol (3-5, 16, 28). We also hypothesized that relative reliability scores for SmO 2 would be lower than for V̇O 2 and HR, higher than [BLa] and RPE, and have larger thresholds for minimal detectable change compared with V̇O 2 and HR, but not RPE and [BLa] (32-39).

Participants
Twenty-one trained cyclists (10 females and 11 males; 29.1 ± 7.8 years of age, 69.6 ± 11.3 kg, 174 ± 11 cm, 58.6 ± 7.9 ml kg min −1 peak oxygen uptake) volunteered and provided written informed consent to participate in this experiment. Descriptive statistics are presented in Table 1. To obtain sufficient power of Yogev et al. 10.3389/fspor.2023.1143393 Frontiers in Sports and Active Living β = 0.8 with α = 0.05, an a priori sample size calculation was made in G*Power software (version 3.1.9.7, Kiel, Germany) using previously reported data from other groups that compared SmO 2 values within and between sessions during ramp incremental tests and severe intensity efforts (5,15,16,40,41). This study was conducted in accordance with the principles established in the declaration of Helsinki and approved by the research ethics committee of The University of British Columbia (H21-00446). The inclusion criteria were healthy athletes with at least 2 years of training experience in cycling, aged 18-48 years, V̇O 2 peak ≥45 ml kg −1 min −1 for females and ≥50 ml kg −1 min −1 for males, non-smokers, with no history of cardiovascular disease, and no injuries requiring time away from training within the previous six months (42)(43)(44). Sex was classified by self-report and all participants identified as either male or female.

Experimental design
Participants visited the laboratory on two occasions one or two weeks apart, at the same time of day (±1 h) to assess test-retest reliability of NIRS parameters during an incremental multi-stage cycling test. Trials were performed outside of the participants' competitive season. Participants were instructed to avoid strenuous exercise for 24 h (hrs), avoid alcohol for 12 h, avoid caffeine for 4 h, maintain the same diet for 12 h, and get at least 8 h of sleep prior to both trials.
At the start of the first trial, participants' height and mass were recorded, and skinfold measurements (Harpenden Skinfold Caliper, Baty International, West Sussex, England) were taken at the right vastus lateralis at the placement of the NIRS sensor (detailed below).
Each trial consisted of an incremental multi-stage cycling test, with 5 min CW stages, each followed by a 1-min passive rest interval. The protocol began at 1.0 W kg −1 and progressed by 0.5 W kg −1 per stage. Participants performed the first trial at a self-selected cadence and maintained ±5 cadence range throughout the second trial. Participants performed the first trial to maximal exercise tolerance or until cadence fell by more than 10 rpm for 10 s, despite strong verbal encouragement. This cutoff was selected to minimize the effect of changing cadence on SmO 2 (45). In the second trial, the protocol was repeated to the same duration as the first trial (n = 18) or to the limits of tolerance if they were unable to reach the same end point (n = 3). Participants maintained a seated position with both hands on the handlebars in their preferred riding position for the entire exercise protocol.

Data collection
The cycling protocol was performed on the participant's own bicycle mounted to an electronically controlled stationary trainer (Tacx NEO 2T, Garmin International Inc., Olathe, KS, USA). Resistance was controlled and cycling power output (watts) and cadence [revolutions per minute (rpm)] were recorded at 1 Hz using PerfPRO Studio Software (Hartware Technologies, Rockford, MI, USA) installed on a laptop computer.
Heart rate [beats per minute (bpm)] was recorded from a chest strap monitor (Garmin International Inc., Olathe, KS, USA) at 1 Hz. Capillary [BLa] (mmol·L −1 ) was sampled from the fingertip during the last 30 s of each work stage and measured with an electrochemical biosensor analyzer (Edge USA, USA). RPE was reported at the same time using the Borg category-ratio 0-10 scale (46). Expired gases were collected and sampled by an open circuit metabolic analyzer (TrueOne 2400, ParvoMedics Inc., Sandy, UT, USA) from a 4 L mixing chamber and V̇O 2 was recorded as 15 s average values.
Muscle oxygen saturation was measured at the right vastus lateralis (VL) using a wearable NIRS sensor (Moxy Monitor, Fortiori Design LLC., Hutchinson, MN, USA). The Moxy monitor is a self-contained, relatively low cost, wearable continuous wave NIRS sensor that can resolve an arbitrarily scaled heme volume that represents both hemoglobin and myoglobin concentration in the tissue under illumination, as well as SmO 2 on a 0%-100% scale (4). The sensor was positioned on the right VL muscle belly at ⅓ the distance from the patella to greater trochanter with the participant seated with knee bent to 90°. The sensor was secured with adhesive tape along with the manufacturer-supplied light shield to minimize signal interference from ambient light and movement.
Moxy employs four wavelengths of near-infrared light (680, 720, 760, and 800 nm), with a single LED source and two detectors at 12.5 and 25 mm separation, giving a maximum penetration depth of approximately 12.5 mm (3). It is recommended that participant skinfold thickness (SF) be less than the maximum penetration depth (SF < ½ the maximum inter-optode distance); however, to better represent real-world use of NIRS with competitive athletes, no exclusion was made on the basis of skinfold measurements, nor was a correction performed for SF (47). The SmO 2 signal was used as the primary muscle oxygenation variable in this study (3)(4)(5). SmO 2 was recorded every 2 s (0.5 Hz) and smoothed with a 5 s moving average as per manufacturer default settings.

Data analysis
V̇O 2 peak was determined as the highest two consecutive 15 s measurements within each trial. Peak workload Frontiers in Sports and Active Living (Wpeak) was determined for each trial using the following formula: Where WC is the workload of the prior completed stage, ΔW is the final incremental workload (0.5 W kg −1 ), and t is the time completed at the final work stage. HRpeak was determined as the highest 1 s value recorded during each trial.
To evaluate test-retest reliability of each parameter across exercise intensity between the two trials, we analyzed the first, median, and last work stages performed by each participant. These will be referred to as lowest, median, and highest workloads. Continuous variables (SmO 2 , V̇O 2 , and HR) were averaged from the last 60 s within each work stage, omitting the final 10 s to exclude influence from the end of work stage transition to rest.
During analysis, three sets of measurements were excluded due to measurement error or methodological inconsistencies. First, gas exchange measurements for two participants were excluded. All V̇O 2 data from one participant was excluded while the data from the final stage of the other was excluded. Second, technical issues with the ergometer resistance occurred for two participants, which affected the lowest and median work stages, respectively. These issues led to the [BLa], V̇O 2 , and SmO 2 data having to be discarded. Lastly, the HR signal during the first stage in one participant was not acquired.

Statistical analysis
Data analysis and statistical analysis were carried out in R (v4.1.2, R Foundation for Statistical Computing, Vienna, Austria). Descriptive results are presented as mean ± standard deviation (SD) or 95% confidence interval (CI95%), as indicated. A twoway repeated measures ANOVA was performed to evaluate the effects of workload (lowest, median, highest) and trial (1, 2) on each variable. Pairwise post hoc comparisons were made to evaluate for a main effect of trial and adjusted for multiple comparisons with the Bonferroni method. Significance was set at p < 0.05. Normality was assessed by Shapiro-Wilk tests.
Absolute reliability and within-subject agreement were quantified as the SEM in original units, which is appropriate for homoscedastic data (2, 49, 50). All outcome variables evaluated in the present study were found to be homoscedastic from visual analysis and Levene's test, except for [BLa]. Absolute reliability was also quantified as a coefficient of variation (CV) from within-subject repeated measurements (Atkinson & Nevill, 1998

Test-retest comparisons
Data from a representative participant performing the full incremental multi-stage cycling protocol, with muscle oxygen saturation (SmO 2 ) response from two trials are displayed in Figure 1. Data from the same participant at the lowest, median, and highest workloads are displayed in Figure 2. Descriptive group results for trials 1 and 2 can be found in Table 2. Testretest comparisons between trials at the lowest, median, and highest workloads are displayed in Figure 3. Between-trials comparisons showed no significant differences within each workload for all outcome variables. Workload at the lowest, median, and highest stages were 71 ± 11, 199 ± 47, and 303 ± 80 W, respectively. Cadence showed an effect of workload between lowest and median (p < 0.01), and lowest and highest workloads (p < 0.001), with no difference between median and highest (94 ± 8) workloads (p = 0.062) ( Table 2).

Relative reliability
ICC for SmO 2 were good to excellent. ICC for HR and V̇O 2 were good to excellent, apart from V̇O 2 at the lowest workload, which was moderate. ICC for [BLa] and RPE were poor to moderate ( Table 3).

Discussion
The aim of this study was to estimate and compare the reliability of wearable NIRS with common performance and physiological measures across an incremental multi-stage cycling test on a stationary cycling ergometer. Our results for test-retest analysis supported our hypothesis that SmO 2 would demonstrate good reliability across intensity, with excellent ICC scores at the lowest and median workloads, and a good ICC at the highest workload. With regards to relative reliability compared with the remaining variables, SmO 2 provided similar ICC values to both HR and V̇O 2 , scoring higher than V̇O 2 at the lowest workload, but not as high as both of these variables at the highest workload. Compared with [BLa] and RPE, SmO 2 had better ICC scores across intensity ( Table 3). This suggests that for use as an everyday tool for monitoring training load, SmO 2 is similarly sensitive to physiological variability as these other measures.
Compared to previous studies of reliability of wearable NIRS (3-5), our results demonstrated higher absolute agreement Data shown for a representative participant performing two trials of an incremental multi-stage cycling test. Work stages are 5 minutes with 1-min passive rest intervals. Workload starts at 1.0 W·kg −1 and increases by 0.5 W·kg −1 each stage. Muscle oxygen saturation (SmO 2 ) overlaid from Trial 1 (light blue) and Trial 2 (dark blue). Participants performed the first trial to the maximal limits of tolerance. The second trial was repeated to the same duration, or to the limits of tolerance if they were unable to reach the same duration as the first trial. Data shown for a representative participant at the lowest, median, and highest work stages from two trials, with workloads displayed in W·kg −1 . Muscle oxygen saturation (SmO 2 ) recorded from Trial 1 (light blue) and Trial 2 (dark blue). Mean SmO 2 data from the last 60 seconds of each work stage was used for test-retest comparisons (red shaded areas). They reported less variability with CV scores of 6, 8, and 10% at baseline, post exercise recovery, and maximal task tolerance, respectively. Despite the lower variability, their relative reliability scores were lower than the ones detected in our results. Unlike the Moxy sensor used in our study, the PortaMon sensor is more expensive, offering greater illumination depth and sampling rate. A study by McManus et al., compared between the PortaMon and Moxy during rest, exercise, and arterial occlusion (3). They found that Moxy displayed a greater dynamic range during exercise and arterial occlusion. Therefore, it is possible that the differences in both absolute and relative reliability between van Hooff's group and our results is explained by sensor differences.
Our results in Table 3 provide useful information for athletes, practitioners, and sport scientists who wish to monitor changes in VL SmO 2 over time using a Moxy wearable sensor. Having SEM and MDC quantified across intensity can aid in understanding the expected day to day variance for SmO 2 during regular training and testing. Changes within a range of ±SEM may be meaningful for an individual athlete but should not be considered to represent a significant difference. A longitudinal intervention such as a structured training program can be considered to have a significant effect on SmO 2 if the values measured at a given absolute workload change by more than the MDC. Additionally, having all other common measures presented in conjunction with SmO 2 improves our understanding of what each expected change should be for specific exercise workloads over time.

SmO 2 compared to V O 2 and HR
As expected, V̇O 2 presented excellent reliability at all but the lowest workload, potentially due to a signal noise ratio issue. Since the early 1980's, V̇O 2 has been shown to demonstrate good-to-excellent test-retest reliability scores with a CV of approximately 10% across intensity during incremental exercise tests under controlled conditions (35, 37). In a recent investigation on the reliability of V̇O 2 during an IET, Pallarés et al. (2016), observed the reliability of V̇O 2 associated with the gas exchange threshold and the respiratory compensation point (37). These are common thresholds used to demarcate the transitions from moderate to heavy and heavy to severe intensity domains, respectively. They reported excellent test-retest ICC scores for both gas exchange threshold and the respiratory compensation point. In their report, within-subject reliability for both thresholds was found to have slightly lower CV (3.6% and 2.1%, respectively) than those found in our study at the median and highest workload (6.4% and 6.5%) (37).
In another study, reliability of V̇O 2 and ventilatory measurements were estimated for the ParvoMedics TrueOne 2400 metabolic cart during a cycling ergometer IET (52). The protocol included 10-12 min work steps with 50 W increments, starting at a resistance of 50 W and progressing to 250 W. For V̇O 2 , their results showed a CV of approximately 5%, providing further strength to our findings, especially during the highest workload.
As for HR, Montoye et al. observed the reliability of HR using a chest strap during an IET. The protocol included 3 min work steps with 50 W increments, starting at a resistance of 50 W to maximal task tolerance. Their results showed excellent test-retest reliability scores, with CV less than 5% as seen in our results (36). In our results, HR showed CV ranging from 2.6% to 4.7% across intensity, and good to excellent ICC scores, which was higher than SmO 2 only at the highest workload.
As expected, both V̇O 2 and HR did show better reliability compared with SmO 2 , but only during the highest intensity. Despite difference in the mechanisms related to each measurement, comparing them side-by-side during an incremental exercise test provides an interesting contrast to better understand how reliable SmO 2 is between sessions, in reference to these two common measures. The number of participants (n) with data from two trials used for test-retest comparisons at lowest, median, and highest workloads with mean Trial 1 and 2, mean of both trials together (Grand mean) and difference between trials (Mean diff). Muscle oxygen saturation (SmO 2 ), heart rate (HR), systemic oxygen uptake (V O 2 ), concentration of blood lactate ([BLa]), and rate of perceived exertion (RPE). a Indicates significant differences between lowest and median work stages. b Indicates significant differences between lowest and highest work stages. c Indicates significant differences between median and highest workloads. There were no significant differences between trials.  (38). It is worth highlighting that both [BLa] and RPE were taken using single measurements at the end of each stage, unlike the other continuous measures. They are also more prone to subjective interpretation of RPE scores, and investigator error in accurately acquiring capillary blood samples during exercise (34,53). Despite [BLa] and RPE measuring different mechanisms, they are commonly used in exercise testing. Thus, contrasting them with SmO 2 aids in highlighting how reliable each of these measures are in this context.

Study limitations
First, our experimental design did not include a familiarization trial to limit participant burden, which may have negatively Frontiers in Sports and Active Living affected the test-retest reliability outcomes. As previously mentioned, RPE scales are known to require participant familiarization periods, to ensure accurate self-reporting of exercise exertion (53). They are also dependent on other external factors such as psychological stress, emotional state, and readiness (53). As such, it is possible that the lack of familiarization negatively impacted the RPE reliability outcomes compared with the other measures used in our study. Second, the SmO 2 signal of the wearable, self-contained Moxy sensor may be more sensitive to changes in blood flow, and changes in tissue properties with small positioning inconsistencies compared to NIRS signals measured with more advanced stationary technologies (21, 47, 54, 55). These sensor limitations may have affected the quality of the signal, and as a result, the test-retest results presented in our study. In order to include a more representative sample size of healthy, trained male and female cyclists, we did not exclude any participants based on skinfold thickness, even though previous reports suggest signal quality may be affected by skinfold thickness greater than 12.5 mm (3, 4, 56). Our rationale for providing a more generalizable subject group was a compromise that possibly affected reliability outcomes. And third, as muscle recruitment is specific to exercise modalities, NIRS responses will be specific to the location and modality (57, 58). Other locations may demonstrate higher physiological variation, although the measurement uncertainty of the Moxy device itself may be consistent across modalities (59).

Conclusions
Our main objectives were estimating the reliability of muscle oxygen saturation, alongside common performance and physiological markers across exercise intensity, as well as to provide absolute agreement in original units that can be expected between sessions for each marker. Our findings show that the commercially available Moxy wearable NIRS sensor provides goodto-excellent test-retest reliability during an incremental multi-stage cycling protocol. Compared to other common physiological metrics, reliability was highest for V̇O 2 and HR, followed by SmO 2 , with [BLa] and RPE having lower reliability. Additionally, estimated values of absolute agreement reported in our results provide practitioners with the ability to differentiate measurement error from "true" physiological change. Athletes and coaches should expect a certain range of day to day variability when assessing and monitoring performance. This knowledge can be applied by practitioners who use this non-invasive, affordable technology for training prescription or monitoring internal load alongside other common measures. Wearable NIRS is an important instrument that if used appropriately, can add valuable insight into understanding muscle metabolic responses to exercise.

Data availability statement
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Ethics statement
The studies involving human participants were reviewed and approved by the Research Ethics Committee of The University of British Columbia. The patients/participants provided their written informed consent to participate in this study.