Test–Retest Reliability and Reliable Change Estimates for Four Mobile Cognitive Tests Administered Virtually in Community-Dwelling Adults

Objective: Remote mobile cognitive testing (MCT) is an expanding area of research, but psychometric data supporting these measures are limited. We provide preliminary data on test–retest reliability and reliable change estimates in four MCTs from SWAY Medical, Inc. Methods: Fifty-five adults from the U.S. Midwest completed the MCTs remotely on their personal mobile devices once per week for 3 consecutive weeks, while being supervised with a video-based virtual connection. The cognitive assessment measured simple reaction time (“Reaction Time”), go/no-go response inhibition (“Impulse Control”), timed visual processing (“Inspection Time”), and working memory (“Working Memory”). For each cognitive test except Working Memory, we analyzed both millisecond (ms) responses and an overall SWAY composite score. Results: The mean age of the sample was 26.69years (SD=9.89; range=18–58). Of the 55 adults, 38 (69.1%) were women and 49 (89.1%) used an iPhone. Friedman’s ANOVAs examining differences across testing sessions were nonsignificant (ps>0.31). Intraclass correlations for Weeks 1–3 were: Reaction Time (ms): 0.83, Reaction Time (SWAY): 0.83, Impulse Control (ms): 0.68, Impulse Control (SWAY): 0.80, Inspection Time (ms): 0.75, Inspection Time (SWAY): 0.75, and Working Memory (SWAY): 0.88. Intraclass correlations for Weeks 1–2 were: Reaction Time (ms): 0.75, Reaction Time (SWAY): 0.74, Impulse Control (ms): 0.60, Impulse Control (SWAY): 0.76, Inspection Time (ms): 0.79, Inspection Time (SWAY): 0.79, and Working Memory (SWAY): 0.83. Natural distributions of difference scores were calculated and reliable change estimates are presented for 70, 80, and 90% CIs. Conclusion: Test–retest reliability was adequate or better for the MCTs in this virtual remote testing study. Reliable change estimates allow for the determination of whether a particular level of improvement or decline in performance is within the range of probable measurement error. Additional reliability and validity data are needed in other age groups.


INTRODUCTION
Mobile cognitive testing (MCT) -brief, repeated cognitive tests delivered through mobile devices -is of considerable interest to the neuropsychology community. Its rapidly growing popularity is due in part to the downstream effects of the global COVID-19 pandemic (i.e., physical distancing), coupled with the increasing availability of wireless networks (Internet World Stats -Usage and Population Statistics, 2020) and smart phones (Pew Research Center, 2019). MCTs have a number of advantages over traditional neuropsychological testing, including remote, automated administration and scoring, sensitivity to fluctuating physiological states (e.g., arousal and mood), and the potential for improved ecological validity (Allard et al., 2014;Moore et al., 2017a;Sliwinski et al., 2018;Koo and Vizer, 2019;Weizenbaum et al., 2020). Moreover, due to the ease of repeated testing, MCT data are frequently aggregated, thereby enhancing stability in the estimation of cognitive functioning (Allard et al., 2014;Sliwinski et al., 2018). In other words, MCTs could allow for a repeatable, dynamic, real-world assessment of cognitive functioning, which has the potential for benefits in a wide variety of healthy and clinical populations, given the importance of understanding cognitive functioning outside of controlled clinical environments. However, MCTs are intended as an adjunct to rather than a replacement of traditional neuropsychological testing, which has several advantages, including the precise control of an examinee's environment, an in-depth assessment of multiple cognitive domains, and a variety of available tests with large normative datasets.
Despite the need for both physically distant cognitive assessment and brief, repeatable, automated tests, a recent systematic search of available MCTs reported that only seven out of 25 included any psychometric data, with only one out of 25 having extensive supporting data (i.e., norms, reliability, validity, sensitivity, and specificity; Charalambous et al., 2020). For clinical scientists to begin using MCTs in neuropsychological research, rigorous psychometric evaluations of the tests must first be conducted.
SWAY Medical, Inc., offers an app that includes a suite of four MCTs assessing reaction time, impulse control, timed visual processing, and working memory. Data are measured via touch screen as well as tri-axial accelerometry (i.e., motion detection), which can reduce latencies from 50-200 ms (in conventional touch-screens) down to 1-2 ms (Plant and Quinlan, 2013;Patterson et al., 2014;Amick et al., 2015;Woods et al., 2015;Burghart et al., 2019;SWAY Medical, LLC., 2020;VanRavenhorst-Bell et al., 2021). Therefore, SWAY MCTs might have an advantage over other mobile tests due to less variation in response time measurements (i.e., error variance) across devices and operating systems. However, little psychometric evidence is currently available for these tests. Burghart et al. (2019) reported good testretest reliability data for the reaction time test, but the other three SWAY MCTs were not included in their study. They also reported a significant correlation (r = 0.59) between the SWAY reaction time measure and a validated desktopbased test of reaction time. VanRavenhorst-Bell et al. (2021) investigated psychometric properties of the SWAY reaction time, timed visual processing, and impulse control tests in a sample of 88 healthy adults (aged 18-48). The authors reported preliminary evidence pertaining to convergent and discriminant validity of the SWAY tests when correlated with the ImPACT Quick Test. Half (12 of 24) of the bivariate correlation coefficients were statistically significant, with r values ranging from 0.22 to −0.46.
The prior studies discussed above administered SWAY tests in person, and neither reported on estimates of reliable change.
The purpose of the current study is to examine test-retest reliability and reliable change estimates in the four SWAY MCTs, administered remotely in a sample of community-dwelling adults. We hypothesized that test-retest reliability estimates would be at least adequate in all four MCTs.

Participants
Participants were 61 adults, aged 18 and older, recruited through print materials and technology-based communications dispersed across a university and the U.S. Midwest I-35 corridor. The current study data were collected together with SWAY balance test data (to be presented in a separate paper). The study was conducted during the COVID-19 pandemic. The original recruitment goal was to enroll at least 50 participants, and as many as 100. After recruiting 61 people, data collection was discontinued due to study personnel limitations. Exclusion criteria were the following self-reported medical conditions, assessed with the use of the Physical Activity Readiness Questionnaire Plus: musculoskeletal injury impacting movement/ balance, neurological dysfunction, uncorrected vision, or a vestibular condition. Participants were also excluded if they were unable to maintain a videoconferencing connection during the testing sessions, or if they did not have a smart device capable of downloading and running the SWAY application. Of the 61 original participants, one withdrew from the study due to unforeseen medical issues, one withdrew due a time commitment, and four were removed due to equipment failure that prevented their data from being recorded. The final sample included 55 adults. All participants provided informed consent to participate and the study procedures were approved by the affiliate university's Institutional Review Board.

Materials/Procedures
The four SWAY tests were administered remotely, on participants' personal mobile devices. The SWAY tests can be used on any device with an iOS version of 9.3 or higher and an Android version 7.0 or higher. A prior study showed that, within these constraints, the SWAY application can be administered on different mobile devices and operating systems without affecting measured data (VanRavenhorst-Bell et al., 2021). In order to improve adherence to the study protocol, all sessions were supervised by a research assistant who connected to the participant using video-based virtual connections. Participants completed the four SWAY MCTs twice per week for 3 consecutive weeks. Week 1 (but not Week 2 or Week 3) also included an unscored practice administration for all four MCTs, in order to allow for familiarization. This led to three scores for each test in Week 1 (two of which were retained for data analysis), two scores for each test in Week 2, and two scores for each test in Week 3. The two test administrations within each week were averaged to create a mean score for each of the four tests, which allows for more stable estimates of cognitive functioning (see, e.g., Allard et al., 2014;Lange and Süß, 2014;Moore et al., 2017b;Sliwinski et al., 2018). The one-week test-retest interval is consistent with reliability studies of the SWAY balance tests (Amick et al., 2015).
The procedure described above matches recommendations for administration of SWAY MCTs. With longer test-retest intervals (e.g., 1-2 months or more), it is typically recommended that examiners administer a practice test before each administration. In the current study, we elected not to include practice tests in Weeks 2 and 3 due to the short test-retest interval.
The SWAY protocol consisted of four MCTs. For the Simple Reaction Time test, the examinee holds their mobile device horizontally (landscape) and moves the device as rapidly as possible in any direction when the screen color changes from white to orange. The test starts after a variable delay of 2-4 s in order to prevent the examinee from anticipating the stimulus ahead of time. The examinee completes five total trials. The most rapid and the slowest trial reaction times are both excluded in order to remove outliers and better capture the examinee's typical response times. Following those exclusions, the values from the three remaining trials are averaged to calculate the score for the test.
For the Impulse Control (go/no-go) test, the examinee again holds their device horizontally and then moves it as rapidly as possible in any direction when a green circle with a white check mark is displayed on a blank screen. They do not move the device if a red circle with a white "X" is presented on a blank screen. The test begins after a variable delay of 2-4 s. Eight total trials are administered (five "go" trials and three "no-go" trials). The five "go" trials are retained for scoring. Of these five trials, the most rapid and the slowest reaction times are both excluded and the values from the three remaining trials are averaged to calculate the score for the test.
During the Inspection Time test, examinees hold their device horizontally. They see two T-shaped lines, one on each side of the screen. One of the two lines is long and one is short. The long end of two "Ts" is quickly hidden and the examinee taps the device screen on the side where the longer line was presented. They do not tap the device screen if they are unsure about which of the two lines is longer (see Figure 1). The test begins after a variable delay of 1-2 s. The display interval begins at ~102 ms and reduces by 1 screen refresh (~17 ms) for each correct response until the user reaches 1 screen refresh. An additional trial is completed at 17 ms to verify the score and the test is completed. When an incorrect answer occurs, 1 screen refresh (~17 ms) is added to the next trial until a correct response is recorded. After an incorrect response, the examinee must earn two correct responses at a given interval before reducing by 1 screen refresh again. If an examinee gets every trial correct, including two trials at 1 screen refresh, they have completed the test. If the examinee makes two incorrect responses at any refresh interval, they must repeat and get two in a row correct at that interval with one more screen refresh to complete the test. The maximum number of trials is 20. An examinee's score is the screen refresh rate at the conclusion of the test.
Finally, for the Working Memory test, examinees hold the device vertically. They first see three letters (all consonants) on the screen for 3 s and then are asked to remember the letters. Next, the letters disappear and a 2 (columns) × 4 (rows) grid of squares appears. One of the squares briefly flashes orange and then the examinee touches the square that flashed orange. Two squares then turn orange, one at a time, and then the examinee reproduces the sequence on the grid. The sequence continues to lengthen until the examinee makes one mistake; at that point, the grid disappears, they type in the three letters shown at the beginning of the test, and the test concludes (see Figure 1). The score for this test is created with a formula that accounts for both accurate recall of the three letters and progress through the grid sequence. The Working Memory SWAY Score is calculated in two steps. First, the maximum sequence length achieved is assigned a SWAY score, as follows: 0 = 0, 1 = 25, 2 = 50, 3 = 64, 4 = 67, 5 = 70, 6 = 73, 7 = 76, 8 = 79, 9 = 82, 10 = 85, 11 = 88, 12 = 91, 13 = 94, 14 = 97, and ≥ 15 = 100. Second, three points are subtracted for each consonant that is incorrectly recalled. For example, a Sequence Length of 6, with three out of three consonants correctly recalled would be a SWAY score of 73. A Sequence Length of 4, with one out of three consonants correctly recalled would be a SWAY score of 61 (67 -3 -3).
The Reaction Time, Impulse Control, and Inspection Time tests all consist of two indices: (a) millisecond (ms) reaction times, and (b) an overall SWAY score. The Working Memory test does not have a reaction time component and is summarized with a single SWAY score.

Statistical Analyses
Descriptive statistics for continuous variables are presented as the mean, standard deviation (SD), median (Md), range, and interquartile range (IQR); categorical variables include the sample size for each variable (n) and the proportion of the overall sample (%). We examined distributional characteristics of the four SWAY tests through a visual inspection of the histograms and skewness/kurtosis statistics. Several SWAY variables were not normally distributed, so nonparametric statistics are presented where appropriate.
We report intraclass correlation coefficients (ICCs) as an estimate of test-retest reliability. Interclass correlations such as Pearson's and Spearman's r measure relationships between variables in different classes of measurement. That is, the Pearson product-moment correlation is used to assess the strength and direction of the linear relationship between two variables, and the Spearman rank-order correlation is used to measure the strength and direction of the monotonic relationship between two variables. Conceptually, the ICC is often recommended over Pearson or Spearman correlations to evaluate test-retest reliability because test-retest reliability examines two or more scores within the same class of measurement (Kroll, 1962;McGraw and Wong, 1996;Bédard et al., 2000;Weir, 2005;Koo and Li, 2016). ICC values can be interpreted as the proportion of variance in observed scores that can be ascribed to true score variance. In other words, if the ICC is 0.80, then 80% of the observed score variance results from true score variance and 20% results from error. The first step in assessing test-retest reliability using the ICC is to test for systematic error (e.g., practice effects) using a repeated measures analysis; we did so using Friedman's ANOVAs. Next, we calculated ICCs using a mean-rating, absolute-agreement, two-way mixed effects model, which is appropriate for test-retest reliability where multiple measurements are averaged to produce a composite score (Shrout and Fleiss, 1979;McGraw and Wong, 1996;Field, 2005;Weir, 2005;Koo and Li, 2016). We used data from Weeks 1, 2, and 3 to calculate ICCs for test-retest reliability.
Following the assessment of test-retest reliability, we calculated the natural distribution of the difference scores for Week 2 minus Week 1. For example, a 10% difference score for Week 2 minus Week 1 refers to the score that occurs in ≤10% of the full sample. These data are presented in order to examine how closely the calculated reliable change estimates align with the actual distribution of difference scores.
A reliable change methodology was used to estimate measurement error surrounding the test-retest difference scores. The "Reliable Change Index" was originally proposed by Jacobson and Truax (1992) and a number of authors proposed modifications and refinements over the years (Chelune et al., 1993;Speer and Greenbaum, 1995;Hsu, 1999;Iverson, 2001). In order to define reliable change for  The reliable change method used in the current study has a similar purpose to the minimal detectable change (MDC; Stratford et al., 1996) and the minimal clinically important difference -identifying clinically meaningful change over time.
Both the current reliable change method and the MDC rely on the SE of measurement in their calculation. However, unlike the current reliable change approach, the MDC does not use the SE of the difference score in its calculation: 90% MDC = 1.64 × SEM × √2.
Statistical significance was set a priori at p < 0.05. All analytic procedures, with the exception of the reliable change calculations, were carried out in IBM SPSS, Version 27.0.

RESULTS
The mean age of the sample was 26.69 years (SD = 9.89; Md = 23.00; range = 18-58; IQR = 20-30). Of the 55 adults, 38 (69.10%) were women and 49 (89.09%) used an iPhone for the testing (the remaining six reported using phones with an Android operating system). Descriptive data for the five tests and indices are presented in Table 1. Histograms for each test at each time interval are presented in Figure 2. Friedman's ANOVAs were all nonsignificant (p values, range = 0.32-0.62; Table 2), suggesting that test scores did not differ across Weeks 1, 2, and 3. As seen in Table 3, test-retest ICCs for Weeks 1-3 ranged from 0.68 (Impulse Control, ms) to 0.88 (Working Memory), and ICCs for Weeks 1-2 ranged from 0.60 (Impulse Control) to 0.83 (Working Memory).
The natural distributions of difference score data (Week 2-Week 1 and Week 3-Week 2) are presented in Tables 4, 5. ICCs (from Weeks 1 and 2) are presented in Table 3, and reliable change values, based on the reliable change CIs formula, are presented in Table 6. The reliable change values allow the reader to determine, with varying degrees of confidence, whether a particular level of improvement or decline in performance is within the range of measurement error. For example, for an examinee completing the Working Memory (SWAY) test across two testing sessions, an improvement of greater than about 11 points is unlikely to be due to measurement error at the 70% (liberal) confidence level, and an improvement of about 17 points is unlikely to be due to measurement error at the 90% (conservative) confidence level. For an examinee completing the Impulse Control (ms) test, a decline of about 77 ms (slower reaction times) is unlikely to be due to measurement Van Patten et al.

Reliability of Mobile Cognitive Tests
Frontiers in Psychology | www.frontiersin.org 7 October 2021 | Volume 12 | Article 734947 error at the 70% confidence level, and a decline of about 94 ms is unlikely to be due to measurement error at the 80% confidence level. As seen in Table 7, we extracted several cutoff scores (from Tables 3, 6) for each test and then computed the percentages of the sample that scored below (worsened) or above (improved) those cutoff scores. These percentages were calculated from the frequency distributions of Week 2-Week 1 and Week 3-Week 2 difference scores.

DISCUSSION
The current study provides a preliminary examination of test-retest reliability and reliable change estimates in four novel MCTs, assessing reaction time, impulse control, timed visual processing, and working memory in 55 communitydwelling adults. There were no statistically significant practice effects on the four MCTs. The test-retest reliability coefficients were adequate or better. Practical information relating to the interpretation of change on these tests is provided in Tables 4-7.
The data provided in these tables allow the reader, on a preliminary basis, to determine whether or not particular scores are statistically reliable at different levels of certainty. That is, if an examinee does not improve or decline to a greater degree than is reflected in the confidence bands, the change in that person's performance may be attributable to measurement error, or normal variability, and the probability that the change is clinically meaningful is reduced. When examining Tables 4-7, it can be seen that a change of five or more points is relatively uncommon for both the Reaction Time score and the Impulse Control score, although the Reaction Time score has more variability. A change of four or more points is relatively uncommon for the Inspection Time score. Worsening by seven or more points, or improving by 10 or more points, is relatively uncommon for the Working Memory score. Of course, in individual cases, clinical judgment is necessary to determine whether or not a particular improvement or decline is meaningful for that examinee.
Overall, results from the current study provide preliminary support for clinical scientists to begin considering the SWAY MCTs as outcome measures in a variety of research settings. More studies are still needed, however, to establish more definitive recommendations for how to interpret change on these four SWAY MCTs.

Psychometric Support for MCTs
Although scientific interest in MCTs is growing (Allard et al., 2014;Moore et al., 2017a;Sliwinski et al., 2018;Koo and Vizer, 2019;Weizenbaum et al., 2020), there is a dearth   of psychometric data supporting these instruments (Charalambous et al., 2020), particularly with respect to test-retest reliability and reliable change. This leaves researchers with few options if they want to incorporate MCTs into their study designs. Results of Brouillette et al. (2013) supported the testretest reliability in one MCT measuring processing speed, and Timmers et al. (2014) found high test-retest correlations in an MCT assessing short term memory. Other investigations have reported good convergent and discriminant validity of MCTs compared to laboratory tests, with more shared variance between tests of similar constructs than between tests of distinct constructs (Moore et al., 2017a(Moore et al., , 2020Dupuy et al., 2018;Sliwinski et al., 2018). However, no studies to our knowledge have reported on estimates of reliable change in MCTs, and this is a limitation of the  current literature, given that a major advantage of MCTs is their brevity and potential for repeatability (Allard et al., 2014;Sliwinski et al., 2018).

Scientific Utility of MCTs
Research opportunities incorporating MCTs include delivering compensatory cognitive training in a telehealth format (e.g., Lawson et al., 2020), assessing the cognitive impact of pharmacotherapy for epilepsy (e.g., Eddy et al., 2011;Adams et al., 2017), measuring cognitive improvement following mindfulness-based treatment for ADHD (Poissant et al., 2019), and measuring long-term cognitive trajectories after mild traumatic brain injury and concussion (McAllister and McCrea, 2017), among others. In each of these cases, replacing traditional paper-pencil neuropsychological tests with remotely delivered MCTs would greatly reduce the time burden and overall cost associated with running the study. Alternatively, augmenting a conventional neuropsychological battery with MCTs would allow for both repeated, real-world assessments of cognition in context (MCTs), and standard evaluations of optimal cognitive performance in the laboratory (traditional testing), possibly reducing the likelihood of Type II (false negative) errors and providing a more complete picture of participants' cognitive functioning.

Limitations
Although promising, the current study has several limitations. First, important demographic data such as participants' level of education, race, or ethnicity were not recorded, and, therefore, external validity is reduced. Relatedly, SWAY requires a smart phone with either an iOS version of 9.3 or higher, or an Android version 7.0 or higher, so the MCTs in this study are not accessible to everyone. Second, our sample is comprised primarily of younger adults ( (Table 4). We are not certain as to why these scores vary across Week 3-Week 2 compared to Week 2-Week 1, but it could be related to our relatively modest sample size in that a small number of people with more variable scores could influence these CIs. Sixth, the current study took place in a remote, supervised setting (by design). This context is less controlled and structured than conventional in-person, laboratory-based neuropsychological testing, while also being less ecologically valid than unsupervised remote testing (which is how MCTs are often examined). In other words, it lies somewhere between laboratory-based neuropsychological testing and unsupervised remote MCT with respect to the level of control and structure vs. ecological validity. Consequently, any inferences made from the current study results should account for the uniqueness of the setting. Seventh, our test-retest interval was only 1 week. This brief interval is not representative of typical clinical practice in neuropsychology and may constrain the degree to which the test-retest reliability coefficients and reliable change indices generalize to other settings. Eighth, the current study presents reliability but not validity data for the SWAY MCTs. VanRavenhorst-Bell et al. (2021) reported preliminary support for the convergent validity of three of the current tests (Reaction Time, Impulse Control, and Visual Inspection) with the ImPACT Quick Test, but additional construct and criterion validity data are needed (e.g., convergent/ discriminant validity with laboratory neuropsychological tests). Finally, although the current procedure is consistent with recommendations for administration of SWAY MCTs, other options are available. For example, a SWAY "Screening Test" involves the administration of each test one time, rather than two administrations being averaged together as was done in the current study. Our results should be generalized with caution outside of current study procedures.

Conclusion
In summary, we report promising preliminary results with respect to test-retest reliability in four MCTs measuring reaction time, impulse control, timed visual processing, and working memory, in a sample of community-dwelling adults. Reliable change metrics allow for the estimation of statistically reliable improvements and declines in performance. However, the psychometric support for these MCTs remains limited, and more reliability and validity data are needed across the lifespan. Ultimately, following the accumulation of additional psychometric data, large, representative normative datasets, and research in clinical populations, we believe that these tests may be useful in clinical practice as an adjunct to traditional paper-pencil neuropsychological testing.

DATA AVAILABILITY STATEMENT
The statistical analyses and underlying data supporting the conclusions of this article will be made available by the authors to qualified researchers for research purposes, without undue reservation.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by Wichita State University Institutional Review Board. The patients/participants provided their written informed consent to participate in this study.

AUTHOR CONTRIBUTIONS
RVP performed the literature review, helped conceptualize the statistical analyses, helped manage the database, conducted the statistical analyses, and wrote portions of the manuscript. GI conceptualized the study, helped with the literature review, helped conceptualize and interpret the statistical analyses, and wrote portions of the manuscript. MM collected the data and helped manage the database. HV-B supervised data collection and helped conceptualize the study. All authors critically reviewed drafts of the manuscript and contributed to the article and approved the submitted version