Assessment of Smartphone-Based Spiral Tracing in Multiple Sclerosis Reveals Intra-Individual Reproducibility as a Major Determinant of the Clinical Utility of the Digital Test

Technological advances, lack of medical professionals, high cost of face-to-face encounters, and disasters such as the COVID-19 pandemic fuel the telemedicine revolution. Numerous smartphone apps have been developed to measure neurological functions. However, their psychometric properties are seldom determined. It is unclear which designs underlie the eventual clinical utility of the smartphone tests. We have developed the smartphone Neurological Function Tests Suite (NeuFun-TS) and are systematically evaluating their psychometric properties against the gold standard of complete neurological examination digitalized into the NeurExTM app. This article examines the fifth and the most complex NeuFun-TS test, the “Spiral tracing.” We generated 40 features in the training cohort (22 healthy donors [HD] and 89 patients with multiple sclerosis [MS]) and compared their intraclass correlation coefficient, fold change between HD and MS, and correlations with relevant clinical and imaging outcomes. We assembled the best features into machine-learning models and examined their performance in the independent validation cohort (45 patients with MS). We show that by involving multiple neurological functions, complex tests such as spiral tracing are susceptible to intra-individual variations, decreasing their reproducibility and clinical utility. Simple tests, reproducibly measuring single function(s) that can be aggregated to increase sensitivity, are preferable in app design.


INTRODUCTION
Expert neurological examination is an art that is slowly but surely disappearing (1). The skilled neurologist can reliably identify deficits in neurological function(s) and localize them to the specific part of the central (CNS) or peripheral nervous system (PNS). An expert examiner can also differentiate deficit that lacks anatomical substrate, by examining identical neurological function in different ways, noting inconsistencies, and motivating a patient to provide adequate effort. Such neurological examination takes between 30 and 60 min to perform and years to master. Because of examiner dependency, the quantitative aspect of neurological examination, especially when performed by different raters, is less precise. Traditional neurological disability scales nonalgorithmically aggregate semi-quantitative ratings of different neurological functions, usually selected by an individual [e.g., in Expanded Disability Status Scale, EDSS; (2)] or teams of experts [e.g., in Scripps Neurological Rating Scale, SNRS; (3)], into a single number. This is suboptimal for two reasons: (1) the features of the neurological examination aggregated to the disability scale are not data-driven and therefore may not be optimal and (2) the lack of a defined algorithm causes errors during the translation of the examination into a scale. These drawbacks are eliminated by data-driven scales [such as Combinatorial Weight-Adjusted Disability Scale, CombiWISE; (4)] and digital tools that allow convenient documentation of neurological examination in its entirety with automated, algorithmically codified computation of relevant disability scales [such as NeurEx TM app; (5)].
However, these solutions are useless when the lack of expert medical professionals, limited time for patient encounters, or inability to examine patients in person due to pandemics deprives patients of the benefit of this historically validated tool. Therefore, there is a strong movement to supplement neurological examination or, in some instances, to replace it, by patient-autonomous tests of neurological functions (both cognitive and physical) acquired via smartphones, tablets, or web interphase (6)(7)(8)(9)(10)(11)(12)(13)(14)(15).
While some of these apps are already marketed to patients, they often lack careful assessment of their psychometric properties against the gold standard of neurological examination and imaging or electrophysiological measures of CNS (or PNS) injury. Even a simple assessment of test-retest reproducibility may be missing. For instance, while the work of Creagh et al. (9) demonstrated the potential of the smartphone-based test to predict 9-Hole Peg Test (9HPT) in the training cohort of subjects with multiple sclerosis (MS), no evaluation of testretest reproducibility (or accuracy of 9HPT prediction in the independent validation cohort) was provided.
Many of these apps use tests adopted from standard neurological examination and modified to self-administered digital tests. This is true for the Spiral tracing test examined in this paper. Spiral tracing has been used in movement disorders to identify tremors and quantify their severity. Its digitalization offers automated identification and quantification of the tremor frequency and amplitude by Fourier transformation (16). Furthermore, digitalization of the shape(s) tracing allows other quantitative measurements of speed and precision of the tracing (by finger or stylus) which may reflect neurological (dys)functions.
We have reviewed previous studies of digitalized spiral/object tracing (9)(10)(11)17) to derive a comprehensive set of digital features (40 total) and determined their psychometric properties (i.e., reproducibility, ability to differentiate patients with MS from healthy donors (HD) and correlation with relevant features of neurological examination, disability scales and CNS tissue destruction visible on brain MRI) in the training and independent validation cohorts of patients with MS. We hypothesized that by virtue of aggregating multiple neurological functions (i.e., vision, fine finger motoric, proprioceptive and cerebellar functions) in the test performance, spiral tracing will outperform simpler smartphone tests that we have evaluated previously in the Neurological Function Tests Suite (NeuFun-TS), such as finger or foot tapping, balloon popping, and level test, which demonstrated comparable or even stronger sensitivity and specificity to traditional non-clinician-acquired disability measures, such as 9HPT (7,18,19). Specifically, finger tapping, a simple motoric test consisting of tapping a finger on the surface of a smartphone for 10 s as rapidly as possible achieved Pearson correlation coefficients of up to 0.75 with NeurEx-derived cerebellar functions, 0.73 with motoric functions, and 0.69 for strength subscore of the motoric functions. Analogous correlations were observed for the balloon popping test, where a subject was required to tap a balloon that randomly appeared at different locations of the smartphone screen (i.e., balloon popping test). The level test, where a subject is tilting smartphone screen to guide a "ball" that appears at random locations at the periphery of the smartphone to the designated center of the screen and holds the ball in the center during the test achieves Spearman correlations of up to 0.4 with proprioception, 0.42 with motor functions, 0.49 with muscle atrophy subscore of motor functions, and 0.63 with cognitive functions.
Because we were unsure of the optimal size/thickness of the spiral in the spiral-tracing digital adaptation, we tested three different levels of increasing difficulty. However, contrary to our expectation, we observed comparatively weak correlations (Spearman Rho up to 0.33) of spiral-tracing-derived outcomes with simultaneously measured features from neurological examination documented in the NeurEx TM app. Furthermore, intra-individual test reproducibility and correlation of the best spiral-tracing outcome (the sum of Hausdorff distances) with the clinician-derived disability outcomes decreased with the increasing test difficulty, leading us to conclude that the poor clinical utility of the spiral-tracing outcomes is due to their poor intra-individual reproducibility.
Because other authors (9) aggregated multiple features from the spiral-tracing test to achieve a stronger correlation with traditional disability outcomes such as 9HPT using supervised machine-learning (ML) algorithms, we performed the same analyses here. We reproduced the ability to derive models with strong cross-validation performance in the training cohort (i.e., R 2 up to 0.73 for correlation with 9HPT). In contrast to the performance of the best spiral tracing outcome (i.e., the sum of the Hausdorff distances) the cross-validation performance of the ML models in the training cohort increased with the increasing test difficulty. However, this apparent increase in the performance of ML models was entirely due to overfitting; when applied to the true independent validation cohort, all ML models performed poorly so that none outperformed the Hausdorff distances.

Participants
The data were collected from participants enrolled in the Natural History protocol: Comprehensive Multimodal Analysis of Neuroimmunological Diseases in the Central Nervous System (ClinicalTrials.gov identifier NCT00794352). The study was approved by the National Institute of Allergy and Infectious Diseases (NIAID) scientific review and by the National Institutes of Health (NIH) Institutional Review Board. All methods were performed in accordance with the relevant guidelines and regulations. All study participants gave informed consent. HD was recruited in two ways: (1) full participants in the Natural History protocol that underwent comprehensive neurological/imaging evaluation and (2) participants in a substudy of the Natural History protocol to obtain normative data for smartphone apps (without neurological/imaging evaluation). Two different groups are comprised within the MS datasets: a cohort that is tested in a clinic approximately every 6 months (non-granular testing subcohort) and those that had the smartphone at home and did the test more than 5 times during a period of 2 years (granular testing sub-cohort). Prior to all analyses, the MS datasets were separated into a 2/3 training and 1/3 test set weighted by one of the clinical features (average 9HPT; see Table 2). A summary of the demographic information is provided in Table 1.

Test Design and Data Collection
The Spiral test was written in Java and Kotlin using the Android Studio integrated development environment. The test is distributed as an Android Package (APK) over email, or directly installed to the device over USB, and updates are sent out over the air. The testing devices are Google Pixel XL and Google Pixel 2 XL, running Android (Android Version 11), with the intent of keeping them up to date. Results are uploaded to Firebase Firestore, a commercial cloud database, with alphanumeric identifiers to avoid Personally Identifiable Information. Spirals are generated using physical dimensions and rendered using the individual device's screen characteristics and configuration, to ensure that spirals with the same parameters look the same across all devices. Spiral tracing test consisted of tracing with a finger of each hand an orange spiral shown on the screen of the smartphone at three difficulty levels: Level 1 (simplest) consisted of the thickest spiral of shortest length, while level 3 (most difficult) consisted of the thinnest spiral of longest length (Figure 1). Each participant was instructed to trace the spiral as accurately and fast as possible. A total of four test trials were conducted by the subjects at each of the test dates for each difficulty level. Two of the drawings are done clockwise from and to the center of the spiral by the dominant hand and similarly, two drawings are done by the nondominant hand counterclockwise (i.e., from and to the center of the spiral). Thus, a total of four tests were conducted by each of the subjects on their testing dates. As previously stated, the non-granular testing sub-cohort was tested approximately every 6 months while the granular testing sub-cohort was tested more frequently during the period of 2 years. Hence, the number of tests conducted by each subject throughout the 2 years of study period varies per subject. During the experiment, raw sensor data was collected from the smartphone touchscreen as x-and y-screen coordinates with a corresponding timestamp in milliseconds and an estimated pressure of the tap based on the surface area of the finger on the touchscreen.

Clinical Assessments of Motor Symptoms
The complete neurological examination, lasting 30-60 min and performed by an MS-trained clinician was transcribed into the NeurEx TM app (5). NeurEx TM computes traditional disability scales such as EDSS (2), SNRS (3), and others. We also extracted relevant subsystem scores of those neurological functions that, based on domain expertise, contribute to the spiral tracing (i.e., pyramidal and motor functions of hands, cerebellar functions, proprioception functions). Finally, we extracted semiquantitative MRI data of CNS tissue destruction, focusing on the brainstem, cerebellum, and medulla/upper cervical spinal cord. These are features previously validated as important in determining physical disability (20,21). The details of MRI sequences and computation of selected MRI features have been previously published (20,21). Thus, together we tested 22 disability features in the MS training set ( Table 2). Though spiral data were obtained from 22 HD, we note that clinical features were generated for only 9 HD (see Section Participants for details). We highlight that the clinical features extracted from the NeurEx TM app are later used to validate features obtained from the spiral tracing test.

Feature Extractions
The raw sensor data was processed with signal and time series analysis methodologies to compute temporal, spatial, and spatiotemporal features. When appropriate, features were calculated following the work of Creagh et al. (9) in addition to some new features computed in this work and this generated a total of 40 spiral-derived features (i.e., digital features). To measure temporal irregularities in the upper extremity function in neurological patients, previous research used speed and velocity as signals in the objective quantification of motor symptoms (9,(22)(23)(24). Thus, we initially computed the velocity (v), radial velocity, and angular velocity (av) of the drawing spirals as follows: Where x i , y i , i = 1 . . . N are the horizontal and vertical coordinates of pixels on the screen respectively with N representing the total number of touch data points.t i, i = 1 . . . N is the timestamp converted to second. The radial velocity is computed as follows: Where r = x 2 + y 2 . If we denote θ the four-quadrant inverse tangent i.e., θ = tan −1 y x , then the angular velocity takes the following form: The sum, coefficient of variation, skewness, and kurtosis were computed for the velocity, radial velocity, and angular velocity, respectively.
To calculate the degree of resemblance between the reference spiral and cohort's drawing, we introduced features related to the Hausdorff distance, which quantify the extent to which each point in the reference spiral lies near the points in the cohort's drawing following procedures illustrated (9,(25)(26)(27). Similar to Figure 3 in Jeong and Srinivasan (27), a detailed example procedure to calculate the Hausdorff distance of the reference and cohort's drawing is presented in Supplementary Figure S1. We point out that the Hausdorff distance was calculated using Frontiers in Medical Technology | www.frontiersin.org the "metric.hausdorff " function in the fda.usc R software package (28). Prior to calculating the Hausdorff distance, the x and y screen coordinate points of the reference spiral were interpolated to the length of the cohort's drawing's coordinates using cubic spline interpolation (29)(30)(31). Several Hausdorff distance-related features were then calculated (e.g., maximum of Hausdorff distance, interquartile range of Hausdorff distances, etc.). All Hausdorff distance related features are provided in Table 3.
Two approaches were utilized to compute the error-related features between the reference spiral and the cohort's drawings.
The first error was computed using the trapezoid to integrate the two spiral regions. This error was calculated by finding the intersection of the two-spiral region (i.e., the difference between the two areas in magnitude). For instance, we note the following trapezoidal formula from Aghanavesi et al. (22):

F8
Kurtosis of radial velocity *** *** *** *** *** ***   Where N is the total number of x or y screen coordinate points and b−a N is the spacing between points. Let us suppose that the reference spiral and the cohort's spiral are denoted by f ref x, y and f coh x, y , respectively. Then the error based on the trapezoidal rule becomes Where |.| is the absolute value of the difference between the two Area Under the Curve (AUC). We now proceed with the second error calculated using the following two-dimensional (2D) Mean Square Error [MSE; (32)]: Where M and N=2 are the numbers of data points and coordinates points, respectively. Again, a spline interpolation was used on the cohort's data point to the M length of the reference data point prior to error calculation. Furthermore, to obtain the similarity between the reference spiral and cohort's drawing, the following 2D correlation coefficient from Aljanabi, Hussain, and Lu (33) was utilized: Where A MN and B MN are the reference and cohort's spiral coordinate points with dimension M x 2, respectively. A = i x i + i y i 2M and B = i x i + i y i 2M are the reference and cohort's spiral mean. Note that the the reference and cohort's spiral mean (i.e.,Ā andB) are not necessarily equal as the x and y coordinate points differ. A full list of features calculated is provided in Table 3. When applicable, references of the calculated spiralderived features are provided in Supplementary Table S1 in the supplemental material.

Statistical Analysis
Statistical analyses were used to evaluate the validity and strength of features (i.e., clinical disability scales and spiral-derived adjusted p-value < 0.001 was used to establish statistically significant differences for comparing the HD and MS cohorts. All features that were not statistically significant using the unpaired Two-Samples Wilcoxon test (36) in the training set were removed from subsequent analysis. Moreover, average fold change (FC) between the HD and MS was computed at all difficulty levels and at both dominant and non-dominant hand. An FC > 2 was used as cutoff of significant difference between HD and MS.
Test-retest reliability of the spiral-derived features was measured using the intraclass correlation coefficient [ICC; (15)] of features obtained from the granular testing HD and MS subcohorts. The ICC was calculated using the ICC function from the irr R package (37). As stated by (38,39), there are several versions of the ICC that can give different results when used on the same dataset. However, the authors pointed that the two-way mixed-effects model and the absolute agreement are more appropriate for test-retest reliability studies. Thus, an ICC with two-way mixed-effects model and the absolute agreement was used in this study. Following the recommendation of Koo and Li (38) that stated that an ICC between 0.5 and 0.75 are considered moderate, a Spearman correlation matrix between clinical disability scales (see Table 2) and spiral-derived features (see Table 3) with ICC > 0.5 and FC > 2 were constructed. A BH-adjusted p-value > 0.05 was used to access the statistical significance of the correlation test.
To determine the existing relationship between the significant spiral-derived features (i.e., BH-adjusted p-value < 0.001, FC > 2, and ICC > 0.5 between HD and MS) and the statistically significant clinical disability scales, four different regression models (Elastic Net or ElasticNet, Support Vector Regression with Radial Basis Function Kernel or SVR Radial, Random Forest or RF, and Stochastic Gradient Boosting or GBM) were used where the clinical disability scale were the dependent variables while spiral-derived features were the independent variables. For all regression models, the "caret" library (40) in the R software was used along with other libraries such as "glmnet" (41) for ElasticNet model, "randomForest" (42) for RF model, and "xgboost" (43) for GBM model. Prior to the regression modeling, outliers in the spiral-derived features were identified as feature values that lie outside of ± 2(q 0.9 -q 0.1 ) where q p is the pquantile (44). To reduce variability in the features, all variables were bi-symmetric log-transformed using the transformation formula y = sgn (x) log 10 (1 + |x/C|) where y is the transformed function of the x variable, C has a default value of 1/ln (10), and sgn(x) is the mathematical Signum function [as presented in (45)].
Moreover, a linear regression model was used to assess the relationship between selected clinical disability scales and the sum of the Hausdorff distances (Feature F24 in Table 3). During the analysis, adherence to the normality assumptions of the residuals was tested using histograms and quantile plots. All models were evaluated using the Root Mean Square Error (RMSE; measured in seconds) and the coefficient of determination (R 2 ) of the prediction. Apart from the linear regression model, all model parameters (i.e., the penalty strength parameter λ and the penalties from both L1 and L2 regularization parameter α in ElasticNet; the cost value C and γ of the SVR Radial; the number of variables randomly sampled at each split in the RF model; the number of trees, the interaction depth, the minimum number of samples in tree terminal nodes, and the learning rate in GBM model) were tuned via grid search. Five-fold cross-validation (CV) with 10 repetitions was used to assess the model suitability in the training cohort. The out-ofsample test performance was evaluated using the final model from the 5-fold CV based on the RMSE to predict clinical    disability scales given the test datasets (i.e., the independent validation cohort).

Feature Evaluation
To determine clinical disability scales and spiral-derived features that are relevant for further analysis, statistical significance between HD and MS was calculated using the unpaired Two-Samples Wilcoxon test. With the exception of four features (NeurEx TM vision score, brainstem atrophy, medulla/upper cspine atrophy, and cerebellum atrophy), all clinical disability scales and MRI features were found to be statistically significant at p < 0.001 after adjusting the p-value using the BH approach ( Table 2). There were differences in spiral-derived features that were statistically significant between the dominant and nondominant hands. In the dominant hand category, for instance, spiral-derived features 21 and 28 were statistically significant (BH adjusted p-value < 0.001) at the difficulty level 1, 2, and 3. However, in the non-dominant hand category, these features (i.e., 21 and 28) were not statistically significant at any difficulty levels. We also found spiral-derived feature 22 to be statistically significant at all difficulty levels in the non-dominant hands (BH adjusted p-value < 0.001) but not in the dominant hands. While spiral-derived features that were statistically significant between HD and MS vary between dominant, non-dominant hands and difficulty levels, 19 of these features were consistently statistically significant at all levels and both hands (see bold feature description in Table 3).
Furthermore, a look at the FC of HD and MS with respect to the ICC indicated that only three spiral-derived features (kurtosis of velocity, kurtosis of angular velocity, and the sum of the Hausdorff distances) have FC > 2 and ICC > 0.5 when ICC was calculated using the granular data of HD (Figure 2). When ICC was computed for patients with MS, kurtosis of radial velocity, kurtosis of angular velocity, and the sum of Hausdorff distances have FC > 2 and ICC > 0.5 (Figure 3). In general, four spiral-derived features (i.e., kurtosis of velocity, kurtosis of radial velocity, kurtosis of angular velocity, and the sum of Hausdorff distances) were found to be statistically significant between HD and MS, have FC > 2, and have HD or MS patients ICC > 0.5. These features have a moderate strength of test-retest reliability and are significantly different in HD and MS as indicated by their FC and ICC (Figures 2, 3). There were statistically significant differences between HD and patients with MS in selected clinical features and the four most impactful spiral-derived features as depicted by violin and boxplots (see Figure 4 for boxplot of selected clinical features and Figures 5-7 for boxplots of the most impactful spiral-derived features at the difficulty level 1, 2, and 3, respectively). Most spiral-tracing features have a median value higher in the MS as compared to the HD (Figures 4-7). Also, the ICC of many patients with MS is also higher than that of HD (Figure 8). This is expected based on a higher inter-individual variance of spiral-tracing outcomes in MS vs. HD.
To determine the relationship between the clinical disability scales and most impactful spiral-derived features (kurtosis of velocity, radial velocity, angular velocity, and the sum of Hausdorff distances), a Spearman Rho correlation matrix among the features was constructed (see Figure 9 for difficulty level 1 and 2, and Supplementary Figure S6 for difficulty level 3). In general, the highest correlations were seen among the 9HPT average, EDSS, CombiWISE, NeurEx TM , Lesion Load Brainstem, and the spiral-derived features at the dominant hand levels (Figure 9 and Supplementary Figure S6). Among the spiralderived features, the sum of the Hausdorff distances had the highest correlations in both cohorts, but the strength of correlations was weak to moderate.
From the neurological examination subdomains, only motoric and cerebellar functions (but not proprioception) are reliably correlated with best spiral tracing features. We observed a very high positive correlation between the sum of Hausdorff distances and the time taken to complete the drawing in both the HD and MS patients (Supplementary Figures S7-S9; Spearman Rho > 0.97 for both dominant and non-dominant hand and at all difficulty levels). This is counterintuitive as we expected that increasing the speed of spiral drawing will negatively affect tracing accuracy. Instead, it appears that the disability and/or patient's confidence in his/her ability to trace the spiral affected both the speed and accuracy of the tracing congruently.

ML Models of Best Spiral Tracing Features and Their Independent Cohort Validation
Four regression models (ElasticNet, SVR Radial, RF, and GBM) were used to evaluate the relationship between clinical disability scales and our four most impactful features (kurtosis of velocity, kurtosis of angular velocity, kurtosis of radial velocity, and the sum of the Hausdorff distances). The models had the best performance predicting the clinical disability scales in the (small; 22 subjects as presented in Table 1) HD cohort based on the mean RMSE and R 2 across the 5-fold CV with 10 repetitions (see Supplementary Tables S2, S5, S8 for difficulty level 1, 2, and 3, respectively). Among the HD at the difficulty level 1, the ElasticNet model performed the best by explaining at most 85% in clinical disability scales when the dependent variables were CombiWISE or EDSS (Supplementary Table S2). When the dependent variables were 9HPT Average or NeurEx, SVR Radial had the best performance at the difficulty level 1 with an R 2 value between 0.69 and 0.79 (Supplementary Table S2). The results of the percent variance explained in model outcomes at the difficulty levels 2 and 3 in HD were lower compared to the difficulty level 1 but still range between 30 and 76% (Supplementary Tables S5, S8).
Model's performance in the (much larger; 89 subjects as presented in Table 1) MS training cohort was much lower at all difficulty levels in comparison to results from the HD. The SVR Radial performed the best by explaining only 6-23% of the variance in clinical disability scale depending on the hand used (dominant or non-dominant), difficulty levels, and clinical disability scale (see Supplementary Tables S3, S6, S9 for difficulty level 1, 2, and 3, respectively).
However, the out-of-sample test performance (i.e., the independent validation cohort) were much lower compared to  Supplementary Tables S4, S7, S10 for difficulty level 1, 2, and 3, respectively). Of these, models with a 9HPT average yield the best performance with an R 2 between 0.0304 and 0.1914 but only for the dominant hand (Supplementary Tables S4, S7, S10 for difficulty level 1, 2, and 3, respectively). However, in the independent test set, the GBM models generally validated the worst. Thus, we conclude that cross-validation of the training set is overly optimistic and does not reliably predict an independent test cohort performance.
Given that the sum of Hausdorff distances had the highest correlation with the clinical disability scale at all difficulty levels (Figure 9 and Supplementary Figure S6), linear regression models were constructed to measure the relationship between the disability scales and the sum of Hausdorff distances alone. In general, all clinical disability scales were positively correlated with the sum of Hausdorff distances (see Figure 10 for correlation with average 9HPT and Supplementary Figures S10-S12 for correlations with EDSS, CombiWISE, and NeurEx, respectively). The validation cohort performance of the sum of Hausdorff distances alone (Figure 10 for 9HPT) was comparable to the more complex ML-based models (i.e., R 2 between 0.00834 and 0.1593 in Supplementary Table S11).
Overall, we observed better predictive accuracy (based on R 2 ) in the dominant hand category than the non-dominant hand category and for difficulty levels 1 and 2 compared to difficulty level 3 in the dominant hand (Figure 11). These results remained consistent when controlling for the age and gender variables in all the models (see Supplementary Tables S12, S13 for crossvalidation and out-of-sample test performance of the ML models at the difficulty level 1 for MS subjects, respectively). Similar results were found from the CV of the training cohort in Creagh et al. (9), where the authors observed that the mean absolute error (MAE) was higher in non-dominant hand models than dominant models for HD subjects. However, it was observed that nondominant hand models more accurately predicted 9HPT than dominant hand regression models in the MS subjects (9).

DISCUSSION
Test reproducibility (measured as ICC, signal-to-noise ratio, or concordance coefficient) is generally given lesser importance in the test design than test sensitivity. Presented analyses of the Spiral tracing support the notion that test reproducibility is an essential determinant of its clinical utility. Achieving high reproducibility of digital tests should be at forefront of the medical app developers. Spiral tracing is a complex test that includes many neurological functions: fine finger movements, negatively affected by motoric dysfunction, proprioceptive loss and cerebellar dysfunction, eye-hand coordination, affected by vision, oculomotor and cerebellar dysfunctions, and cognition or anxiety associated with anticipated test difficulty. Additionally, the precision of the test is affected by the time of test execution, even though we observed, counterintuitively, a strong positive correlation between measures of test accuracy such as the sum of Hausdorff distances and the time it took to perform the tracing, even in HD. This suggests that time was not the primary driver of the inaccuracy of tracing. Rather, combined neurological disability and/or lack of confidence in the ability to perform tests, both fast and accurately, negatively affected the test performance. This explains why Spiral tracing features that adjusted the accuracy of tracing for the velocity performed worse than the most successful accuracy measure, the sum of Hausdorff distances.
The comparison of cross-validation performance of the MS training cohort with the performance of the ML-based models in the true independent validation cohort demonstrated that training cohort cross-validation performance overestimates performance of the test in subjects who did not contribute to the model development: when the performance of the strongest MLbased models (i.e., modeling 9HPT) are compared between crossvalidation of the training cohort and the independent validation cohort, all four ML algorithms greatly overestimated the performance of the models in the independent validation cohort. The best feature of the Spiral tracing (the sum of Hausdorff distances) performed comparatively to the ML-based models in the independent validation (Figure 8). This overestimation of the model performance from the training cohort data, even when training cohort results are based on cross-validation, is the rule we observed uniformly in the past decade of our experience with independent validation of complex models. We used to not even show the training cohort results in our publications, as we consider them irrelevant. However, after realizing that the vast majority of ML studies in biomedical literature do not use independent validation cohort and that most readers and reviewers consider cross-validation of the training cohort equivalent to the independent validation, we now routinely publish training cohort data to demonstrate the level of overfitting in comparison to the truly independent validation cohort.
We also point out that the cross-validation performance of the training cohort does not faithfully predict even the ranking of the models.
In this regard, because the COVID-19 pandemic precluded us from recruiting an independent validation cohort of HD, we consider the cross-validation performance of the HD models unrealistically optimistic (especially because of the small number of HD) and fully expect that those models represent overfitting. Therefore, the HD models should not be considered promising without independent validation. The poor performance of these ML-based models was, in our experience, expected based on poor ICCs and weak univariate correlations of individual Spiral tracing features with the gold standard of neurological disability measures and brain MRI markers of CNS injury in MS cohorts. In comparison, much simpler tests such as rapidly tapping on the screen of the smartphone correlated much stronger with analogous disability measures (i.e., up to Spearman Rho of 0.76) and were also much more intra-individually stable (7). Interestingly, both 9HPT and smartphone finger tapping differentiated MS from HD better for non-dominant hand; the observation was reproduced for 9HPT in multiple studies (4,7,19). We interpreted this observation by functional repair: even though MS likely affects both hands equally, the daily use of the dominant hand promotes repair, both as remyelination and the establishment of new synaptic circuits by remaining neurons. Therefore, digital tests of the nondominant hand, which has less rehabilitation/repair, are more sensitive to measuring the difference between patients with MS and HD and to measuring the progression of disability in time. Surprisingly, the non-dominant hand performed much worse in the Spiral tracing test, in both MS cohorts. We believe that this was due to higher intra-individual variance/greater noise, which has little to do with disability and more to do with test complexity.
As test complexity increased to Level 3, the reproducibility and clinical relevance of the Spiral tracing features decreased quite dramatically.
We recognize that poor reproducibility of the Spiral tracing observed in our study may be mitigated in situations where spiral tracing is performed on tablets and therefore, the spiral is much larger (10). We developed NeuFun-TS for smartphones rather than tablets, due to the larger worldwide prevalence and greater availability of different sensors in the former compared to the latter. The test selection must consider the screen size difference for apps targeting different mobile devices.
In conclusion, in self-administered digital measurements of neurological functions, the designers should strive to develop tests that are easy to perform and therefore highly reproducible, but still reflect a specific neurological (dys)function. These simpler tests will likely be (by design) less sensitive than tests that depend on multiple neurological functions, but the sensitivity can be restored by aggregating results from multiple simple tests, as is being done in NeuFun-TS. However, the total time necessary to complete all tests in NeuFun-TS will likely determine compliance with longitudinal testing. Therefore, as Spiral tracing does not add clinical value beyond existing tapping (19), balloon popping (7), and level tests (18), we plan to drop Spiral tracing from NeuFun-TS standard tests. Spiral tracing Fourier analysis to identify tremor, frequency, and severity may still be very useful in patients with movement disorders.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The name of the repository and accession number can be found at: GitHub, https://github.com/bielekovaLab/ Bielekova-Lab-Code/tree/master/FormerLabMembers/Messan_ Komi.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by National Institute of Allergy and Infectious Diseases (NIAID) scientific review and the National Institutes of Health (NIH) Institutional Review Board. The patients/participants provided their written informed consent to participate in this study.

AUTHOR CONTRIBUTIONS
BB conceived the study design. BB, LP, TH, YK, VM, and PK performed the app construction and clinical data collection. BB and KM conceived the analytical methodologies. KM performed the statistical analysis and ML modeling with their respective visualizations. All authors participated in writing and editing the manuscript.

FUNDING
This research was supported by the Intramural Research Program of the National Institutes of Health (NIH) and the National Institute of Allergy and Infectious Diseases (NIAID). KM was supported in part by an appointment to the NIAID Research Participation Program. This program is administered by the Oak Ridge Institute for Science and Education through an interagency agreement between the U.S. Department of Energy (DOE) and the National Institute of Allergy and Infectious Diseases (NIAID). ORISE is managed by ORAU under DOE contract number DE-SC0014664.