Motor Impairment Estimates via Touchscreen Typing Dynamics Toward Parkinson's Disease Detection From Data Harvested In-the-Wild

2 Parkinson’s Disease (PD) is a neurodegenerative disorder with early non-motor/motor symptoms 3 that may evade clinical detection for years after the disease onset due to their mildness and slow 4 progression. Digital health tools that process densely sampled data streams from the daily human-5 mobile interaction can objectify the monitoring of behavioral patterns that change due to the 6 appearance of early PD-related signs. In this context, touchscreens can capture micro-movements 7 of ﬁngers during natural typing; an unsupervised activity of high frequency that can reveal 8 insights for users’ ﬁne-motor handling and identify motor impairment. Subjects’ typing dynamics 9 related to their ﬁne-motor skills decline, unobtrusively captured from a mobile touchscreen, were 10 recently explored in-the-clinic assessment to classify early PD patients and controls. In this 11 study, estimation of individual ﬁne motor impairment severity scores is employed to


INTRODUCTION
Parkinson's Disease (PD) is the second most common neurodegenerative disorder after Alzheimer's Disease (Shulman et al., 2011) with a wide clinical spectrum of motor and nonmotor symptoms (Chaudhuri et al., 2006), which are mild in the early stages and are causing progressive disability at the later ones.The underlying neuropathological process is preceding the onset of relevant PD motor symptoms up to decades, leaving the disease undiagnosed for years (Hawkes et al., 2010;Schrag et al., 2015).PD is creating significant impact on patients' quality of life, that, in part, is caused by a wide variety of motor impairments, such as brady-/hypokinesia (B/H-K) and rigidity (R), being, yet, less evident for the person concerned due to their mildness in the early stages of the disease.Furthermore, degradation in motor function is reflected to patient's motor behavioral patterns, e.g., fine motor movements, speed of reflex movements and intermittent tremor.Diagnosis of PD is made by a movement disorders specialist who assesses, usually clinically, the patient's overall condition using questionnaires and standardized scales, such as the Unified Parkinson's Disease Rating Scale (Fahn et al., 1987).UPRDS Part-III (Goetz et al., 2008) consists of 14 single items qualitatively measuring the range of PD motor symptomatology, evaluated by experts during the examination of specific tasks.
Objective and frequent evaluation with quantitative measures can assist the clinical decision making process on PD diagnosis and patients' monitoring.Nevertheless, in clinical examination, subject's self-reports are frequently involved as a source of information, subjected to the experience of the physician to assess the severity of the PD symptoms.Information and Communication Technology (ICT)-based solutions (Mellone et al., 2012) and plethora of related data can assist the relevant stakeholders to better understand the disease's impact on daily habits, even in the early stages, as well as patient's response to drug therapy.An emerging field of ICT where large-scale data streams are acquired from users' habitual patterns is humanmobile interaction.The latter can reveal everyday information that could be transformed to useful behavioral indices, built in a dynamic and personalized way across the time of interaction.Design of digital monitoring tools for PD with diagnostic value has been a research field with a great variety of applications, due to the wide spectrum of PD clinical symptoms.Efforts processing data streams as captured from ICT devices, have proven to be robust in distinguishing populations facing motor symptoms from healthy ones, in different sub-tasks in-the-clinic assessment, such as voice (Orozco-Arroyave et al., 2016), gait or tremor (Abdulhay et al., 2018).
Digital health solutions with potential transferability to reallife environments (in-the-wild) is a challenging step to capture both useful disease indicators and achieve long-term adherence.Bot et al. (2016) were the first ones that reported a largescale smartphone-based PD-related study, namely mPower, with over 9,000 participants (both PD and healthy users), aiming remote PD screening by suggesting patients to perform designed digitized tests to assess motor-functionalities and self-reported questionnaires.However, drop out rates highlighted that the specific tests are not viable for long term user engagement.Moreover, Zhan et al. (2018) have recently used a mobile application to assess longitudinal PD patients via tests on five scheduled scenarios, and by using sensorial data analysis, they proposed an aggregated index that was correlated with the total UPDRS Part III score.Although both aforementioned studies paved the way for smartphone-based PD assessment, they included the requirement of users' active interaction with the mobile application.This requirement, however, is a nonresilient factor for avoiding drop outs and subjects are possibly subjected to Hawthorne effect (Monahan and Fisher, 2010).Nonobtrusive and passive sensing of data could overcome the latter barriers in designing such monitoring tools.Such an example is a recent study (Arroyo-Gallego et al., 2018) designed to unobtrusively perform data collection from keystroke typing on physical keyboard on subjects' PCs, in order to detect subjects with PD by using a machine learning approach.In fact, a numerical index was produced that related hold time keystrokes to the total UPDRS Part-III score.A possible drawback of this approach (Giancardo et al., 2016), was the use of the total UPDRS Part-III score as the regression target as it encapsulates both relevant (e.g., B/H-K) and irrelevant (e.g., voice degradation, gait) items with keystroke typing and fine-motor movement.The produced index performed 0.83 Area Under Curve (AUC) of the Receiving Operating Characteristic (ROC) with 0.77/0.72 sensitivity/specificity in classifying PD patients and healthy users, when they were typing in-the-clinic setting, and 0.76 AUC and sensitivity/specificity of 0.73/0.69during the remote at-homesetting assessment.
Fine-motor skills decline can also be detected from typing patterns on mobile touchscreen as derived from recent similar works on typing pattern analysis on mobile touchscreen (Arroyo-Gallego et al., 2017;Iakovakis et al., 2018), during in-the-clinic experiments.In our latest study, we proposed a feature vector representation of enriched keystroke information and a two-stage machine learning-based pipeline to process multiple typing sessions as captured from mobile touchscreen, performing 0.92 AUC with 0.82/0.81sensitivity/specificity on early PD and healthy subject's classification.Subjects typed multiple typing sessions during in-the-clinic medical examination, the derived typing sequences were analyzed and the study findings resulted in four keystroke features with high discriminative power, and a plausible connection with symptoms.
Motivated by the aforementioned, the current study steps further by analyzing keystroke information with respect to specific motor symptoms that are possibly causing the variations to PD patients' typing patterns.The analysis direction is aiming to increase the interpretation of the produced indexes, by targeting the UPDRS Part III single item scores that are related to the PD fine motor symptoms.The first part of this study is the exploitation of the best performing features from mobile touchscreen typing in the binary setting (control or PD patient), as predictors of specific UPDRS Part III single items, in order to predict each symptoms' severity in the standardized medical scale, so to be easily interpreted by medical experts.From a methodological point of view, by employing regression models, numerical indexes were produced, which describe the severity of the motor symptomatology on typing kinetics, and tested with a leave-one-subject-out (LOSO) validation of symptom's severity estimation, using keystroke features as dependent variables; the target variables were the UPDRS Part III single item scores of each symptom.The conceptualization of this analysis direction was reinforced from the correlation results between both the employed features and the specific UPDRS Part III single items, and the variation that the symptoms had plausibly caused to keystroke distribution coming from PD patients.
Furthermore, the second contribution of the present work is the testing of the generalization power of these trained regressors from the in-the-clinic to the in-the-wild data analysis.Mobile touchscreen typing data were unobtrusively captured from users via a dedicated smartphone application.More specifically, the employment of the developed models in-thewild setting was used to investigate further their diagnostic performance and their time response to a longitudinal manner.Multi modal data were collected through a developed research data donation application (i-Prognosis, 2017), and a third-party keyboard was included to capture keystroke dynamics from routine typing.The variance and noise induced to the data from the daily activities in the uncontrolled setting, is a real-life challenge when screening in an unobtrusive manner.However, touchscreen typing is a high-frequent activity and usually is an activity with cognitive attention; factors that contribute to the retainment of user's specific patterns in keystroke dynamics across time.
The present work is in line with the efforts toward predictive analytics approaches, both in-the-clinic and in-thewild data analysis, in capturing PD-related early signs.This could potentially add in building an effective PD prediction system, taking into consideration the pragmatic conditions of everyday living, for automatic remote inference and recommendation of PD diagnosis and management.

MATERIALS AND METHODS
Based on the findings of our previous work (Iakovakis et al., 2018), this study exploits the most discriminative features as the representation of typing patterns on mobile touchscreen, and make use of regression models to estimate individual UPDRS Part III single items scores relevant to fine-motor impairment.As it is depicted in Figure 1, the selected features are used as the independent variables for estimating the symptoms' severity which are the UPDRS Part III single items scores, used as the target variables for the training of the different regression models.A LOSO evaluation with nested cross validation for regressors' optimization scheme is used in the development set (DSet) in-the-clinic (Figure 1A), to identify the models and symptoms that can be predicted, and the generalization of the methods' is tested in-the-wild scenario (Figure 1B).The estimators are evaluated in their diagnostic properties using ROC based performance in the binary setting of classifying PD and healthy users.The goal of this employment is to investigate the diagnostic properties of the method as well as the transferability potential.

In-the-Clinic Data
The DSet consisted of data acquired from 33 subjects who provided data on a day of visit at the clinic.The collection protocol included a typing experiment of multiple text excerpts on smartphones and a clinical evaluation.These data were logged in a spreadsheet file and mapped to the subjects coded IDs by the neurologist.The specific UPDRS Part III single item scores used in the regression analysis were item 31/ 21/ 22/ 23 for Bradykinesia-Hypokinesia/ Tremor/ Rigidity/ Finger Taps, respectively.The DSet study protocol was approved by the Aristotle University of Thessaloniki, Greece (Bioethics Committee of Medical School, approval no.359/3.4.17).Written informed consent was obtained from all subjects prior to their participation in the study and the procedures carried out were according to institutional and international guidelines on research studies involving adult human beings.Subjects held the right to withdraw from the procedure at any time, without providing any justification.Recruitment and study procedures were carried out according to institutional and international guidelines on research involving adult human beings.PD patients under medication (14) has mean/std Levadopa equivalent dose of 237/156, and were asked to refrain from taking it for at least 8 h before their visit.More detailed information about the dataset acquisition and study cohort can be found in Iakovakis et al. (2018).

In-the-Wild Data
The data captured in-the-wild (GData) were collected by the i-PROGNOSIS remote data collection study (GData study).Subjects from four countries across EU contributed pseudoanonymised multi modal data remotely (e.g., voice, handling) via downloading the mobile application from Google Play store and enrolling in the study.The application provides information regarding the study details on its first launch and gives subjects the option to communicate with medical representatives in each country in case of additional questions.An electronic informed consent was obtained from all subjects enrolled, within the smartphone application, by digitally signing a dated consent form.Due to the remote nature of the study, obtaining written consent was impractical.Subjects held the right to withdraw from the procedure at any time via the available option within the application and even request the deletion of the collected data.Subjects of GData have the option to use the thirdparty keyboard included in the application named iPrognosis App, as to capture the keystroke dynamics during their routine typing activities.All the experimental and ethical protocols were approved by Ethik-Kommission an der Technischen Universität Dresden, Dresden, Germany (EK 44022017), Greece, Bioethics Committee of the Aristotle University of Thessaloniki Medical School, Thessaloniki, Greece (359/3.4.17Two hundred and ten total users provided 42,812 typing sessions (mean/std 204/460 typing sessions per user).However, only subjects who self-reported to be aged between 48 and 80 years old were included in the analysis in order to be in the same age group with the subjects in the DSet, resulting in a total of 48 subjects.

Data Capturing
The application for capturing keystroke-related data includes a custom software keyboard developed for the Android Operating System (OS) by three authors (DI, SH, and VH).The users have to enable the software keyboard after the application installation and set it as the default input method.Subjects of the in-the-clinic assessment performed the typing task using the custom keyboard and a common smartphone provided by the authors, whereas in-the-wild subjects used the keyboard with their own mobile devices.In the background, a class of the software keyboard captured the timestamps of press and release touch events for each key tap, as well as the normalized pressure (0.000-1.000) on each press event as outputted by the OS.Each key tap was also flagged as a long-press event, corresponding to deliberate special keyboard actions, or not.The characters typed were not captured, as the context of what is being typed is not required for the analysis, rendering our data collection process privacyaware.For each typing session (keyboard shown and afterwards hidden, with at least one key tap in the meantime), sequences of captured data were stored in JSON format and were indexed as database entry in a local SQLite database, available only to the application.The application periodically transmitted database entries to a remote cloud server (Microsoft Azure) when the user's device was connected to Wi-Fi and charging.Each entry was accompanied by a unique coded ID of the user to ensure privacy.

Keystroke Dynamic Features
The derivatives of time-stamp sequences of touchscreen key press t p n and release t r n produce the so-called hold time (HT) and flight time sequences (FT).The HT sequences, containing values of HTs HT n = t r n − t p n , n = 1,2,...,N, i.e., the differences between the time-stamp a key was released and the time-stamp the key was pressed, are pre-processed, in order to discard deliberate longpresses.Similarly, the FT sequences, containing values of FTs FT n = t p n+1 − t r n , n=1,2,...,N-1, i.e., the differences between the time-stamp a key is pressed and the time-stamp the previous key was released, were also pre-processed to minimize the effect of typing dexterity and subjective factors.In particular, the filtering process included an upper bound of 3 s for sequence elements, a normalization procedure per typing session and a conditional filtering step to the produced normalized FT (NFT) sequences, as follows: (1) Each typing session is represented by a feature vector consisting of mean of mean µ of Hold time (HT), mean of standard deviation (σ ) of HT as long as skewness S of normalized flight times, as derived by aggregating non-overlapping windows of 15 (s) of keystroke sequences: The aforementioned typing features are drawn from our previous study (Iakovakis et al., 2018), in which statistical representations of sequences of HT, FT, and normalized pressure (NP) values were examined in-the-clinic data.However, in some cases, operational system implementations do not provide the value of NP, therefore the NP-related feature was omitted from in-thewild data analysis, as no consistency in same smartphone devices was secured across the GData users.

Regression
For each of the UPDRS Part III items related to upper-extremity fine-motor symptoms, scores can be directly associated with the symptom severity, a regression model (transformation) was trained/tested, under a leave-one-subject-out (LOSO) scheme with inner cross validation for regressors' parameters optimization, to estimate the score of the corresponding item (target) on a typing session level.The total 275 typing sessions of DSet were assigned with the quantized target score of UPDRS Part III single items for each subject.By design, UPDRS scores (integer between 0 and 4) were quantized.Quantization levels span the severity of the underlying symptom with the lowest value (0) denoting a normal behavior, values between (0 and 2) exhibiting mild symptoms, and values from 3 up to 4 severe impairment or inability.Regression can granularize the target domain and provide an index of higher resolution, based on the continuous input predictors (i.e., the keystroke dynamics features of X).The UPDRS single items scores under investigation are B/H-K (B), tremor of right/left hand (T r /T l ), rigidity (R) of right/left hand (R r /R l ) and alternative finger tapping of right/left (AFT r /AFT l ).In general, regression models involve: (a) the unknown parameters, denoted as β, (b) the independent variables, X and the dependent variable, Y.In our case, the regression analysis aims to investigate if the regressors f i can approximate the scale of Y i -th symptom's severity on each subject, i.e., An inner cross validation loop is used to optimize the β parameters of the different models under test.The models that were evaluated for the regression training were Support-Vector-Regression (Smola and Schölkopf, 2004), Lasso and Ridge Regression (Tibshirani, 1996), Random Forests (Liaw and Wiener, 2002), and Bagging of Linear Regressors (Breiman, 1996).We evaluated the accuracy of each regressor on a subject level to measure the ability to capture the scale of the severity, as long as the test error, by employing the Pearson's correlation coefficient and the mean absolute error (MAE).The regression analysis was applied in-the-clinic data and the learned functions f i that can explain a significant part of the underlying symptom and the mean absolute test error is below 0.5 (the half distance between the quantized scores), were further evaluated in-the-wild setting, as explained in the succeeding section.

Regression Models Exploitation
In-the-Wild Each subject K d ∈ G, where G is denoted the set of GData subjects, has produced a sequence of s j typing sessions, and a corresponding feature vector Xs j where j ∈ {1, .., n d } is the total number of sessions for subject K d .

Session Level Analysis
The feature extraction pipeline of typing session as captured in-the-wild (GData), depicted in Figure 2A, includes a post-hoc filtering component that discards recordings with less than eight keystrokes, to foster the statistical validity of the subsequent feature extraction.Therefore, each typing session is subjected to a windowing process that discards windows with less than four keystrokes to be consistent with the DSet processing pipeline.The values of Xs j are computed via aggregation across valid windows.Each session is represented by a feature vector X d j for the subject K d ∈ G. Three features are extracted from each valid typing session from GData and the learned mappings (f i ) are applied to each session.

Subject Level Analysis
From a subject's perspective, an aggregation mechanism F(•) over the estimates f i (Xs j ) that belongs to a period of time δ, is used to characterize the subject's distribution during this time period (see Figure 2B).Each time window δ can contain a different number of estimates associated with a time-stamp t.Subject's contribution with less than 10 valid feature vectors X d j during the period δ is omitted from the analysis, in order for the δ to contain enough number of typing sessions.In the current context of analysis, we examine the median as the aggregation mechanism F(•) to get the most representative sample from the distribution.Moreover, two time windows δ ∈ {1, 52}weeks are used for validating the discrimination power of the estimators.In particular, the time frame of a week (δ = 1) is chosen to include all patterns and habits that can vary during a week (micro-level), whereas all 52 weeks (δ = 52) are set as the global time frame of the analysis for a macro-level perspective.

PD Classification Performance Evaluation
The motor estimates resulted from the aforementioned subject-level analysis, are evaluated in the binary classification performance (PD vs. control), by estimating the area under curve (AUC) of the ROC curve.ROC based performance of the indexes discrimination power is computed with Confidence Interval (CI) with 1,000 bootstraps.Additionally, the sensitivity/specificity metrics, corresponding to the optimal ROC-based cut-off point (decision threshold), are estimated by maximizing the Youden Index (Fluss et al., 2005), under the assumption that costs for false positives and false negatives are equal.

Statistical Analysis
Logistic regression tests using the subject status (PD or control) as dependent variable and symptoms predictions, sex, age, years of education, and usability of smartphones as independent variables, are performed for statistical significance evaluation of the variables discrimination power.The two groups (PD and controls) of the GData setting are tested in terms of demographics using a two-sided Mann-Whitney Utest.Moreover, a two-sided Kolmogorov-Smirnov test for the null hypothesis that two samples are drawn from the same continuous distribution is used to examine the statistically significant difference of the raw keystroke dynamics between PD and controls.Statistically significant difference is set at the level of p < 0.05.

Keystroke Dynamics Distributions
In Figure 3, the raw keystroke dynamics variables under investigation are compared with respect to their distribution as drawn from the DSet and the GData settings.From this figure, a clear similarity in the trend between the two settings can be observed.Moreover, both HT and FT values, are statistically different (p < 0.001) for group-wise comparisons of PD and controls in the two settings.This justifies the initial assumption of the robustness of the proposed approach to transfer knowledge from analysis in-the-clinic to the one in-the-wild, as described below.

In-the-Clinic Setting
From the LOSO analysis results it was found that the best performing regression models achieved 0.83 (0.39), 0.69 (0.41), and 0.68 (0.55) Pearson's correlation coefficient (MAE) for predicting dominant-hand R, B/H-K and right-handed finger tapping, respectively.As MAE was greater than 0.5 for the case of right-handed finger tapping, overlapping the quantized levels of UPDRS Part III scores, dominant hand R and B/H-K were only involved in the consequent analysis.The predictions of the latter two symptoms and the UPDRS Part III single item scores are visualized in Figure 4.In particular, the dominanthand R and B/H-K median predicted scores are depicted along with the medical scores, using error-bars of 0.5 height.The produced indexes achieve 78 and 70% in predicting the quantized UPDRS medical scores on R and B/H-K scores during the LOSO experiment, respectively.The rest UPDRS single items (T l/r , R l , AFT l ) related to motor activity can not be predicted from the keystroke features.This is due to the low Pearson's correlation coefficient values (< 0.35), which are probably caused by the use of dominant hand during typing (all of them were right-handed) and the possible subtle relation of finger movement coordination with hand tremor (see also section 4).

In-the-Wild Setting
As it was mentioned in the section 2.1, GData subjects have matched demographic characteristics of the ones participated in DSet, in order to avoid any inhomogeneity across the two data settings, after appropriate subject filtering process.Moreover, results from the statistical analysis tests, tabulated in Table 1, show that data from a PD patient' and a healthy controls' group, are matched in terms of demographics.Furthermore, ROC curves of the the two estimated indexes estimated for each subject are depicted in Figure 5, considering as time-frame δ = 52 weeks.In particular, estimation of R achieves 0.84 (0.75/0.93 is the 95%CI) AUC with 0.77/0.8sensitivity/specificity and 0.79 accuracy, where the estimation of B/H-K achieves 0.80 (0.7/0.92 is the 95%CI) AUC with 0.92/0.63sensitivity/specificity and 0.70 accuracy in the GData cohort (more diagnostic properties are tabulated in Table 2).In addition, when assessing diagnostic properties of subjects' contribution in time frame (δ = 1week), the indexes achieve lower discrimination performance with 0.80 AUC for R with 0.82/0.65 sensitivity/specificity and 0.78 AUC with 0.86/0.60sensitivity/specificity for B/H-K.Finally, statistical significance discriminative performance of the estimated indexes was found (p < 0.001) with logistic regression models including gender, age, years of education, and mobile usability time as co-variates, achieving p < 0.001, where the other dependent variables (see section 2.3.4) did not show any statistical significance.

DISCUSSION
Digital Health is an emerging field that could enhance disease detection and management via the realization of objective and accessible tools that could quantify behavioral characteristics.Unobtrusive capturing of data via the natural interaction with digital devices is a key factor of digital tools' design to meet the need of long-adherence.User's habitual patterns are influenced by motor symptoms, even in the early stage of PD where the motor manifestation is subtle.The underlying behavior can be detected via algorithmic transformation of high frequency sampled data streams to useful medical indicators, that can  be interpreted from physicians and assist the longitudinal process of passive monitoring, diagnosis, and treatment.The current study design aims to amalgamate the aforementioned requirements, while the results contribute to the interpretation and the real life transferability of the developed methods, by exploiting keystroke dynamics during routine typing on mobile touchscreen, a fine-motor activity which is of high frequency due to the booming of mobile technology (Sarwar and Soomro, 2013).A machine learning-based estimation of dominant-hand rigidity and bradykinesia/hypokinesia severity is employed using captured keystroke typing features.
The individual items of R and B/H-K, related to fine-motor skills decline in PD, are used as regression targets to provide a granular data driven estimation about the specific fine motor symptom, which is more interpretable than a high-level label for the subject (e.g., a binary label or total UPDRS Part III single item score).The regression results show that dominanthand R and B/H-K predictions achieve a low test error and can sufficiently capture the severity of symptom.Hand tremor and non-dominant hand UPDRS Part III single item scores, however, could not be predicted during the proposed analysis, which resulted in low correlation coefficients (<0.35).The latter can be explained by the nature of the finger typing information, which can be disentangled to the finger reflexes of pressing the keys (can expressed via HT) and the finger movements across the digital screen (FT); actions that can be influenced by R and B/H-K but not directly from tremor.In addition, PD patients were at the early stage of the disease and their UPDRS Part III single item scores were in the lower scales 0 − 2 for both symptoms.The latter reinforces the added value of the results toward capturing PD specific fine-motor impairment at the early stage of the disease, where the symptoms are mild.Moreover, the expansion of the developed in-the-clinic method to the uncontrolled setting in-the-wild constitutes a step toward remote passive monitoring of users' fine-motor symptoms.The unobtrusive capturing of GData containing more than 40,000 typing sessions (mean/std: 204/406 per subject) from 210 total users (mean/std weeks of each subject's data contribution: 7/14), highlights the positive effect of the unobtrusive data collection in the long-term adherence.This overcomes the dropouts seen in recent smartphone-based studies for PD, which FIGURE 5 | Classification performance of the median predicted estimates of R and B/H-K during the time frame of δ = 52.ROC curves demonstrating the classification performance of the proposed models for estimating individual motor symptoms on the GData (13 PD patients/35 controls).Solid lines represent the mean ROC curve, while shadowed areas delimit the 95% confidence intervals, computed over 1,000 bootstraps.From the ROC curves the corresponding AUC values are for the R 0.84 (0.75/0.93 is the 95%CI) and for the B/H-K is 0.80 (0.7/0.92 is the 95%CI), respectively.
The diagnostic properties of the produced indexes achieved up to 0.8/0.77sensitivity/specificity on classifying PD and healthy subjects in the wild setting, when aggregating for the whole time period of data contribution, which matches the satisfactory performance seen in-the-clinic setting analyses, i.e., 0.81/0.82 in Iakovakis et al. (2018) and 0.81/0.81 in Arroyo-Gallego et al. (2017).The results are also compliant with the findings of Arroyo-Gallego et al. (2018), who suggests that keystroke dynamics on physical keyboard can be used for remote PD screening with sensitivity/specificity of 0.73/0.69.Estimated indexes were also aggregated across the time frames of hour, weekday, and week, as to compute the variance and consistency of the indices over time.Figure 6 exemplifies the longitudinal estimates of the median of indices coming from a PD patient and a healthy GData user for hour/weekday and for six continuous weeks of data contribution.Estimated indexes of both subjects have a constant behavior over the time frames of week and weekday, whereas intra day data are more variant.
Additionally, in Figure 7 group-wise comparisons of the time response of the indexes are depicted, with an obvious discrimination of the two groups across different time frame resolutions.The corresponding standard deviation of PD/controls; 0.37/0.34for hours 0.27/0.3for weekday and 0.24/0.3for week.The latter denote more variant behavior during the day rather than the weekday and week for both groups.In fact, intra-day variations of controls' fine-motor skills have been previously reported (Van Vugt et al., 2013) to be affected by the circadian rhythm, which can also be considered as a factor that might influence the findings here, due to its relation with fine motor movements during smartphone interaction.Also, dopamine plays a substantial role on circadian regulation and timing behavior (Agostino et al., 2011), whereas recent works (Videnovic and Golombek, 2017) present increasing evidence to disruption of circadian function in PD, where a dopamine-based therapy may increase the circadian oscillations.Healthy population tends to have the peak of their motor coordination and fast reflexes between the time zone of 14:00 and 16:00 (Bass, 2012;Smolensky and Lamberg, 2015), which may explain the groups' median indexes divergence during that specific period of time, as it can be seen in Figure 7. Though circadian rhythm in PD is a novel area of research and recent studies state it as a new therapeutic target (Videnovic and Willis, 2016), applications of digital health with interpretability can enhance the understanding of the underlying patterns of the human behavior, setting the direction for the future work.
From a wider perspective, smartphone interaction has been a promising research direction (Pan et al., 2015) for detecting individual PD-related symptoms, such as gait difficulties and hand tremor through accelerometer recordings during the execution of specific scenarios using a smartphone.Furthermore, fusion of data associated with different PD symptoms captured via smartphone-based tests (Arora et al., 2015) and machine learning have been explored toward PD screening, resulting in classification performance of 0.96/0.97sensitivity/specificity.Although the latter study provides proof-of-evidence for the feasibility of assessing a wide range of motor symptoms through smartphone interaction, data were recorded during guided scenarios, constraining the scalability of data collection due to the requirement of users' active participation.The novelty of the current study is that it sets up an interpretable framework for unobtrusive assessment of individual PD symptoms, which can be further used in combination with other data sources, e.g., background and privacy-aware capturing of accelerometer data for tremor assessment during typing or microphone data (voice) for dysarthrophonia assessment during phone calls, to broaden symptom assessment and pave the way for a holistic objective PD detection tool.Following the same approach proposed in this work, the additional data sources can potentially yield explainable symptom severity indicators, that if combined, can form a fused behavioral vector based on which the final decision of the subject's status against PD can be reached, in a similar way that diagnosis takes place in clinical practice.The fusion approach can include the feeding of the time-aggregated (e.g., every week) behavioral vectors to a decision system, similar to that of (Arora et al., 2015), allowing for high frequency monitoring of the time evolution of both the overall decision, as well as the individual PD symptoms indicators.This is the direction that the i-PROGNOSIS European research project (http://www.i-prognosis.eu)follows toward early PD screening in daily living, in the context of which this study has been carried out.
One possible limitation of our study is that patients under dopaminergic therapy that were included in the DSet data setting refrained from taking their medication at least 8 h before their participation in the experiment, instead of the 12 h that usually secures the "practically off " condition.The latter, combined with potential effects of long-duration response to Levodopa, may have improved the psycho-motor state of these patients and consequently, their typing cadence, leading to a reduced discrimination performance across classification methods tested.Nevertheless, the promising results of the symptoms' estimation in the DSet setting show limited "echoing" effects of dopaminergic therapy on certain study participants' fine motor skills.A second possible limitation of this study is the validity of the users' self-reported demographics in the GData data setting.This perhaps will induce noise in the data evaluation.However, using the time-frame of one week within the range of 52 weeks creates a longitudinal user profile at the micro and macro level of analysis, which reveals a data driven behavior that can be compared across different users.In this way, noticeable deviations could lead to group reorganization; yet, this was not the case here.
Considering the future adoption and extension of the current methods, effectiveness of the approach will not be affected by subjects' reorganization because symptoms severity estimators were trained/evaluated based on data captured from a medically valid cohort.However, reorganization could happen to the inthe-wild cohort, which was used for the reporting of diagnostic properties of the indices.The group reorganization will not severely affect the findings, since the diagnostic properties of the indices are reported with a confidence interval, the true value FIGURE 7 | Response of the estimated indices across time using δ of (A) daily hours and (B) weekdays, for all subjects grouped according to their health status, i.e., PD (green graph) and controls (blue graph).The solid lines represent the group medians, whereas the shadowed areas denote the upper(75th)/lower(25th) quartiles range, respectively.
of the diagnostic performance will probably lie in the span of the reported confidence intervals in case of more participants join the study.Moreover, demographic characteristics did not show any statistical significance from the logistic regression tests.Scalability is the main cause the study is designed in this manner, and re-analyzing the data arising from a larger pool of subjects will yield to even more robust values for the diagnostic performance, which will further increase the medical interpretation of the proposed approach.Toward this, research plans include sampling of subjects for medical evaluation in order to fine-tune the time-aggregation function used in this study toward better modeling of time-related variations of the proposed indicators.
Regarding the transparency of the developed machine learning methods, the current approach aims to quantify the finemotor skills of the user and transform daily behavior to indices that can be linked to scores of the physician's assessments.The produced indices can be also reverse-explained by the physician due to the initial compact feature representation and plausible correlations with standardized UPDRS items.The use of specific features in the pipeline, which are naturally linked to finemotor impairment, further enhances the interpretability of the resulting estimations.Specifically, keystroke dynamics-related features inputted to the regression mechanisms are naturally affected by rigidity (muscle stiffness) and bradykinesia (slowness of movement), causing longer (mean of HT), more variant (standard deviation of HT) pressings of virtual keyboard keys and slower finger coordination across the screen (skewness of FT), during PD patients' typing, when compared to controls.The latter can be interpreted by the physician as an objective projection of the scale of symptoms' severity to the specific body part used to perform the task.In a nutshell, the developed approach aims to support physicians, not replace them, and accelerate the PD diagnosis, by providing objective tools for remote quantification of the symptoms footprint on the patient.

CONCLUSIONS
In this study, evidence of real-life usage of unobtrusive detection of fine-motor symptoms from undiagnosed population using prior information on symptoms severity estimation in-theclinic setting was provided.The presented results validate the initial hypothesis that individual symptom severity can be approximated using keystroke dynamics information and based on this, PD can be detected via keystroke pattern analysis when data are captured in-the-wild.Separation between PD and healthy controls, purely based on smartphone keystroke dynamics is possible and these results could be seen constantly over a longer time frame.Furthermore, the severity expressed by the smartphone touchscreen typing corresponds well with the severity evaluated by the neurologists on which the training algorithms are based.Potential future extension of the method can include the use of deep learning (LeCun et al., 2015), accompanied with explainable methods (Gunning, 2017), which may reveal better representations of the raw keystroke dynamics and capture more efficiently the latent factors of the symptom's digital footprint, considering also other factors in the analysis, such as the circadian rhythm.Finally, embedding such analyses in the operational systems of smartphones could assist the mobile health booming, considering though, all ethical guidelines and data regulations regarding privacy and security, such as the General Data Protection Regulation (GDPR).

FIGURE 1 |
FIGURE 1 | Schematic representation of the methodological steps referring to: (A) the DSet keystroke data from touchscreen typing captured in-the-clinic setting and (B) the extension of the best resulted models from the DSet to the GData.LOSO-based training/testing is used to evaluate which regressor can achieve better estimation of motor symptoms via the use of keystroke dynamics features as inputs and UPDRS Part III single item scores as targets.The regressors that efficiently capture the scale of symptoms severity are used in-the-wild setting for subjects characterization as PD or control.

FIGURE 2 |
FIGURE 2 | Procedural pipeline for processing GData (A) per typing session and (B) per subject.Typing sessions with at least eight keystrokes are considered valid for processing, whereas the rest are omitted.The keystroke dynamics consist of hold time (HT) and flight time (FT) sequences and are both split by non-overlapping 15 s windows (W j ).Only windows with at least four keystrokes within the 15 s interval are used further in order to extract features by computing the mean feature distribution for the valid corresponding windows.Each subject K d has contributed typing features that are grouped by a time window δ, which is considered valid if contains more than 10 sessions.Each session X d j is transformed with a learned mapping f k , k ∈ B/H − K, R which was previously trained on a the DSet, computes a single numerical score from each typing session.An aggregation mechanism F(•) is applied to each time window δ to characterize subject's time window contribution over time.

FIGURE 3 |
FIGURE 3 | Box plots of keystroke dynamic variables (FT and HT) distributions when comparing the two data capturing settings (DSet and GData).

FIGURE 4 |
FIGURE 4 | Regression estimates (green dots) of LOSO experiment using keystroke dynamic features and UPDRS Part III single items scores.The median of the predicted values distribution across the typing session predictions are also plotted per subject.Moreover, error bars of height ±0.5 are superimposed to show if the symptom estimation lies within the span of the physician's score.

FIGURE 6 |
FIGURE 6 | Responses of the estimated indices across time using different time resolution (hours, weekdays, and weeks) of two cases, i.e., a PD (blue graph) and a control (green graph).Blue/green line represents the median of the estimated indices, whereas blue/green dots are estimations for a single typing session regarding the PD patient/control contribution for each period of time.

TABLE 1 |
Summary of GData study cohort (48 subjects) demographic and clinical characteristics with respect to each group ( PD patients and Healthy).