Machine Learning for the Diagnosis of Parkinson's Disease: A Review of Literature

Diagnosis of Parkinson's disease (PD) is commonly based on medical observations and assessment of clinical signs, including the characterization of a variety of motor symptoms. However, traditional diagnostic approaches may suffer from subjectivity as they rely on the evaluation of movements that are sometimes subtle to human eyes and therefore difficult to classify, leading to possible misclassification. In the meantime, early non-motor symptoms of PD may be mild and can be caused by many other conditions. Therefore, these symptoms are often overlooked, making diagnosis of PD at an early stage challenging. To address these difficulties and to refine the diagnosis and assessment procedures of PD, machine learning methods have been implemented for the classification of PD and healthy controls or patients with similar clinical presentations (e.g., movement disorders or other Parkinsonian syndromes). To provide a comprehensive overview of data modalities and machine learning methods that have been used in the diagnosis and differential diagnosis of PD, in this study, we conducted a literature review of studies published until February 14, 2020, using the PubMed and IEEE Xplore databases. A total of 209 studies were included, extracted for relevant information and presented in this review, with an investigation of their aims, sources of data, types of data, machine learning methods and associated outcomes. These studies demonstrate a high potential for adaptation of machine learning methods and novel biomarkers in clinical decision making, leading to increasingly systematic, informed diagnosis of PD.


INTRODUCTION
Parkinson's disease (PD) is one of the most common neurodegenerative diseases with a prevalence rate of 1% in the population above 60 years old, affecting 1-2 people per 1,000 (Tysnes and Storstein, 2017). The estimated global population affected by PD has more than doubled from 1990 to 2016 (from 2.5 million to 6.1 million), which is a result of increased number of elderly people and age-standardized prevalence rates (Dorsey et al., 2018). PD is a progressive neurological disorder associated with motor and non-motor features (Jankovic, 2008) which comprises multiple aspects of movements, including planning, initiation and execution (Contreras-Vidal and Stelmach, 1995).
During its development, movement-related symptoms such as tremor, rigidity and difficulties in initiation can be observed, prior to cognitive and behavioral alterations including dementia (Opara et al., 2012). PD severely affects patients' quality of life (QoL), social functions and family relationships, and places heavy economic burdens at individual and society levels (Johnson et al., 2013;Kowal et al., 2013;Yang and Chen, 2017).
The diagnosis of PD is traditionally based on motor symptoms. Despite the establishment of cardinal signs of PD in clinical assessments, most of the rating scales used in the evaluation of disease severity have not been fully evaluated and validated (Jankovic, 2008). Although non-motor symptoms (e.g., cognitive changes such as problems with attention and planning, sleep disorders, sensory abnormalities such as olfactory dysfunction) are present in many patients prior to the onset of PD (Jankovic, 2008;Tremblay et al., 2017), they lack specificity, are complicated to assess and/or yield variability from patient to patient (Zesiewicz et al., 2006). Therefore, non-motor symptoms do not yet allow for diagnosis of PD independently (Braak et al., 2003), although some have been used as supportive diagnostic criteria (Postuma et al., 2015).
Machine learning techniques are being increasingly applied in the healthcare sector. As its name implies, machine learning allows for a computer program to learn and extract meaningful representation from data in a semi-automatic manner. For the diagnosis of PD, machine learning models have been applied to a multitude of data modalities, including handwritten patterns (Drotár et al., 2015;Pereira et al., 2018), movement (Yang et al., 2009;Wahid et al., 2015;Pham and Yan, 2018), neuroimaging (Cherubini et al., 2014a;Choi et al., 2017;Segovia et al., 2019), voice (Sakar et al., 2013;Ma et al., 2014), cerebrospinal fluid (CSF) (Lewitt et al., 2013;Maass et al., 2020), cardiac scintigraphy (Nuvoli et al., 2019), serum (Váradi et al., 2019), and optical coherence tomography (OCT) (Nunes et al., 2019). Machine learning also allows for combining different modalities, such as magnetic resonance imaging (MRI) and single-photon emission computed tomography (SPECT) data (Cherubini et al., 2014b;Wang et al., 2017), in the diagnosis of PD. By using machine learning approaches, we may therefore identify relevant features that are not traditionally used in the clinical diagnosis of PD and rely on these alternative measures to detect PD in preclinical stages or atypical forms.
In recent years, the number of publications on the application of machine learning to the diagnosis of PD has increased. Although previous studies have reviewed the use of machine learning in the diagnosis and assessment of PD, they were limited to the analysis of motor symptoms, kinematics, and wearable sensor data (Ahlrichs and Lawo, 2013;Ramdhani et al., 2018;Belić et al., 2019). Moreover, some of these reviews only included studies published between 2015 and 2016 (Pereira et al., 2019). In this study, we aim to (a) comprehensively summarize all published studies that applied machine learning models to the diagnosis of PD for an exhaustive overview of data sources, data types, machine learning models, and associated outcomes, (b) assess and compare the feasibility and efficiency of different machine learning methods in the diagnosis of PD, and (c) provide machine learning practitioners interested in the diagnosis of PD with an overview of previously used models and data modalities and the associated outcomes, and recommendations on how experimental protocols and results could be reported to facilitate reproduction. As a result, the application of machine learning to clinical and non-clinical data of different modalities has often led to high diagnostic accuracies in human participants, therefore may encourage the adaptation of machine learning algorithms and novel biomarkers in clinical settings to assist more accurate and informed decision making.

Search Strategy
A literature search was conducted on the PubMed (https:// pubmed.ncbi.nlm.nih.gov) and IEEE Xplore (https://ieeexplore. ieee.org/search/advanced/command) databases on February 14, 2020 for all returned results. Boolean search strings used are shown in Table 1. No additional filters were applied in the literature search. All retrieved studies were systematically identified, screened and extracted for relevant information following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines (Moher et al., 2009).

Inclusion and Exclusion Criteria
Studies that satisfy one or more of the following criteria and used machine learning methods were included: (1) Classification of PD from healthy controls (HC), (2) Classification of PD from Parkinsonism (e.g., progressive supranuclear palsy (PSP) and multiple system atrophy (MSA)), and (3) Classification of PD from other movement disorders (e.g., essential tremor (ET)).
Studies falling into one or more of the following categories were excluded: (1) Studies related to Parkinsonism or/and diseases other than PD that did not involve classification or detection of PD (e.g., differential diagnosis of PSP, MSA, and other atypical Parkinsonian disorders), (2) Studies not related to the diagnosis of PD (e.g., subtyping or severity assessment, analysis of behavior, disease progression, treatment outcome prediction, identification, and localization of brain structures or parameter optimization during surgery), (3) Studies related to the diagnosis of PD, but performed analysis and assessed model performance at sample level (e.g., classification using individual MRI scans without aggregating scan-level performance to patient level), (4) Classification of PD from non-Parkinsonism (e.g., Alzheimer's disease), (5) Study did not use metrics that measure classification performance, (6) Study used organisms other than human (e.g., Caenorhabditis elegans, mice or rats), (7) Study did not provide sufficient or accurate descriptions of machine learning methods, datasets or subjects used (e.g., does not provide sample size, or incorrectly described the dataset(s) used), (8) Not original journal article or conference proceedings papers (e.g., review and viewpoint paper), and (9) In languages other than English.
For studies published online first and archived in another year, "year of publication" was defined as the year during which the study was published online. If this information was unavailable, the year in which the article was copyrighted was regarded as the year of publication. For studies that introduced novel models and used existing models merely for comparison, information related to the novel models was extracted. Classification of PD and scans without evidence for dopaminergic deficit (SWEDD) was treated as subtyping (Erro et al., 2016).

Study Objectives
To outline the different goals and objectives of included studies, we have further categorized them based on the type of diagnosis and their general aim. From the perspective of diagnostics, these studies could be divided into (a) the diagnosis or detection of PD (which compares data collected from PD patients and healthy controls), (b) differential diagnosis (discrimination between patients with idiopathic PD and patients with atypical Parkinsonism), and (c) sub-typing (discrimination among subtypes of PD).
Included studies were also analyzed for their general aim: For studies with a focus on the development of novel technical approaches to be used in the diagnosis of Parkinson's disease, e.g., new machine learning and deep learning models and architectures, data acquisition devices, and feature extraction algorithms that haven't been previously presented and/or employed, we defined them as (a) "methodology" studies. Studies that validate and investigate (a) the application of previously published and validated machine learning and deep learning models, and/or (b) the feasibility of introducing data modalities that are not commonly used in the machine learning-based diagnosis of PD (e.g., CSF data), were defined as (b) "clinical application" studies.

Model Evaluation
In the present study, accuracy was used to compare performance of machine learning models. For each data type, we summarized the type of machine learning models that led to the perstudy highest accuracy. However, in some studies, only one machine learning model was tested. Therefore, we define "model associated with the per-study highest accuracy" as (a) the only model implemented and used in a study or (b) the model that achieved the highest accuracy or that was highlighted in studies that used multiple models. Results are expressed as mean (SD).
For studies reporting both training and testing/validation accuracy, testing or validation accuracy was considered. For studies that reported both validation and test accuracy, test accuracy was considered. For studies with more than one dataset or classification problem (e.g., HC vs. PD and HC vs. idiopathic hyposmia vs. PD), accuracy was averaged across datasets or classification problems. For studies that reported classification accuracy for each group of subjects individually, accuracy was averaged across groups. For studies reporting a range of accuracies or accuracies given by different cross validation methods or feature combinations, the highest accuracies were considered. In studies that compared HC with diseases other than PD or PD with diseases other than Parkinsonism, diagnosis of diseases other than PD or Parkinsonism (e.g., amyotrophic lateral sclerosis) was not considered. Accuracy of severity assessment was not considered.

Literature Review
Based on the search criteria, we retrieved 427 (PubMed) and 215 (IEEEXplore) search results, leading to a total of 642 publications. After removing duplicates, we screened 593 publications for titles and abstracts, following which we excluded 313 based on the exclusion criteria and examined 280 full text articles. Overall, we included 209 research articles for data extraction (Figure 1
The 209 studies had an average sample size of 184.6 (289.3), with a smallest sample size of 10 (Kugler et al., 2013), and a largest sample size of 2,289 (Tracy et al., 2019; Figure 2A). For studies that recruited human participants (n = 93), data from an average of 118.0 (142.9) participants were collected (range: 10-920; Figure 2B). For other studies (n = 116), an average sample size of 238.1 (358.5) was reported (range: 30-2,289; Figure 2B). For a description of average accuracy reported in these studies in relation to sample size, see Figure 2C.

Study Objectives
In included studies, although "diagnosis of PD" was used as the search criteria, machine learning had been applied for diagnosis (PD vs. HC), differential diagnosis (idiopathic PD vs. atypical Parkinsonism) and sub-typing (differentiation of sub-types of PD) purposes. Most studies focused on diagnosis (n = 168, 80.4%) or differential diagnosis (n = 20, 9.6%). Fourteen studies performed both diagnosis and differential diagnosis (6.7%), 5 studies (2.4%) diagnosed and subtyped PD, 2 studies (1.0%) included diagnosis, differential diagnosis, and subtyping.
Among the included studies, a total of 132 studies (63.2%) implemented and tested a machine learning method, a model architecture, a diagnostic system, a feature extraction algorithm, or a device for non-invasive, low-cost data acquisition that hasn't been established for the detection and early diagnosis of PD (methodology studies). In 77 studies (36.8%), previously proposed and validated machine learning methods were tested in clinical settings for early detection of PD, identification of novel biomarkers or examination of uncommonly used data modalities for the diagnosis of PD (e.g., CSF; clinical application studies).

Source of Data
In the 132 studies that proposed or tested novel machine learning methods (i.e., methodology studies), a majority used data from publicly available databases (n = 89, 67.4%). Data collected from human participants were used in 41 studies (31.1%) and the two remaining studies (1.5%) used commercially sourced data or data from both existing public databases and local participants specifically recruited for the study. Out of the 77 studies that used machine learning models in clinical settings (i.e., clinical application studies), 52 (67.5%) collected data from human participants, 22 (28.6%) used data from public databases. Two (2.6%) studies obtained data from a database and a local cohort, and 1 (1.3%) study collected data postmortem.

Number of Subjects
The average sample size was 137.1 for the 132 methodology studies ( Figure 3B). For 41 out of the 132 studies that used data from recruited human participants, the average sample size was 81.7 ( Figure 3C). In the 77 studies on clinical applications, the average sample size was 266.2 ( Figure 3B). For 52 out of the 77 clinical studies that collected data from recruited participants, the average sample size was 145.9 ( Figure 3C).

Machine Learning Methods Applied to the Diagnosis of PD
We divided 448 machine learning models from the 209 studies into 8 categories: (1) support vector machine (SVM) and variants (n = 132 from 130 studies), (2) neural networks (n = 76 from 62 studies), (3) ensemble learning (n = 82 from 57 studies), (4) nearest neighbor and variants (n = 33 from 33 studies), (5) regression (n = 31 from 31 studies), (6) decision tree (n = 28 from 27 studies), (7) naïve Bayes (n = 26, from 26 studies), and (8) discriminant analysis (n = 12 from 12 studies). A small percentage of models used did not fall into any of the categories (n = 28, used in 24 studies). On average, 2.14 machine learning models per study were applied to the diagnosis of PD. One study may have used more than one category of models. For a full description of data types used to train each type of machine learning models and the associated outcomes, see Supplementary Materials and Supplementary Figure 2.

Performance Metrics
Various metrics have been used to assess the performance of machine learning models ( Table 3). The most common metric was accuracy (n = 174, 83.3%), which was used individually (n = 55) or in combination with other metrics (n = 119) in model evaluation. Among the 174 studies that used accuracy, some have combined accuracy with sensitivity (i.e., recall) and specificity (n = 42), or with sensitivity, specificity and AUC (n = 16), or with recall (i.e., sensitivity), precision and F1 score (n = 7) for a more systematic understanding of model performance. A total of 35 studies (16.7%) used metrics other than accuracy. In these studies, the most used performance metrics were AUC (n = 19), sensitivity (n = 17), and specificity (n = 14), and the three were often applied together (n = 9) with or without other metrics.
Given that studies used different data modalities and sources, and sometimes different samples of the same database, a summary of model performance, instead of direct comparison across studies, is provided.
Voice recordings from the UCI machine learning repository were used in 42 studies ( Table 4). Among the 42 studies, 39 used accuracy to evaluate classification performance and the average accuracy was 92.0 (9.0) %. The lowest accuracy was 70.0% and the highest accuracy was 100.0%. Eight out of 9 studies that collected voice recordings from human participants used accuracy as the performance metric, and the average, lowest and highest accuracies were 87.7 (6.8) %, 77.5%, and 98.6%, respectively. The 4 remaining studies used data from the Neurovoz corpus (n = 1), mPower database (n = 1), PC-GITA database (n = 1), or data from both the UCI machine learning repository and human participants (n = 1). Two out of these 4 studies used accuracy to evaluate model performance and reported an accuracy of 81.6 and 91.7%.

Movement Data (n = 51)
The 43 out of 51 studies using accuracy to assess model performance achieved an average accuracy of 89.1 (8.3) %, ranging from 62.1% (Prince and de Vos, 2018) to 100.0% (Surangsrirat et al., 2016;Joshi et al., 2017;Pham, 2018;Pham and Yan, 2018; Figure 4A). One study reported three machine learning methods (SVM, nearest neighbor and decision tree) achieving the highest accuracy individually (Félix et al., 2019). Out of the 51 studies, the per-study highest accuracy was achieved with SVM in 22 studies (41.5%), with ensemble learning in 13 studies (24.5%), with neural network in 9 studies (17.0%), with nearest neighbor in 4 studies (7.5%), with discriminant analysis in 1 study (1.9%), with naïve Bayes in 1 study (1.9%), and with decision tree in 1 study (1.9%). Models that do not belong to any given categories were associated with the highest per-study accuracy in two studies (3.8%; Figure 4B).
Among the 33 studies that collected movement data from recruited participants, 25 used accuracy in model evaluation, leading to an average accuracy of 87.0 (7.3) % ( Table 5). The lowest and highest accuracies were 64.1%  and 100.0% (Surangsrirat et al., 2016), respectively. Fifteen studies used data from the PhysioNet database (Table 5) and had an average accuracy of 94.4 (4.6) %, a lowest accuracy of 86.4% and a highest accuracy of 100%. Three studies used data from the mPower database (n = 2) or data sourced from another study (n = 1), and the average accuracy of these studies was 80.6 (16.2) %.

MRI (n = 36)
Average accuracy of the 32 studies that used accuracy to evaluate the performance of machine learning models was 87.5 (8.0) %. In these studies, the lowest accuracy was 70.5%  and the highest accuracy was 100.0% (Cigdem et al., 2019; Figure 4A). Out of the 36 studies, the per-study highest accuracy was obtained with SVM in 21 studies (58.3%), with neural network in 8 studies (22.2%), with discriminant analysis in 3 studies (8.3%), with regression in 2 studies (5.6%), and with ensemble learning in 1 study (2.8%). One study (2.8%) obtained the highest per-study accuracy using models that do not belong to FIGURE 4 | Data type, machine learning models applied, and accuracy. (A) Accuracy achieved in individual studies and average accuracy for each data type. Error bar: standard deviation. (B) Distribution of machine learning models applied per data type. MRI, magnetic resonance imaging; SPECT, single-photon emission computed tomography; PET, positron emission tomography; CSF, cerebrospinal fluid; SVM, support vector machine; NN, neural network; EL, ensemble learning; k-NN, nearest neighbor; regr, regression; DT, decision tree; NB, naïve Bayes; DA, discriminant analysis; other: data/models that do not belong to any of the given categories.       any of the given categories ( Figure 4B). In 8 of 36 studies, neural networks were directly applied to MRI data, while the remaining studies used machine learning models to learn from extracted features, e.g., cortical thickness and volume of brain regions, to diagnose PD.
Out of 17 studies that used MRI data from the PPMI database, 16 used accuracy to evaluate model performance and the average accuracy was 87.9 (8.0) %. The lowest and highest accuracies were 70.5 and 99.9%, respectively ( Table 6). In 16 out of 19 studies that acquired MRI data from human participants, accuracy was used to evaluate classification performance and an average accuracy was 87.0 (8.1) % was achieved. The lowest reported accuracy was 76.2% and the highest reported accuracy was 100% ( Table 6).

SPECT (n = 14)
Average accuracy of 12 out of 14 studies that used accuracy to measure the performance of machine learning models was 94.4 (4.2) % ( Table 7). The lowest reported accuracy was 83.2% (Hsu et al., 2019) and 97.9% (Oliveira F. et al., 2018; Figure 4A). SVM led to the highest per-study accuracy in 10 out of 14 studies (71.4%). The highest per-study accuracy was obtained with neural networks in 3 studies (21.4%) and with regression in 1 study (7.1%; Figure 4B).

PET (n = 4)
All 4 studies used sensitivity and specificity (Table 7) in model evaluation while 3 used accuracy. Average accuracy of the 3 studies was 85.6 (6.6) %, with a lowest accuracy of 78.16% (Segovia et al., 2015) and a highest accuracy of 90.72% ; Figure 4A). Half of the 4 studies (50.0%) obtained the highest per-study accuracy with SVM (Segovia et al., 2015;Wu et al., 2019) and the other half (50.0%) with neural networks ( Figure 4B).

CSF (n = 5)
All 5 studies used AUC, instead of accuracy, to evaluate machine learning models ( Table 7). The average AUC was 0.8 (0.1), the lowest AUC was 0.6825 (Maass et al., 2020) and the highest AUC was 0.839 (Maass et al., 2018), respectively. Two studies obtained the highest per-study AUC with ensemble learning, 2 studies with SVM and 1 study with regression ( Figure 4B).

Other Types of Data (n = 10)
Only 5 studies used accuracy to measure the performance of machine learning models ( Table 7). An average accuracy of 91.9 (6.4) % was obtained, with a lowest accuracy of 84.85% (Shi et al., 2018) and a highest accuracy of 100% (Nuvoli et al., 2019; Figure 4A). Out of the 10 studies, 5 (50%) used SVM to achieve the per-study highest accuracy, 3 (30%) used ensemble learning, 1 (10%) used decision trees and 1 (10%) used machine learning models that do not belong to any given categories ( Figure 4B).

Combination of More Than One Data Type (n = 18)
Out of the 18 studies that used more than one type of data, 15 used accuracy in model evaluation ( Table 7). An average accuracy of 92.6 (6.1) % was obtained, and the lowest and highest accuracy among the 15 studies was 82.0% (Prince et al., 2019) and 100.0% (Cherubini et al., 2014b), respectively ( Figure 4A). The per-study highest accuracy was achieved with ensemble              learning in 6 studies (33.3%), with neural network in 5 studies (27.8%), with SVM in 4 studies (22.2%), with regression in 1 (5.6%) study and with nearest neighbor (5.6%) in 1 study. One study (5.6%) used machine learning models that do not belong to any given categories to obtain the highest per-study accuracy ( Figure 4B).

Principal Findings
In this review, we present results from published studies that applied machine learning to the diagnosis and differential diagnosis of PD. Since the number of included papers was relatively large, we focused on a high-level summary rather than a detailed description of methodology and direct comparison of outcomes of individual studies. We also provide an overview of sample size, data source and data type, for a more in-depth understanding of methodological differences across studies and their outcomes. Furthermore, we assessed (a) how large the participant pool/dataset was, (b) to what extent new data (i.e., unpublished, raw data acquired from locally recruited human participants) were collected and used, (c) the feasibility of machine learning and the possibility of introducing new biomarkers in the diagnosis of PD. Overall, methodology studies that proposed and tested novel technical approaches (e.g., machine learning and deep learning models, data acquisition devices, and feature extraction algorithms) have repetitively shown that features extracted from data modalities including voice recordings and handwritten patterns could lead to high patient-level diagnostic performance, while facilitating accessible and non-invasive data acquisition. Nevertheless, only a small number of studies further validated these technical approaches in clinical settings using local human participants recruited specifically for these studies, indicating a gap between model development and their clinical applications. A per-study diagnostic accuracy above chance levels was achieved in all studies that used accuracy in model evaluation ( Figure 4A). Apart from studies using CSF data that measured model performance with AUC, classification accuracy associated with 8 other data types ranged between 85.6% (PET) and 94.4% (SPECT), with an average of 89.9 (3.0) %. Therefore, although the small number of studies of some data types may not allow for a generalizable prediction of how well these data types can help us differentiate PD from HC or atypical Parkinsonian disorders, the application of machine learning to a variety of data types led to high accuracy in the diagnosis of PD. In addition, an accuracy significantly above chance levels was achieved in all machine learning models (Supplementary Table 1), while SVM, neural networks and ensemble learning were among the most popular model choices, all yielding great applicability to a variety of data modalities. In the meantime, when compared with other models, they led to the per-study highest classification accuracy 7 | Studies that applied machine learning models to handwritten patterns, SPECT, PET, CSF, other data types and combinations of data to diagnose PD (n = 67).

Objectives Type of diagnosis
Source of data       in >50% of all cases (50.7, 51.9, and 52.3%, respectively; Supplementary Table 1). Despite the high diagnostic accuracy and performance reported, in a number of studies, data splitting strategies and the use of cross validation were not specified.
For data modalities such as 3D MRI scans, when 2D slices are extracted from 3D volumes, multiple slices could be generated for one subject. Having data from the same subject across training, validation and tests sets can lead to a biased data split (Wen et al., 2020), causing data leakage and overestimation of model performance, thus compromising reproducibility of published results. As previously discussed (Belić et al., 2019), although satisfactory diagnostic outcomes could be achieved, sample size in few studies was extremely small (<15 subjects). The application of some machine learning models, especially neural networks, typically rely on a large dataset. Nevertheless, collecting data from a large pool of participants remains challenging in clinical studies, and data generated are commonly of high dimensionality and small sample size (Vabalas et al., 2019). To address this challenge, one solution is to combine data from a local cohort with public repositories including PPMI, UCI machine learning repository, PhysioNet and many others, depending on the type of data that have been collected from the local cohort. Furthermore, when a great difference in group size is observed (i.e., class imbalance problem), labeling all samples after the majority class may lead to an undesired high accuracy. In this case, evaluating machine learning models with other metrics including precision, recall and F-1 score is recommended (Jeni et al., 2013).
Even though high diagnostic accuracy of PD has been achieved in clinical settings, machine learning approaches have also reached high accuracy as shown in the present study, while models including SVM and neural networks are particularly useful in (a) diagnosis of PD using data modalities that have been overlooked in clinical decision making (e.g., voice), and (b) identification of features of high relevance from these data. For example, the use of machine learning models with feature selection techniques allows for assessing the relative importance of features of a large feature space in order to select the most differentiating ones, which is conventionally challenging using manual approaches. For the discovery of novel markers allowing for non-invasive diagnostic options with relatively high accuracy, e.g., handwritten patterns, a small number of studies have been conducted, mostly using data from published databases. Given that these databases generally included handwritten patterns from a small number of diagnosed PD patients, sometimes under 15, it would be of great importance to validate the use of handwritten patterns in early diagnosis of PD in clinical studies of a larger scale. In the meantime, diagnosing PD using more than one data modality has led to promising results. Accordingly, supplying clinicians with non-motor data and machine learning approaches may support clinical decision making in patients with ambiguous symptom presentations, and/or improve diagnosis at an earlier stage.
An issue observed in many included studies was the insufficient or inaccurate description of methods or results, and some failed to provide accurate information of the number and type of subjects used (for example, methodology studies on early diagnosis of PD missing a table summarizing the characteristics of subjects, therefore it was challenging to understand the stage of PD in recruited patients), or how machine learning models were implemented, trained and tested. Infrequently, authors skipped basic information such as number of subjects and their medical conditions and referred to another publication. Although we attempted to list model hyperparameters and cross-validation strategies in the data extraction table, many included studies did not make this information available in the main text, leading to potential difficulties in replicating the results. Apart from these, rounding errors or inconsistent reporting of results also exist. Furthermore, although we treated the differentiation of PD from SWEDD as subtyping, there is ongoing controversy regarding whether it should be considered as differential diagnosis or subtyping Erro et al., 2016;Chou, 2017;Kwon et al., 2018). Given these limitations, clinicians interested in adapting machine learning models or implementing diagnostic systems based on novel biomarkers are advised to interpret published results with care. Further, in this context we would like to stress the need for uniform reporting standards in studies using machine learning.
In both machine learning research and clinical settings, appropriately interpreting published results and methodologies is a necessary step toward an understanding of state-of-theart methods. Therefore, vagueness in reporting not only compromises the interpretation of results but makes further methodological developments based on published research unnecessarily challenging. Moreover, for medical doctors interested in learning how machine learning methods could be applied in their domains, insufficient description of methods may lead to incorrect model implementation and failure of replication.
To enable efficient replication of published results, detailed descriptions of (a) model and architecture (hyperparameters, number and type of layers, layer-specific parameter settings, regularization strategies, activation functions), (b) implementation (programming language, machine learning and deep learning libraries used, model training and testing, metrics and model evaluation, validation strategy, optimization), and (c) version numbers of software/libraries used for both preprocessing and model implementation, are often desirable, as newer software versions may lead to differences in pre-processing and model implementation stages (Chepkoech et al., 2016).
Due to the use of imbalanced datasets in medical sciences, reporting model performance with a confusion matrix may give rise to a more comprehensive understanding of the model's ability to discriminate between PD and healthy controls. In the meantime, due to costs associated with acquisition of patient data, researchers often need to expand data collected from a local cohort using data sourced from publicly available databases or published studies. Nevertheless, unclear description of data acquisition and pre-processing protocols in some published studies may lead to challenges in the integration of newly acquired data and previously published data. Taken together, to facilitate early, refined diagnosis of PD and efficient application of novel machine learning approaches in a clinical setting, and to allow for improved reproducibility of studies on machine learning-based diagnosis and assessment of PD, a higher transparency in reporting data collection, pre-processing protocols, model implementation, and study outcomes is required.

Limitations
In the present study, we have excluded research articles in languages other than English and results published in the form of conference abstracts, posters, and talks. Despite the ongoing discussion of advantages and importance of including conference abstracts in systematic reviews and reviews (Scherer and Saldanha, 2019), conference abstracts often do not report sufficient key information which is why we had to exclude them. However, this may lead to a publication and result bias. In addition, since the aim of the present review is to assess and summarize published studies on the detection and early diagnosis of PD, we noticed that few large-scale, multi-centric studies on subtyping or/and severity assessment of PD were therefore excluded. Given the current challenges in subtyping, severity assessment and prognosis of PD, a further step toward a more systematic understanding of the application of machine learning to neurodegenerative diseases would be to review these studies.
Moreover, due to the high inter-study variance in the data source and presentation of results, it was challenging to directly compare outcomes associated with each type of model across studies, as some studies failed to indicate whether model performance was evaluated using a test set, and/or results given by models that did not yield the best per-study performance. Results of published studies were discussed and summarized based on data and machine learning models used, and for data modalities such as PET (n = 4) or CSF (n = 5), the number of studies were too small despite the high total number of studies included. Therefore, it was improbable to assess the general performance of machine learning techniques when PET or CSF data are used.

CONCLUSIONS
To the best of our knowledge, the present study is the first review which included results from all studies that applied machine learning methods to the diagnosis of PD. Here, we presented included studies in a high-level summary, providing access to information including (a) machine learning methods that have been used in the diagnosis of PD and associated outcomes, (b) types of clinical, behavioral and biometric data that could be used for rendering more accurate diagnoses, (c) potential biomarkers for assisting clinical decision making, and (d) other highly relevant information, including databases that could be used to enlarge and enrich smaller datasets. In summary, realization of machine learning-assisted diagnosis of PD yields high potential for a more systematic clinical decision-making system, while adaptation of novel biomarkers may give rise to easier access to PD diagnosis at an earlier stage. Machine learning approaches therefore have the potential to provide clinicians with additional tools to screen, detect or diagnose PD.

DATA AVAILABILITY STATEMENT
The original contributions generated for the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author/s.

AUTHOR CONTRIBUTIONS
JM conceived and designed the study, collected the data, performed the analysis, and wrote the paper. CD and JF supervised the research. All authors contributed to the article and approved the submitted version.