The differential diagnosis value of radiomics-based machine learning in Parkinson’s disease: a systematic review and meta-analysis

Background In recent years, radiomics has been increasingly utilized for the differential diagnosis of Parkinson’s disease (PD). However, the application of radiomics in PD diagnosis still lacks sufficient evidence-based support. To address this gap, we carried out a systematic review and meta-analysis to evaluate the diagnostic value of radiomics-based machine learning (ML) for PD. Methods We systematically searched Embase, Cochrane, PubMed, and Web of Science databases as of November 14, 2022. The radiomics quality assessment scale (RQS) was used to evaluate the quality of the included studies. The outcome measures were the c-index, which reflects the overall accuracy of the model, as well as sensitivity and specificity. During this meta-analysis, we discussed the differential diagnostic value of radiomics-based ML for Parkinson’s disease and various atypical parkinsonism syndromes (APS). Results Twenty-eight articles with a total of 6,057 participants were included. The mean RQS score for all included articles was 10.64, with a relative score of 29.56%. The pooled c-index, sensitivity, and specificity of radiomics for predicting PD were 0.862 (95% CI: 0.833–0.891), 0.91 (95% CI: 0.86–0.94), and 0.93 (95% CI: 0.87–0.96) in the training set, and 0.871 (95% CI: 0.853–0.890), 0.86 (95% CI: 0.81–0.89), and 0.87 (95% CI: 0.83–0.91) in the validation set, respectively. Additionally, the pooled c-index, sensitivity, and specificity of radiomics for differentiating PD from APS were 0.866 (95% CI: 0.843–0.889), 0.86 (95% CI: 0.84–0.88), and 0.80 (95% CI: 0.75–0.84) in the training set, and 0.879 (95% CI: 0.854–0.903), 0.87 (95% CI: 0.85–0.89), and 0.82 (95% CI: 0.77–0.86) in the validation set, respectively. Conclusion Radiomics-based ML can serve as a potential tool for PD diagnosis. Moreover, it has an excellent performance in distinguishing Parkinson’s disease from APS. The support vector machine (SVM) model exhibits excellent robustness when the number of samples is relatively abundant. However, due to the diverse implementation process of radiomics, it is expected that more large-scale, multi-class image data can be included to develop radiomics intelligent tools with broader applicability, promoting the application and development of radiomics in the diagnosis and prediction of Parkinson’s disease and related fields. Systematic review registration https://www.crd.york.ac.uk/PROSPERO/display_record.php?RecordID=383197, identifier ID: CRD42022383197.


Introduction
Parkinson's disease (PD) is the second utmost common neurodegenerative illness, and its prevalence is anticipated to more than double over the next 30 years (GBD 2016Parkinson's Disease Collaborators, 2018Tolosa et al., 2021). The increasing number of patients will impose a significant medical and economic burden on society. Currently, the diagnosis of PD depends on a set of standards proposed by the International Parkinson and Movement Disorder Society (MDS) in 2015 (Postuma et al., 2015). During this process, clinicians rely on limited support and exclusion criteria, as well as "Red flags" to evaluate patients, which is time-consuming and labor-intensive and is related to the experience of clinical experts. Moreover, in the early stages, it is challenging to accurately and timely identify PD due to overlapping symptoms with atypical Parkinson's syndrome (APS) (Respondek et al., 2019). Studies have shown that about 20-30% of patients with multiple system atrophy (MSA) or progressive supranuclear palsy (PSP) were initially misdiagnosed as idiopathic Parkinson's disease (IPD) in clinical practice (Saeed et al., 2020). In addition, in terms of the motor subtypes of PD, the postural instability and gait difficulty subtype (PIGD) has greater damage to the neurological function than the tremor-dominant subtype (TD) and has a relatively poor response to deep brain stimulation (DBS) and levodopa therapy . Given the above reasons, early and accurate identification of PD and differentiation of its subtypes have profound clinical significance for developing individualized treatment plans and predicting prognosis.
Radiomics has emerged as a result of the development of artificial intelligence and medical precision. It extracts highdimensional data from clinical images (such as PET, MRI, and CT) that can be mined (Lambin et al., 2012(Lambin et al., , 2017. Through analyzing and constructing classification models, radiomics can be utilized alone or in conjunction with histological, demographic, genomic, or proteomic data to support evidence-based clinical decision-making (Rizzo et al., 2018). In recent years, radiomics has gradually demonstrated significant clinical utility in the diagnosis, differential diagnosis, severity assessment, and prediction of disease progression in Parkinson's disease (PD), Parkinson's syndrome, and other neurodegenerative disorders, through the utilization of various imaging techniques (Adeli et al., 2016;Klyuzhin et al., 2016;Rahmim et al., 2016).
However, radiomics encompasses diverse methods in its implementation and is highly correlated with the expertise of clinical experts. The diagnostic performance of radiomics needs to be comprehensively evaluated from an evidencebased perspective. Systematic reviews, as a component of evidence-based medicine, can provide relevant guidance to some extent in formulating clinical strategies. Therefore, we conducted this study to evaluate the accuracy of radiomicsbased machine learning in diagnosing Parkinson's disease (PD) and to summarize some of the challenges currently faced by radiomics in order to provide a reference for future applications of radiomics.

Materials and methods
Our systematic review and meta-analysis were conducted based on the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA 2020) guidelines (Moher et al., 2009). The PRISMA guidelines are provided in Supplementary  Table 1. This study was registered on PROSPERO (ID: CRD42022383197).

Inclusion and exclusion criteria 2.1.1. Inclusion criteria
(1) Patients clinically diagnosed with Parkinson's disease (PD) with complete imaging data. (2) Fully constructed radiomics ML models for the diagnosis of PD. (3) Studies without external validation were also included.
(4) Published studies employing the same or different machine learning (ML) algorithms on a single dataset. (5) Studies reported in English were included.
(2) Studies that only performed differential factor analysis and did not construct a complete ML model. (3) Studies that lacked outcome indicators for ML model prediction accuracy (Roc, c-statistic, c-index, sensitivity, Literature screening flowchart. specificity, accuracy, recall, precision, confusion matrix, diagnostic four-grid table, F1 score, calibration curve).

Literature search strategy
We performed a comprehensive search of the PubMed, Cochrane, Embase, and Web of Science databases for all available literature up to November 14th, 2022, utilizing a combination of subject headings and free-text terms. Our search was not restricted by language or geographic region. The detailed search strategy is shown in Supplementary Table 2.

Study selection and data extraction
We imported the retrieved literature into EndNote and removed duplicate articles. The remaining articles were screened based on their titles and abstracts. For the potentially relevant studies, we downloaded and read the full-text articles to determine their eligibility according to the inclusion and exclusion criteria. Before extracting the data, a standardized electronic spreadsheet was developed. The extracted information included the title, first author, publication year, country, study type, patient source, PD diagnostic criteria, radiomics source, whether complete imaging protocols were recorded, number of imaging reviewers involved, whether pre-experiments were conducted under different imaging parameters, whether repeated measurements were performed at different times, imaging segmentation software, texture extraction software, number of PD cases/images, total number of cases/images, number of PD cases/images in the training set, number of cases/images in the training set, method of generating the validation set, number of PD cases in the validation set, number of cases in the validation set, variable selection method, type of model used, modeling variables, whether radiomics scores were constructed, overfitting evaluation, whether the code and data were made publicly available, and model evaluation indications.
The literature screening and data extraction were independently conducted by two researchers (JB and XW), and cross-checking was performed afterward. In cases of disagreement, a third researcher (WH) was consulted to resolve the issue.

Quality assessment
The methodological quality of the included studies was assessed by the two researchers (JB and XW) using the Radiomics Quality Score (RQS), and an interactive check was conducted afterward (Lambin et al., 2012). If there was a dispute, a third researcher   The MDS PD criteria: the movement disorder society PD criteria; the UK PD SBB criteria: the UK PD society brain bank criteria. (B) In article 8, 13, and 27, different research objects were used for training and verification, and the specific sample numbers were listed in the Table 2. In article 5, 11, 17, 25, and 28, the method of cross-validation is adopted, so there is no specific sample number of validation set 3. Article 5 used the same dataset of articles 4 and 15. Articles 20 and 24 were based on the same dataset. PCA, principle component analysis; RFE, recursive feature elimination; RFECV, recursive feature elimination with cross-validation; ICC, intraclass correlation coefficient; ANOVA, analysis of variance; MSA-c, multiple system atrophy-cerebellar type; MSA-p, multiple system atrophy-parkinsonian type.
(WH) was asked to assist in the decision-making process. RQS is a radiomics-specific quality assessment tool that scores the quality of the original study design based on 16 items (e.g., whether the image acquisition method and data were described in detail, whether measures were taken to prevent overfitting or multiple segmentation, whether the study was prospective, and whether the model was validated and how it was validated). Each criterion is assigned a numerical value that corresponds to the impact of the study on radiomics research, and the total score ranges from −8 to 36, which is then converted into a percentage score (0-100%). This score represents the rigor of model development and the evaluation of the study's impact on the field.

Outcome measures
The primary outcome measure of our systematic review is the c-index, which reflects the overall accuracy of the ML model. However, when there is a severe imbalance in the number of cases between the observation group and the control group, the c-index may not be sufficient to reflect the accuracy of the ML model for disease diagnosis. As a result, our primary outcome measures also included sensitivity and specificity. Differential diagnosis of Parkinson's disease (comparing idiopathic PD patients and APS patients), and (c) Parkinson's disease subtypes (comparing TD and PIGD). This study reported the c-index with a 95% confidence interval (CI), which reflected the accuracy of ML models. In cases where the original literature lacks a 95% confidence interval or standard error of the c-index, they were estimated by the formula proposed by Debray et al. (2019). The meta-analysis of sensitivity and specificity requires the diagnostic fourfold table (true negatives, true positives, false negatives, and false positives), but few original studies directly reported a diagnostic fourfold table. Thus, we need to calculate the fourfold table by combining sensitivity and specificity with the number of cases. However, in cases where sensitivity and specificity are missing, Origin 2020 was used to extract them from the ROC curve. A random effects model was used to perform the meta-analysis of the overall accuracy of the ML model, as reflected by the c-index, while a bivariate mixed effects model was used for the meta-analysis of the sensitivity and specificity (Reitsma et al., 2005). Statistical analysis was performed using Stata 15.0 (Stata Corporation, USA). A p-value < 0.05 was considered statistically significant.  Bar chart (quality evaluation table).

Quality analysis
differences that could be related to underlying gene-protein expression patterns broadens the perception of radiomics and biology. Nine studies ( There were 42 fourfold tables for diagnosis that were available and could be directly or indirectly extracted in the training set, and the pooled sensitivity and specificity were 0.91 (95% CI: 0.86-0.94) and 0.93 (95% CI: 0.87-0.96), respectively. There were 60 models in the validation set, and the sensitivity and specificity for disease diagnosis were 0.86 (95% CI: 0.81-0.89) and 0.87 (95% CI: 0.83-0.91), respectively, as depicted in Figure 3, Table 2 and Supplementary Table 5.
Among all the ML models constructed, support vector machine (SVM) and logistic regression (LR) showed ideal predictive performance in the training and validation sets with a larger sample size. Meanwhile, attention should also be paid to other models, such as CNN and LASSO, which demonstrated good diagnostic performance, despite a limited number of these models included in this study. Including more models in future studies can help verify their diagnostic potential.

Differential diagnosis PD and APS
Regarding the differential diagnosis between PD and APS, a total of 41 ML models reported a c-index, with a pooled c-index of 0.866 (95% CI: 0.843-0.889) in the training set, while in the validation set, 43 ML models reported a c-index, with a pooled c-index of 0.879 (95% CI: 0.854-0.903). The training set of 41 models had a pooled sensitivity and specificity of 0.86 (95% CI: 0.84-0.88) and 0.80 (95% CI: 0.75-0.84), respectively. Conversely, the validation set had a pooled sensitivity and specificity of 0.87 (95% CI: 0.85-0.89) and 0.82 (95% CI: 0.77-0.86), respectively. These results are detailed in Figure 4, Table 3 and Supplementary Table 6. Notably, the SVM model showed good discrimination accuracy even with a relatively large number of models included in the analysis.

Overfitting evaluation
For the diagnosis and differential diagnosis of PD, no overfitting was observed for the ML models. Meanwhile, in the respective differential diagnoses, no overfitting was observed for the most commonly used ML model when there were relatively sufficient models. The detailed information is shown in Supplementary Tables 5-9.

Discussion
Our meta-analysis results indicated that radiomics demonstrated excellent diagnostic accuracy in PD diagnosis, with a pooled sensitivity and specificity of 0.91 and 0.93 in the training set, and 0.86 and 0.87 in the validation set, respectively. Furthermore, radiomics-based ML has good discrimination performance in differentiating PD from APS and classifying PD subtypes.
In recent years, researchers have made significant progress in exploring biomarkers for the diagnosis of Parkinson's disease (PD) (Parkinson Progression Marker Initiative, 2011;Tolosa et al., 2021). A meta-analysis of ML based on blood gene features for the prediction of idiopathic PD exhibited a sensitivity of 0.72 and specificity of 0.67 (Falchetti et al., 2020). Kalyakulina et al. (2022) conducted a meta-analysis of ML based on DNA methylation for the differentiation between PD cases and controls, with a classification accuracy of 0.76 using uncoordinated data and over 0.95 using coordinated data. di Biase et al. (2020) review reported an accuracy of over 0.83 for PD diagnosis using ML based on gait feature testing. Kwon et al. (2022) review demonstrated that the integration of clinically relevant biomarkers such as metabolomics, proteomics, and microRNA omics data from cerebrospinal fluid can serve as a powerful method for identifying PD and MSA. The aforementioned research results demonstrate that diagnostic models based on different variables have good performance in PD diagnosis. However, there have been no studies on the evaluation or integration of radiomics. Furthermore, the differentiation of PD and atypical parkinsonian syndromes (APS), as well as the classification of PD subtypes is rarely discussed. Previous studies have used conventional neuroimaging methods such as PET (Brajkovic et al., 2017), MRI, and molecular imaging (Atkinson-Clement et al., 2017;Loftus et al., 2023) for PD diagnosis based on visual assessment or statistical parameter mapping (SPM) analysis. Despite their high diagnostic accuracy, combining radiomics with artificial intelligence can save time and energy, reduce examination costs, and even improve diagnostic accuracy (Wu et al., 2019).
Previous studies have demonstrated that clinical factors, such as olfactory function (Alonso et al., 2021a,b), speech features, motor data, handwriting patterns, cardiac scintigraphy, cerebrospinal fluid (CSF), and serum markers, are closely associated with the diagnosis and severity assessment of Parkinson's disease (PD) and should not be disregarded when constructing diagnostic models (Mei et al., 2021;Rana et al., 2022). Halligan et al. (2021) have recommended that multivariable models should include clinical imaging biomarkers to evaluate their cumulative contribution to overall outcomes. A review by Zhang (2022) has shown that multimodal data, based on ML using imaging and clinical features, can enhance the accuracy of PD diagnosis and early detection. Additionally, Makarious et al. (2022) have demonstrated in their review that multimodal datacombined ML models is superior to single biomarker mode, and the model has been validated in the PD Biomarker Program (PDBP) dataset. The ten studies included in this metaanalysis (Cao et al., 2020Pang et al., 2020Pang et al., , 2022Shu et al., 2020;Hu et al., 2021;Li et al., 2021;Zhang et al., 2021;Sun et al., 2022;Zhao et al., 2022) also revealed that Meta-analysis results of c-index for differential diagnosis between PD and PSP based on radiomics-based machine learning (Validation set).
TABLE 5 Meta-analysis results of sensitivity and specificity for differential diagnosis between PD and PSP based on radiomics-based machine learning.

Model
Training set Validation set Therefore, future radiomics analysis should incorporate other relevant variables to build more reliable models, and radiomic features can be added to existing diagnostic models to improve their diagnostic accuracy. This study is the first systematic review and meta-analysis of radiomics-based ML in the diagnosis of PD and the differentiation of PD from APS. This study revealed that the main brain regions commonly used for diagnosis of PD were located in the substantia nigra-corpus striatum system, and some related areas such as the cerebral cortex. This was consistent with the pathological mechanism and features of PD. Some non-motor symptoms (olfactory disorder, depression, cognitive impairment, etc.) as non-radiomics variables for ML models had good value in diagnosing PD. Furthermore, we found that the major brain regions currently and commonly used to differentiate PD from APS were located in the basal ganglia system, especially the putamen area. UPDRS scores, as non-radiomics variables for ML model, were of good value in distinguishing PD from APD. The radiomics features commonly used to build ML models include first-order properties, shape features, and textural features [such as Gray Level Co-occurrence Matrix (GLCM), Gray Level Difference Matrix (GLDM), Gray-Level Run-Length Matrix (GLRLM)], etc.
We attempted to categorize models by type to determine the best model, but the number of some models, such as CNN, is limited due to their recent emergence, newer technology in deep learning (DL), and possible biases (Ching et al., 2018;Choi et al., 2020). DL has demonstrated greater potential for super-large datasets containing thousands or millions of cases (Camacho et al., 2018), whereas research datasets typically contain only hundreds of patients, making ML more suitable and cost-effective for building models for research purposes . In our study, DL also demonstrated good diagnostic prediction performance, but we cannot draw definitive conclusions due to the limited number of the included studies. Further research is needed to endorse these findings. However, the SVM model still demonstrates excellent robustness even when the number of samples is relatively abundant. Additionally, we found that MRI was the main tool that used radiomics to predict PD diagnosis in clinical practice. In future work, incorporating data from various imaging modalities can further enhance the diagnostic capabilities for the disease. Our findings may advance the field of digital therapy and provide theoretical evidence for developing ML models for diagnosing PD in the future.
However, this study has certain limitations. Firstly, Currently, radiomics lacks a standardized operational guideline, which leads to variations in the process of region of interest (ROI) delineation and texture feature extraction among researchers. Even when multiple researchers are involved, it appears challenging to eliminate the impact of these variations. Additionally, the use of diverse dimensionality reduction methods or variable selection methods may contribute to high heterogeneity in radiomics studies targeting the same clinical question. Therefore, these factors may introduce a significant heterogeneity in systematic reviews related to radiomics. It is difficult to avoid such heterogeneity until standardized operational guidelines are widely adopted.
Secondly, we observed that the included studies seemed to have relatively low scores, mainly due to the fact that the RQS scale is more inclined toward critical research on radiomics. Additionally, the RQS scale may be unsuitable for some models in clinical practice, making it difficult for some studies to obtain high RQS scores. Moreover, many related studies currently have a retrospective design, are single-center studies, and use internal validation or resampling methods (cross-validation), resulting in poor generalizability of the models and limiting the integration of ML models with clinical environments. Therefore, in the future, images from different hospitals and research centers are needed to externally validate the prediction model, making it adapt to a wider range of clinical scenarios. Furthermore, not all models are suitable for clinical practice, so the clinical effectiveness of diagnostic models must be strictly evaluated based on current diagnostic standards.
Imaging plays an indispensable role in the clinical diagnosis and treatment process. However, the interpretation of imaging data currently relies primarily on the expertise of clinical experts. In this regard, developing an intelligent radiomics reading tool based on standardized criteria would provide significant assistance to novice clinicians, especially in the diagnosis and treatment of complex diseases. This assistance in radiomics-based interpretation is crucial for clinical practice. Furthermore, promoting the development of radiomics can bring substantial value to the initial screening and diagnosis of many diseases, particularly in economically and medically underdeveloped regions.
However, radiomics currently faces several inevitable challenges and problems, with significant biases present in certain aspects of the radiomics implementation process. The development of radiomics did not adequately consider excessive parameter tuning, nor did it involve repeated measurements at different time points on the same patient (although this incurs certain costs, it is necessary for the development of such a tool). Moreover, the delineation of the ROI heavily relies on the expertise and knowledge of clinical experts. Therefore, in the development process, it is essential to incorporate ROI delineation from clinicians at different levels to generate imaging data, followed by the extraction of radiomics features using specific software. We have observed strong correlations among some of the extracted radiomics variables, making the selection of modeling variables a challenging task. Hence, it is crucial to compare different methods and identify the optimal variable selection approach to build ML models while avoiding overfitting. Additionally, in the process of constructing ML models, it may be advantageous to prioritize logistic regression (LR) as it offers good visualization and relatively straightforward predictive line plots. We hope that better standards for radiomics and ML will be established in the future, such as the standardization of image acquisition, segmentation, feature extraction, statistical analysis, and reporting formats, to achieve reproducibility and facilitate clinical application.

Conclusion
Our study suggested that radiomic-based ML exhibited high sensitivity and specificity in diagnosing Parkinson's disease (PD), Frontiers in Aging Neuroscience 13 frontiersin.org discriminating PD and atypical parkinsonian syndromes (APS), and distinguishing different subtypes of PD. This approach can serve as a potential method for screening, detecting, and diagnosing PD, making a significant contribution to clinical decision-making systems. However, due to the current lack of standardized operational guidelines, radiomics still faces numerous challenges in its current applications.

Data availability statement
The original contributions presented in this study are included in the article/Supplementary material, further inquiries can be directed to the corresponding author.