Iron Imaging as a Diagnostic Tool for Parkinson's Disease: A Systematic Review and Meta-Analysis

Background: Parkinson's disease (PD) is a progressive neurodegenerative disease whose main neuropathological feature is the loss of dopaminergic neurons of the substantia nigra (SN). There is also an increase in iron content in the SN in postmortem and imaging studies using iron-sensitive MRI techniques. However, MRI results are variable across studies. Objectives: We performed a systematic meta-analysis of SN iron imaging studies in PD to better understand the role of iron-sensitive MRI quantification to distinguish patients from healthy controls. We also studied the factors that may influence iron quantification and analyzed the correlations between demographic and clinical data and iron load. Methods: We searched PubMed and ScienceDirect databases (from January 1994 to December 2019) for studies that analyzed iron load in the SN of PD patients using T2*, R2*, susceptibility weighting imaging (SWI), or quantitative susceptibility mapping (QSM) and compared the values with healthy controls. Details for each study regarding participants, imaging methods, and results were extracted. The effect size and confidence interval (CI) of 95% were calculated for each study as well as the pooled weighted effect size for each marker over studies. Hence, the correlations between technical and clinical metrics with iron load were analyzed. Results: Forty-six articles fulfilled the inclusion criteria including 27 for T2*/R2* measures, 10 for SWI, and 17 for QSM (3,135 patients and 1,675 controls). Eight of the articles analyzed both R2* and QSM. A notable effect size was found in the SN in PD for R2* increase (effect size: 0.84, 95% CI: 0.60 to 1.08), for SWI measurements (1.14, 95% CI: 0.54 to 1.73), and for QSM increase (1.13, 95% CI: 0.86 to 1.39). Correlations between imaging measures and Unified Parkinson's Disease Rating Scale (UPDRS) scores were mostly observed for QSM. Conclusions: The consistent increase in MRI measures of iron content in PD across the literature using R2*, SWI, or QSM techniques confirmed that these measurements provided reliable markers of iron content in PD. Several of these measurements correlated with the severity of motor symptoms. Lastly, QSM appeared more robust and reproducible than R2* and more suited to multicenter studies.


INTRODUCTION
Parkinson's disease (PD) is a progressive neurodegenerative disease whose main neuropathological characteristic is the loss of dopaminergic neurons of the substantia nigra (SN) pars compacta (SNc) (1). Degeneration of dopaminergic neurons in the SN of PD patients is accompanied by an increase in iron content. Iron is necessary for body homeostasis, oxygen transport, or central nervous system development, but its capacity of producing reactive oxygen species that lead to stress oxidation can have a deleterious effect on the SN of PD patients. Iron also plays an important role in the neurodegenerative processes associated with PD (2)(3)(4).
Iron is a paramagnetic element that induces magnetic field inhomogeneities, that is, differences in the local magnetic field relative to the mainly diamagnetic surrounding brain tissues. Iron-induced local field inhomogeneities increase spinspin interactions, thus accelerating the transverse relaxation of the MRI signal (5). This property can be exploited to estimate iron content using MRI based on a reduction in T2 * relaxation time or an increase in R2 * (1/T2 * ), phase changes in susceptibility-weighted imaging (SWI), or increased susceptibility values on quantitative susceptibility mapping (QSM). Based on these techniques, iron-sensitive MRI provides a noninvasive estimation of iron content as shown in primate and postmortem studies in humans (6,7). Recent studies using ironsensitive MRI in PD have investigated whether iron increase in basal ganglia, particularly in the SN, can be used as a biomarker in PD diagnosis and follow-up of iron content in the disease.
Studies using iron-sensitive MRI to quantify iron content in PD have reported variable results. Some have reported increased iron content over the global SN (8)(9)(10)(11). Others have only reported increased iron contents in some SN subregions (12,13) or have not reported any increase in iron levels (14). Moreover, the role of iron for monitoring disease progression in PD and its correlation with clinical symptoms is still under debate (15)(16)(17). Finally, to estimate iron content, a wide variety of techniques have been employed, and no systematic comparison of the results from these studies has yet been carried out. Thus, to elucidate the current role of iron-sensitive MRI in PD and its potential application as a biomarker of PD diagnosis, we carried out a systematic review of publications employing iron-sensitive MRI to study SN in PD. We sought to determine whether MRI of iron using R2 * , SWI, or QSM measurements could successfully distinguish PD patients from healthy controls (HCs), showing the pathological increase in iron of the SN. We also investigated the factors that could influence iron quantification and analyzed the correlations between demographic and clinical data and iron load in the SN.

Articles Review
The study was performed in accordance with the "Preferred Reporting Items for Systematic Reviews and Meta-analysis" (PRISMA) statement and checklist (Supplementary Table 1) (18).
To identify all the relevant literature on iron-sensitive MRI in the SN in PD, PubMed and ScienceDirect databases (from January 1994 to December 2019) were searched. A combination of the following search term was used: ("Parkinson" OR Parkinsonism OR substantia nigra) AND ("magnetic resonance imaging" OR MRI) AND (R2 * OR SWI OR QSM OR susceptibility). All titles and abstracts from the retrieved articles were screened, and the full text of those that could be eligible was obtained. Reference lists of identified studies were also screened for additional studies. Two independent assessors (CSM and NP) performed this literature search, selected all relevant studies based on the Patient, Intervention, Comparison, Outcome, and Study type (PICOS) guidelines, and extracted all the information on iron estimation from the selected studies (18).
Criteria for study inclusion were publication as a full-text original article redacted in English, the use of iron-sensitive MRI as a diagnostic tool (T2 * , R2 * , SWI, or QSM), availability of iron level estimation in the SN, and the differentiation of participants with PD from HCs. For PD, a probable diagnosis based on standard diagnostic criteria was considered sufficient for inclusion (19). Articles using additional non-iron-sensitive MRI techniques, analyzing additional regions of interest other than the SN, or investigating additional pathologies other than PD were also included. In the case of multiple publications on the same population or overlapping populations, the study describing results in the largest number of subjects was included in the meta-analysis (20,21). Studies in the same study populations were included when they reported results in different parts of the SN or the ipsilateral and contralateral hemispheres separately. For longitudinal trials, only the data from the first period were included. The articles were included if the measurements on the level of SN were available. We did not reject articles if other techniques along with R2 * , other regions of interest, or other pathologies along with PD were studied. Criteria for study exclusion comprised unavailability of any iron-sensitive MRI analysis (R2 * , QSM, or SWI) in the SN or of the numerical results or absence of a HC group.
For each study, when available, the following information on the subject population was extracted: mean age of PD and HC subjects, disease duration, severity of PD [Hoehn and Yahr (HY) stage and Unified Parkinson's Disease Rating Scale (UPDRS)], and medication dose. For iron imaging, the following information was extracted: magnetic field strength, scanner vendor, number of echoes for T2 * measurements, type of measurements (R2 * or T2 * , SWI, or QSM), and regionof-interest (ROI) location. Mean and standard deviation (SD) of the different metrics were recorded (R2 * , SWI, and QSM). One article only provided the ranges of the values (22). Four articles did not report values as mean ± SD but as median (range) (10,16,23,24).

Statistical Meta-Analysis
Statistical parameter computation was performed using in-house software written in MATLAB R2016a software (the MathWorks, Inc., Natick, MA, USA). The meta-analysis was conducted in R version 3.5.1 (R Development Core Team, 2018) using the meta package (25). A random-effect model based on restricted maximum likelihood estimator of the between-studies variance was used.

Data Extraction
The mean and SDs of R2 * , SWI, and QSM measurements in the SN were extracted from the tables or the body of the manuscript when tables were unavailable. If values were only available as part of a diagram, the values were extracted using manual measurement on an image editing tool (GIMP, version 2.8) on three separate occasions (1 day apart) and averaged.
In the studies where only the median value and range of values were available, the mean and SD were estimated as previously described (26).
To combine T2 * and R2 * values from separate studies into a single analysis, all T2 * values were converted to R2 * , with the formula R2 * = 1/T2 * .

Effect Size
Effect size was computed as the standardized mean difference (Hedge's g) by subtracting the mean of the HC group from that of PD patients divided by the pooled standard deviation (25). Each g was weighted by the inverse of its variance and adjusted for small sample bias (27). The pooled effect size was calculated separately for R2 * , SWI, and QSM. To allow comparability within the meta-analysis, when the SN was subdivided into several ROIs and R2 * , SWI, or QSM values for the entire SN were not available, the mean value over the SN region was calculated (10,17,(28)(29)(30)(31)(32)(33). If the mean values were separately presented for PD patients with different severity levels, they were weighted and averaged. SWI articles used different techniques, and Hedge's g was assumed as the absolute value of the result. An effect size of g > 0.70 was considered as a large effect. Confidence interval (CI) of 95% was calculated using the standard error (SE). A fixed-effect or random-effect (restricted maximum likelihood) model was used based on the Q statistics.

Between-Study Heterogeneity
The across-study heterogeneity for all the articles included in the meta-analysis was analyzed by calculating Cochrane's Q and I 2 statistic. Values range from 0.0% (no heterogeneity) to 100% (high heterogeneity); values of 25%, 50%, and 75% have been suggested as benchmarks of low, moderate, and high heterogeneity, respectively (34).

Outliers
Effect sizes greater than three standard deviations from the mean were considered outliers. Results were reported with and without outliers (35).

Risk of Bias
The risk-of-bias analysis in individual studies was performed with a tool for the quality assessment of studies of diagnostic accuracy (QUADAS). The rating was performed by two independent raters (NP and CSM), and discordant ratings were resolved by consensus. The QUADAS questionnaire included 14 items covering the following issues: reference standard, covered patient spectrum, verification bias, disease progression bias, review bias, incorporation bias, clinical review bias, test execution, indeterminate results, and study withdrawals (Supplementary Table 2). Publication bias across studies for each outcome measure was examined by visually inspecting the funnel asymmetry plot and by applying the Egger regression intercept test.

Other Statistical Tests
Between-group differences; differences in R2 * , SWI, and QSM values between SN subregions; the effect of magnetic field strength; the effect of MRI vendor; and the effect of ROI delineation methods were assessed using the nonparametric Wilcoxon, Mann-Whitney, and Kruskal-Wallis test.

Correlations
To assess the relationship between R2 * , SWI, and QSM values with the clinical characteristics of patients [age, disease duration, UPDRS levels, H&Y stage, or technical parameters (number of echoes and voxel size)], a correlation analysis was performed. Correlation coefficients were computed between R2 * , SWI, and QSM values and the clinical scores. To correct for multiple comparisons across several clinical scores, an approximate multivariate permutation test was conducted, and the sampling distribution was built to calculate the corrected p-value as the proportion of values that were larger than the observed correlation coefficient value (36).

RESULTS
The search of the database revealed 479 results in both PubMed and 1,425 ScienceDirect databases. After applying inclusion and exclusion criteria on the basis of titles and/or abstracts, 86 full-text articles were reviewed. Of these, 40 articles were excluded for the following reasons: the average R2 * , QSM, or SWI results in the SN were not measured or not explicitly reported (n = 28), presence of duplicated data (n = 4), review articles (n = 8). Forty-six studies were included in the meta-analysis: 3 T2 * based, 24 R2 * based, 10 SWI, and 17 QSM-based (Supplementary Figure 1). Eight of these studies presented measurements for both R2 * and QSM. We show the relevant publications, population characteristics, and technical details of the included studies in Tables 1, 2 for R2 * , Tables 4,  5 for SWI, and Tables 7, 8 for QSM.
No significant publication bias was identified by a funnel plot and Egger regression intercept test. The funnel plots were symmetrical, and the Egger regression intercept test had no significant publication bias for the meta-analysis of R2 * , SWI, and QSM changes (p = 0.13, p = 0.58, and p = 0.45, respectively, Supplementary Figures 2-4). The risk-of-bias analysis in individual studies is presented in Supplementary Table 2.

R2 * Meta-Analysis
Searching the database returned 27 R2 * /T2 * -based articles. The R2 * meta-analysis included a population of 1,629 subjects with 879 PD patients and 750 HCs. Among all studies, the mean age of the patients (62.8 ± 3.7 years, range 54 to 72 years) did not 0.6 ± 0.2, p = 0.07), but these differences were not significant. As for the SN subregions, increased iron contents were observed at the level of the SNc in PD, while there were no significant changes in iron content in the SNr (p = 0.003 vs. p = 0.07), with effect size significantly higher in the SNc compared to the SNr (1.78 vs. 1.23, respectively, p = 0.09). R2 * was increased in the lateral compared to medial SNc (p = 0.02 and 0.06, respectively) with nonsignificantly higher effect size in lateral than in median SNc (0.86 vs. 0.52, respectively, p = 0.11) possibly due to the low number of articles reporting distinct measurements, while both ipsilateral and contralateral sides had the same effect size (0.95 vs. 0.93, p = 0.89) with no difference for R2 * (p = 0.31) (Figure 2).

SWI Meta-Analysis
Searching the database returned 10 SWI-based articles. The SWI meta-analysis included a population of 655 subjects with 361 PD patients and 294 HCs (     Data are presented as mean (STD). N/A, not available. SWI values was I 2 = 89%, which indicated a high heterogeneity between studies. Most SWI-based studies calculated the phase of images, but others used the relative susceptibility, the SWI contrast, or the SWI hypointensity (Table 5); consequently, Hedge's g was calculated as the absolute value. In the SWI-based studies, the standardized mean difference was 1.14 with a CI of 95% between 0.54 and 1.73 (range 0.36 to 3.47), confirming the difference in susceptibility values between HC and PD patients (Table 6, Figure 3). There were not enough SWI-based articles to access the quality of SWI-based discrimination of different SN subregions, contralateral or ipsilateral SN sides, or field strengths.

Correlation Analyses
Both R2 * values and Hedge's g did not correlate with either clinical characteristics (age, disease duration, UPDRS, and LEDD) or imaging parameters (voxel size and number of echoes). For SWI, effect size correlated positively with UPDRS values (r = 0.84, p = 0.04) and voxel size (r = 0.65, p = 0.04).

Scanner Effects
As expected, R2 * values in the SN were lower at 1.5 T (mean 24.45 s −1 ) (9, 17) than at 3 T (mean 38.26 s −1 ), due to the known increase of R2 * with magnetic field strength. However, no statistical comparison could be performed due to the low number of data acquired at 1.5 T. In R2 * , there was a statistical difference between the four vendors (p = 0.0006) with Philips providing higher R2 * values (mean 50.25 s −1 ) than the other vendors, that is, Siemens (mean 34.68 s −1 , p = 0.0001), Magnex Scientific (mean 34.99 s −1 , p = 0.015), and General Electrics (mean 37.77 s −1 , p = 0.0008) while there was no significant difference between Siemens and General Electrics (p = 0.22). There was no longer a statistical difference between the four manufacturers (p = 0.93) when R2 * values were normalized using control values. In SWI, there was no between-vendor significant difference in effect size or phase values (p = 0.22 and p = 0.21, respectively). All SN subdivisions are included. ***Significantly different from controls (p < 0.001). **Significantly different from controls (p < 0.005). *Significantly different from controls (p < 0.05).
In QSM values and Hedge's g, there were no between-vendor significant differences (p = 0.78 and p = 0.07, respectively), and all the studies were similar in ROI definition.

DISCUSSION
Overall, studies reported in this meta-analysis systematically detected iron overload in the SN by iron-sensitive MRI compatible with PD patients compared to HCs even in the early stages of the disease. This result was in agreement with the abnormal iron metabolism in PD that is associated with the SN cell loss (62).
Studies differed in iron measurement techniques, the studied ROIs, methods of image acquisition and analysis, and patient population. The main measures used to quantify iron load were the R2 * , the phase and relative phase of the images for SWI, and the susceptibility for QSM. In R2 * and QSM, positive values of Hedge's g were related to an increase in iron in PD patients as compared to the HCs (6,7). In SWI, studies used different image analysis techniques while we used Hedge's g as an absolute value of the result.
Regarding R2 * values, there was a high variability between studies that was due to several factors. First, the age range of PD patients was wide (58 to 72 years). Second, we observed a dependence of the R2 * value on the vendor, magnetic field strength, and ROI placement methods. Moreover, previous studies have shown that R2 * depends not only on iron content but is also affected by the orientation of the head in the scanner, the surrounding iron distribution (blooming artifacts), the magnetic field strength, and imaging parameters such as the echo time or voxel size (53,63,64). Yet all studies demonstrated a significant increase in R2 * , suggesting that R2 * was a robust biomarker of SN changes in PD, especially when using a given protocol on a given scanner. Nevertheless, R2 * values in PD became comparable across scanners after normalization by the HC values. Multicenter studies should consider the strong difference in R2 * between vendors, especially for Philips, and consider normalizing data.
As for manual or semiautomated segmentation methods, the effect size was higher when the segmentation was performed using anatomical images such as 3D T2-weighted and T2 *weighted rather than using R2 * or QSM maps, although the difference was not statistically significant. This difference may be related to the better resolution of the anatomical images. However, the constant improvements in resolution and contrast of parametric mapping, especially at ultra-high-field strength, especially in QSM, may modify this observation in future studies. There were no significant correlations for R2 * values or for R2 * effect size with clinical parameters. The lack of correlations may be due to the large variability of R2 * results as well as to a high number of possible confounding factors. Finally, while most articles reported UPDRS values, a large proportion of them did not specify if the values were obtained using on or off medication. In PD, degeneration of dopaminergic neurons predominates in the lateral part of the SNc, particularly in nigrosome 1 (65); therefore, greater differences were expected in this region. The highest effect size was observed in this lateral part of the SN consistent with this hypothesis. Some authors have analyzed separately the SN contra and ipsilateral to the most affected side in the hypothesis that the contralateral SN would show greater nigral damage (12,31,32). However, we found no difference for both R2 * values and effect sizes. This suggests that at PD onset, both SNc and SNr are already affected with increased iron overload.
SWI measurements were the most variable across studies. SWI is purely a qualitative method that some studies have however used to carry out local measurements (66). However, this is intrinsically incorrect, since these phase measurements do not reflect local modifications because the phase shift of the signal inside a voxel comes not only from sources of susceptibility inside this voxel but also from neighboring sources outside this voxel (67). This can be a reason for the high variability across studies (I 2 = 89%), even if the pooled effect size was rather high. Additional reasons for the variability of SWI values in these studies can be the presence of blooming artifacts on phase and SWI images (similar to R2 * ) and the dependence of SWI on tissue geometry and orientation dependence relative to the direction of the main magnetic field. Limitations of SWI induced by the nonlocal, geometrydependent, and orientation-dependent nature of the signal phase are overcome by QSM, which directly estimates the tissue susceptibility distribution based on the local perturbation of the magnetic field (64). Still, there was a positive correlation between the UPDRS scores and the effect size of phase measurements, suggesting a correlation between iron load and the severity of motor symptoms.
The effect size observed for QSM was higher than the one for R2 * although the difference was not significant. Moreover, in studies that compared R2 * and QSM values in the same patients, the effect size was significantly higher for QSM than for R2 * , suggesting that QSM might be a more robust marker than R2 * . In addition, there were no significant differences in QSM values between the vendors. The QSM values also correlated significantly with age and UPDRS. The variability of susceptibility measurements was lower than that for SWI; however, this variability was still high, suggesting that more work needs to be done to standardize QSM image processing pipelines between centers. There were no significant differences in QSM values between the different subdivisions of the SN or between the SN contra or ipsilateral to the most affected side. Overall, while we would expect more significant changes in the SNc than the SNr based on pathological studies, this difference was not found in a high proportion of studies. This lack of difference could be due to the segmentation techniques, since the SNc was not clearly visible on the R2 * or QSM images due to the relatively weak presence of iron in this region and its segmentation most often using a probabilistic method. A neuromelanin-based segmentation method could help better delineate the SNc (56).
Finally, the results of this meta-analysis suggest that QSM has some advantages over R2 * because it provides a quantification of magnetic susceptibility, which might better reflect the underlying tissue iron content compared to R2 * . Moreover, both QSM and R2 * are preferable to SWI because they provide quantitative values unlike SWI.
In terms of implementation, however, all methods pose some issues. Firstly, QSM and SWI require acquiring images of the MRI signal phase, whereas R2 * only requires acquiring images of the MRI signal magnitude. Although protocols for SWI are now available for most vendors, for QSM, the correct acquisition of the phase might pose a technical issue in a clinical setting, as not all vendors correctly combine the phase from multiple elements of a phased-array coil, and dedicated software is required in these cases (68). Secondly, SWI only requires a single-echo acquisition, whereas multiecho acquisitions are required for R2 * and are in general preferable for QSM too (69)(70)(71). This might pose a time issue in clinical practice. Therefore, when deciding which methods to use, one should consider the need for quantitative vs. qualitative imaging, the ease of implementation on clinical systems, and time constraints.
Our study has several limitations. First, despite the careful search across several databases, some studies could have been missed (72). Second, common with all literature searches and meta-analysis publication practices, usually publishing positive rather than negative studies might have biased the results. Consequently, the mean effect size could be somehow overestimated as unpublished negative data were probably underrepresented. Third, there was a large heterogeneity of R2 * and SWI values, indicating a large variability of these measurements, that depended on blooming artifacts and the fact that, in contrast with QSM, neither R2 * nor SWI directly quantifies the changes in magnetic susceptibility due to iron deposition in the SN. Moreover, the values varied with the scanner vendor and other technical parameters.
In summary, we have observed a consistent increase in MRI measures of iron content in PD across the literature using R2 * , SWI, or QSM techniques, confirming that these measurements provide reliable markers of iron content in PD. Several of these measurements correlated with the severity of motor symptoms. Lastly, QSM appeared to be a more robust biomarker than R2 * . However, image processing pipelines for QSM are not yet fully standardized, although efforts in this direction are being made (68,73,74). Therefore, QSM is a promising biomarker of diseaserelated iron accumulation in PD, but further work is needed to establish it as a robust biomarker in multicenter clinical studies and its usefulness as a longitudinal marker.

DATA AVAILABILITY STATEMENT
All datasets generated for this study are included in the article/Supplementary Material.

AUTHOR CONTRIBUTIONS
NP, CS-M, and SL contributed conception and design of the study. NP, RG, and CS-M organized the database. LY-C performed the statistical analysis. NP wrote the first draft of the manuscript. NP, CS-M, EB, MS, and SL wrote sections of the manuscript. All authors contributed to manuscript revision, read, and approved the submitted version.