Inter- and Intra-Scanner Variability of Automated Brain Volumetry on Three Magnetic Resonance Imaging Systems in Alzheimer’s Disease and Controls

Magnetic Resonance Imaging (MRI) has become part of the clinical routine for diagnosing neurodegenerative disorders. Since acquisitions are performed at multiple centers using multiple imaging systems, detailed analysis of brain volumetry differences between MRI systems and scan-rescan acquisitions can provide valuable information to correct for different MRI scanner effects in multi-center longitudinal studies. To this end, five healthy controls and five patients belonging to various stages of the AD continuum underwent brain MRI acquisitions on three different MRI systems (Philips Achieva dStream 1.5T, Philips Ingenia 3T, and GE Discovery MR750w 3T) with harmonized scan parameters. Each participant underwent two subsequent MRI scans per imaging system, repeated on three different MRI systems within 2 h. Brain volumes computed by icobrain dm (v5.0) were analyzed using absolute and percentual volume differences, Dice similarity (DSC) and intraclass correlation coefficients, and coefficients of variation (CV). Harmonized scans obtained with different scanners of the same manufacturer had a measurement error closer to the intra-scanner performance. The gap between intra- and inter-scanner comparisons grew when comparing scans from different manufacturers. This was observed at image level (image contrast, similarity, and geometry) and translated into a higher variability of automated brain volumetry. Mixed effects modeling revealed a significant effect of scanner type on some brain volumes, and of the scanner combination on DSC. The study concluded a good intra- and inter-scanner reproducibility, as illustrated by an average intra-scanner (inter-scanner) CV below 2% (5%) and an excellent overlap of brain structure segmentation (mean DSC > 0.88).


INTRODUCTION
The elderly population is drastically increasing and so does the prevalence of dementia. In the last decade, magnetic resonance imaging (MRI) has become an important tool in the diagnostic work-up for patients with Alzheimer's disease (AD), where MRI scans are essential for detecting brain atrophy and atrophy patterns for differential diagnosis of dementia subtypes (Frisoni et al., 2010). In this context, the introduction of visual rating scales to assess hyperintensities, global cortical atrophy, posterior cortical atrophy, and medial temporal lobe atrophy on brain MRI scans has helped standardizing radiological reading in AD and the differential diagnosis in dementia Jack et al., 2011;Dubois et al., 2014;Niemantsverdriet et al., 2018;Struyfs et al., 2020). However, these rating scales are based on the visual assessment of 3D structures through 2D slices. As a result, despite the important role of these scales in the clinical setting, it is known that they are timeconsuming, subjective, not uniformly adopted, and dependent on the expertise of the radiologist (Vernooij et al., 2019). Therefore, recent developments in the field of imaging artificial intelligence (AI) have enabled the automatic extraction of clinically relevant measures from brain MRI scans (Niemantsverdriet et al., 2018;Struyfs et al., 2020;Wittens et al., 2021). Thereupon, it has been shown that automated volumetry, combined with the expertise of radiologists, can improve the sensitivity and specificity of assessing AD-related atrophy (Pemberton et al., 2021). Since combining modern AI technology with radiological expertise has the potential to detect and monitor abnormalities more accurately, a strict validation of AI tools is necessary to assess the validity in a clinical setting. However, within and between-scanner variability can at least partially neutralize the added diagnostic value of AI-based automated volumetry for the (differential) diagnosis of AD. Several studies testing the repeatability and reproducibility of different automated volumetric tools for white matter hyperintensity (WMH) quantifications, as well as for brain volumetric measurements in other neurological disorders such as multiple sclerosis, have shown the importance of considering variation when comparing scans from multi-center and longitudinal studies. In addition, the consistent appliance of metrics such as the coefficient of variation (CV), absolute volume differences (AVD), as well as intra-class correlation and Dice similarity coefficients (DSC) in studies assessing intra-and inter-scanner variability, facilitate between study comparisons (Gasperini et al., 2001;Huppertz et al., 2010;Biberacher et al., 2016;Shinohara et al., 2017;Guo et al., 2019). We have therefore set up a study to assess the effect of within-and between-scanner variabilities on three different MRI systems with harmonized scan parameters, using automated volumetry computed by the CE-labeled and FDAcleared post-processing software icobrain dm (Struyfs et al., 2020). To this end, 10 subjects were scanned twice during a timeinterval of 2 h, and automatically computed brain volumes were statistically analyzed to assess intra-and inter-scanner agreement. The purpose of this study is to determine the extent of intra and inter-scanner variability with harmonized acquisition parameters and its implication in routine clinical practice.

Study Population
The study population included five healthy controls and five patients in different stages of the AD continuum, resulting in a total of ten participants ( Table 1). Recruitment of all participants was effectuated at the Neurology and Radiology departments of UZ Brussel between April 2020 and August 2020. The exclusion criteria consisted of defibrillators, neurostimulators, pacemakers, and all other standard MRI contraindications, advanced AD (defined as having a Mini-Mental State Examination score < 10/30), and brain tumors or other neurological disorders that could cause cognitive impairment. Patient classification was effectuated in compliance with the National Institute on Aging-Alzheimer's Association criteria for "MCI due to AD" and "Dementia due to AD" Jack et al., 2011;McKhann et al., 2011;Sperling et al., 2011;Dubois et al., 2014). Four MCI patients, and one dementia due to AD patient were included in this study.

Study Design
For this prospective study, three different MRI imaging systems were used (section "Image acquisition, " Table 2). All three MRI scanners are located at the radiology department of the VUB university hospital (UZ Brussel), Brussels, Belgium. To test the intra-scanner variability, each participant underwent two MRI scans per imaging system in a randomized manner. In between the two scans on the same MRI system, the participants were repositioned to make sure that a difference in positioning is considered during the evaluation. The duration time of each MRI scan varied between 6 and 10 min per scanner. The time in between scans was 3-5 min.
To test the inter-scanner variability, this workflow was repeated on three different MRI systems. A total of sixty MRI scans were used for downstream comparative analysis. The total scan time for all systems combined was a maximum of 90 min per participant. To minimize the variability in brain volume, all scans were performed in a time span of 2 h.

Ethical Committee
This randomized prospective study was approved by the Ethical Committee of UZ Brussel in Brussels, Belgium (Reference nr: 2020-079). Written informed consent of all participants and/or legal representatives (in case of dementia) was obtained.

Image Acquisition
All subjects were scanned twice on each of the three following scanners: a 1.5T Achieva dStream (Philips Medical Systems, Best, Netherlands), 3T Ingenia (Philips Medical Systems, Best, Netherlands) and a Discovery MR750w 3T (GE Medical Systems, Milwaukee, WI, United States), and will be further referred to as "Achieva, " "Ingenia, " and "GE." Every MRI scan study consisted of a sagittal 3D T1-weighted (T1w) MR sequence and a sagittal 3D Fluid Attenuated Inversion Recovery (FLAIR) sequence. Sequence parameters were harmonized as much as possible between the vendors, limited by the constraints of each manufacturer. Priority was given to harmonization of acquisition parameters over reconstruction parameters. The parameters are based on the existing clinical routine scans for AD within UZ Brussel. The 3T Ingenia scanner parameters were taken as a starting point and were subsequently modified on the 1.5T Achieva and Discovery MR750w 3T MRI systems to be as alike as possible. This was done through harmonizing resolution, timings, flip angles, and bandwidth. Test scans were performed on volunteers on the various systems and checked by two radiologists (G-JA and TV). All scans were visually investigated and feedback was provided. If necessary, the parameters were adapted and the procedure was repeated. No quantification was performed during testing since T1 and FLAIR data were already being quantified by Icometrix as part of the clinical routine and positive feedback on the scan quality was obtained. Lastly, voxel interpolation during reconstruction did not exceed a factor 2 in any dimension. All scan parameters are listed in Table 2.

Image Processing
A visual assessment was performed to exclude possible causes of inaccurate measurements, including, but not limited to, motion artifacts, metal artifacts, and head-coil artifacts.

Statistical Analysis
All data processing was performed using the R environment (R-Studio, v.1.0.136) for statistical computing and graphics using the following "packages" and (functions). Demographic information was reported as mean and standard deviation (SD; where applicable), with a significance level of <0.05 [R package: "arsenal" (tableby and write2word)].

Measures of Agreement at Image Level
The similarity between pairs of T1w scans is reported using an affine similarity index, contrast difference between two T1w images of the same subject, and a maximum scaling factor.

Affine similarity index
An affine similarity index is defined as the normalized mutual information (NMI) between any two T1w images of the same subject after affine registration between the two images (Studholme et al., 1999). This measure expresses how well two images match without requiring that the image intensities are similar, thus it is a measure of scan similarity that can be assessed in a clinical setting and can be related to the measurement error. An affine registration allows an image to be mapped to another image using rotation, translation, scaling and skewing. As it is a global transformation, the same rotation, translation, scaling, and skewing parameters are applied for the entire image, meaning that there are no different parameters for different voxels or structures. Post alignment, the NMI is calculated, where higher NMI values express stronger similarity and lower values express more mismatch. Previously, it was found that the alignment between two T1w images can be considered reliable when the affine similarity index is above 0.2 (Sima et al., 2019).

Maximum scaling factor
A maximum scaling factor is also reported as the maximal stretching along any of the three spatial axes when affinely registering two T1w images of the same subject. A value above one might indicate that there are geometric differences between the two T1w images, while a maximum scaling factor of 1 indicates that no scaling is needed in any of the three directions to perfectly align the two images. For the sake of simplicity, we do not differentiate between stretching and shrinking because these are inverse operations, depending on which image is considered as reference.

Contrast difference
Besides measuring global image similarity between pairs of T1w scans with the affine similarity index, the contrast difference between two T1w images of the same subject is also computed. The image contrast is defined as the contrast-to-noise ratio (CNR) between WM and GM image intensities (Magnotta and Friedman, 2006), computed as: It is expected that images with similar contrast, as indicated by a lower absolute difference in WM/GM CNR, would be segmented more consistently.

Intraclass correlation coefficient
The intra-scanner variability was analyzed by determining the intraclass correlation coefficient (ICC, with 95% CI), using the function "ICC" of R package psych (v. 2.3.0), based on absolute agreement, single-measurement, and a two-way mixed model, returning the estimate of ICC and respective confidence intervals (Shrout and Fleiss, 1979;McGraw and Wong, 1996;Revelle, 2012). The ICC is a measure of reproducibility between repeated measurements of the same item, carried out by different observers and can be calculated using the following formula: being the variance amongst groups and S 2 W the variance within groups (Wolak et al., 2012). Having an index going from 0 (no agreement) to 1 (absolute agreement), the ICC value can be interpreted as either poor (<0.50), moderate (0.50 < x < 0.75), good (0.75 < x < 0.90), or excellent (>0.90), when looking at the 95% confidence intervals of the ICC estimate, as suggested by Koo and Li (2016). For the intrascanner variability, the ICC expresses the fraction of the variance in outcome between individuals, divided by the total variance (Trevethan, 2017). This calculation was carried out separately for the three MRI systems for each of the brain structures mentioned in section "Post processing technique." In addition, the mean ICC value and confidence intervals were calculated over all brain structures for each MRI system. For the inter-scanner variability, four ICCs were calculated based on absolute agreement, single measurement, and a two-way mixed model, using the mean value of the test and retest scan per MRI system. First, data from all scanners were included, by considering all possible pairwise comparisons. The second, third and fourth ICCs, represented pairwise comparisons between scanners (Ingenia -Achieva, Achieva -GE, and Ingenia -GE). As was done for the intrascanner measurements, the mean ICC value and confidence intervals were additonally calculated over all brain structures for each MRI system. Taken together, we used the ICC to express the correlation between replicated measurements for the same subject within the same scanner (intra-scanner variability) and in between scanners (inter-scanner variability).

Coefficient of variation
Another complementary measure of precision is the CV (%). The CV expresses within-person variability as the ratio of the SD (σ) of repeated measurements divided by their mean (µ), and was calculated by the following formula: For the intra-scanner variability, we calculated the CV between the two technical replicates (scan 1 and scan 2) within one person within one scanner [R package: "matrixStats" (rowMeans and rowSds)] (Bengtsson et al., 2021). For the inter-scanner variability, the mean value of the two repeated measurements from each person was taken for each scanner R package "tidyverse" (gather, group_by, summarize). Subsequently, four CVs were calculated. For the first CV, the three mean values of the two repeated measurements from all scanners were considered and the ratio of their SD was divided by the mean. For the second, third and fourth CV, the computations were done in a pairwise manner for each scanner combination (Ingenia -Achieva, Achieva -GE, and Ingenia -GE).

Absolute volume differences
Absolute volume differences (mL) between two measurements were also calculated for both intra-and inter-scanner comparisons. For intra-scanner variability, the AVD was calculated as the absolute difference between test (scan one) and retest (scan two) scans within each person within each scanner. For the inter-scanner variability, pairwise differences between scanners were calculated, starting from taking the mean value of the two repeated measurements from each person. The AVD was calculated in a similar way as described previously for the CV, except for the fact that no "all scanner" AVD calculation was carried out, since AVD calculations only allow pairwise comparisons.

Dice similarity coefficient
The DSC was calculated to measure the voxel wise overlap between test and retest scan segmentations within each person within each scanner (intra-scanner variability) and for pairwise comparisons between scanners (inter-scanner variability). To this end, one randomly chosen T1w image in a test-retest pair was affinely transformed to the other T1w image in the pair, so that the corresponding brain structure segmentations can be resampled to the same geometric space prior to computing the DSC overlap as: where X is the brain structure segmentation from one scan and Y is the brain structure segmentation from the other scan after the corresponding spatial transformation. Each of these measures of agreement were computed separately for all brain structure volumes mentioned in subsection "Post processing technique."

Percentual difference
Despite a direct mathematical relationship [factor sqrt (2)] between the percentual difference of two measures and the CV of the same two measurements, percentual differences were reported to interpret reproducibility in the context of yearly atrophy. Lastly, actual volumes were reported to determine the presence of bias between the scanner types. Significant differences within and between scanners were evaluated for the actual volumes, CV, AVD, and DSC values using a mixed model approach correcting for repeated measurements, with Bonferroni alfa levels of <0.005 [0.05/total number (n = 11)] of brain structures. A patient pseudonym (anonymous patient identifier) was included as a random effect to control for the variation in patients, while the scanner pairs (within of between scanner pairs) were included as a fixed effect. Significant differences in actual volumes between the scanner types were evaluated to assess systematic bias between scanner types, while significant differences in measures of agreement (CV, AVD, and DSC) assessed reproducibility.

Intra-and inter-scanner variability on patient level
Quantitative measurements and the limits of agreement (LOA) were visualized through Bland-Altman plots using the R package "blandr" to graphically explore individual subject within-scanner measurements as well as to check for possible heteroscedasticity and outliers. Here, the difference between a test and retest scan (y-axis) was plotted against the average of the two scans (xaxis), including a central horizontal line on the scatter plot depicting the mean difference or "mean bias." In addition, the SD of the mean bias was used to construct the upper and lower LOA (mean bias ± 1.96 SD). The pre-defined maximum allowed difference was based on a priori clinically defined criteria which should not exceed the annual pathological whole brain, GM, and hippocampal atrophy change seen in AD as suggested by Barnes et al. (2009);Sluimer et al. (2008), and Anderson et al. (2012), which is around 2% for larger brain structures and not more than 4.66% for hippocampal volumes. However, it has to be taken into account that atrophy rates are neither spatially nor temporally uniform in MCI and AD patients. If the limits do not exceed the maximum acceptable difference between test and retest scans, and the measurement is not higher than the upper limit of the 95% confidence interval of the upper LOA, nor lower than the lower limit of the 95% CI of the lower LOA, the measurements are considered to be in agreement (Stöckl et al., 2004;Chhapola et al., 2015;Giavarina, 2015).

Measurements of Agreement Between Image Pairs at Image Level
Measurements of agreement between image pairs were reported within (intra) and between (inter) scanners to assess agreement at image level ( Table 3). There is a very high similarity between different scan-rescan T1w acquired in each scanner, as demonstrated by a reliable affine similarity index, low WM/GM contrast difference, and a maximum scaling factor of 1 for all comparisons between images of the same scanner. Achieva and Ingenia also showed a very reliable affine similarity index. When comparing T1w images of GE and Achieva, the WM/GM contrast showed a higher difference, which could indicate a less consistent segmentation. In fact, when looking at the individual image quality through the absolute CNR values (the CNR value per T1w image, Supplementary Table 1), it is shown that Achieva has a higher contrast than GE.
The T1w images of each scanner are visualized for two randomly selected subjects in Figure 1. Here, the T1w images of a healthy control ( Figure 1A) and a patient with MCI due The scanner models of the considered image pairs, irrelevant of order, i.e., irrelevant of which image is considered as reference. Affine similarity: mean ± SD of the affine similarity index where >0.2 corresponds to a reliable affine similarity index. WM/GM contrast difference: The absolute difference in WM/GM contrast-to-noise ratio (mean ± SD), with a threshold of acceptability between 0.1 and 0.2. Max scale factor: The mean ± SD of the maximum scaling factor over the three spatial directions, where a value of 1.00 indicates that no scaling is needed, and 1.01 indicates 1% scaling is required. Note that the standard deviation is approximately 0, showing that the scaling needed in pairwise comparisons is subject-independent.
to AD (Figure 1B) without segmentation (top), with icobrain dm's segmentation of the LVENT (middle), and with icobrain dm's segmentation of the cortical brain structures including the hippocampus (bottom) are depicted. These scans and results are shown for both scans on the three different MRI systems.

Measures of Agreement for Intra-and Inter-Scanner Brain Volumes and Segmentations
Automated volumetric measurements computed by the icobrain dm segmentation software were determined for each MRI scanner, for each of the following brain structures: whole brain, GM, CGM, WM, frontal, parietal and temporal cortices, hippocampal volumes, and LVENT.

Intra-Scanner Variability
To examine the reproducibility of measurements within each of the scanners, the CV, AVD, DSC, and the ICC were determined for all previously mentioned brain structures, calculated with icobrain dm (Table 4). Here, the CV expresses the difference between measurements within the same individual, within the same scanner, while the ICC expresses the between-person variance with respect to the total variance. The individual CV values (mean ± SD) were between 0.16 ± 0.12 and 3.14 ± 2.15%. The intra-scanner CVs over all volumes were similar on the three scanners, with 1.05 ± 0.87% for Achieva, 1.15 ± 0.81% for Ingenia and 0.95 ± 0.46% for GE. The AVDs and ICCs showed the same trend as the CV values. The ICC showed no scores below (mean [CI]) 0.941 [0.823,0.981] (HIP-L, Ingenia). The ICC scores tended to decrease slightly when looking at smaller regional brain volumes such as the hippocampal volumes, except for the right hippocampus. The DSC values (mean ± SD) went from 0.87 ± 0.02 (PC, Ingenia) to 0.98 ± 0.00 (WB, GE). The intra-scanner DSC (mean ± SD) overall volumes were 0.91 ± 0.01 for Achieva, 0.92 ± 0.01 for Ingenia and 0.93 ± 0.04 for GE. Significant differences in DSC values were reported for the hippocampal volumes (p < 0.001). The estimated effect, standard error, z-values, and adjusted p-values (Bonferroni) per pairwise differences for each brain structure showing a significant overall difference in DSC values are reported in Supplementary Table 2.

Inter-Scanner Variability
To examine the inter-scanner variability, the AVD, CV, DSC, and ICC and were determined for Achieva -Ingenia, Achieva -GE, Ingenia -GE, and an all-scanner comparison ( Table 5). Here, the CV expresses the differences between the three MRI systems, while the ICC was used to express how similar the observations are across the three MRI systems.

Individual Quantitative Intra-and Inter-Scanner Variability
The quantitative measurements computed by icobrain dm for the test (scan 1) and retest (scan 2) scans per subject (color-coded differentiation) and per scanner (symbol-coded differentiation) were visually (Figure 2) and statistically (Supplementary Table 3) presented using a Bland-Altman plot to detect possible deficiencies in individual reliability, heteroscedasticity, and outliers.
The Y-coordinate of a point shows the difference in mL between scan one and scan two, while the X-coordinate indicates the mean between the two volumes. By showing the results of the three different scanners for each person separately (with the three different plot characters according to the scanner), we depicted the inter-scanner variability per subject between the different scanners.
The within-subject inter-scanner-variability is visible in the X-axis direction when looking at the inter-scanner means in mL [(scan one + scan two)/2] differences. Furthermore, betweensubject variation is visible in the X-axis direction, looking at the difference in means between individuals. Here, the LOA represents the 95% prediction interval [1.96 SD], where a smaller range indicates a better agreement.
According to the Bland-Altman plot, the GE result falls outside of the LLOA for subject 10 for WB and CGM, and outside of the ULOA for LVENT. To identify if there is an underlying reason behind this larger intra-scanner variability, a double-check of the native MRI sequences was performed for this specific subject 10. Evaluation by a neuroradiologist (GA) revealed no significant MRI artifacts. Both acquisitions had a similar gray and WM contrast. However, evaluation of icobrain dm's segmentation revealed a slight oversegmentation of the cortex in the superior sagittal sinus, which might be a partial explanation for the increased difference between the two scans. Furthermore, no heteroscedasticity nor a specific pattern regarding intra-or inter-scanner variability was found for any of the regions of interest.

Percentual Differences
Percentual differences are reported in Table 6. When looking at the intra-scanner variability results, the largest percentual volume difference was seen for the left hippocampus (mean ± SD, 4.47 ± 3.13, and Ingenia), while pairwise comparisons showed the largest difference for WM (mean ± SD, 8.74 ± 3.47, and Ingenia -GE). These findings were in line with the intra-CV and intra-ICC values. The smallest percentual volume difference was found for gray matter for intra-scanner (mean ± SD, 0.22 ± 0.17,  . p-value < 0.005 for "all scanners" differences were highlighted in "bold." ∧ p-value < 0.005 between Achieva and GE · p-value < 0.005 between Ingenia and GE. Achieva) and whole brain for inter-scanner variability results (mean ± SD, 0.52 ± 0.32, Achieva -GE).

Actual Brain Structure Volumes
Actual volumes for all brain structures were reported as mean ± SD (Table 7). To assess systematic bias between the three MRI systems, a mixed model approach correcting for repeated measures with post-hoc Bonferroni correction was employed. Intra-scanner variability results showed no significant withinscanner differences for any of the brain structure volumes. For Achieva -Ingenia, whole brain and LVENT were significantly different (p < 0.001). For Achieva -GE, GM, CGM, WM, frontal, parietal, and temporal cortices, as well as the right hippocampus showed significant differences (p < 0.001). Significant differences for all brain volumes (p < 0.001), except the TC, total, and left hippocampus, were found for Ingenia -GE. The estimated effect, standard error, z-values and adjusted p-values (Bonferroni) per pairwise differences for each brain structure showing a significant overall difference in actual volumes are reported in Supplementary Table 4.

DISCUSSION
As the potential added diagnostic value of AI-based automated volumetry on brain MRI scans might at least in part be neutralized by intra-and inter-scanner variability, a thorough evaluation of the measurement error and variability in clinical routine circumstances is crucial. In the current study, the intraand inter-scanner variability of global, cortical, and subcortical brain volumes was evaluated using the CE marked and FDA cleared icobrain dm software on three different MRI systems. It is known that intra-scanner variability exists and depends on several uncontrollable (short-term physiological fluctuations), semi-controllable (head motion, subject-positioning, noise, and measurement error) and controllable (impact of day-to-day, time of day, and medication) factors. Previous studies have reported a time-of-day dependence of MRI-based global brain volume calculations (Trefler et al., 2016;Dieleman et al., 2017). In this study, all patients were scanned in the morning, eliminating both the impact of day-to-day and morning/night differences as additional variables. Subject-positioning variation was minimized, by placing the subject in the MRI scanner by the same operator, through a standardized procedure. Nevertheless, when comparing brain structure segmentations of repeated (intra-or inter-scanner) scans using Dice overlap, affine image alignment and resampling was still required. This post-processing step typically leads to an additional variability, which is an unavoidable limitation of this type of agreement measurement between repeated scans.
Overall, low intra-scanner variability of the MRI measures was found. The ICC showed no scores below (mean [CI]) 0.941 [0.823, 0.981] (HIP-L, Ingenia), indicating a good intra-scanner agreement for all intra-scanner comparisons. The largest intrascanner variability was observed in the left hippocampus, which can be explained by the smaller size of this brain structure (Struyfs et al., 2020) and the complexity of its delineation. In addition, a higher variability compared to other larger brain regions was demonstrated for the LVENT, related to the CSF presence and existing short-term physiological fluctuations (Dieleman et al., 2017). Similar to the intra-scanner variability, a low inter-scanner variability of the icobrain dm measures was observed. The ICC showed no mean scores below 0.961 [0.901, 0.991] (WM, Ingenia -GE). In addition, significant differences between the actual volumes and the visual assessment of Bland-Altman plots did not reveal any systematic pattern regarding intra or interscanner bias. The mixed modeling approach showed that the significance came from the DSC measures and actual volumes, while CV and AVD differences were not statistically significant. This might indicate that DSC are more sensitive than volumetric criteria, since the overlap between two segmentations is assessed, while an imperfect overlap might be compensated for when calculating volumes.
However, statistical significance and clinical relevance should not be mistakenly conflated since one does not necessarily Brain volumes used to calculate measurements of precision were computed by the icobrain dm segmentation software. Coefficient of variation (CV, %), intraclass correlation coefficients (ICC, [95% CI]), absolute volume differences (AVD, mL) and Dice similarity coefficients (DSC, mean ± SD) was reported for all pairwise comparisons and three-scanner comparisons ("All scanners"). The highest CV and lowest ICC and DSC values amongst all structures were highlighted in bold. Achieva: Philips Medical Systems Achieva dStream 1.5T. Ingenia: Philips Medical Systems Ingenia 3T. GE: GE Discovery MR750w 3T.
FIGURE 2 | Bland-Altman plots for individual brain structures. Variability results (intra-and inter-scanner, as well as intra-and inter-subject variability) presented in a Bland-Altman plot per brain structure for three different MRI systems; Philips Medical Systems Achieva dStream 1.5T, Philips Medical Systems Ingenia 3T, and GE Discovery MR750w 3T, computed by icobrain dm. The y-axis represents the difference in mL [test (scan 1) -retest (scan 2)]. The x-axis represents the mean in mL of scan 1 and scan 2 [(scan 1 + scan 2)/2]. The quantitative measurements are presented per subject (color-coded differentiation) and per scanner (symbol-coded differentiation). Here, the limit of agreement (LOA, upper LOA in yellow and lower LOA in green) represents the 95% prediction interval [1.96 SD]. Percentual volume difference [Percentage (%), mean ± SD] for each individual brain volume. All intra scanner volumes represent the mean percentual volume difference of all three scanner types combined, since percentual volume differences only allow pairwise comparisons. The largest percentual volume differences (intra-and inter-scanner variability) were highlighted in bold.  (Good et al., 2001;Scahill et al., 2003;Biberacher et al., 2016;Schippling et al., 2017;Vinke et al., 2018). Annual atrophy rates of around 2% have been observed in Alzheimer's patients for whole brain (Sluimer et al., 2008) and GM volumes (Anderson et al., 2012). In addition, a meta-analysis on hippocampal atrophy rates in AD patients and controls reported annualized hippocampal atrophy rates of 4.66% (3.92-5.40) for AD patients, while an atrophy rate of 1.41% (0.52-2.30) was reported for healthy individuals (Barnes et al., 2009). According to our study, the within-scanner difference in percentage for whole brain volumes, taken within a time span of 3 h, are similar to the previously reported annual volume decline for healthy individuals, although a pathologic whole brain volume change, as seen in AD, would go beyond the observed intra-and inter-scanner measurement error. In the light of these events, attention needs to be paid when comparing MRI scans obtained with different protocols, since even with the same vendor, harmonized protocols, and elimination of the previously mentioned controllable influencing factors, a volumetric bias remains. On the other hand, our volumetric analysis was performed in a "cross-sectional" way, where each individual scan was segmented independently. It is known that "longitudinal" methods, which simultaneously analyze two or more brain scans, have a significantly lower measurement error, and should be preferred over cross-sectional measurements, for computing atrophy of brain structures.

Harmonization
The three MRI systems that were used in this study had a difference in coils and channels per coil between the systems (16 channels vs. 32 channels), resulting in differences in the signal-to-noise (SNR) ratio. In this study design we used coils that were directly purchased from the manufacturer. In addition, the difference in FOV between the three MRI systems can be explained by the fact that there was no phase oversampling available for the GE. To compensate for this, a large FOV was employed, enabling the acquirement of the same resolution and number of phase-encoded lines as for the other vendors, and hence SNR was not affected. Furthermore, the reconstruction resolution for GE was slightly lower compared to the other MRI systems, as this could not be chosen freely and interpolation factors > 2 were avoided. Nevertheless, the in-plane image resolution of 1 mm × 1 mm is well suited for brain segmentation. The difference in image resolution could have a slight effect on the segmentation of some regions which contain a lot of complicated borders. The total duration of the scan was also larger for GE than for the other systems, since GE does not offer TFE sequence, but has its own BRAVO sequence, a TFE sequence that is optimized for brain recording. The disadvantage, in our case, is that the adjustable parameters are limited, including control over the scan duration. Another potential cause of variability that was not investigated in this study, are the scanner-specific differences in post processing of the raw data. For example, the GE scanner that was utilized in this study ended up with 280 instead of 288 slices (as with Achieva and Ingenia). This was the consequence of an implicit oversampling and the "throwing away" of the outer slices during the reconstruction, which is GE specific. Additional efforts are needed to deepen our understanding of the effects on inter-scanner variability of these scanner-specific post processing differences. An additional limitation of this study was the small sample size (n = 10) that was not sufficient to draw realistic conclusions regarding disease related variability, but, however, producing a total of 60 MRI scans which allowed for the analysis of within-subject differences. In addition, since only one automated volumetric software tool was utilized in this study, it would be beneficial to investigate the effect of different automated volumetric software's on the intra and inter-variability across different MRI systems.
Follow-up of brain MRI scans can aid in tracking disease progression, which may be relevant for research purposes. In addition, MRI can display the presence of typical brain atrophy patterns correlated to specific neurodegenerative diseases. Analyzing and subsequently improving intra-and inter-scanner variability can bring us closer to comparing MRI scans from the same individual, taken from different centers. Being able to compare multi-center MRI scans is also useful in clinical trials, where MRI scans from different scanners can then be pooled for data analysis. Harmonizing inter-center MRI scans might aid multi-center research, but its application in a clinical setting remains challenging. Therefore, techniques which allow for the harmonization of MRI data, e.g., based on AI, would be very valuable to overcome these obstacles. This approach might allow comparison of recent MRI scans with older MRI scans (using different acquisition techniques) over a longer period.
In conclusion, harmonized acquisition sequences were able to produce good quality brain scans on different MRI scanners and were suitable for automated brain segmentation. In addition, observed intra-and inter-scanner measurement error was smaller than the annual pathologic whole brain volume change, as seen in AD. Harmonized scans obtained with different scanners of the same manufacturer had a measurement error closer to the intra-scanner performance. The gap between intra-and interscanner comparisons grew when comparing scans from different manufacturers. This was observed at image level in terms of image contrast, image similarity and geometry, and translated into a higher variability of automated brain volumetry. However, on average, intra and inter-scanner variability results showed a good overlap of brain structure segmentation (mean DSC > 0.88) and good reproducibility within-(mean CV < 2%) and betweenscanners (mean CV < 5%) was obtained over global, cortical, and subcortical brain structures.

DATA AVAILABILITY STATEMENT
All data are available from the corresponding author on reasonable request.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by this randomized prospective study was approved by the Ethical Committee of UZ Brussel in Brussels, Belgium (Reference nr: 2020-079). Written informed consent of all participants and/or legal representatives (in case of dementia) was obtained. The patients/participants provided their written informed consent to participate in this study.

AUTHOR CONTRIBUTIONS
G-JA: conceptualization, investigation, resources, data curation, and writing -original draft. MW: conceptualization, investigation, resources, data curation, formal analysis, validation, visualization, and writing -original draft. DMS: methodology, software, validation, formal analysis, and writing -review and editing. MN: data curation, and writingreview and editing. TV: writing -review and editing. A-MV, YD, and GN: conceptualization. NB: conceptualization, and writing -review and editing. HR: methodology, resources, supervision, and validation. EF: formal analysis, and writing -review and editing. DS: conceptualization and software. WH: conceptualization, validation, and writing -review and editing. MB: supervision, and writingreview and editing. JM: conceptualization, resources, supervision, funding acquisition, and writing -review and editing. SE: conceptualization, resources, supervision, project administration, funding acquisition, and writing -review and editing. All authors critically revised and approved the content of the final manuscript before submission.

FUNDING
This research was in part supported by the agency of Flanders Innovation and Intrepreneurship (VLAIO) and the Interreg V programme Flanders-Netherlands of the European Regional Development Fund (ERDF; Herinneringen/Memories project). Icobrain dm is a proprietary software, developed by icometrix for the automated quantification of brain volumes and white matter hyperintensities.