Quantification of Cervical Cord Cross-Sectional Area: Which Acquisition, Vertebra Level, and Analysis Software? A Multicenter Repeatability Study on a Traveling Healthy Volunteer

Background: Considerable spinal cord (SC) atrophy occurs in multiple sclerosis (MS). While MRI-based techniques for SC cross-sectional area (CSA) quantification have improved over time, there is no common agreement on whether to measure at single vertebral levels or across larger regions and whether upper SC CSA can be reliably measured from brain images. Aim: To compare in a multicenter setting three CSA measurement methods in terms of repeatability at different anatomical levels. To analyze the agreement between measurements performed on the cervical cord and on brain MRI. Method: One healthy volunteer was scanned three times on the same day in six sites (three scanner vendors) using a 3T MRI protocol including sagittal 3D T1-weighted imaging of the brain (covering the upper cervical cord) and of the SC. Images were analyzed using two semiautomated methods [NeuroQLab (NQL) and the Active Surface Model (ASM)] and the fully automated Spinal Cord Toolbox (SCT) on different vertebral levels (C1–C2; C2/3) on SC and brain images and the entire cervical cord (C1–C7) on SC images only. Results: CSA estimates were significantly smaller using SCT compared to NQL and ASM (p < 0.001), regardless of the cord level. Inter-scanner repeatability was best in C1–C7: coefficients of variation for NQL, ASM, and SCT: 0.4, 0.6, and 1.0%, respectively. CSAs estimated in brain MRI were slightly lower than in SC MRI (all p ≤ 0.006 at the C1–C2 level). Despite protocol harmonization between the centers with regard to image resolution and use of high-contrast 3D T1-weighted sequences, the variability of CSA was partly scanner dependent probably due to differences in scanner geometry, coil design, and details of the MRI parameter settings. Conclusion: For CSA quantification, dedicated isotropic SC MRI should be acquired, which yielded best repeatability in the entire cervical cord. In the upper part of the cervical cord, use of brain MRI scans entailed only a minor loss of CSA repeatability compared to SC MRI. Due to systematic differences between scanners and the CSA quantification software, both should be kept constant within a study. The MRI dataset of this study is available publicly to test new analysis approaches.


INTRODUCTION
Spinal cord (SC) atrophy assessment in neurological diseases such as multiple sclerosis (MS) has gained important attention over the past years (1)(2)(3). Techniques for measuring SC volume or cross-sectional area (CSA) on the basis of magnetic resonance imaging (MRI) have improved in reliability by the use of semiautomated and fully automated techniques largely replacing time-consuming manual outlining of the cord (4,5). Some of these techniques have already been established in the field of MS, analyzing large cohorts of patients (6)(7)(8)(9)(10)(11)(12).
Most of the studies concentrated on CSA measurements in the upper cervical regions including individual levels at the C2/C3 vertebra (6) or larger volumetric regions (9,13), while a few studies used quantification of the entire cervical cord (5,10). Furthermore, it has been shown that cervical SC atrophy can also be quantified using 3D T1-weighted (3D T1w) brain MRI scans that include the upper cervical cord region as well as by using T2-weighted and other MRI sequences and combinations of image contrast (8,(13)(14)(15). For a current meta-analysis of MRIbased SC quantification studies in MS, see Casserly et al. (2). Still, the choice of the specific cord level can influence results due to physiological variability in the cord area or disease-related patterns of atrophy along the cervical cord (10,16). Moreover, recent comparative studies have shown that the CSA estimates can differ systematically between the various software methods (11,17). Hence, harmonization of analyses remains challenging. Multicentric studies performed under harmonized conditions have been increasingly used to explore systematic differences caused by MRI scanners, image acquisition methods (pulse sequences and parameter settings), and volumetric analyses (11,(18)(19)(20)(21). But so far, no common agreement on methods for SC atrophy quantification has been established.
In this multicenter single-subject study, we aim at identifying a common and reliable regional level of CSA measurement by comparing different software methods with respect to variability and systematic differences at different cervical cord levels using brain and cord MRI. We intend to widen the scope of previous comparable studies by including other software tools, more SC levels, and a larger number of different scanners (15,22). The study design reflects a real-world multicenter scenario with harmonization of the MRI protocol for sagittal isotropic 3D T1w brain and cervical cord imaging with respect to image resolution and contrast, but without the exact specification of the sequence design and timing. We quantitatively compare three established software methods for CSA assessment [NeuroQLab (NQL), Active Surface Model (ASM), and Spinal Cord Toolbox (SCT)] in terms of reliability at three different cervical cord levels: at the upper SC at the vertebral levels C1-C2, at the level of the vertebral disc C2/3, and at the entire cervical SC (C1-C7). In this process, we compare high-resolution 3D T1w MRI of the brain and cervical SC of six European centers, one set for covering the whole brain and the upper part of the cervical SC and one specifically optimized for the SC.
In addition, the current work provides a publicly available reference MRI dataset to the research community, containing single-subject, multicenter, repeated volumetric acquisitions of the brain and cervical SC in a healthy volunteer.

MATERIALS AND METHODS MRI
A single healthy volunteer (male, age 45 years at first scan) underwent multiple repeated MRI scans at six different European centers with high expertise in MRI for MS. The scans at a given site were acquired during a single visit. At all scanners, the time of examination was in the afternoon between 3:00 and 7:00 p.m., thus minimizing the effects of potential daytime-dependent fluctuations of the SC volume. The acquisitions at five of the centers took place between March 2015 and February 2016 and at center 6 in January 2017. Given the limited time period and the age of the healthy volunteer, we assumed that the true CSA could be considered stable between the scans given the slow cervical cord atrophy rate in healthy controls [e.g., (10)].
The participant gave written informed consent to take part in the repeated MRI acquisitions at different centers both for the use of the anonymized MRI data for scientific purposes within the scope of the present study and for sharing the data with the research community. At each site, the MRI acquisitions reported in the present study were conducted after signed informed consent of the volunteer and with the approval of ethics boards of the involved institutions.

MRI Acquisition
Scanning included a sagittal isotropic 3D T1w sequence of the brain covering the upper cervical cord including at least the C1-C3 vertebral levels, which was followed by acquisition of a sagittal isotropic 3D T1w sequence of the entire SC using combined neurovascular head and neck matrix coils. Both sequences were acquired three times with repositioning of the subject in the scanner between the second and third examination in order to incorporate potential effects of different positioning within the magnet. Image acquisition followed a consent protocol within the MAGNIMS consortium, which was standardized for geometry requiring isotropic image resolution of 1 × 1 × 1 mm 3 , and accepted magnetization-prepared, highcontrast 3D T1w gradient-echo imaging according to the local expertise and the specific software/hardware available in each of the participating centers. Thus, the study design was intended to reflect a real-world multicenter scenario without the exact specification of the sequence design and timing. All scans were performed using 3-Tesla scanners (one General Electric, three Philips, and two Siemens). All sites used the vendor-specific 3D distortion correction procedures to correct for non-linear gradient distortion effects (23). Detailed imaging parameters for each sequence at each site are provided in Table 1.
The anonymized MRI datasets of brain and cervical cord MRI used in this study are freely available for scientific research upon request to MAGNIMS (https://www.magnims.eu/ magnims-cord-dataset/).

Cross-Sectional Area Assessment
For CSA measurements, three different software methods were used that had previously been found to be reliable in large MS cohorts (7)(8)(9). The first approach consisted of the fully automated deformable model method PropSeg, freely available with the SCT (5) (version 4.1; https://sourceforge.net/ projects/spinalcordtoolbox/). In addition to this fully automated approach, two semiautomated methods requiring manual interaction were chosen: the ASM, available with costs with the Jim Software package (JIM, v. 7.0, Xinapse Systems, Colchester, UK; www.xinapse.com) (4, 7) and the watershedsegmentation method available with NQL (Fraunhofer-Mevis, Bremen, Germany; license freely available for research purposes upon request from Fraunhofer-Mevis) (24,25).
Spinal Cord Toolbox, "PropSeg" SCT features specific segmentation tools for the SC. The segmentation algorithm PropSeg is based on an iterative propagation of a deformable model with an adaptive contrast mechanism (5,26). Automated detection of the center of the SC is done by ellipse detection and information from the body symmetry, followed by propagating a tubular surface along the SC edge using deformable models. SCT has been applied in studies on MS patients and has been shown to be highly reproducible (17). We used SCT version 4.01 with default settings.

Active Surface Model
The ASM, which is implemented as the cord finder tool in the JIM software package, requires interactive marking of the center of the cord on a regular distance along several vertebral levels to be included in the analysis (4). Cord center line and cord outlines at each slice are then calculated using a segmentation algorithm with a steadily increasing refinement of the ASM. This allows a rapid semiautomated segmentation by measuring the cord CSA along the length of the extracted surface parameter. We used the cord finder tool included in JIM version 7.0, with the following settings: nominal cord diameter setting, 10 mm; number of shape coefficients, 18; order of longitudinal variation, 5. ASM has been shown to be highly reproducible, and the method has already been used in cross-sectional and in longitudinal MS studies (7,10,27,28).

NeuroQLab
NQL requires the user first to interactively define the section of the cord to be analyzed by placing an oblique plane through the dataset, which runs through the upper and lower end of the section (24). This is aided by two perpendicular lines, which allow to align the section precisely to the specific vertebral bodies. This step is followed by a semiautomatic pre-segmentation using a watershed transformation of the pixel intensities. Subsequently, a fully automated model-based volume measurement is performed by fitting the intensity distribution of the pre-segmented input region using a Gaussian mixture model. The SC volume is modeled using Gaussian mixture of two tissue classes [spinal cord tissue and cerebrospinal fluid (CSF)] and a separate class representing partial volume voxels. The volume is calculated by summation of the SC tissue class volume and half of the volume of the partial volume class. The center line of the SC is calculated and used to determine the mean CSA by normalizing the measured volume to the section length. The operator can correct the final results interactively. NQL has been shown to be highly reproducible (24,25), and the method has already been used in cross-sectional and in longitudinal studies of different neurological diseases including MS (8,15,22,(29)(30)(31).

Similarities and Differences Between the Software Methods Regarding Partial Volume Effects and Handling of Cord Curvature
The definition of the cord contour and rules for inclusion or exclusion of voxels located on the edge between cord and CSF may account for a substantial proportion of the segmented   volume, given the limited image resolution and the small diameter of the SC. This aspect is handled differently between the software methods and may contribute to systematic differences between the CSA estimates generated with different algorithms (11,17). While the SCT algorithm includes only voxels that are classified as "pure" cord tissue into the segmentation, without referring to partial volume effects at the margin of the cord contour, NQL takes voxels at the edge of the cord into account that are affected by partial volume effects by including pure cord tissue voxels and 50% of the partial-volume tissue class in the CSA calculation (17,24). Similarly to NQL, the cord segmentation of ASM includes a fraction of those voxels subject to partial volume effects between the cord and the surrounding CSF because the cord surface definition in ASM is partly controlled by seeking high-intensity gradients (4). Additionally, effects of the cervical cord curvature are treated differently between NQL and ASM or SCT. While ASM and SCT are optimized with regard to variations of the cord curvature (4,5), NQL quantifies the cord volume between two parallel oblique planes and uses the center line merely for calculation of the mean cord area from the segmented volume (24). Thus, the CSA estimations by NQL might differ from the corresponding ASM or SCT results depending on the degree of curvature of the SC or the exact choice of the cord section to be analyzed.

Cross-Sectional Area Measurements at Different Cord Levels
CSA measurements were performed in different sections of the cervical cord, which were defined by anatomical markers. In all measurement setups, CSA represented the mean crosssectional cord area within the chosen cord segment. Sections were chosen according to previously published procedures (7,17,25) and to match the procedural performance of the included software methods. For both head and SC acquisitions, CSA was measured at three different cord sections: at the level C1-C2, across the entire cervical cord from C1 to C7, and at the level of the intervertebral disc between C2 and C3 within a single slab of 3-mm thickness. Figure 1 illustrates which levels were investigated.
In the semiautomated methods (ASM and NQL), the C1-C2 and the C1-C7 sections were manually defined using the top of the dens and the endplate of the corresponding caudal vertebra (C3 or C7) as anatomical references for the upper and lower boundaries, respectively (7,15,25). The C2/C3 sections in ASM and NQL were manually defined by marking the level of the bottom of the C2 vertebral body as the upper boundary and including a 3-mm section caudally.
In SCT, the C1-C2 and C1-C7 sections were defined according to the automated vertebral labeling of the cord, which is part of the SCT algorithms (17). Since calculation of CSA in an arbitrary cord section was not provided in SCT, we determined CSA at the C2/C3 level based on the quantitative output reports of the software: the most caudal slice that was assigned to the C2 vertebra was manually determined from the SCT output, and the averaged CSA of three caudally adjacent slices (slice thickness 1 mm) was calculated.
The results of all methods were visually inspected for segmentation errors, false vertebra labeling (in case of SCT), or image artifacts. Two SCT measurements were excluded from the analyses due to erroneous segmentations. The C1-C7 CSA measurements of one center (site 6) were corrupted by infolding artifacts of the shoulders. We excluded these C1-C7 CSA estimates for all software methods. Only for this center, SCT segmentations in the C1-C2 and C2/3 sections were initiated by choosing the C2/C3 vertebral disc as a starting point for the segmentation to avoid PropSeg starting in the (lower) part of the image that contains image artifacts.

Image Contrast Assessments
To compare contrast-to-noise ratios of cord tissue to CSF between the centers, the mean and standard deviations of signal intensities within regions of interest (ROI) were determined in the T1w MRI images. We placed ROI in the cervical cord at the C1-C2 level, the C2/3 level, the C5-C7 vertebral level, and in regions of the adjacent CSF using standard diagnostic image viewer tools. For this purpose, we used ovoid size-adapted contours placed well within the cord or CSF, each at the height of the level in question. The size of the ROIs was adjusted to cover as large an area as possible excluding the interface between cord tissue and CSF. For the C1-C2 cord level, the corresponding CSF ROI was placed within the cerebellomedullaris cistern, while for the C2/C3 and C5-C7 cord levels, the corresponding CSF ROIs were positioned in adjacent areas between the cord and the vertebral body (Details are shown in Supplementary Figure 1 in the electronic supplement.). For the ROIs at the C1-C2 and C2/C3 levels (brain and cord MRI) and the C5-C7 level (cord MRI only), we calculated the contrast-to-noise ratio per unit of time (CNR UT ) between cord and CSF, controlling for differences between sites regarding the acquisition duration of the T1w sequences (32): where signal is the mean image intensity within the ROI, SD is the standard deviation, and t is the acquisition duration of MRI sequence (min).

Statistical Analysis
Statistical analyses were performed using the software package SPSS (IBM, SPSS V. 25). Results were considered statistically significant when associated with p < 0.05.

Mean Maximum Observed Difference and Repeatability of CSA
To describe the smallest difference between two CSA measurements that could be detected in this multicenter setting with different scanners [named mean maximum observed difference (MMOD)], we determined the maximum of the absolute differences between the CSA results of the three repeated scans and averaged it across all centers (avg. max. abs. scans ). The MMOD was calculated separately for each software method and cord level in brain and SC MRI.
Additionally, the repeatability of the CSA measurements using the different software methods, cord levels, and brain or SC MRI was assessed by calculating the coefficient of variation (CV) of the three repeated scans using the formula, standard deviation repetitions /mean, and averaging the results over all centers.
Since the requirements of normal distribution and equality of variances of CSA within the group variables software methods, cord levels, and acquisition type (brain or SC MRI) were not met, we used non-parametric tests throughout the statistical analyses.
We checked for within-subject differences between the scan repetitions by using Wilcoxon signed rank tests for paired samples for the comparison of scan no. 1 with scan no. 2 (simple repetition) and scan no. 2 to scan no. 3 (repetition after repositioning of the healthy volunteer). By testing separately for each software method and cord level in brain and SC MRI, we detected no significant differences between the scan repetitions. Therefore, we did not include the scan repetition number as a factor in further statistical analyses. Given our atypical study design with only one subject, we rather handled the scans as independent measurements.

CSA Differences Between Methods
We assessed group differences of CSA between the software methods separately for MRI acquisition type (brain or cord) and cervical cord level (C1-C2, C2/3, C1-C7) by using Kruskal-Wallis tests. Post-hoc pairwise comparisons included Dunn's posthoc tests with Bonferroni adjustment for multiple comparisons. In these analyses, the CSA estimates were aggregated across centers and scan repetitions.

Differences Between Brain and Cord MRI
Group differences between brain and cord MRI of CSA and group differences of the CV of CSA were assessed separately for the software methods and the C1-C2 or C2/3 level using Wilcoxon signed rank tests for paired samples. Therein, the CSA estimates were aggregated across centers and scan repetitions.

Between-Center Agreement
We investigated differences of CSA between sites at the C1-C2 or C2/3 cord levels separately for each software method while aggregating measurements of brain and cord MRI (N = 6 measurements each). We applied Kruskal-Wallis tests with post-hoc comparisons between pairs of sites adjusted for multiple comparisons with Dunn's post-hoc tests and Bonferroni adjustment.

Intensity and Contrast Assessment
We assessed group differences of the contrast-to-noise ratio of cord and CSF between the centers with Kruskal-Wallis rank tests. Post-hoc pairwise comparisons used Dunn's post-hoc tests with Bonferroni adjustment for multiple comparisons.

Image Quality and Contrast in Different MRI Scanners
In total, 36 datasets, respectively, 18 pairs of brain (covering the upper cervical SC) and dedicated SC scans, were acquired in six centers using 3-Tesla MRI scanners of different vendors ( Table 1). The SC MRI acquisition of site 6 included infolding artifacts in the caudal part of the images, so intensity measurements in these areas were omitted for site 6. Apart from that, visual inspection showed similar image quality and typical contrast settings of the sagittal, isotropic T1w sequences of brain and cervical cord across sites ( Figure 2); however, intensity measurements (for cord imaging and brain imaging at the C1-C2 and C2/3 levels and for cord imaging at the lower vertebral levels C5-C7) revealed in all centers decreasing signal intensity toward the caudal parts of the images when comparing the different levels ( Table 2). Moreover, the achieved contrast-tonoise ratio per unit of time between cord and CSF (CNR UT ) differed significantly between the centers. When comparing the cord levels within each scanner, CNR UT was similar between the C1-C3 and C2/3 cord levels in brain and cord MRI, while there was lower CNR UT in the caudal parts of the images (C3-C5 FIGURE 2 | Image quality of 3D T1-weighted brain and spinal cord sequences at six 3-Tesla MRI scanners, acquired in the same healthy volunteer at all sites (age 45 years at first examination; male). cord level). In particular, for the dedicated cord acquisitions, the CNR UT was higher in the Philips scanners (nr. 1, 2, 4) than that in the Siemens and GE scanners. Details are shown in Table 2.

Cross-Sectional Area Quantification
As a first step, we compared the CSA results aggregated across all scanners at different measurement settings (software, cord level, cord, or brain MRI) according to a multicenter scenario with different scanner types. Two CSA estimates using SCT at the C2/3 level and three measurements at the C1-C7 level were excluded from the analyses due to erroneous segmentations and infolding artifacts, respectively (see section Similarities and Differences Between the Software Methods Regarding Partial Volume Effects and Handling of Cord Curvature). Table 3 shows mean and standard deviations of CSA (pooled across all centers) differentiated according to the software method at the different cord levels and to brain or cord MRI. Kruskal-Wallis tests between the software methods resulted for both brain and cord MRI in significantly lower CSA estimates using SCT for the C1-C2 cord level (brain MRI) and the C1-C2 and C1-C7 levels (cord MRI) when compared to those of both semiautomated techniques (all p < 0.001) (Figure 3). CSA results obtained with NQL and ASM were similar at the C1-C2 level (differences not significant), while at the C1-C7 level, CSA estimates obtained with ASM were lower than those when using NQL (p = 0.001), and CSA estimates at the C2/3 vertebral level were lower when using NQL than those when using ASM (p < 0.001).

Inter-scanner Repeatability Depends on Cord Level, MRI Type, and Software
The repeatability of CSA across centers (inter-scanner) was best for NQL and ASM measured over the entire cervical cord (C1-C7) reflected by low CV of 0.4 and 0.6% and low MMOD of 0.6 and 0.9 mm 2 ( Table 3). Regarding the cord MRI acquisitions, the MMOD and CV across all software methods were only slightly higher at the C1-C2 vertebral levels compared to those at the C1-C7 level, while the CV and MMOD for measurements at the C2/3 interval were considerably worse than those at the C1-C7 level (Table 3, Figure 4).
The repeatability was similar for the semiautomatic methods (ASM and NQL) using brain or cord MRI at all levels, except for the C2/3 level in cord MRI, where ASM had a higher MMOD and CV than those for NQL. For the SCT method, the MMOD and CV were higher than those for NQL and ASM at the C1-C2 level and in the entire cervical cord. At the C2/3 level, the repeatability of SCT was superior to that of ASM using cord acquisitions and superior to that of both semiautomatic quantification techniques (NQL, ASM) using brain acquisitions (Table 3, Figure 4).

Comparing Cross-Sectional Area Results in Brain and Cervical Cord MRI
CSA results based on brain acquisitions (aggregated across all centers) were significantly smaller than those from dedicated cord acquisitions at the same cord levels, except for the SCT method at the C2/3 level (Figure 5; CSA group differences assessed by Wilcoxon signed rank tests for paired samples). Still, the variability of CSA (expressed as CV) was not clearly higher when using the brain MRI scans than in dedicated cord scans (Table 3,Figure 4): the differences between brain and cord MRI of the CV of CSA were not statistically significant in any of the software methods at the C1-C2 or at the C2/3 level (all p > 0.050 using Wilcoxon signed rank tests for paired samples).

Scanner Dependencies
In a further analysis, we investigated the differences between CSA results of brain and cord MRI separately for each center.  Table 4 shows CSA for brain and cord MRI at the C1-C2 and C2/C3 levels, estimated with the different software methods, differentiated with respect to scanner type. We observed different grades of deviation between brain and cord MRI results within and between centers (ranges: −4.4 to 0.4 mm² at the C1-C2 level and −4.7 to 0.2 mm² at the C2/3 level), with best agreement at both cord levels and for all software methods in the GE scanner (site nr. 5) and the Siemens scanner (site nr. 6), which both have a long magnet design (198 cm). In the combined results from all scanners, the mean differences between brain and cord scans ranged between −2.0 and −0.9 mm² at the C1-C2 level and between −1.4 and −0.8 mm² at the C2/C3 level, depending on the software method used. Specifically, the braincord acquisition differences at the C1-C2 level seemed larger when using SCT compared to those when using NQL or ASM, while they were smaller for SCT compared to those for ASM and NQL at the C2/3 level.

Consistency of Cross-Sectional Area Results Between the Centers
Between-center agreement of CSA at the C1-C2 or C2/3 cord levels was assessed separately for each software method using Kruskal-Wallis test with post-hoc comparisons between pairs of sites adjusted for multiple comparisons by aggregating measurements of brain and cord MRI (N = 6 measurements at each center). The results are depicted in Figure 6. At the C1-C2 cord level, we observed significantly higher CSA in site 6 compared to those in site 3 and site 4 when using NQL and also compared to those in center 1 and center 2 for the ASM method. There were no other center differences (all p > 0.050) for ASM and NQL at the C1-C2 level. Overall, there were no significant  inter-center differences for SCT at both cord levels and for NQL and ASM at the C2/3 cord level.

DISCUSSION
Quantification of CSA has gained increasing attention over the past years, and techniques assessing CSA have improved in terms of robustness and reproducibility. However, to be successfully established in multicentric MS studies, CSA quantification still lacks harmonized procedures, e.g., agreement on a common vertebra level region to be measured.
In the present traveling volunteer study, we studied three popular fully automated and semiautomated techniques for CSA assessment (SCT, NQL, ASM) at different cervical cord regions derived from brain and SC scans of a single healthy volunteer scanned at six different European MS centers in order to propose a common cord level for reliable CSA assessment. Our results were in agreement with the findings of recent studies that showed good concordance between results of brain and SC MRI using the NQL method in the upper cervical cord and extended these assessments by including other software tools and SC levels and a larger number of different scanners (15,22).
The CSA results were dependent on the software used and, as expected, on the cord region included in the evaluation. Repeatability, especially across centers, was best when scanning the entire cervical cord. Agreement was better between similar types of approaches (e.g., semiautomated) than between different types (e.g., semiautomated and fully automated) techniques, as we observed lower CSA values when using a fully automated approach regardless of the vertebra level used.

Absolute Cross-Sectional Area Results Depend on the Evaluation Software and Cord Level
In concordance with a recent study (11), absolute results of the automatic segmentation with SCT (PropSeg) at the upper portion of the cervical cord were systematically lower than CSA results assessed by the semiautomated methods ASM or NQL. In the present study, we show as a new finding that the mean CSA results when obtained from the entire cervical cord were also significantly lower when using SCT than those when using ASM or NQL. Those systematic differences between SCT and the semiautomated methods, independent of the specific cervical region, are probably due to different ways the algorithms define the contour of the cord and assign voxels at the edges of the cord as belonging or not belonging to the cord. Since ASM and NQL take partial volume effects into account, while SCT does not, these differences between methods could be the reason for the overall consistency between CSA derived from NQL and ASM: the CSA estimates of NQL and ASM were in good agreement when assessing the C1-C2 vertebral level, while mean CSA results in the entire cervical cord assessed using ASM were lower than the NQL results.
Since CSA varies in the cranio-caudal direction across the cervical SC, the exact choice of the cord level has a major influence on the absolute CSA results (Figure 4C) (16,33). Obviously, calculation of mean CSA over different segments involves averaging some of that variation, so subtle differences between cord levels may be lost, although noise is reduced as measurement takes place across a larger volume. Additionally, the segmentation quality of the methods might differ in their sensitivity to image degradation in the caudal part of the cervical cord due to signal drop off ( Table 2), potential geometrical image distortions, or the presence of emerging nerve roots. Given the high reproducibility of the NQL and ASM results at the C1-C2 and C1-C7 levels, the differences in the absolute results when quantifying the entire cervical cord might be due to different susceptibilities of the methods to those specific pitfalls mentioned above when segmenting a long cord section. Additionally, effects of the cervical cord curvature, which is more pronounced for the entire cervical cord than in the smaller upper cervical cord FIGURE 5 | Cross-sectional area (CSA) across all centers based on brain or cord MRI using three different software methods at the C1-C2 or the C2/C3 vertebral level. Significance of group differences assessed using univariate ANOVAs with CSA as dependent variable and MRI acquisition type and scan repetition number as fixed factors.
section, are accounted for in the ASM method, but not in NQL. Thus, the CSA results in NQL might be slightly overestimated when measuring the entire cervical cord due to varying volume contributions at the upper and lower boundary of the cord section. Furthermore, the NQL method seemed less suited than ASM or SCT to quantify the single 3-mm cord section at the C2/3 level, where NQL measured considerably lower CSA values compared to ASM method using both brain and cord MRI (Figure 3, Table 3). NQL uses an intensity-based Gaussian mixture model for automatic tissue classification of the cord and the surrounding CSF, which typically requires a large number of voxels and hence might lose precision when applied to very small volumes like the 3-mm section at the C2/3 level that was investigated in the present study.

Comparing Repeatability Between the Software Methods and Cord Levels
In the multicentric analysis across the six centers, the repeatability was best when assessing the entire cervical cord (CV ≤ 1.0% for all software methods) and slightly worse in the upper portion of the cervical cord (Table 3). Furthermore, in these cord sections, the variability of the results was higher when using the SCT method (CV ≤ 1.6%) than those when using the NQL (CV ≤ 0.9%) or ASM (CV ≤ 0.7%), reflecting a reduced repeatability of the automated software method compared to the semiautomated segmentations in these measurement settings.
Segmentation of the C2/3 cord level regarding a small 3-mm cord section [comparable to the classical method proposed by Losseff et al. (6)] led to marked increases of the variability in all methods compared to measurements at the C1-C2 or C1-C7 levels (Table 3, Figures 3A,B). While segmentation of small cord sections might theoretically be advantageous when assessing local changes, its sensitivity to image inhomogeneity and partial volume effects probably leads to marked variability of the results, thus making this method less feasible for use in larger studies. Nevertheless, the SCT software method seemed to be more robust when looking at a very small cord section (CV ≤ 2.0%) than NQL or ASM (CV ≤ 2.6 and ≤2.5%, respectively).
In addition to these cord section-dependent effects, the variability of CSA values in a multicentric analysis can partly result from different image qualities between the centers, as reflected by differences of the cord-to-CSF contrast ( Table 2). Despite that the consensus on MRI protocols used in the network of the participating centers was aimed at homogeneous spatial resolution and contrast features for brain and SC MRI, subtle differences between the sequence details related to scanners and different vendors remained. Enhanced cord to CSF CNR in cord MRI achieved by certain vendors compared to others can be due to differences in echo time (TE), inversion time (TI), and repetition time (TR) protocol settings, but also in scanner and coil design (Table 2, Figure 2). Such scanner-dependent contrast differences between cord and surrounding CSF could lead to changes in the partial volume effects at the boundary of the SC, with different effects on the segmentation results of the software methods. Accordingly, we found varying CSA differences between pairs of scanners, which differed between the three software methods (Figure 6). These findings underline the need for careful protocol and contrast harmonization between centers in multicentric studies, since inter-center differences based on different scanners and protocols seem to contribute to the limitations of detectability of disease-related CSA changes.

Comparing Brain and Spinal Cord MRI
Comparison between brain and cord CSA estimates at the same levels showed for all software methods used slightly lower CSA when using brain MRI. Different factors can contribute to differences in the CSA results of brain and SC MRI: one is based on gradient non-linearity distortions that may become relevant at the edges of the magnet especially in modern short-bore scanners. This may particularly influence the CSA quantification based on brain MRI, since the upper cervical cord is located off-center in the sagittal images, at the periphery of the field of view. A thorough analysis of these effects and possible ways of compensating or avoiding their impact on CSA measurements has recently been published (23), which showed that non-linear gradient distortions will lead to lower CSA results when quantifying using brain MRI. Our results of lower brain MRI CSA results compared to cord MRI confirmed these findings ( Figure 5, Table 4). Scanners with long-bore magnets should be less susceptible to these effects. Different grades of deviations between brain and SC results between the sites might partly be due to different magnet types and gradient systems provided by the different vendors.

Limitations
This study considered only one healthy traveling volunteer. A limitation of the study results is given by reduced SC MRI image quality due to infolding artifacts in one of the centers, leading to a slightly reduced statistical power of the SC MRI evaluations. Furthermore, it would have been preferable to have more than one healthy participant in this reproducibility study. The comparability of the results and the reproducibility in the different measurement constellations could have been further improved if procedures were used to ensure that the evaluation takes place in exactly the same regions (for example, by transferring binary segmentation masks from one method to the other). In this study, we have limited the evaluations to using the individual methods "as is" and to aligning the regions only on the basis of anatomical markers. We think that this corresponds more to the typical situation in larger studies where a decision has to be made for a certain method.
Furthermore, stricter standardization of MRI protocols than applied in this study would probably lead to a reduction in variability of CSA between scanners in multicentric studies. Recently, a fully harmonized examination protocol for different scanner vendors, including sagittal 3D T1w imaging and other sequences for quantitative examination of the SC was freely made available to the research community (the spine generic protocol, https://spinalcordmri.org/protocols). This generic SC protocol has successfully been implemented in 42 MRI centers worldwide in order to generate a harmonized multi-subject dataset (34). Future multicentric studies on CSA quantification should adapt to this approach.

Conclusions and Recommendations for the "Optimal" Cervical Cord Section and Software Method for Cross-Sectional Area Assessment
Aiming at optimal reproducibility of the compared methods, dedicated isotropic SC MRI using 3D acquisition should be acquired whenever possible, since repeatability was best when scanning the entire cervical cord ( Table 5). Nevertheless, CSA quantification of only the upper part of the cervical cord, even based on brain MRI, seemed to entail only minor loss of repeatability and comes with major advances in terms of acquisition time and patient comfort. Thus, if lengthy brain imaging protocols are used and the additional acquisition of dedicated SC MRI is not feasible, CSA quantification of the C1-C2 cord level making use of sagittal 3D T1w brain MR (using a combined head and neck coil) can be used to achieve reliable CSA results.
Looking at disease-related changes in the cervical cord, which was not part of the present study, may lead to a different point of view. Quantification involving the entire cervical cord means averaging over processes that may be focused to certain cord regions, and good reproducibility may thus be traded off for less sensitivity to subtle changes. Recent studies have shown that cord atrophy in MS especially involves the upper cervical cord level (7,15,25,26,35), so CSA quantification in the upper portion of the cervical cord, involving only a smaller cord interval, may be advantageous in clinical studies of MS patients. Recently, differences in the local patterns of cervical cord atrophy between the relapsing-remitting MS types and progressive forms have been shown, pointing to increasing involvement of the caudal cervical cord levels in the secondary progressive and primary progressive types of MS (10). As a consequence, the cord level for CSA evaluation in MS studies should be optimized with regard to the subtypes of MS patients.
Comparing the different software methods, the semiautomated methods NQL and ASM seemed to be similarly suitable and robust and be superior compared to SCT with regard to reproducibility. All three software methods performed similarly on brain and SC acquisitions. When analyzing large patient studies, the automated SCT method CSA differs between software and cord levels can be clearly advantageous with regard to analysis time (the semiautomated methods both take about 5 to 7 min for processing and handling a single dataset). The choice of the optimal method may depend on the number of patients included in the study. Furthermore, the use of SCT may be beneficial because it is freely available for scientific purposes, while NQL can be used upon request to Mevis, and ASM is distributed with costs. Since the absolute CSA results of the different software methods and the different cord levels deviate considerably from each other, it is important to keep the acquisition and post-processing methodology identical within a study and to report these study details in publications. When comparing CSA results of different publications, absolute results should be regarded carefully, considering different evaluation methods. Longitudinal rates of change of CSA might be less sensitive to these methodological effects. On the other hand, longitudinal analyses can also entail specific problems that may affect the accuracy of CSA measurements. For example, the quality of patient repositioning or possible hardware or software changes between follow-ups have to be taken into account.
Center-dependent effects that have been detected in this traveling-volunteer reproducibility study have to be considered when pooling data from different centers. Differences in scanner geometry, coil design, and variability of image contrast have effects on CSA estimates and thereby limit the sensitivity to detect small disease-related cervical cord changes in multicentric studies.
Further longitudinal studies on MS patients and healthy controls to be acquired at different centers are warranted to further optimize cord levels and software tools for CSA quantification in multicentric studies.

DATA AVAILABILITY STATEMENT
The datasets analyzed for this study are freely available for scientific research upon request to MAGNIMS (https://www. magnims.eu/magnims-cord-dataset).

ETHICS STATEMENT
The study involving human participants was conducted in line with the International Conference on Harmonization Good Clinical Practice (ICH GCP) and was reviewed and approved by the local ethics boards of the involved institutions. At each site, the MRI acquisitions reported in the present study were conducted after signed informed consent of the participant, who gave their written informed consent to take part in the repeated MRI acquisitions at different centers, both for the use of the anonymized MRI data for scientific purposes within the scope of the present study and for sharing the data with the research community. The institutional review boards were 1) Ospedale San Raffaele; Basel; Switzerland: Ethics Committee Northwest and Central Switzerland (Ethikkommission Nordwest-und Zentralschweiz (EKNZ), Basel, Switzerland).

FUNDING
Parts of this work were funded by the German Federal Ministry for Education and Research, BMBF, German Competence Network Multiple Sclerosis KKNMS (Grant Nos. 01GI1601I and 01GI0914) and by grants from the UK MS Society. FP, CG, and MY were supported by the National Institute for Health Research (NIHR) University College London Hospitals Biomedical Research Center. The funding institutions did not interfere with the study design, the collection, analysis and interpretation of data, the writing of the report, or the decision to submit the article for publication.