A comparison of FreeSurfer-generated data with and without manual intervention

This paper examined whether FreeSurfer—generated data differed between a fully—automated, unedited pipeline and an edited pipeline that included the application of control points to correct errors in white matter segmentation. In a sample of 30 individuals, we compared the summary statistics of surface area, white matter volumes, and cortical thickness derived from edited and unedited datasets for the 34 regions of interest (ROIs) that FreeSurfer (FS) generates. To determine whether applying control points would alter the detection of significant differences between patient and typical groups, effect sizes between edited and unedited conditions in individuals with the genetic disorder, 22q11.2 deletion syndrome (22q11DS) were compared to neurotypical controls. Analyses were conducted with data that were generated from both a 1.5 tesla and a 3 tesla scanner. For 1.5 tesla data, mean area, volume, and thickness measures did not differ significantly between edited and unedited regions, with the exception of rostral anterior cingulate thickness, lateral orbitofrontal white matter, superior parietal white matter, and precentral gyral thickness. Results were similar for surface area and white matter volumes generated from the 3 tesla scanner. For cortical thickness measures however, seven edited ROI measures, primarily in frontal and temporal regions, differed significantly from their unedited counterparts, and three additional ROI measures approached significance. Mean effect sizes for edited ROIs did not differ from most unedited ROIs for either 1.5 or 3 tesla data. Taken together, these results suggest that although the application of control points may increase the validity of intensity normalization and, ultimately, segmentation, it may not affect the final, extracted metrics that FS generates. Potential exceptions to and limitations of these conclusions are discussed.


INTRODUCTION
FreeSurfer 1 (FS) is a freely available fully automated brain image morphometric software package that allows for the measurement of neuroanatomic volume, cortical thickness, surface area, and cortical gyrification of regions of interest (ROIs) throughout the brain. FS was designed around an automated workflow that encompasses several standard image processing steps necessary to achieve a final brain parcellation within the subject's space; however, manual image editing is allowed after each stage to ensure quality control. The first stage performs skull stripping and motion artifact correction, the second performs gray-white matter segmentation (Fischl et al., 2002), and the third segments 34 ROIs based on anatomic landmarks (Desikan et al., 2006). Another critical function that FS provides is the ability to construct surface-based representations of the cortex, from which cortical thickness, neuroanatomic volume, and surface area can be derived. Manual measurement of the volumes of specific ROIs is an arduous, labor-intensive task, and is subject to inter-rater variability. FS offers consistency in its fully automated processing, which is ideal for either single-or multi-site studies with large sample sizes. In general, validation studies have demonstrated that FS can produce measurements that are comparable to those derived from manual tracing of brain regions (Fischl et al., 2002;Tae et al., 2008;Bhojraj et al., 2011). FS has also been shown to be a highly reliable method for automated cortical thickness measurements across scanner strength and pulse sequence in all regions of the brain, with minor variability being attributed to cytoarchitectural differences of certain ROIs and difficulties with surface reconstructions in temporal lobe regions (Han et al., 2006;Fjell et al., 2009).
However, strictly implementing the automated procedures in FS can result in variability in the accuracy of segmentation for some ROIs. For example, Cherbuin et al. (2009) showed that absolute hippocampal volumes measured with FS were significantly larger than those of manual tracings, with reported 23 and 29% overestimation of left and right hippocampal volumes, respectively. Closer inspection revealed that this was due to inclusions of surrounding high intensity voxel structures as well as misidentification of pockets of cerebrospinal fluid as hippocampal tissue (Cherbuin et al., 2009). Other studies suggest that the temporal lobe and nearby regions are troublesome areas of the brain for FS to measure accurately (Desikan et al., 2006;Oguz et al., 2008). The presence of either excess dura matter, closely adjacent temporal bone or cerebellum can potentially lead to inclusions which may affect volume and ROI segmentation (Desikan et al., 2010). Moreover, some neuropathological conditions, which lead to enlarged ventricles like normal pressure hydrocephalus or Alzheimer's disease may affect white matter segmentation steps and thus may lead to greater necessity of editing the FS images of patients with similar conditions (Moore et al., 2012). Magnetic Resonance (MR) imaging acquisition artifacts can also lead to over-inclusion of white matter.
Given the propensity of FS to include areas of the brain extraneous to the ROI, investigators have the option of interrupting the automated process and its output. This can be done via skull stripping the brain, via the addition of control points to correct intensity normalization, via direct manual edits of white matter boundaries, or via a combination of these manual editing methods. These manual edits alter the white matter surface so that it more fully includes white matter structures and does not mistakenly segment gray matter or non-brain tissue as white matter. Manually editing the skull strip can ensure that it is more precise than the automatically completed procedure implemented by FS, and not affected by altered local anatomy in pathological states (Fennema-Notestine et al., 2006). This may improve the segmentation of white matter and lead to less control point placement in the next stage of quality control human intervention.
We reviewed 82 previous studies published primarily between 2006 and 2013 (see Table 1) that utilized FS, discovering a great deal of variability in the extent to which investigators utilized skull stripping, control point or white matter editing options (see Table 1 for review criteria). Two of the studies obtained their samples from previously established databases. Of those 82 studies, 36 utilized 3 tesla (T) or higher MRI scanners, with 8 of those electing the fully automated procedure (31%). The remaining 18 chose to manually edit their 3T data using different combinations of skull stripping, control points, and white matter editing options (69%). The remaining studies utilized 1.5T MRI scanners with 26 choosing the fully automated procedure (46%). Thirty-one 1.5T studies implemented some combination of manual intervention (54%). Scanner strength did not robustly affect whether or not a study decided to edit their data. Fujimoto et al. (2014) compared 3T and 7T data, and reported only editing 7T data for residual hyperintensities in the temporal lobe while leaving the 3T unedited. Pfefferbaum et al. (2012) compared 3T data to 1.5T data, and chose to edit the 3T images more extensively. The heterogeneity in the papers we reviewed underlines the lack of a standard protocol for deciding whether to interrupt the FS segmentation process and manually edit.
Given that there is no standard protocol for the decision to interrupt the fully automated FS pipeline to manually edit the images, this paper seeks to establish the extent to which editing affects the final measurements that FS provides. Conceivably, time consuming manual interventions may only marginally affect the edited data sets, leading one to believe that the editing of this data may only be necessary for specific ROIs. To that end, our study is constructed around the following question: To what extent do the FreeSurfer-generated data for each region of interest differ significantly between the edited and unedited (i.e., fully automated) methods of measurement? Accordingly, we compare the means and variances of surface area, white matter volumes, and cortical thickness derived from edited and unedited datasets for each of the 34 ROIs. Note that surface area was chosen instead of gray matter volume, since surface area has been shown to be genetically and phenotypically independent of cortical thickness (Panizzon et al., 2009;Winkler et al., 2010) and, therefore, more informative than gray matter volume. Moreover, we compare effect sizes between edited and unedited conditions in a small sample of individuals with 22q11.2 deletion syndrome (22q11DS) and neurotypical controls, in order to determine whether or not editing FS output would alter the sample size necessary to detect significant differences in surface area, white matter, or cortical thickness. We hypothesize that the values generated by the edited method will differ from those of the unedited method, and that the edited method will produce larger effect sizes.

Participants
Data used in this study were selected from an ongoing longitudinal study focusing on biomarkers for psychosis in 22q11.2 deletion syndrome (Kates et al., 2011a). The procedures of the longitudinal study were approved by the Institutional Review Board at SUNY Upstate Medical University. Participants were recruited through the SUNY Upstate International Center for the Evaluation, Treatment and Study of Velo-Cardio-Facial Syndrome and from the community, and all participants provided informed consent. Imaging data and neuropsychiatric testing data were acquired at four visits, about 3 years apart. For the first three time points, images were acquired on a 1.5T scanner; for the fourth time point, images were acquired on a 3T scanner.
The subsample with imaging data from the 1.5T MR scanner was drawn from a larger sample of 116 participants who returned for the third time point of the longitudinal study. The subsample consisted of the first 30 participants (stratified by study group) whose Time 3 imaging data were processed, roughly corresponding to the order in which the participants returned for Time 3. They consisted of 20 with 22Q11.2 deletion syndrome (22q11DS) (8 male; mean age 17.54, SD 1.9) and 10 community controls (4 male; mean age 17.18, SD 1.21).
The subsample of participants whose imaging data was from the 3T MR scanner consisted of 21 subjects who returned for the fourth time point and had been included in the subsample with 1.5T MR dataset. Nine additional subjects were matched by age, gender, and diagnosis to the remaining participants from the 1.5T MR subsample. The mean age of the 22q11DS group was 20.74, SD 2.1, and the mean age of the control group was 20.42, SD 1.06.
This study was approved by the Institutional Review Board of SUNY Upstate Medical University, and all participants provided signed, informed consent in accordance with the Declaration of Helsinki.
The individuals who implemented the FS processing pipeline were blind to the diagnostic status of study participants.
The 3T imaging data were acquired in the sagittal plane on a 3T Siemens Magnetom Trio Tim scanner (syngo MR B17, Siemens Medical Solutions, Erlangen, Germany) utilizing an ultrafast gradient echo 3D sequence (MPRAGE) with PAT kspace-based algorithm GRAPPA and the following parameters: echo time = 3.31 ms; repetition time = 2530 ms; matrix size 256 × 256; field of view = 256 mm, slice thickness = 1 mm.

Imaging Data Preprocessing
Preprocessing of 1.5T imaging data consisted of generating an isotropic brain image with non-brain tissue removed, and aligning that image along the anterior-posterior commissure. This was accomplished by importing the raw 1.5T MRI images into the imaging software program, BrainImage (available from the Center for Interdisciplinary Brain Sciences Research, Stanford University), where we performed an initial intensity correction, an automatic brain mask creation, followed by a manual editing step of the brainmask (Subramaniam et al., 1997). After the final manual editing, the skull was removed from the image and the brain image was saved in Analyze file format for import into the imaging software package, 3DSlicer (www.slicer.org; Fedorov et al., 2012). In 3DSlicer, the skullstripped brains were aligned along the anterior and posterior commissure axis, and then re-sampled into isotropic voxels (0.9375 mm 3 ) using a cubic spline interpolation transformation.
Preproccessing of 3T images also consisted of generating an isotropic brain image with non-brain tissue removed. However, instead of using BrainImage to remove non-brain tissue, we used the initial, preprocessing step in the FS pipeline. The resulting brain mask was imported into 3DSlicer, and manually edited using the same steps included in the protocol cited above. Afterwards, the skull was removed from the image and the brain image was aligned along the anterior and posterior commissure axis using a cubic spline transformation and kept at the same resolution as the initial data, isotropic voxels (1 mm 3 ).
At that point, both 1.5T and 3T edited and aligned brain masks were subject to the FreeSurfer segmentation process, described below.

FS Segmentation Process
The preprocessed images were imported into the automated brain segmentation software FreeSurfer (FS) installed on a Dell Optiplex machine using the Ubuntu 12.04 operating system. In addition to resampling of the image into 0.9375 mm 3 using a cubic spline transformation during preprocessing as described above, the FS segmentation process resampled the images into 1 mm 3 as part of its motion correction step. Cortical reconstruction and volumetric segmentation was performed with the Freesurfer image analysis suite, which is documented and freely available for download online (http://surfer.nmr.mgh. harvard.edu/). The technical details of these procedures are described in prior publications (Dale and Sereno, 1993;Dale et al., 1999;Fischl et al., 1999aFischl et al., ,b, 2001Fischl et al., , 2002Fischl et al., , 2004aFischl and Dale, 2000;Ségonne et al., 2004;Han et al., 2006;Jovicich et al., 2006).
Briefly, the FS segmentation process included: the segmentation of the subcortical white matter and deep gray matter volumetric structures (including hippocampus, amygdala, caudate, putamen, ventricles) (Fischl et al., 2002(Fischl et al., , 2004a; intensity normalization (Sled et al., 1998); tessellation of the gray matter white matter boundary; automated topology correction (Fischl et al., 2001;Ségonne et al., 2007); and surface deformation following intensity gradients to optimally place the gray/white and gray/cerebrospinal fluid borders at the location where the greatest shift in intensity defines the transition to the other tissue class (Dale and Sereno, 1993;Dale et al., 1999;Fischl and Dale, 2000). Once the cortical models were complete, a number of deformable procedures were performed including surface inflation (Fischl et al., 1999a), registration to a spherical atlas which utilizes individual cortical folding patterns to match cortical geometry across subjects (Fischl et al., 1999b), parcellation of the cerebral cortex into units based on gyral and sulcal structure (Fischl et al., 2004b;Desikan et al., 2006), and creation of a variety of surface based data including maps of curvature and sulcal depth. Details of the methods involved have been described extensively elsewhere (Fischl and Dale, 2000;Salat et al., 2004).

Final Steps of Fully Automated (Unedited) Pipeline
Following the successful completion of the FS reconstruction process, the FS directories were duplicated, and one copy immediately underwent the final reconstruction stream without manual intervention. Cortical thickness, surface area and white matter volume measurements were extracted for selected Region of Interest (ROIs) and the directories were backed up to a remote and secure location. Cortical thickness measurements were computed by looking at the average distance, calculated using a spatial lookup table, between the white matter and pial surfaces generated by FS (Fischl and Dale, 2000). This group of FS data without any manual intervention will be referred to as "unedited."

Final Steps of Manual Intervention (Edited) Method
The second copy of the data were manually inspected for defects that could affect the accuracy of the final cortical measurements. The full protocols for processing and editing both 1.5T and 3T data are provided in Supplementary Material; however a brief description of the process follows. In the coronal view, starting posteriorly, with the opposite hemisphere of the brain obstructed in order to minimize human error, each slice was inspected for errors in the surfaces created by FS. An error can be described as an instance where one of the surfaces drawn by FS includes or excludes voxels incorrectly. These errors are most often caused by motion artifacts in the more posterior sections of the brain, and by hyperintensities around the temporal and orbitofrontal lobes. Control Points, manually inserted targets that adjust a voxel's intensity value to 110, were inserted within adjacent white matter regions in order to correct surface errors as described on the FS website 2 . Where appropriate, hyperintensities, and extraneous tissue were removed from the brain volume as well, as described in the White Matter Edits tutorial on the FS website 3 . Once completed, the process was repeated for the opposite hemisphere. After all errors were corrected, the brain was re-run through the second reconstruction stream beginning at the module where control point adjusted voxels are taken into account. This process was repeated up to four times to ensure all errors in FS surfaces were corrected.
Following successful correction of the FS surfaces, the final reconstruction step was run and cortical thickness and volume measurements were extracted for all ROIs. Manually-corrected data, hereafter referred to as "edited, " were then compared with the unedited data.

Statistical Analyses
Analyses comparing the unedited and edited volumes and cortical thickness values for each ROI were run separately in SPSS (v22) for the 1.5T and 3T data. Accordingly, for both the 1.5T and the 3T data, the variance was calculated for each ROI, based on the total sample of 30 individuals, and the Levene's test was used to compare the variance of each edited ROI to that of each unedited ROI. Intraclass correlation coefficients between edited and unedited ROIs were calculated based on the total sample as well, and paired t-tests were conducted in order to determine if the means differed significantly between edited and unedited ROIs. The Bonferroni correction was applied to the 34 paired ttests that we performed for each set of measures (i.e., surface area, white matter volume, thickness) at each field strength.
As noted above, we also generated effect sizes for the mean surface areas/white matter volumes/cortical thickness values between the 20 individuals with 22q11DS and the 10 controls, in order to determine the differences in effect sizes that the edited vs. unedited methods yielded. This would allow one to determine the sample sizes for edited vs. unedited methods that would be necessary to detect significant differences in volume/cortical thickness between individuals with 22q11DS and controls. To determine whether effect sizes for the edited method differed significantly from effect sizes for the unedited method, we calculated paired t-tests across all ROIs. Bonferroni corrections were applied to paired t-tests as described above. In addition, we calculated the arithmetic difference in effect size for each edited vs. unedited ROI (by subtracting the unedited value from the edited value). Figure 1 compares MR images with and without manual intervention with control points. Means and standard deviations for surface area, white matter volume, and cortical thickness for each ROI, separated by scanner field strength, are provided in Table 2. The differences between edited and unedited measures are represented by Bland-Altman plots in Figure 2.Variances and intraclass correlation coefficients for all ROIs, separated by scanner field strength, are provided in Table 3. Effect sizes are provided in Table 4 and box plots representing effect sizes are provided in Figure 3.

Surface Area Measures
Levene's test indicated that the variance of each edited region of interest did not differ significantly from its unedited counterpart. Intraclass correlation analyses between unedited and edited surface areas yielded coefficients ranging from 0.82 to 0.99 for 32 out of the 34 ROIs. The only exceptions were entorhinal cortex areas (0.52) and parahippocampal gyrus areas (0.21). After Bonferroni correction, paired t-tests indicated that mean areas did not differ significantly between any unedited and edited ROIs.
Paired t-tests indicated that the mean effect size for surface areas did not differ significantly from the mean effect size for unedited areas. Moreover, the mean arithmetic difference in effect size between all edited and unedited surface area ROIs was −0.011 (SD 0.12). The regions for which the difference in effect size between edited and unedited methods exceeded either 0.20 or −0.20 (indicating small effect sizes) for was the entorhinal cortex (−0.26), lingual area (0.22), pars orbitalis (−0.27), and pars triangularis (−0.21).

White Matter Volumes
No significant differences were observed in variances of white matter volumes between edited and unedited ROIs. Intraclass correlation analyses between unedited and edited white matter volumes yielded coefficients ranging from 0.85 to 0.99 for 32 out of 34 ROIs. Similar to surface areas, the exceptions were entorhinal cortex (0.60) and parahippocampal gyrus (0.34) volumes. Mean volumes did not differ significantly between 32 of the 34 pairs of unedited and edited regions. Exceptions were the lateral orbitofrontal (p < 0.001) cortex and the superior parietal lobule (p < 0.001).

FIGURE 1 | Comparison of MR images before and after manual intervention. (A)
In comparison with the unedited 1.5T image (left), the manually edited brain image (right) shows a more accurate portrayal of the parahippocampal gyrus, the hippocampus and the white matter boundary. (B) However, in the 3T brain images, there is little difference between the unedited (left) and the manually edited (right) images. The manual intervention implemented in the 3T brain was intended to include white matter and gray matter incorrectly being excluded from the lateral orbitofrontal gyrus area. Control points on this slice in addition to edits on anterior and posterior brain slices had no significant effect on the exclusion. This shows that although control points can have an effect on white matter and pial surface, as well as cortical parcellation, it is inconsistent.
The mean effect size for edited measures of white matter volumes did not differ significantly from the mean effect size for unedited measures. The mean arithmetic difference in effect size between all edited and unedited white matter ROIs was −0.018 (SD 0.11). The regions with the largest differences in effect sizes between edited and unedited methods for measuring white matter volumes were the entorhinal cortex (0.27), the pars triangularis (0.24), the frontal pole (−0.21) and the temporal pole (0.22).

Cortical Thickness
No significant differences were observed in variances of cortical thickness between edited and unedited ROIs. Intraclass correlation analyses between unedited and edited measures of cortical thickness yielded coefficients ranging from 0.84 to 0.985 for 31 out of 34 ROIs. Exceptions included entorhinal cortex (0.81), inferior temporal gyrus (0.76) and the temporal pole (0.79). Mean cortical thickness did not differ significantly between 32 of the 34 pairs of unedited and edited regions. Exceptions were the precentral gyrus (p < 0.001) and the rostral anterior cingulate (p < 0.001).
The mean effect size for edited measures of cortical thickness did not differ significantly from the mean effect size for unedited measures. The mean arithmetic difference in effect size between all edited and unedited measures of cortical thickness was −0.03 (SD 0.16). The regions with the largest differences in effect size between edited and unedited methods were the caudal anterior cingulate (0.43), fusiform gyrus (−0.23), inferiorparietal lobule (0.39), rostral anterior cingulate (0.21), superior frontal gyrus (0.20), supramarginal gyrus (0.30) and temporal pole (0.24). Note that the majority of these values were positive, indicating that the effect sizes for the edited method tended to be larger than those for the unedited method used to measure cortical thickness.

Surface Area Measures
For the 3T data, Levene's test similarly indicated that the variance of each edited region of interest did not differ significantly from its unedited counterpart. Intraclass correlation analyses between unedited and edited surface areas yielded coefficients ranging from 0.86 to 0.99 for 33 out of 34 ROIs. Exceptions included the insula (0.799). Paired t-tests indicated that mean surface areas did not differ significantly between any pairs of unedited and edited  regions. However, several regions tended to differ, including the fusiform gyrus (p = 0.002), the lateral orbitofrontal area (p = 0.003), and the inferior temporal lobe (p = 0.004). For the 3T data, the mean effect sizes for edited and unedited measures of surface area did not differ. The mean arithmetic difference in effect size between edited and unedited surface area ROIs was −0.028 (SD 0.12). The regions with the largest differences in effect sizes between the edited and unedited methods were the entorhinal cortex (0.21), pericalcarine cortex (−0.29), the rostral anterior cingulate (0.26), and the temporal pole (0.287).

White Matter Volumes
No significant differences were observed in the variances of white matter volumes between edited and unedited ROIs. Intraclass correlation analyses between unedited and edited white matter volumes yielded coefficients ranging from 0.90 to 1.00 for all ROIs. After Bonferonni correction, the mean white matter volumes did not differ significantly between any pairs of unedited and edited regions, however the fusiform gyrus (p < 0.005) and the pars orbitalis (p < 0.005) approached significance.
The mean effect size for edited measures of white matter volume did not differ significantly from the mean effect size for unedited measures. The mean arithmetic difference in effect size between edited and unedited white matter ROIs was −0.013 (SD 0.11). The regions with the largest differences in effect size between the unedited and edited methods were the frontal pole (0.369), temporal pole (0.22), transverse temporal cortex (0.21) and insula (0.25).

Cortical Thickness
No significant differences in the 3T data were observed in variances of cortical thickness between edited and unedited ROIs. Intraclass correlation analyses between unedited and edited measures of cortical thickness yielded coefficients ranging from 0.86 to 0.986 for 32 out of 34 ROIs. Exceptions included medial orbitofrontal cortex (0.65) and the insula (0.81). In contrast to 1.5T data, mean cortical thickness differed significantly between 7 of the 34 pairs of unedited and edited regions, including the banks of the superior temporal sulcus, entorhinal cortex, fusiform gyrus, inferior temporal gyrus, lateral orbitofrontal cortex, medial orbitofrontal cortex and rostral middle frontal cortex (all p < 0.001). Moreover, an additional 3 ROIs approached significance, including the superior frontal gyrus (p < 0.003), precentralgyrus (p < 0.004) and the caudal middle frontal gyrus (p < 0.004).
The mean effect size for edited measures of cortical thickness did not differ significantly from the mean effect size for unedited measures. The mean arithmetic difference in effect size between edited and unedited measures of cortical thickness was 0.07 (SD 0.15). The regions with the largest differences in effect sizes were the lateral orbitofrontal cortex (0.226), the lingual gyrus (−0.439), the rostral anterior cingulate (0.244) and the insula (−0.47).

DISCUSSION
In the last 5 years, FreeSurfer (FS) has become the standard for obtaining cortical metrics from MRI images due to its ease of configuration, accurate results, and high reproducibility (Fischl FIGURE 2 | Bland Altman plots, representing the differences between edited and unedited measures of surface area, white matter volume and cortical thickness for each field strength. The difference between the edited and unedited measure of each region of interest is plotted against the average of the two measures. Mean, and 95% limits, of agreement are provided in each plot. These plots indicate that, for the most part, the two methods are producing somewhat similar results, although all plots show a fairly wide range of values. Outliers, beyond the 95% agreement limit, indicating poor agreement, include: for surface area   , 2002;Tae et al., 2008;Bhojraj et al., 2011). However, there has been a lack of consensus around whether or not additional manual editing is required in order to increase the ability to detect effects between groups. This is the first study, to the best of our knowledge, to directly compare FS's fully automated method to that of FS's semi-automated manual intervention method that utilizes control points to alter gray-white matter boundaries. Overall we found very few differences between methodological approaches, although we do note specific exceptions below.

1.5T Data
We found few differences between methodological approaches when using the FS segmentation process to obtain surface areas from 1.5T images. The absence of differences in variance, and the high level of intraclass correlation coefficients between the regions in edited and unedited brains support previous studies that have established the consistency and reproducibility of the fully automated FS segmentation process (Fischl et al., 2002). As found in previous studies, the regions where differences were observed, i.e., the entorhinal cortex and parahippocampal gyrus, are common locations for imaging artifacts (Oguz et al., 2008;Desikan et al., 2010). These results support previous research into FS's difficulty obtaining measurements in similar scenarios, rather than suggesting a difference between the two methods (Desikan et al., 2010). This is supported by an absence of significant differences in the mean volumes and mean effect sizes between the two methods for measuring surface areas. Although some differences were observed in white matter volume variance, the absence of consistently larger effect sizes for either method further indicates that the differences should not be viewed as a higher level of accuracy in volume segmentation for either method. One exception may be the lateral orbitofrontal cortex, for which we observed significant differences in mean volume. Due to motion which causes commonly-occurring imaging artifacts, the lateral orbitofrontal cortex is a region where raters make numerous corrections (i.e., using control points) during the FS pipeline. Although in our data, the difference in effect size between our patient and control samples was negligible for this region, that may not be the case for other populations and therefore automated white matter volumes derived for this region in general, when using a 1.5T scanner, should be viewed with caution.
As described in the methods section, cortical thickness is derived from the distance between the white matter surface, which follows the border between white and gray matter, and the pial surface, which follows the border between gray matter and cerebrospinal fluid. Since manually inserting control points affects where those surfaces are positioned, the differences between the methods should be most pronounced in cortical thickness measurements. Although there was an absence of difference in the variance, ICC's, and mean cortical thickness for most regions, the difference in effect sizes was surprising. The caudal anterior cingulate, superior frontal gyrus, supramarginal gyrus, and temporal pole all had effect sizes which favored the edited method, but do not typically require many control points. On the other hand, the region that favored the unedited method, the fusiform gyrus, usually needs heavier manual correction to exclude hyper intensities. Although further exploration is needed in order to determine what specifically caused the unexpected results, it is possible that errors in the automated segmentation are more pronounced in 22q11DS due to enlarged ventricles, and that fusiform gyrus matter was incorrectly excluded in the unedited brains, giving the appearance of a larger effect then was actually present. Nonetheless, the lack of consistently significant FIGURE 3 | Box plots representing means and standard deviations of effect sizes for each measurement type/field strength. Note that the only outliers were in the cortical thickness plots for the 3T data. The outlying regions of interest were pericalcarine thickness (1.49) and medial orbitofrontal thickness (1.60).
differences in variances and mean cortical thickness volumes between the edited methods further supports the notion that manual intervention for 1.5T images in FS's automated process does not provide an increase in ability to detect an effect size between groups commensurate with the human hours required.

3T Data
The results for surface area and white matter volume in 3T data are similar to what was observed for the 1.5T data, and suggest that consistency in method is most likely more important than the choice between the fully automated and the manual-edit procedures. This is corroborated by similar effect sizes observed for both the manual and automated process, with the exception of temporal and occipital lobe structures affected by the issues described above.
Although no significant differences were observed in cortical thickness variance between the two groups, a notable difference in the results between the 1.5T and 3T data were 7 regions with differences in mean cortical thickness. The relatively large number of regions in the 3T for which we observed differences, and the fact that the same differences weren't present in 1.5T data warrant further explanation. In particular, the superior temporal sulcus, and the lateral and medial orbitofrontal cortices typically require manual editing in both the 1.5T and 3T data.
It is possible that due to the higher contrast in 3T scans, the control points had greater success in correcting misplaced surfaces than in the 1.5 scans, potentially resulting in more accurate surfaces and cortical thickness measurements. This would have been supported by larger effect sizes in those regions for the brains which had been edited. However, such an effect was only observed for the lateral orbitofrontal cortex, and overall the differences between effect sizes for any region were evenly split between the edited and unedited methods. Therefore, it is evident that although there were differences between the two methods, editing the brain images didn't translate into our ability to detect group differences more readily with one method or the other.

LIMITATIONS
Artifacts due to intensity inhomogeneity, head motion, reduced signal to noise ratio, and partial volume effects can all lead to reduced image quality, alterations in intensity values and, ultimately, errors in image segmentation. These issues may be magnified in higher field-strength data secondary to increases in B1 field inhomogeneity (Marques et al., 2010), potentially necessitating more manual editing of higher fieldstrength images. Acquiring and averaging multiple acquisitions, which improves signal-to-noise and contrast-to-noise ratios, and reduces motion artifacts, can address these issues (Kochunov et al., 2006;Winkler et al., 2010). The present analyses were based on a single sequence acquisition, which therefore constitutes a limitation to our study. Multiple sequence acquisition carries trade-offs in both scanning cost and time, which can deter researchers. In the present study, the sample consisted, in part, of school-aged children with intellectual disability and, in many cases, attention deficit hyperactivity disorder. Accordingly, we had to strike a balance between optimizing the quality of our images while maintaining a timeframe that our sample would tolerate. This may have necessitated more manual intervention to correct errors in segmentation.
Although we observed similarities in the metrics we extracted from the different regions of the brain, we did not conduct an overlap analysis to determine whether the ROIs had a high level of spatial overlap. It is possible that the regions appear to be similar numerically, but have different boundaries with one methodological approach more accurately denoting the region it represents. Another limitation is that both the 1.5T and 3T data used were manually skull stripped prior to implementing the FS pipeline: if brains were run fully automated, they would be subject to the automated skull stripping module included within FS. However, we do not believe that had a significant effect on our results, and previous research supports this notion (Fennema-Notestine et al., 2006). Our processing pipeline may have also been limited by the fact that we did not assess the quality of the images (e.g., signal to noise ratio) prior to processing the data, which may have affected the extent to which manual interventions were needed.

CONCLUSIONS
This study is significant in that it shows that the additional time and cost necessary to manually correct the FS segmentation process does not necessarily increase one's ability to detect differences in cortical measurements between groups. Future studies should be conducted with larger and more diverse samples in order to provide additional insight into the differences between methods. In addition, since the temporal and frontal lobe contain numerous regions affected by disorders like Alzheimer disease and schizophrenia, and many of the differences we observed were within those lobes, additional research should focus on methods which can increase the segmentation accuracy specifically in those regions.

AUTHORS CONTRIBUTIONS
WK, IC, and CM designed the study. CM, CT, and JB completed all image processing for the study. IC and WK completed all statistical analyses of the imaging data. AR, CM, and WK wrote the manuscript. All authors revised the manuscript for accuracy and intellectual content, and all authors approved the final manuscript.

ACKNOWLEDGMENTS
This research was supported by the National Institutes of Health, MH064824, to WK. The authors thank Margaret Mariano for her editorial assistance.

SUPPLEMENTARY MATERIAL
The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fnins. 2015.00379