A Contrast Augmentation Approach to Improve Multi-Scanner Generalization in MRI

Most data-driven methods are very susceptible to data variability. This problem is particularly apparent when applying Deep Learning (DL) to brain Magnetic Resonance Imaging (MRI), where intensities and contrasts vary due to acquisition protocol, scanner- and center-specific factors. Most publicly available brain MRI datasets originate from the same center and are homogeneous in terms of scanner and used protocol. As such, devising robust methods that generalize to multi-scanner and multi-center data is crucial for transferring these techniques into clinical practice. We propose a novel data augmentation approach based on Gaussian Mixture Models (GMM-DA) with the goal of increasing the variability of a given dataset in terms of intensities and contrasts. The approach allows to augment the training dataset such that the variability in the training set compares to what is seen in real world clinical data, while preserving anatomical information. We compare the performance of a state-of-the-art U-Net model trained for segmenting brain structures with and without the addition of GMM-DA. The models are trained and evaluated on single- and multi-scanner datasets. Additionally, we verify the consistency of test-retest results on same-patient images (same and different scanners). Finally, we investigate how the presence of bias field influences the performance of a model trained with GMM-DA. We found that the addition of the GMM-DA improves the generalization capability of the DL model to other scanners not present in the training data, even when the train set is already multi-scanner. Besides, the consistency between same-patient segmentation predictions is improved, both for same-scanner and different-scanner repetitions. We conclude that GMM-DA could increase the transferability of DL models into clinical scenarios.


INTRODUCTION
The segmentation of different brain structures from Magnetic Resonance Imaging (MRI) is an important problem in the field of neuroimaging. Obtaining precise and consistent delineations is crucial in the diagnosis, follow-up and treatment of neurological disorders. Important examples are the monitoring of the progression of Multiple Sclerosis (MS) or dementia, both connected to accentuated neurodegeneration (Giorgio and De Stefano, 2013). In recent years, convolutional neural networks (CNN) have become an increasingly popular segmentation approach, but the fact that these models are extremely sensitive to data variability has hindered their large scale adoption in clinical and research settings. Specifically, these algorithms remain sensitive to factors such as hardware and acquisition settings, which can be especially problematic when integrating data from different cohorts (Mårtensson et al., 2020). For these models to generalize to data collected using new or unseen scanners, large multicenter and multi-scanner datasets are necessary at the training stage. Nevertheless, collecting such data is not trivial and most available datasets are homogeneous in terms of scanner types and acquisition protocols.

Related Work
The above mentioned problem is often termed as the scanner bias problem. A popular way to deal with it in large clinical trials is through approaches based on statistical harmonization. In most cases the focus is on removing the scanner bias from the volumetric measurements based on scanner-or centerinformation (Fortin et al., 2018;Garcia-Dias et al., 2020). At the image level, it is common to use the standardization of the MRI intensity scale to reduce scanner sensitivity (Wang et al., 1998;Nyúl and Udupa, 1999;Shinohara et al., 2014), which has been previously shown to improve the outcome of computer vision tasks like segmentation (Zhuge and Udupa, 2009) and registration (Bagci et al., 2010). Recently, some works have attempted to use Deep Learning (DL) methods to modify the analyzed images such that they appear to have been acquired under similar settings (Dewey et al., 2019;Zhao et al., 2019b). However, harmonization methods have the undesirable property that the results will always be bound by the least informative scanner in the dataset, as shown in Moyer and Golland (2021), while standardization methods are not able to remove residual across-subject variability (Shinohara et al., 2014;Fortin et al., 2016;Wrobel et al., 2020). Additionally, many of these approaches require retraining and updating of the models when including new data from unseen scanners or centers.
In order to avoid these unwanted effects, it is interesting to tackle the problem from a generalization perspective, by improving the performance and reproducibility of the methods of interest (often segmentation of brain tissues or lesions). When considering DL methods in particular, a common approach is to increase the variability in the data by applying well designed data augmentation (DA). The idea behind DA is simple: by applying transformations to the labeled data it is possible to artificially increase the training set, which implicitly regularizes the trained CNN. The most common DA strategies explore transformations of the original data, mostly based on the application of operations such as elastic distortions (Simard et al., 2003), linear geometric transformations such as translations and rotations, color transformations (mostly by altering the intensities of the RGB channels in 2D images) (Krizhevsky et al., 2012) or noise injection (Sietsma and Dow, 1991).
In the medical imaging field, DA is especially important since annotated datasets are typically small. Although simple transformations such as the ones described above can alleviate overfitting and improve performance on the test sets in medical applications (Milletari et al., 2016), they do not take into account the high variability in terms of contrast found in MRI. Some works have attempted to overcome this limitation by generating completely synthetic images using generative adversarial networks, as is the case in Shin et al. (2018). Nonetheless, there is still a long way to go until these images can be used effectively. Other more promising approaches start from existing images and alter them in such a way that new sequences or contrasts are simulated. One relevant example is described in Jog et al. (2019), where a CNN-based algorithm resilient to variations in the input acquisition is presented. To achieve this, approximate forward models of different MRI pulse sequences are built. This way, synthetic versions of the training images are generated such that they appear to have been acquired using different sequences. The method has the disadvantage that it is complex, slow and it requires nuclear magnetic resonance parameter maps of the training images, which are often unavailable. Zhao et al. (2019a) proposed to learn a model of transformations from an atlas to images in a dataset and to use this model along with a single labeled example to synthesize additional labeled examples with variable appearance and spacial deformations. More recently, Billot et al. (2020) presented a contrast-agnostic brain segmentation method, again based on generating synthetic images. The method uses only a segmentation map to generate new images with varying, sometimes even unrealistic, contrasts. The generated images have random appearance, deformation, noise, and bias field. With this type of extreme augmentation, it is possible to obtain a final model that is not biased toward any specific MRI contrast and that achieves good performance on unpreprocessed brain scans of any contrast. Although this method is very promising, by design it is limited to segmentation applications and nuanced variations in the individual images are lost.

Our Contribution
In the present work we propose a novel intensity-based DA strategy with the main goal of reducing the scanner bias of models trained on data with low protocol-, scanner-or centervariability. Although scanner factors cause variations to other image characteristics (e.g., noise, artifacts, geometric distortions), we have previously found a clear relationship between tissue contrast and volume measurements (Meyer et al., 2019). As such, we hypothesize that augmenting the tissue intensity variability will have a positive effect in the model generalization to new, unseen scanners or center-specific acquisition configurations. The method is based on the Gaussian Mixture Model (GMM) framework: we estimate the individual tissue components of an MRI image and randomly modify them, while preserving structural information. As a result the contrast between different tissues varies, in a similar way to what happens when different scanners or sequences are used during acquisition. We validate the approach in the task of brain structure segmentation. Unlike currently existing methods, the proposed approach does not depend on any existent segmentations or parameter maps; it is simple and fast; it can be used on-the-fly during training; and it is not necessarily limited to segmentation applications.
This work extends the preliminary research presented in Meyer et al. (2021). We previously compared the performance of the same CNN-based model trained under three different settings: (i) single-scanner data, (ii) single-scanner data with the addition of our DA method, and (iii) multi-scanner data. We now additionally investigate the effect of adding the DA method to multi-scanner data and evaluate the reproducibility and consistency of the models on a test-retest dataset containing same-patient repetitions in the same and different scanners. Finally, we investigate the effect of the presence of bias field on the training images. Overall we observe a clear improvement in generalization to unseen scanner types when adding the proposed method to the training pipeline, not only when the original training dataset is homogeneous, but also in the case when a large, heterogeneous dataset is used as training set.

GAUSSIAN MIXTURE MODEL-BASED INTENSITY TRANSFORMATION
The idea behind the proposed approach is to increase the intensity and contrast variability of images in datasets with low scanner and center acquisition diversity, such that it becomes representative of what is found in large multi-scanner and multicenter cohorts. This DA method is applied during the training phase of a DL network of choice, and is not necessary at inference. Figure 1 shows a depiction of the method. An implementation is available at https://github.com/icometrix/gmm-augmentation.

The Gaussian Mixture Model Framework
It is well documented that in a skull-stripped T1w brain MRI without contrast injection, characteristic peaks in the histogram correspond to different tissues, i.e., CSF has the lowest intensity, followed by GM and WM. This has been explored by several segmentation methods based on Gaussian Mixture Models (GMM) (Van Leemput et al., 1999;Ashburner and Friston, 2005). GMM is a type of probabilistic model that assumes that data can be modeled as a superposition of K Gaussians. Within this framework, if we have a set of observations {v 1 , . . . v N }, corresponding to the intensities v of each voxel n ∈ N in an image I, we can model each observation in the data using a mixture of Gaussians, such that: Each N (µ k , σ 2 k ) is a component of the mixture, with its own mean µ k and variance σ 2 k , and π k are the mixing coefficients. For simplicity we hide the subscript n when referring to the intensity of a given voxel: v n is represented as v from here on.
We start by selecting K = 3 Gaussian components for the GMM, where each component roughly corresponds to the CSF, GM, and WM classes. The parameters are initialized and updated iteratively using the Expectation Maximization (EM) (Dempster et al., 1977) algorithm implemented in the scikit-learn package for Python (Pedregosa et al., 2011) with default parameters. FIGURE 2 | Variation of the three considered GMM components in terms of mean µ k (left) and variance σ 2 k (right). The components vary much more in a multi-scanner dataset than in a single-scanner setting.
Once we estimate the parameters for each component k, we can use Bayes' rule to compute the probability of each class label C, such that: . (2)

Altering the Components of the GMM
If we modify the individual components of a 3-component GMM we can modify images in the training data by changing their GMM probability distributions while preserving the inherent image characteristics. We can create a new intensity distribution for each of the tissues by generating new parameters µ k → µ ′ k and σ 2 k → σ 2′ k for each of the components in an individual skull stripped image. To do this we: a) sample individual variation terms q µ k and q σ 2 k for each component from a uniform distribution, b) add these values to the original parameters, such that µ ′ k = µ k + q µ k and σ 2′ To define the range of the uniform distributions we use to sample the variation terms q µ k and q σ 2 k , we start by estimating the range of typical variation for each component from a large multiscanner collection of patient data (dataset C in section 3). To do this, all images are first skull stripped, intensities are clipped at percentiles 1 and 99 to remove extreme values, and normalized to the range [0, 1]. Then we fit a 3-component GMM to each image in the dataset using the same procedure as described above. We extract the mean µ k and variance σ 2 k values of each component. We then use the standard deviation (s(·)) of the estimated parameters to define the range of variability we allow. q µ k and q σ 2 k are sampled for each component from the uniform distributions U(−s(µ k ), s(µ k )) and U(−s(σ 2 k ), s(σ 2 k )), respectively. The distribution of the estimated parameters µ k and σ 2 k is depicted in Figure 2. In this figure we illustrate how the variability of the estimated parameters in a multi-scanner and multi-center setting is larger than that of a homogeneous dataset (same center, same scanner, same acquisition protocol) (dataset A in section 3). Besides differences in hardware, acquisitions in different centers tend to not be perfectly harmonized, which causes variations in contrast of the images. This is one of the many factors that contribute to the increased variability of the estimated parameters, and can be addressed by the proposed approach. For the multi-center data, mean and variance values for the 3 components have approximate standard deviations of s(µ) = {3, 6, 8}×10 −2 and s(σ 2 ) = {1, 1, 3}×10 −3 , respectively.
The choice of a uniform distribution for sampling the new variation terms implies that any random combination of tissue intensities can be generated. We could restrict this to more probable distributions by selecting a normal distribution. However, since exposing networks to extreme but anatomically plausible augmentation can be beneficial for learning (Billot et al., 2020), we decided to allow the possibility for some unrealistic combinations to arise.

Reconstruction
Once the new parameters have been defined, we could think that a logical next step would be to generate a new histogram of intensities by mixing the new Gaussian distributions and using histogram matching (Wang et al., 1998) techniques to generate a new image I ′ . However, doing this would not guarantee that structural information is preserved (e.g., two components could overlap or even shift order, and voxels from one tissue would be wrongly assigned to another class). To avoid this we describe the intensity v of some voxel n ∈ N in terms of the distance from the mean of the component measured with the Mahalanobis distance d vk = (v − µ k )/σ k . This implies that if we know the values of µ k and σ 2 k we can find the updated value of v → v ′ for each component k by preserving the distance d vk : Finally, we can compute the new intensity v ′ for a voxel n by leveraging each component by the initial probability Frontiers in Neuroscience | www.frontiersin.org that this voxel belonged to a certain class p(C = k|v), This guarantees that the voxels that have a high probability of belonging to a certain class will represent the same class, while allowing for nuanced variations at the borders between different tissues.

DATASETS AND EXPERIMENTAL SETUP
From here onwards, the method will be referred to as GMM-DA, for simplicity. In order to validate the GMM-DA method, we investigate the added value of the described method on the task of brain structure segmentation using a well described type of CNN architecture. We train the same network on two different datasets: a collection of single-scanner data from healthy subjects, and a multi-scanner and multi-center collection of patient data. We compare the performance of the models trained with and without the addition of the GMM-DA strategy. The different models are evaluated on manual segmentations and on test-retest data. The available datasets and the different experiments are described in the following sections.

A) OASIS
Contains T1w MRI scans from 416 subjects (age: [18,96] years) (the OASIS-1 cohort) (Marcus et al., 2007). Only 280 of the 316 healthy subjects were considered (see dataset B). The data was randomly split into train/validation/test sets [n = 179(64%)/45(16%)/56(20%)]. Although the data is heterogeneous from a population point of view, it is extremely homogeneous in terms of protocol and scanner. All images were acquired on a 1.5T Siemens Vision scanner, using the MP-RAGE sequence with constant repetition time (TR) and echo time (TE) (TR: 9.7 ms; TE: 4.0 ms). Slice thickness is also constant (1.25 mm).

B) MICCAI 2012
Contains 35 T1w scans from healthy subjects. The original MRI scans are from OASIS, but this dataset contains manual labels of brain structures. These data were provided for use in the MICCAI 2012 Grand Challenge and Workshop on Multi-Atlas Labeling (Landman and Warfield, 2012). All the images in this dataset were removed from OASIS prior to splitting the data into the different training and test sets, to avoid overlap. We exclude 5 scans from repeated subjects and use the remaining 30 for evaluating the methods on the manual labels.

C) MS Dataset
This is a collection of multi-center Additionally, we used this data to generate an independent test set, containing 92 images from 10 different scanner models. For an additional experiment we pooled a train/validation set of 251/44 images, ensuring that any scanner models present in the pre-selected test set or in OASIS were not included.

D) Test-Retest Dataset
Contains T1w MRI scans from 10 MS patients. Each patient was scanned twice (with re-positioning) on three different 3T scanner types with different acquisition sequences: (i) Philips Achieva: 3D T1-weighted FSPGR sequence (TR 4.93 ms); (ii) Siemens Skyra: 3D T1-weighted MP-RAGE sequence (TR 2300 ms, TE 2.29 ms); (iii) GE Discovery MR450w: 3D T1-weighted FSPGR sequence (TR 7.32 ms, TE 3.14 ms). Further details regarding this data can be found in Jain et al. (2015). This dataset allows the models to be tested for consistency, both in an intra-scanner setting as well as in an inter-scanner setting.

Data Pre-processing
All images were normalized using a modified z-score function robust against outliers, where the median of the distribution was preferred over of the mean, and the standard deviation of the distribution was computed within percentiles 10 and 90. Additionally, images were bias-field corrected using the N4 inhomogeneity correction algorithm as implemented in the Advanced Normalization Tools (ANTs) toolkit (Tustison et al., 2010) and linearly registered to MNI space using the tools implemented in NiftyReg (Ourselin et al., 2001).

Model Architecture
For the segmentation task we use a 3D UNet architecture (Çiçek et al., 2016) with a few adaptations, namely: • Weight normalization layers (Salimans and Kingma, 2016) are added after each convolutional operation instead of batch normalization; • LeakyReLU (Maas et al., 2013) is used as the main activation function; • Strided convolutions are used instead of max pooling.
The models are trained using a combination of the soft-dice loss (L Dice ) and the weighted categorical cross-entropy loss (L wCE ), as suggested in Isensee et al. (2021): Considering that y n ∈ {0, 1} is the one-hot-encoded label of the n th voxel in the model's input andŷ n ∈ [0, 1] is the prediction output of the model for the same voxel, the soft-Dice loss is an extension to K classes of the popular Dice loss presented, as presented in Sudre et al. (2017): To deal with the accentuated class imbalance of this problem we use the weighted categorical cross-entropy loss similarly to what was described in Ronneberger et al. (2015). This loss function can be expressed as: where w nk is the weighting factor for the n-th voxel belonging to the k-class in the training set. These weights allow to compensate the scarcity of voxels from some of the classes. The network takes as input patches of size 128 × 128 × 128 and outputs probability maps of size 88 × 88 × 88. Kernel size is 3 × 3 × 3 and initial number of filters 16 (raised to the power of 2 at each layer in the encoder path). The model is implemented using Tensorflow 2.0 and trained until convergence using minibatch stochastic gradient descent (Adam optimizer) with initial learning rate λ = 0.001 on a machine equipped with a Tesla K80 Nvidia GPU (12 GB dedicated).

Experiments
To validate the approach we compare the performance of models trained with and without the addition of the GMM-DA strategy. First, we evaluate how a model trained on single scanner data generalizes to an unseen multi-scanner dataset (train on the OASIS training set, and evaluate on the OASIS test set and the MS dataset test set). This is the key experiment in the results, since we are particularly interested in evaluating the increase in generalizability of the CNN to multi-scanner and multicenter data after adding the augmentation step. Although we acknowledge the presence of white matter lesions in the images from the MS dataset, we decide not to deal with them explicitly in this context. Secondly, in order to evaluate how the same network performs on unseen scanners and centers when trained on heterogeneous data, we train the same models on the MS dataset described in section 3. We additionally investigate if the addition of GMM-DA in this setting is still beneficial. We proceed to compare these four approaches on manual labels and on the test-retest dataset. Finally, we evaluate how the presence of bias field (BF) on the training images impacts the performance of the GMM-DA. To this end, we train the same models on the MS dataset images, this time without the bias field correction step.
We train and evaluate a total of six models. The models are named according to the architecture (CNN), training data (OASIS or MS), presence of bias field (BF) on the training images and addition of the data augmentation (DA) step. As such, a model trained on the MS dataset, on data with bias field and to which GMM-DA was applied is termed CNN MS-BF-DA . The investigated models and a description of the data where they were trained (T) or evaluated (E) are summarized in Table 1.

Performance Metrics
Dice scores (DC), sensitivity (Se) and precision (Pr) are reported (complete Se and Pr results are given in the Supplementary Material). DC values are compared using Wilcoxon paired rank-sum and Levene tests to evaluate the null hypotheses H 0 that the results from the different models have equal median and variance values, respectively. These tests were selected given the presence of outliers and deviations from normality in the distributions (see Figure 3). When evaluating the reproducibility of the methods, absolute volume differences between acquisitions from the same patient are reported. Results are summarized in terms of median (P50) and percentile 10 (P10) or 90 (P90), where relevant.

GMM Augmentation of a Homogeneous Dataset
To evaluate the influence of the addition of GMM augmentation when training on a homogeneous dataset (OASIS), we test CNN OASIS and CNN OASIS-DA on the two cross sectional datasets with automated delineations (test sets of OASIS and MS dataset).  This will allow us to determine: (i) if applying GMM-DA decreases the performance on data similar to the training set in comparison to the base model, and (ii) how the models perform in a multi-scanner setting. The results in terms of Dice scores are summarized in Figure 3 and  Supplementary Table 1 for details). Additionally, we can observe in the right hand side panel of Figure 3 that the variability and incidence of outliers is reduced for CNN OASIS-DA . All these observations imply that the addition of GMM-DA greatly improves the performance of  the model to new data containing unseen scanner types from different centers.

GMM Augmentation of a Heterogeneous Dataset
Now that we have established that the addition of GMM-DA is beneficial for the generalization of a model trained on a homogeneous dataset to multi-scanner settings, we evaluate the performance of a model trained on the MS dataset, which is very heterogeneous. We additionally investigate the effect of adding GMM-DA when training on a dataset with these characteristics. The CNN MS and CNN MS-DA models are evaluated in the same way as the above, and results are summarized in Figure 4 and  Figure 4).

MS Dataset
As mentioned in section 3, the MS test set contains scanner types which were not present in the training set. CNN MS-DA outperforms CNN MS for all structures (Wilcoxon: p ≪ 0.05) except LV (Wilcoxon: p > 0.05) in terms of DC (see Table 3 and the right hand side panel of Figure 4). Se and Pr values are also generally lower in the CNN MS model, with local differences for specific tissues (see Supplementary Table 2 for details). This indicates that adding GMM-DA to an already heterogeneous dataset can further increase the generalizability of the network.

Comparison Between the Different Models
Given that large multi-scanner and multi-center datasets are not commonly available to researchers, we are particularly interested in the comparison between the model trained on OASIS with augmentation (CNN OASIS-DA ) against the model trained on the MS dataset without augmentation (CNN MS ). To facilitate the comparison, the performance of both models is displayed in Figure 5.

OASIS
For most of the evaluated structures, CNN MS shows a significant decrease in performance in comparison to CNN OASIS-DA . It is expected that the models trained on the MS dataset have generally lower performance than the models trained on OASIS, since the images in the MS dataset training set do not share the same characteristics as the ones in OASIS (as previously illustrated in Figure 2). The addition of GMM-DA to CNN MS can help reduce this performance gap, as seen in the previous section.

MS Dataset
Analyzing the right hand side panel of Figure 5, it is interesting to verify that CNN OASIS-DA approximates the variability of the CNN MS for all the structures. In terms of median DC values it sometimes equals or even surpasses its performance (GM, GP and CdN). It is important to keep in mind that the MS dataset contains pathological images which are not present in OASIS. CNN MS has been exposed to many more types of images, with some patients possibly presenting a small number of lesions. However, the contrary is not true, given that OASIS only contains images from healthy subjects. At best, the networks trained on this data were exposed to a few lesions present in the older subjects' scans. It is thus not possible to guarantee that the differences in performance between CNN MS and CNN OASIS-DA on a pathological dataset are caused only by scanner or acquisition variability. Nevertheless, these results show that with a simple data augmentation strategy it is possible to achieve competitive results on unseen data from various scanners and centers. In order to visualize the different results, Figure 6 illustrates the results obtained on three different images from the MS dataset using the four different models described so far. For simplicity, WM and GM are not shown. Looking at this figure it is very clear that when the image contrast is not good, the CNN OASIS model can produce segmentation results which infiltrate WM and CGM regions in unexpected ways. The addition of GMM-DA brings the results much closer to the ground truth results.

Evaluation on Manual Labels
To validate the performance of the models on manual segmentations we evaluate them on the MICCAI 2012 dataset. It is interesting to compare their performance against the performance of the method used to get the automated labels the models were trained on (icobrain). The results are summarized in Figure 7, where results which are statistically different to icobrain are indicated (Wilcoxon: p < 0.05).
For most structures the models reach comparable performance. CNN MS−DA is the model with overall best performance, but still does not surpass icobrain. For GM, CNN MS and CNN MS−DA achieve much lower performance than the other models. This is in line with the results observed for the OASIS dataset. Recalling that this dataset is derived from a subset of OASIS, CNN OASIS and CNN OASIS-DA were exposed to images with these characteristics during training, while CNN MS and CNN MS−DA were not. Variances are not statistically different for any tissue type. Se and Pr values are also comparable for all models, with mean Se ≈ 0.84, Pr ≈ 0.85.

Consistency on Test-Retest Data
By evaluating the models on the test-retest dataset described in section 3 it is possible to evaluate how each model deals with differences in scanner type. As previously mentioned, the dataset contains two repetitions per scanner in two or three different scanners. We compute the difference in predicted volume for each of the evaluated structures between same scanner repetitions (intra-scanner differences) and between the repetitions in different scanners (inter-scanner differences). We consider all possible scanner combinations, which means that we end up with 26 intra-scanner and 88 inter-scanner repetitions. We compare the performance of our methods against icobrain. As already mentioned, this method is clinically available. However, when performing longitudinal evaluations, this method has a key limitation: the results are considered reliable only if the two images being analyzed were acquired in the same, or compatible, scanner. As such, we are interested in achieving better inter-scanner volume estimation differences, and we consider inter-scanner results to be consistent if the volume differences are in a comparable range to the intra-scanner differences obtained by icobrain.
For a simplified overview of the results, we plot the distribution of volume differences for all the considered brain structures in Figure 8. Additionally, in Table 4 we showcase the results in terms of median and P90, which translates the variability in the distributions. We exclude the CNN OASIS model from the table, since it is clear from Figure 8 and Table 2 that the performance of this method is low for multi-scanner datasets.
Globally we observe that intra-scanner differences are much lower than inter-scanner differences for all the models. In the intra-scanner case, CNN OASIS produces a higher error than the other models for all structures. Interestingly, CNN OASIS-DA produces very stable results, comparable to or even better than icobrain for several structure types (Tha, HC, CdN). CNN MS-DA produces the most consistent results for most of the structures, especially when considering P90.
Regarding inter-scanner differences, we observe that the CNN OASIS model produces extremely large variability. The other models either compare to icobrain or produce more consistent results. The exception is WM and GM, where icobrain still outperforms the other methods in terms of consistency. This is in line with the previous observations that performance (in terms of Dice) was lower in these two tissues. The most important observation is that CNN MS-DA produces the most consistent results for all the substructures. The results for this model are  sometimes comparable to the values obtained by icobrain in the intra-scanner case (noticeably for HC and GP). Overall, the addition of GMM-DA results in a very significant improvement, both in comparison to icobrain and to the CNN MS method. Additionally, a very interesting observation is that CNN OASIS-DA achieves a performance which is comparable to that of CNN MS , sometimes even surpassing it (Tha, HC, CdN).

Influence of Bias Field
A bias field is an undesirable spatially smoothly varying low frequency signal that often corrupts MRI images (Juntu et al., 2005). A number of methods have been proposed to remove this signal from the images, and bias-field correction is often used as a pre-processing step. Given that this is a slow procedure which can sometimes produce underlying errors, it has become Best results are highlighted.
FIGURE 9 | Influence of bias field on the GMM. Images and corresponding histograms with GMM prediction overlaid in red. Left: original image with bias field; right: bias field corrected image. popular to skip bias field correction when using deep learning approaches, and instead allow the networks to learn the bias-field mechanisms, with good results (Kamnitsas et al., 2017). However, bias field correction is extremely important for GMM-based methods, since it changes the intensity profiles of the different tissues. This effect is illustrated in Figure 9, where the histogram of a bias field corrected image is compared to that of an image with bias field. It is very likely that when applying GMM-DA some of the voxels corresponding to WM will be treated as GM, or vice-versa. This implies that the structural information can be lost, which will very likely result in drop in the performance of a model trained on images with bias field and the addition of GMM-DA.
To test our hypothesis we trained two models, following the same scheme as in the previous experiments, on images with bias field. More specifically, we repeated the experiment from the previous section training on the MS dataset without the bias field correction at pre-processing. The results of this experiment are detailed in Table 5. As expected, applying GMM-DA on this type of data either decreases the performance of the method (WM, GM and LV), or has no effect on the segmentation performance. This is in line with our hypothesis and indicates that the GMM-DA should be applied on bias-field corrected images.

CONCLUSIONS AND FUTURE WORK
In this work we present a novel intensity-based data augmentation strategy. The main goal of this approach is to aid models trained on scanner-and center-homogeneous datasets generalizing to multi-scanner, multi-center data. The proposed method is fast, simple and can be added to any MRI training pipeline to generate images on-the-fly. We observed that applying the augmentation step while training on homogeneous data leads to a pronounced improvement in performance when the trained model is tested in multi-scanner data from difference centers. This is the case in terms of segmentation quality (as measured by Dice score), but also in the consistency of the produced prediction (as measured in terms of volume differences). When applied to the test-retest dataset there is a remarkable improvement, especially for repetitions in different scanners. The baseline model trained on homogeneous data produces extremely inconsistent results, while the same model with addition of GMM-DA compares to a model trained on multi-scanner, multi-center data. We additionally verify that applying GMM-DA when training a model on multi-center data results in an increase in performance, again both in terms of accuracy and consistency of the predictions. These observations are particularly interesting because large multi-scanner, multicenter datasets are not commonly available to researchers in the field. Nevertheless, even when such a dataset is available, it is possible to obtain even more generalization by adding a simple augmentation strategy. It should be noted that the heterogeneous dataset contains several sources of variability, including acquisition sequence parameters. The resulting contrast variability is also addressed by the GMM-DA. Therefore, we can attribute the improvement in the generalization capabilities of the CNN not only to scanner, but also to generalization to unseen acquisition parameters, or other center-specific factors.
It is possible that combining this method with other DA procedures would result in an even more robust model. Nevertheless, we opted to restrict the augmentation procedures such that we could observe the added value of our method alone. Additionally, since the images were registered to MNI space adding geometric transformations such as rotations and flips is not necessary. Nonetheless, it is expected that the DA algorithm still works well if the images are in native space. Registration was performed as a way to simplify the learning of the network, since we were interested in comparing the effect of the augmentation step in a simplified setting.
There are a few limitations to the present work. Namely, the images need to be bias-field corrected as a pre-processing step to successfully apply the GMM-DA. We don't see this as a disadvantage, since GMM-DA is only needed at training time. We argue that it would be possible to add back the bias-field to the augmented image, which would allow the model to be effectively trained with bias field. This step would allow the final trained model to generalize to images with bias-field, thus eliminating the need for bias-field correction at inference time. Experimental validation of this claim remains out of the scope of the present work, given that it is related to improving the overall model performance, and is not connected to the effectiveness of the proposed approach.
Additionally, the presence of pathology in the MS dataset introduces an extra source of variability. In images with WM lesions, as is the case for MS, it is tempting to assume that a fourth component to the GMM would be a good way to capture the lesion class. However, lesions in T1w images overlap with the GM class in terms of intensity, for which reason it would be impossible to perfectly disentangle the two classes with the current framework. A more sophisticated approach would be necessary for this, likely at the cost of the possibility to generate images on-the-fly, unless lesion masks are available.
Finally, due to scarcity of manual delineations, the models were trained on automated segmentations. This is not ideal, because our model is likely to inherit any bias or known problems that might exist in the ground truth. However, given that we are especially interested in the effect of the augmentation we can still make a fair comparison between the approaches.
Although we focused on the task of brain structure segmentation in T1w MRI images, we believe this simple method has the potential to be used for other tasks in medical imaging that make use of MR images. As long as there are discernible, anatomically-related peaks in the intensity histograms, the method is transferable to other MR protocols and sequences. It is an open question whether the method is helpful for different tasks without further adaptations. For tasks such as lesion segmentation we hypothesize that if lesion masks are available it would be simple to adapt the method such that contrasts and intensities are locally modified within the abnormal area. We further see potential in this method to be adapted such that it offers a fast way to replace missing modalities in tasks requiring two or more MRI modalities (e.g., as often performed for brain tumor segmentation). This would expectably come at the expense of some performance power, but could allow existing pipelines to be used on incomplete data.
Given these considerations, an immediate next step would be to apply the current method to different applications (e.g., brain age or disability scores prediction from MR images) and verify our claim. A second step would be to extend the method to different types of brain lesions when such masks are available, to model the intensity of the tissues of interest individually, and test the added value of the extended method to applications such as detection, classification and segmentation of MS lesions, stroke, or brain tumors. Additional future directions include extending the augmentation method by introducing changes to the different components of the mixture such that they are not necessarily represented by Gaussian distributions. Moreover, it would be interesting to investigate how the addition of (preferably Rician) noise to the images would impact performance on unseen scanner types. Typical geometric distortions and bias fields can also be modeled and included in a more complex data augmentation scheme.

DATA AVAILABILITY STATEMENT
This research study was conducted using human subject data partly made available in open access by OASIS (https://www. oasis-brains.org/) (Marcus et al., 2007) and manual labelings by Neuromorphometrics, Inc. (http://Neuromorphometrics.com/) under academic subscription (Landman and Warfield, 2012). The data is released under the Creative Commons Attribution NonCommercial license (CC BY-NC) with no end date. The MS dataset is a subset of data processed with icobrain ms in clinical practice, for which subjects had agreed to allow icometrix to use an anonymized version of the already analysed MR images for research purposes.

AUTHOR CONTRIBUTIONS
MM: conceptualization, investigation, methodology, software, and writing-original draft preparation. ER and RP: methodology and writing-reviewing and editing. NP: conceptualization and writing-reviewing and editing. KV: supervision, writing-reviewing and editing, and funding acquisition. DS: supervision, conceptualization, writingreviewing and editing, and funding acquisition. All authors contributed to the article and approved the submitted version.

ACKNOWLEDGMENTS
This work builds on the preliminary research paper An augmentation strategy to mimic multi-scanner variability in MRI, accepted for presentation at the International Symposium on Biomedical Imaging (ISBI) 2021 and to be published in conference proceedings (Meyer et al., 2021).