Data Augmentation for Brain-Tumor Segmentation: A Review

Data augmentation is a popular technique which helps improve generalization capabilities of deep neural networks, and can be perceived as implicit regularization. It plays a pivotal role in scenarios in which the amount of high-quality ground-truth data is limited, and acquiring new examples is costly and time-consuming. This is a very common problem in medical image analysis, especially tumor delineation. In this paper, we review the current advances in data-augmentation techniques applied to magnetic resonance images of brain tumors. To better understand the practical aspects of such algorithms, we investigate the papers submitted to the Multimodal Brain Tumor Segmentation Challenge (BraTS 2018 edition), as the BraTS dataset became a standard benchmark for validating existent and emerging brain-tumor detection and segmentation techniques. We verify which data augmentation approaches were exploited and what was their impact on the abilities of underlying supervised learners. Finally, we highlight the most promising research directions to follow in order to synthesize high-quality artificial brain-tumor examples which can boost the generalization abilities of deep models.


INTRODUCTION
Deep learning has established the state of the art in many sub-areas of computer vision and pattern recognition (Krizhevsky et al., 2017), including medical imaging and medical image analysis (Litjens et al., 2017). Such techniques automatically discover the underlying data representation to build high-quality models. Although it is possible to utilize generic priors and exploit domain-specific knowledge to help improve representations, deep features can capture very discriminative characteristics and explanatory factors of the data which could have been omitted and/or unknown for human practitioners during the process of manual feature engineering (Bengio et al., 2013).
In order to successfully build well-generalizing deep models, we need huge amount of ground-truth data to avoid overfitting of such large-capacity learners, and "memorizing" training sets (LeCun et al., 2016). It has become a significant obstacle which makes deep neural networks quite challenging to apply in the medical image analysis field where acquiring high-quality groundtruth data is time-consuming, expensive, and very human-dependent, especially in the context of brain-tumor delineation from magnetic resonance imaging (MRI) (Isin et al., 2016;Angulakshmi and Lakshmi Priya, 2017;Marcinkiewicz et al., 2018;Zhao et al., 2019). Additionally, the majority of manually-annotated image sets are imbalanced-examples belonging to some specific classes are often under-represented. To combat the problem of limited medical training sets, data augmentation techniques, which generate synthetic training examples, are being actively developed in the literature (Hussain et al., 2017;Gibson et al., 2018;Park et al., 2019).
In this review paper, we analyze the brain-tumor segmentation approaches available in the literature, and thoroughly investigate which techniques have been utilized by the participants of the Multimodal Brain Tumor Segmentation Challenge (BraTS 2018). To the best of our knowledge, the dataset used for the BraTS challenge is currently the largest and the most comprehensive brain-tumor dataset utilized for validating existent and emerging algorithms for detecting and segmenting brain tumors. Also, it is heterogeneous in the sense that it includes both low-and highgrade lesions, and the included MRI scans have been acquired at different institutions (using different MR scanners). We discuss the brain-tumor data augmentation techniques already available in the literature, and divide them into several groups depending on their underlying concepts (section 2). Such MRI data augmentation approaches have been applied to augment other datasets as well, also acquired for different organs (Amit et al., 2017;Nguyen et al., 2019;Oksuz et al., 2019).
In the BraTS challenge, the participants are given multimodal MRI data of brain-tumor patients (as already mentioned, both low-and high-grade gliomas), alongside the corresponding ground-truth multi-class segmentation (section 3). In this dataset, different sequences are co-registered to the same anatomical template and interpolated to the same resolution of 1 mm 3 . The task is to build a supervised learner which is able to generalize well over the unseen data which is released during the testing phase. In section 4, we summarize the augmentation methods reported in 20 papers published in the BraTS 2018 proceedings. Here, we focused on those papers which explicitly mentioned that the data augmentation had been utilized, and clearly stated what kind of data augmentation had been applied. Although such augmentations are single-modalmeaning that they operate over the MRI from a single sequencethey can be easily applied to co-registered series, hence to augment multi-modal tumor examples. Finally, the paper is concluded in section 5, where we summarize the advantages and disadvantages of the reviewed augmentation techniques, and highlight the promising research directions which emerge from (not only) BraTS.

DATA AUGMENTATION FOR BRAIN-TUMOR SEGMENTATION
Data augmentation algorithms for brain-tumor segmentation from MRI can be divided into the following main categories (which we render in a taxonomy presented in Figure 1): the algorithms exploiting various transformations of the original data, including affine image transformations (section 2.1), elastic transformations (section 2.2), pixel-level transformations (section 2.3), and various approaches for generating artificial data (section 2.4). In the following subsections, we review the approaches belonging to all groups of such augmentation methods in more detail.
Traditionally, data augmentation approaches have been applied to increase the size of training sets, in order to allow large-capacity learners benefit from more representative training data (Wong et al., 2016). There is, however, a new trend in the deep learning literature, in which examples are augmented on the fly (i.e., during the inference), in the test-time 1 augmentation process. In Figure 2, we present a flowchart in which both training-and test-time data augmentation is shown. Test-time data augmentation can help increase the robustness of a trained model by simulating the creation of a homogeneous ensemble, where (n + 1) models (of the same type, and trained over the same training data) vote for the final class label of an incoming test example, and n denotes the number of artificiallygenerated samples, elaborated for the test example which is being classified. The robustness of a deep model is often defined as its ability to correctly classify previously unseen examples-such incoming examples are commonly "noisy" or slightly "perturbed" when confronted with the original data, therefore they are more challenging to classify and/or segment (Rozsa et al., 2016). Testtime data augmentation can be exploited for estimating the level of uncertainty of deep networks during the inference-it brings new exciting possibilities in the context of medical image analysis, where quantifying the robustness and deep-network reliability are crucial practical issues (Wang et al., 2019). This type of data augmentation can utilize those methods which modify an incoming example, e.g., by applying affine, pixel-level or elastic transformations in the case of brain-tumor segmentation from MRI.

Data Augmentation Using Affine Image Transformations
In the affine approaches, existent image data undergo different operations (rotation, zooming, cropping, flipping, or translations) to increase the number of training examples (Pereira et al., 2016;Liu et al., 2017). Shin et al. pointed out that such traditional data augmentation techniques fundamentally produce very correlated images (Shin et al., 2018), therefore can offer very little improvements for the deep-network training process and future generalization over the unseen test data (such examples do not regularize the problem sufficiently). Additionally, they can also generate anatomically incorrect examples, e.g., using rotation. Nevertheless, affine image transformations are trivial to implement (in both 2D and 3D), they are fairly flexible (due to their hyper-parameters), and are widely applied in the literature. In an example presented in Figure 3, we can see that applying simple data augmentation techniques can lead to a significant increase in the number of training samples.

Flip and Rotation
Random flipping creates a mirror reflection of an original image along one (or more) selected axis. Usually, natural images can be flipped along the horizontal axis, which is not the case for the vertical one because up and down parts of an image are not always "interchangeable." A similar property holds for MRI brain images-in the axial plane a brain has two hemispheres, and the brain (in most cases) can be considered anatomically symmetrical. Flipping along the horizontal axis swaps the left FIGURE 1 | Data augmentation for brain-tumor segmentation-a taxonomy.
FIGURE 2 | Flowchart presenting training-and test-time data augmentation. In the training-time data augmentation approach, we generate synthetic data to increase the representativeness of a training set (and ultimately build better models), whereas in test-time augmentation, we benefit from the ensemble-like technique, in which multiple homogeneous classifiers vote for the final class label for an incoming example by classifying this sample and a number of its augmented versions.
FIGURE 3 | Applying affine and pixel-level (discussed in more detail in section 2.3) transformations can help significantly increase the size (and potentially representativeness) of training sets. In this example, we generate seven new images based on the original MRI (coupled with its ground truth in the bottom row).
hemisphere with the right one, and vice versa. This operation can help various deep classifiers, especially those benefitting from the contextual tumor information, be invariant with respect to their position within the brain which would be otherwise difficult for not representative training sets (e.g., containing brain tumors located only in the left or right hemisphere). Similarly, rotating an image by an angle α around the center pixel can be exploited in this context. This operation is followed by appropriate interpolation to fit the original image size. The rotation operation denoted as R is often coupled with zeropadding applied to the missing pixels:

Translation
The translation operation shifts the entire image by a given number of pixels in a chosen direction, while applying padding accordingly. It allows the network to not become focused on features present mainly in one particular spatial region, but it forces the model to learn spatially-invariant features instead. As in the case of rotation-since the MRI scans of different patients available in training sets are often not co-registered-translation of an image by a given number of pixels along a selected axis (or axes) can create useful and viable images. However, this procedure may not be "useful" for all deep architecturesconvolutional neural networks exploit convolutions and pooling operations, which are intrinsically spatially-invariant (Asif et al., 2018).
and the scaling factors are given as s x and s y for the x and y directions, respectively. As tumors vary in size, scaling can indeed bring viable augmented images into a training set. Since various deep architectures require images of the constant size, scaling is commonly paired with cropping to maintain the original image dimensions. Such augmented brain-tumor examples may manifest tumoral features at different scales. Also, cropping can limit the field of view only to those parts of the image which are important (Menze et al., 2015).

Shearing
The shear transformation (H) displaces each point in an image in a selected direction. This displacement is proportional to its distance from the line which goes through the origin and is parallel to this direction: where h x and h y denote the shear coefficient in the x and y directions, respectively (as previously, we consider two dimensions for readability). Although this operation can deform shapes, it is rarely used to augment medical image data because we often want to preserve original shape characteristics (Frid-Adar et al., 2018).

Data Augmentation Using Elastic Image Transformations
Data augmentation algorithms based on unconstrained elastic transformations of training examples can introduce shape variations. They can bring lots of noise and damage into the training set if the deformation field is seriously varied-see an example by Mok and Chung (2018) in which a widely-used elastic transform produced a totally unrealistic synthetic MRI scan of a human brain. If the simulated tumors were placed in "unrealistic" positions, it would likely force the segmentation engine to become invariant to contextual information and rather focus on the lesion's appearance features (Dvornik et al., 2018). Although there are works which indicate that such aggressive augmentation may deteriorate the performance of the models in brain-tumor delineation , it is still an open issue. Chaitanya et al. (2019) showed that visually nonrealistic synthetic examples can improve the segmentation of cardiac MRI and noted that it is slightly counter-intuitive-it may have occurred due to the inherent structural and deformationrelated characteristics of the cardiovascular system. Finally, elastic transformations often benefit from B-splines (Huang and Cohen, 1996;Gu et al., 2014) or random deformations (Castro et al., 2018). Diffeomophic mappings play an important role in brain imaging, as they are able to preserve topology and generate biologically plausible deformations. In such transformations, the diffeomorphism φ (also referred to as a diffeomorphic mapping) is given in the spatial domain of a source image I, and transforms I to the target image J: I • φ −1 (x, 1). The mapping is the solution of the differential equation: is a geodesic path (d denotes the dimensionality of the spatial domain ), and φ(x, t) : × t → . In Nalepa et al. (2019a), we exploited the directly manipulated free-form deformation, in which the velocity vector fields are regularized using B-splines (Tustison et al., 2009) and B(·) are the B-spline basis functions, N denotes the number of pixels in the domain of the reference image, r is the spline order (in all dimensions), and ∂ξ ∂x is the gradient of the spatial similarity metric at a pixel c. The B-spline functions act as regularizers of the solution for each parametric dimension (Tustison and Avants, 2013).
Examples of brain-tumor images generated using diffeomorphic registration are given in Figure 4-such artificially-generated data significantly improved the abilities of deep learners, especially when combined with affine transformations, as we showed in Nalepa et al. (2019a). The generated (I ′ ) images preserve topological information of the original image data (I) with subtle changes to the tissue. Diffeomorphic registration may be applied not only to images exposing anatomical structures (Tward and Miller, 2017). In Figure 5, we present examples of simple shapes which underwent this transformation-the topological information is clearly maintained in the generated images as well.

Data Augmentation Using Pixel-Level Image Transformations
There exist augmentation techniques which do not alter geometrical shape of an image (therefore, all geometrical features remain unchanged during the augmentation process), but affect the pixel intensity values (either locally, or across the entire image). Such operations can be especially useful in medical image analysis, where different training images are acquired in different locations and using different scanners, hence can be intrinsically heterogeneous in the pixel intensities, intensity FIGURE 4 | Diffeomorphic image registration applied to example brain images allowed for obtaining visually-plausible generated images. For source (I), target (J), and artificially generated (I ′ ) images, we also present tumor masks overlayed over the corresponding original images (in yellow; rows with the o subscript), alongside a zoomed part of a tumor (rows with the z superscript).
Frontiers in Computational Neuroscience | www.frontiersin.org 5 December 2019 | Volume 13 | Article 83 FIGURE 5 | Diffeomorphic image registration applied to basic shapes which underwent simple affine registration (translation) before diffeomorphic mapping. Source images (I) transformed to match the corresponding targets (J) still clearly expose their spatial characteristics (I ′ ).
FIGURE 6 | Generative adversarial networks are aimed at generating fake data (by a generator; potentially using some available data characteristics) which is indistinguishable from the original data by the discriminator. Therefore, the generator and discriminator compete with one another.
gradients or "saturation" 2 . During the pixel-level augmentation, the pixel intensities are commonly perturbed using either random or zero-mean Gaussian noise (with the standard deviation corresponding to the appropriate data dimension), with a given probability (the former operation is referred to as the random intensity variation). Other pixel-level operations include shifting and scaling of pixel-intensity values (and modifying the image brightness), applying gamma correction and its multiple variants (Agarwal and Mahajan, 2017;Sahnoun et al., 2018), sharpening, blurring, and more (Galdran et al., 2017). This kind of data augmentation is often exploited for high-dimensional data, as it can be conveniently applied to selected dimensions (Nalepa et al., 2019b).

Data Augmentation by Generating Artificial Data
To alleviate the problems related to the basic data augmentation approaches (including the problem of generating correlated data samples), various approaches toward generating artificial data (GAD) have been proposed. Generative adversarial networks (GANs), originally introduced in Goodfellow et al. (2014), are being exploited to augment medical datasets (Han et al., 2019;Shorten and Khoshgoftaar, 2019). The main objective of a GAN (Figure 6) is to generate a new data example (by a generator) which will be indistinguishable from the real data by the 2 These variations can be however alleviated by appropriate data standardization.
discriminator (the generator competes with the discriminator, and the overall optimization mimics the min-max game). Mok and Chung proposed a new GAN architecture which utilizes a coarse-to-fine generator whose aim is to capture the manifold of the training data and generate augmented examples (Mok and Chung, 2018). Adversarial networks have been also used for semantic segmentation of brain tumors (Rezaei et al., 2017), brain-tumor detection (Varghese et al., 2017), and image synthesis of different modalities . Although GANs allow us to introduce invariance and robustness of deep models with respect to not only affine transforms (e.g., rotation, scaling, or flipping) but also to some shape and appearance variations, convergence of the adversarial training and existence of its equilibrium point remain the open issues. Finally, there exist scenarios in which the generator renders multiple very similar examples which cannot improve the generalization of the system-it is known as the mode collapse problem (Wang et al., 2017). An interesting approach for generating phantom image data was exploited in Gholami et al. (2018), where the authors utilized a multi-species partial differential equations (PDE) growth model of a tumor to generate synthetic lesions. However, such data does not necessarily follow the correct intensity distribution of a real MRI, hence it should be treated as a separate modality, because using the artificial data which is sampled from a very different distribution may adversely affect the overall segmentation performance by "tricking" the underlying deep model (Wei et al., 2018). The tumoral growth model itself captured the time evolution of enhancing and necrotic tumor concentrations together with the edema induced by a tumor. Additionally, the deformation of a lesion was simulated by incorporating the linear elasticity equations into the model. To deal with the different data distributions, the authors applied CycleGAN (Zhu et al., 2017) for performing domain adaptation (from the generated phantom data to the real BraTS MRI scans). The experimental results showed that the domain adaptation was able to generate images which were practically indistinguishable from the real data, therefore could be safely included in the training set.
A promising approach of combining training samples using their linear combinations (referred to as mixup) was proposed by Zhang et al. (2017), and further enhanced for medical image segmentation by Eaton-Rosen et al. in their mixmatch algorithm (Eaton-Rosen et al., 2019), which additionally introduced a technique of selecting training samples that undergo linear combination. Since the medical image datasets are often imbalanced (with the tumorous examples constituting the minority class), training patches with highest "foreground amounts" (i.e., the number of pixels annotated as tumorous) are combined with those with the lowest concentration of foreground. The authors showed that their approach can increase performance in medical-image segmentation tasks, and related its success to the mini-batch training. It is especially relevant in the medical-image analysis, because the sizes of input scans are usually large, hence the batches are small to keep the training memory requirements feasible in practice. Such data-driven augmentation techniques can also benefit from growing groundtruth datasets (e.g., BraTS) which manifest large variability of brain tumors, to generate even more synthetic examples. Also, they could be potentially applied at test time to build an ensemble-like model, if a training patch/image which matches the test image being classified was efficiently selected from the training set.

DATA
In this work, we analyzed the approaches which were exploited by the BraTS 2018 participants to segment brain tumors from MRI (45 methods have been published, Crimi et al., 2019), and verified which augmentation scenarios were exploited in these algorithms. All of those techniques have been trained over the BraTS 2018 dataset consisting of MRI-DCE data of 285 patients with diagnosed gliomas: 210 patients with high-grade glioblastomas (HGG), and 75 patients with low-grade gliomas (LGG), and validated using the validation set of 66 previously unseen patients (both LGG and HGG, however the grade has not been revealed) (Menze et al., 2015;Bakas et al., 2017a,b,c). Each study was manually annotated by one to four expert readers. The data comes in four co-registered modalities: native pre-contrast (T1), post-contrast T1-weighted (T1c), T2-weighted (T2), and T2 Fluid Attenuated Inversion Recovery (FLAIR). All the pixels have one of four labels attached: healthy tissue, Gd-enhancing tumor (ET), peritumoral edema (ED), the necrotic and non-enhancing tumor core (NCR/NET). The scans were skull-stripped and interpolated to the same shape (155, 240, 240 with the voxel size of 1 mm 3 ).
Importantly, this dataset manifests very heterogeneous image quality, as the studies were acquired across different institutions, and using different scanners. On the other hand, the delineation procedure was clearly defined which allowed for obtaining similar ground-truth annotations across various readers. To this end, the BraTS dataset-as the largest, most heterogeneous, and carefully annotated set-has been established as a standard brain-tumor dataset for quantifying the performance of existent and emerging detection and segmentation approaches. This heterogeneity is pivotal, as it captures a wide range of tumor characteristics, and the models trained over BraTS are easily applicable for segmenting other MRI scans .
To show this desirable feature of the BraTS set experimentally, we trained our U-Net-based ensemble architecture (Marcinkiewicz et al., 2018) using (a) BraTS 2019 training set (exclusively FLAIR sequences) and (b) our set of 41 LGG (WHO II) brain-tumor patients who underwent the MR imaging with a MAGNETOM Prisma 3T system (Siemens, Erlangen, Germany) equipped with a maximum field gradient strength of 80 mT/m, and using a 20-channel quadrature head coil. The MRI sequences were acquired in the axial plane with a field of view of 230 × 190 mm, matrix size 256 × 256 and 1 mm slice thickness with no slice gap. In particular, we exploited exclusively FLAIR series with TE = 386 ms, TR = 5,000 ms, and inversion time of 1,800 ms for segmentation of brain tumors. These scans underwent the same pre-processing as applied in the case of BraTS, however they were not segmented following the same delineation protocol, hence the characteristics of the manual segmentation likely differ across (a) and (b). The 4-fold cross-validation showed that although the deep models trained over (a) and (b) gave the statistically different results at p < 0.001, according to the two-tailed Wilcoxon test 3 , the ensemble of models trained over (a) correctly detected 71.4% (5/7 cases) of brain tumors in the WHO II test dataset, which included seven patients kept aside while building an ensemble, with the average whole-tumor DICE of 0.80, where DICE is given as:   will end up having the correct detection of 85.7% tumors (6/7 cases) with the average whole-tumor DICE of 0.76. We can appreciate the fact that we were able to improve the detection, but the segmentation quality slightly dropped, showing that the detected case was challenging to segment. Finally, it is worth mentioning that this experiment sheds only some light on the effectiveness of applying the deep models (or other data-driven techniques) trained over BraTS for analyzing different MRI brain images. The manual delineation protocols were different, and the lack of inter-rater agreement may play pivotal role in quantifying automated segmentation algorithms over such differently acquired and analyzed image sets-it is unclear if the differences result from the inter-rater disagreement of the incorrect segmentation (Hollingworth et al., 2006;Fyllingen et al., 2016;Visser et al., 2019).

Example BraTS Images
Example BraTS 2018 images are rendered in Figure 7 (two low-grade and two high-grade glioma patients), alongside the corresponding multi-class ground-truth annotations. We can appreciate that different parts of the tumors are manifested in different modalities-e.g., necrotic and non-enhancing tumor core is typically hypo-intense in T1-Gd when compared to T1 . Therefore, multi-modal analysis appears crucial to fully benefit from the available image information.

BraTS 2018 Challenge
The BraTS challenge is aimed at evaluating the state-ofthe-art approaches toward accurate multi-class brain-tumor segmentation from MRI. In this work, we review all published methods which were evaluated within the framework of the BraTS 2018 challenge-although 61 teams participated in the testing phase , only 45 methods were finally described and published in the post-conference proceedings (Crimi et al., 2019). We verify which augmentation techniques were exploited to help boost generalization abilities of the proposed supervised learners. We exclusively focus on 20 papers (44% of all manuscripts) in which the authors explicitly stated that the augmentation had been used and report the type of the applied augmentation.
In Table 1, we summarize all investigated brain-tumor segmentation algorithms, and report the deep models utilized in the corresponding works alongside the augmentation techniques. In most of the cases, the authors followed the cross-validation scenario, and divided the training set into multiple nonoverlapping folds. Then, separate models were trained over such folds, and the authors finally formed an ensemble of heterogeneous classifiers (trained over different training data) to segment previously unseen test brain-tumor images. Also, there  For the methods reported by Lachinov et al. (2018) and Wang et al. (2018), we analyzed the best-performing models. *The authors verified the impact of data augmentation over the training set. For the method reported by Wang et al. (2018), we analyzed the best-performing models. Note that Gholami et al. (2018) and Lachinov et al. (2018) did not present the Hausdorff distances obtained using their approaches. In the prime versions, we applied elastic deformations. This table comes from our previous paper .
are approaches, e.g., by Albiol et al. (2019), Chandra et al. (2018, in which a variety of deep neural architectures were used. In the majority of investigated brain-tumor segmentation techniques, the authors applied relatively simple training-time data augmentation strategies-the combination of training-and test-time augmentation was used only in two methods (Rezaei et al., 2018;Wang et al., 2018). In 75% of the analyzed approaches, random flipping was executed to increase the training set size and provide anatomically correct brain images 4 . Similarly, rotating and scaling MRI images was applied in 40% and 45% of techniques, respectively. Since modern deep network architectures are commonly translation-invariant, this type of affine augmentation was used only in two works. Although other augmentation strategies were not as popular as easyto-implement affine transformations, it is worth noting that the pixel-wise operations were utilized in all of the topperforming techniques (the algorithms by Myronenko (2018), Isensee et al. (2018), andMcKinley et al. (2018) achieved the first, second, and third place across all segmentation algorithms 5 , respectively). Additionally, Isensee et al. (2018) exploited elastic transformations in their aggressive data augmentation procedure which significantly increased the size and representativeness of their training sets, and ultimately allowed for outperforming a number of other learners. Interestingly, the authors showed that the state-of-the-art U-Net architecture can be extremely competitive with other (much deeper and complex) models if the data is appropriately curated. It, in turn, manifests the importance of data representativeness and quality in the context of robust medical image analysis.
In Figure 8, we visualize the DICE scores obtained using almost all investigated methods (Puybareau et al., 2018;Rezaei et al., 2018 did not report the results over the unseen BraTS 2018 validation set, therefore these methods are not included in the figure). It is worth mentioning that the trend is fairly coherent for all classes (whole tumor, tumor core, and enhancing tumor), and the best-performing methods by Isensee et al. (2018), McKinley et al. (2018), and Myronenko (2018 consistently outperform the other techniques in all cases. Although the success of these approaches obviously lies not only in the applied augmentation techniques, it is notable that the authors extensively benefit from generating additional synthetic data. Albeit data augmentation is introduced in order to improve the generalization capabilities of supervised learners, this impact was verified only in four BraTS 2018 papers (Benson et al., 2018;Gholami et al., 2018;Lachinov et al., 2018;Wang et al., 2018). Gholami et al. (2018) showed that their PDE-based augmentation delivers very significant improvement in the DICE scores obtained for segmenting all parts of the tumors in the multi-class classification. The same performance boost (in the DICE values obtained for each class) was reported by Lachinov et al. (2018). Finally, Wang et al. (2018) showed that the proposed test-time data augmentation led to improving the performance of their convolutional neural networks.
In Table 2, we gathered the DICE scores obtained with and without the corresponding data augmentation, alongside the change in DICE (reported in %; the larger the DICE score becomes, the better segmentation has been obtained). Interestingly, training-time data augmentation appeared to be adversely affecting the performance of the algorithm presented by Benson et al. (2018). On the other hand, the authors showed that the Hausdorff distance, being the maximum distance of  all points from the segmented lesion to the corresponding nearest point of the ground-truth segmentation (Sauwen et al., 2017), significantly dropped, hence the maximum segmentation error quantified by this metric was notably reduced (the smaller the Hausdorff distance becomes, the better segmentation has been elaborated; Table 3). Test-time data augmentation exploited by Wang et al. (2018) not only decreased DICE for the whole-tumor segmentation, but also caused the increase of the correspoding Hausdorff distance. The results come from our paper (Nalepa et al., 2019a). The best results are boldfaced.
Therefore, applying it in the WT segmentation scenario led to decreasing the abilities of the underlying models. Overall, the vast majority of methods neither report nor analyze the real impact of the incorporated augmentation techniques on the classification performance and/or inference time of their deep models. Although we believe the authors did investigate the advantages (and disadvantages) of their data generation strategies (either experimentally or theoretically), data augmentation is often used a standard tool which is applied to any difficult data (e.g., imbalanced, with highly under-represented classes).

Beyond the BraTS Challenge
Although practically all brain-tumor segmentation algorithms which emerge in the recent literature have been tested over the BraTS datasets, we equipped our U-Nets with a battery of augmentation techniques (summarized in Table 4) and verified their impact over our clinical MRI data in . In this experiment, we have focused on the wholetumor segmentation, as it was an intermediate step in the automated dynamic contrast-enhanced MRI analysis, in which perfusion parameters have been extracted for the entire tumor volume. Additionally, this dataset was manually delineated by a reader (8 years of experience) who highlighted the whole-tumor areas only. We executed multi-step augmentation by applying both affine and elastic deformations of tumor examples, and increased the cardinality of our training sets up to 16×. In Figure 9, we can observe how executing simple affine transformations leads to new synthetic image patches. Since various augmentation approaches may be utilized at different depths of this augmentation tree, the number of artificial examples can be significantly increased. The multi-fold cross-validation experiments showed that introducing rotated training examples was pivotal to boost the generalization abilities of underlying deep models. To verify the statistical importance of the results, we executed the Friedman's ranking tests which revealed that the horizontal flip with additional rotation is crucial to build well-generalizing deep learners in the patch-based segmentation scenario (Table 5).
Similarly, we applied diffeomorphic image registration (DIR) coupled with a recommendation algorithm 6 to select training image pairs for registration in the data augmentation process (Nalepa et al., 2019a). The proposed augmentation was compared with random horizontal flipping, and the experiments indicated that the combined approach leads to statistically significant (Wilcoxon test at p < 0.01) improvements in DICE ( Table 6). In Figure 10, we have gathered example segmentations obtained using our DIR+Flip deep model, alongside the corresponding DICE values. Although the original network, trained over the original training set would correctly detect and segment large tumors (Figures 10A,B), it failed for relatively small lesions which were under-represented in the training set ( Figure 10C). Similarly, synthesizing artificial training examples helped improving the performance of our models in the case of brain tumors located in the brain areas which have not been originally included in the dataset (by applying rotation and flipping).

CONCLUSION
In this paper, we reviewed the state-of-the-art data augmentation methods applied in the context of segmenting brain tumors from MRI. We carefully investigated all BraTS 2018 papers and analyzed data augmentation techniques utilized in these methods. Our investigation revealed that the affine transformations are still the most widely-used in practice, since they are trivial to implement and can elaborate anatomicallycorrect brain-tumor examples. There are, however, augmentation methods which combine various approaches, also including elastic transformations. A very interesting research direction encompasses algorithms which can generate artificial images (e.g., based on the tumoral growth models) that not necessarily follow real-life data distribution, but can be followed by other techniques to ensure correctness of such phantoms.
The results showed that data augmentation was pivotal in the best-performing BraTS algorithms, and Isensee et al. (2018) experimentally proved that well-known and widely-used fullyconvolutional neural networks can outperform other (perhaps much more deeper and complex) learners, if the training data is appropriately cleansed and curated. It clearly indicates the importance of introducing effective data augmentation methods for medical image data, which benefit from affine transformations (in 2D and 3D), pixel-wise modifications and elastic transform to deal with the problem of limited ground-truth data. In Table 7, we gather the advantages and disadvantages of all groups of brain-tumor data augmentation techniques analyzed in this review. Finally, these approaches can be easily applied in both single-and multi-modal scans, FIGURE 11 | Anatomically incorrect brain images may still manifest valid tumor features-the impact of including such examples (which may be easily rendered by various data-generation augmentation techniques) into training sets for brain-tumor detection and segmentation tasks is yet to be revealed.
usually by synthesizing artificial examples separately for each image modality. Although data augmentation became a pivotal part of virtually all deep learning-powered methods for segmenting brain lesions (due to the lack of very large, sufficiently heterogeneous and representative ground-truth sets, with BraTS being an exception), there are still promising and unexplored research pathways in the literature. We believe that hybridizing techniques from various algorithmic groups, introducing more data-driven augmentations, and applying them at training-and test-time can further boost the performance of large-capacity learners. Also, investigating the impact of including not necessarily anatomically correct brain-tumor scans into training sets remains an open issue (see the examples of anatomically incorrect brain images which still manifest valid tumor characteristics in Figure 11).

AUTHOR CONTRIBUTIONS
JN designed the study, performed the experiments, analyzed data, and wrote the manuscript. MM provided selected implementations and experimental results, and contributed to writing of some parts of the initial version of the manuscript. MK provided qualitative segmentation analysis and visualizations.

FUNDING
This work was supported by the Polish National Centre for Research and Development under the Innomed Grant (POIR.01.02.00-00-0030/15). JN was supported by the Silesian University of Technology funds (The Rector's Habilitation Grant No. 02/020/RGH19/0185). The research undertaken in this project led to developing Sens.AI-a tool for automated segmentation of brain lesions from T2-FLAIR sequences (https:// sensai.eu). MK was supported by the Silesian University of Technology funds (Grant No. 02/020/BK_18/0128).