Multi-Disease Segmentation of Gliomas and White Matter Hyperintensities in the BraTS Data Using a 3D Convolutional Neural Network

An important challenge in segmenting real-world biomedical imaging data is the presence of multiple disease processes within individual subjects. Most adults above age 60 exhibit a variable degree of small vessel ischemic disease, as well as chronic infarcts, which will manifest as white matter hyperintensities (WMH) on brain MRIs. Subjects diagnosed with gliomas will also typically exhibit some degree of abnormal T2 signal due to WMH, rather than just due to tumor. We sought to develop a fully automated algorithm to distinguish and quantify these distinct disease processes within individual subjects’ brain MRIs. To address this multi-disease problem, we trained a 3D U-Net to distinguish between abnormal signal arising from tumors vs. WMH in the 3D multi-parametric MRI (mpMRI, i.e., native T1-weighted, T1-post-contrast, T2, T2-FLAIR) scans of the International Brain Tumor Segmentation (BraTS) 2018 dataset (ntraining = 285, nvalidation = 66). Our trained neuroradiologist manually annotated WMH on the BraTS training subjects, finding that 69% of subjects had WMH. Our 3D U-Net model had a 4-channel 3D input patch (80 × 80 × 80) from mpMRI, four encoding and decoding layers, and an output of either four [background, active tumor (AT), necrotic core (NCR), peritumoral edematous/infiltrated tissue (ED)] or five classes (adding WMH as the fifth class). For both the four- and five-class output models, the median Dice for whole tumor (WT) extent (i.e., union of AT, ED, NCR) was 0.92 in both training and validation sets. Notably, the five-class model achieved significantly (p = 0.002) lower/better Hausdorff distances for WT extent in the training subjects. There was strong positive correlation between manually segmented and predicted volumes for WT (r = 0.96) and WMH (r = 0.89). Larger lesion volumes were positively correlated with higher/better Dice scores for WT (r = 0.33), WMH (r = 0.34), and across all lesions (r = 0.89) on a log(10) transformed scale. While the median Dice for WMH was 0.42 across training subjects with WMH, the median Dice was 0.62 for those with at least 5 cm3 of WMH. We anticipate the development of computational algorithms that are able to model multiple diseases within a single subject will be a critical step toward translating and integrating artificial intelligence systems into the heterogeneous real-world clinical workflow.

An important challenge in segmenting real-world biomedical imaging data is the presence of multiple disease processes within individual subjects. Most adults above age 60 exhibit a variable degree of small vessel ischemic disease, as well as chronic infarcts, which will manifest as white matter hyperintensities (WMH) on brain MRIs. Subjects diagnosed with gliomas will also typically exhibit some degree of abnormal T2 signal due to WMH, rather than just due to tumor. We sought to develop a fully automated algorithm to distinguish and quantify these distinct disease processes within individual subjects' brain MRIs. To address this multi-disease problem, we trained a 3D U-Net to distinguish between abnormal signal arising from tumors vs. WMH in the 3D multi-parametric MRI (mpMRI, i.e., native T1-weighted, T1-post-contrast, T2, T2-FLAIR) scans of the International Brain Tumor Segmentation (BraTS) 2018 dataset (n training = 285, n validation = 66). Our trained neuroradiologist manually annotated WMH on the BraTS training subjects, finding that 69% of subjects had WMH. Our 3D U-Net model had a 4-channel 3D input patch (80 × 80 × 80) from mpMRI, four encoding and decoding layers, and an output of either four [background, active tumor (AT), necrotic core (NCR), peritumoral edematous/infiltrated tissue (ED)] or five classes (adding WMH as the fifth class). For both the four-and five-class output models, the median Dice for whole tumor (WT) extent (i.e., union of AT, ED, NCR) was 0.92 in both training and validation sets. Notably, the five-class model achieved significantly (p = 0.002) lower/better Hausdorff distances for WT extent in the training subjects. There was strong positive correlation between manually segmented and predicted volumes for WT (r = 0.96) and WMH (r = 0.89). Larger lesion volumes were positively correlated with higher/better Dice scores for WT (r = 0.33), WMH (r = 0.34), and across all lesions (r = 0.89) on a log(10) transformed scale. While the median Dice for WMH was 0.42 across training subjects with WMH, the median Dice was 0.62 for those with INTRODUCTION A significant challenge in the deployment of advanced computational methods into typical clinical workflows is the vast heterogeneity of disease processes, which are present both between individuals (inter-subject heterogeneity) and within individuals (intra-subject heterogeneity). Most adults over the age of 60 have a variable degree of abnormal signal on brain MRIs due to age-related changes manifesting as white matter hyperintensities (WMH), which are typically secondary to small vessel ischemic disease (SVID) and chronic infarcts that can be found in subjects with vascular risk factors and clinical histories of stroke and dementia (Wardlaw et al., 2015). These lesions can confound automated detection and segmentation of other disease processes, including brain tumors, which also result in abnormal signal in T2-weighted (T2) and T2 Fluidattenuated inversion recovery (T2-FLAIR) MRI scans secondary to neoplastic processes and associated edema/inflammation. We sought to address this challenge of intra-individual heterogeneity by leveraging (i) the dataset of the International Multimodal Brain Tumor Segmentation (BraTS) 2018 challenge (Menze et al., 2015;Bakas et al., 2017bBakas et al., , 2019 (ii) expert radiologist expertise, and (iii) three-dimensional (3D) convolutional neural networks (CNNs).
Advances in the field of segmentation and radiomics within neuro-oncology have been supported by data made available through The Cancer Imaging Archive (TCIA; Clark et al., 2013). Since 2012, the BraTS challenge has further curated TCIA glioma multi-parametric MRI (mpMRI) scans, segmentation of tumor sub-regions, and survival data in a public dataset and sponsored a yearly challenge to improve performance of automated segmentation and prognostication methods (Menze et al., 2015;Bakas et al., 2017bBakas et al., , 2019. Similar to BraTS, there have been large efforts for improving automatic segmentation of WMH (Griffanti et al., 2016;Habes et al., 2016), which include the MICCAI 2017 WMH competition (Li et al., 2018;Kuijf et al., 2019), as well as stroke lesions, through the Ischemic Stroke Lesion Segmentation Challenge (ISLES; Winzeck et al., 2018). Deep learning (DL) approaches for biomedical image segmentation are now established as superior to the previous generation of atlas-based and hand-engineered feature approaches (Fletcher-Heath et al., 2001;Gooya et al., 2012), as demonstrated by their performance in recent image segmentation challenges (Chang, 2017;Kamnitsas et al., 2017;Li et al., 2018;Bakas et al., 2019;Myronenko, 2019).
Deep learning relies on hierarchically organized layers to process increasingly complex intermediate feature maps and utilizes the gradient of the error in predictions with regard to the units of each layer to update model weights, known as "backpropagation." In visual tasks, this allows for the identification of lower-and intermediate-level image information (feature maps) to maximize classification performance based on annotated datasets (LeCun et al., 2015;Chartrand et al., 2017;Hassabis et al., 2017). Typically, CNNs, a class of feed-forward neural networks, have been used for image-based problems, achieving super-human performance in the ImageNet challenge (Deng et al., 2009. Krizhevsky et al., 2012. The U-Net architecture (Ronneberger et al., 2015;Cicek et al., 2016;Milletari et al., 2016) describes a CNN with an encoding convolutional arm and corresponding decoding [de]convolutional arm has been shown to be particularly useful for 3D biomedical image segmentation through its semantic-and voxel-wise approach, such as for segmentation of abnormal T2-FLAIR signal across a range of diseases (Duong et al., 2019).
Several prior machine learning approaches have been used to model inter-subject disease heterogeneity, such as distinguishing on an individual subject basis between primary CNS lymphoma and glioblastoma (Wang et al., 2011), or between different types of brain metastases (Kniep et al., 2018). There is evidence that these approaches may be superior to human radiologists (Suh et al., 2018), yet little work has been done to address intra-subject lesion heterogeneity. Notably, one recent study used CNNs to distinguish between WMH due to SVID versus stroke, finding that training a CNN to explicitly distinguish between these diseases allowed for improved correlation between SVID burden and relevant clinical variables (Guerrero et al., 2018). Although a large body of work has detailed methodological approaches to improve segmentation methods for brain tumors, to the best of our knowledge no prior studies have addressed intra-subject disease heterogeneity in the BraTS dataset.
Although the task of distinguishing between different diseases within an individual is typically performed subconsciously by humans, distinguishing between different diseases could be challenging for an automated system if it were not specifically designed and trained to perform such a task. When provided with enough labeled training data, image-based machine learning methods have shown success in identifying patterns that are imperceptible to humans. These include GBM subtypes related to specific genetic mutations (i.e., radiogenomics; Bakas et al., 2017a;Korfiatis et al., 2017;Akbari et al., 2018;Chang et al., 2018;Rathore et al., 2018), or imaging subtypes that are predictive of clinical outcomes . Therefore, we sought to train a 3D U-Net model to distinguish between abnormal radiographic signals arising from brain glioma versus WMH in individual subjects, in the mpMRI data of the BraTS 2018 challenge. We hypothesized that this would (1) allow for automatic differentiation of different disease processes, and (2) improve overall accuracy of segmentation of brain tumor extent of disease, particularly in subjects with a large amount of abnormal signal due to WMH.

Data
We utilized the publicly available data of the BraTS 2018 challenge that describe a multi-institutional collection of preoperative mpMRI brain scans of 351 subjects (n training = 285, and n validation = 66) diagnosed with high-grade (glioblastoma) and lower-grade gliomas. The mpMRI scans comprise native T1-weighted (T1), post-contrast T1-weighted (T1PC), T2, and T2-FLAIR scans. Pre-processing of the provided images included re-orientation to LPS (left-posterior-superior) coordinate system, co-registration to the same T1 anatomic template (Rohlfing et al., 2010), resampling to isotropic 1 mm 3 voxel resolution and skull-stripping as detailed in Bakas et al., 2019. Manual expert segmentation of the BraTS dataset delineated three tumor subregions: (1) Necrotic core (NCR), (2) active tumor (enhancing tissue; AT), and (3) peritumoral edematous/infiltrated tissue (ED). The whole tumor extent (WT) was considered the union of all these three classes.

Manual Annotation of WMH
In order to define the new tissue class of abnormal signal relating to WMH in the BraTS training subjects, a neuroradiologist (JR; neuroradiology fellow with extensive segmentation experience) defined manually segmentation masks of WMH using ITK-SNAP (Yushkevich et al., 2006). WMH were considered to be abnormal signal due to SVID, chronic infarcts, and/or any periventricular abnormal signal contralateral to the tumor. Examples of these new two class segmentations of the BraTS 2018 dataset are shown in Figure 1.
Our encoder-decoder type fully convolutional deep neural network consists of (1) an encoder limb (with successive blocks of convolution and downsampling encoding progressively deeper/higher-order spatial features), (2) a decoder limb (with a set of blocks -symmetric to those of the encoder limb -of upsampling and convolution, eventually mapping this encoded feature set back onto the input space), and (3) an introduced novel so-called skip connections (whereby outputs of encoding layers are concatenated with inputs to corresponding decoding layers) in order to improve spatial localization over previous generations of fully convolutional networks (3D Res-U-Net; Milletari et al., 2016).
The network was trained on an NVIDIA Titan Xp GPU (12GB), using the Xavier initialization scheme, Adam optimization algorithm (Kingma and Lei Ba, 2015; initial learning rate 1e −4 ), and 2nd order polynomial learning rate decay over 600 epochs. Training time was approximately 4.5 h. 10-fold internal cross validation on the training set was used for hyperparameter optimization and intrinsic estimation FIGURE 2 | Multiclass input and multiclass output U-Net schematic. Our U-Net has 4-channel input accepting 3D patches from mpMRI with four encoding and decoding layers, and either a four-class output (background, AT, NCR, and ED) or a five-class output (adding WMH as the fifth class).
of generalization performance during training. For inference on the validation set, the model was retrained 10 times independently on the entire training set (n = 285), and model predictions were averaged.
We trained models using this architecture twice; once with the four tissue classes originally annotated in the BraTS dataset, and again with the manual WMH segmentations added as a fifth class. All of the code has been made publicly available at https://github. com/johncolby/svid_paper.

Performance Metrics
Tissue segmentation performance was evaluated with the Dice metric (2×TP/(2×TP + FP + FN); TP = true positive; FP = false positive; FN = false negative; Dice, 1945) for the tumor segmentation in both models, as well as for WMH in the fiveclass model. In addition, the 95th percentile of the Hausdorff distance (Hausdorff 95 ) was used as a performance evaluation metric, to evaluate the distance between the centers of the predicted and the expert 3D segmentations. The metrics for the four tissue classes and the Hausdorff 95 distance were measured by submitting our segmentations to the online BraTS evaluation portal 1 (Davatzikos et al., 2018).

Further Exploration of U-Net Results
In order to further interrogate the performance of our proposed model, we performed correlations between manually segmented and predicted volumes for WT and WMH, as well as Bland Altman plots to assess agreement between the two measures of tissue volumes for both WT and WMH. For the evaluation of WMH, we performed correlations among the 196 cases that contained at least 100 mm 3 of WMH.
To better understand what could affect performance, we also evaluated correlations between total lesion volumes and Dice scores. 1 https://ipp.cbica.upenn.edu/

Segmentation Performance
The performance metrics for the training (10-fold cross validation) and validation subjects (final model) for each of the tissue classes in the four-and five-class models are shown in Table 1. We achieved a median Dice of 0.92 for WT in both the four-and five-class models, in both the training (p = 0.52; 10-fold cross validation) and validation datasets (p = 0.94). Segmentation performance on AT and tumor core (the union of AT and NCR) were also not significantly different between the four-and five-class models. There were no significant differences between tumor segmentation performance for high-or low-grade gliomas in the training set (p = 0.45).
The median Hausdorff 95 distance in the training data was significantly lower (p = 0.002; two tailed t-test) in the fiveclass model (3.0, interquartile range 2.2-9.0) than the traditional four-class model (3.5, interquartile range 2.2-4.9). Example training cases where the Hausdorff 95 distance were much better in the five-class model are shown in Figure 3 with predicted segmentations for AT, NCR, ED, and WMH for both the four and five-class models. However, the Hausdorff 95 distance was not significantly different in the validation data (p = 0.84). Example validation cases with greater than 5 cm 3 of WMH are shown in Figure 4, with predicted segmentations for AT, NCR, ED, and WMH.
We achieved a median Dice of 0.42 in the 189 subjects with WMH of at least 100 mm 3 . Median Dice for WMH in subjects with at least 1000 mm 3 (1 cm 3 ), 5000 mm 3 (5 cm 3 )

Correlation Between Predicted Lesion Volumes and Manual Segmented Volumes
Within the training dataset there was a strong correlation between manually segmented WT volume and predicted WT volume (Pearson r = 0.96, p < 0.0001; Figure 5A). There was also a strong correlation between manually segmented WMH volume and predicted WMH volume (Pearson r = 0.89, p < 0.0001; Figure 5B). Bland-Altman plots assessing agreement between manual and predicted volume for WT and WMH are shown in Figures 5C,D.

Correlations Between Lesion Volumes and Dice
Within the training dataset there was a significant correlation between manually segmented WT volumes and WT Dice scores (Pearson r = 0.33, p < 0.0001; Figure 6A) and between manually segmented WMH volumes and WMH Dice scores (Pearson r = 0.34; p < 0.0001; Figure 6B). When combining WT and WMH, there was a stronger correlation between lesion volumes and Dice scores (Pearson r = 0.68; p < 0.0001; Figure 6C), which was even stronger when the volumes were transformed to a logarithmic (log(10)) scale (Pearson r = 0.89; p < 0.0001; Figure 6D). There was no significant relationship between WMH volume and WT Dice scores (Pearson r = −0.05, p = 0.42).  intelligence systems into daily clinical practice is disease heterogeneity within subjects. In this study, we utilized the BraTS 2018 dataset and expert-revised WMH segmentations to train a state-of-the-art CNN to successfully distinguish and quantify abnormal signal due to WMH as a distinct tissue class from glioma tissue sub-regions. We used a 3D CNN (U-Net architecture; Cicek et al., 2016;Milletari et al., 2016) for multiclass tissue segmentation with performance at the top 10% of the BraTS 2018 leaderboard (Bakas et al., 2019; noting that we did not participate in the official competition). U-Nets have been particularly adept at medical image segmentation, due to their ability to convert feature maps obtained during convolutions into a vector and from that vector reconstruct a segmentation, which reduces distortion by preserving the structural integrity.

DISCUSSION
To our knowledge this is the first study to distinguish intrasubject lesion heterogeneity in the BraTS dataset, noting that Guerrero et al. (2018) previously used a U-Net architecture to distinguish chronic infarcts from WMH due to SVID. Although we hypothesized that adding WMH as a tissue class could improve tumor segmentation performance, we did not find a significant difference between tumor segmentation overlap (Dice) in the model that incorporated WMH as an additional class. Incorporating WMH as a distinct fifth-class did significantly (p = 0.002; two tailed t-test) improve the Hausdorff (95th percentile) distance metric within the training sample. As the Hausdorff 95 distance reflects the center of the lesion, and WMH are often far from the tumor, poorer Hausdorff 95 distance in the four-class model was likely due to false positive segmentations of WMH as tumor as demonstrated in Figure 3. However, upon reviewing validation cases with larger amounts of predicted WMH (Figure 4), it appeared that the original four-class model, although not explicitly trained to model WMH, mostly learned to implicitly ignore most WMH, likely due to spatial characteristics of the WMH being distant from the primary tumors and in characteristic locations and shapes. It is possible that the addition of WMH as an additional class could degrade segmentation performance of ED that was relatively distal to the center of tumor, thus the benefits of reducing distal WMH false positives in the five-class model may have be counterbalanced by increasing false negatives. As evidenced by the BraTS leaderboard, a Dice of ∼0.90 is considered excellent and has previously been shown to be at a level similar to inter-rater reliability for BraTS (Visser et al., 2019). As demonstrated in Figure 6, we found that lesion volume was an important predictor of Dice scores for both WT ( Figure 6A) and WMH (Figure 6B). When evaluating both WT and WMH (Figures 6C,D), we found that the majority of the variance in Dice scores was explained by lesion volume, particularly when transformed to a logarithmic scale ( Figure 6D). Thus, poorer performance for WMH in our data appear to largely be driven by smaller lesion sizes. This is consistent with prior literature that has also shown positive correlations between lesion volumes and Dice scores (Winzeck et al., 2018;Duong et al., 2019). Although our reported Dice scores for WMH appear relatively low (0.42), it should be noted that average volume of WMH in the MICCAI 2017 dataset was 16.9 cm 3 (Kuijf et al., 2019). When looking at cases with larger volumes of WMH (>10 cm 3 ) the average Dice score (0.67) was more similar to those reported in the 2017 MICCAI WMH dataset (0.70-0.80; Kuijf et al., 2019). A further explanation for reduced segmentation performance of smaller lesions may be lower inter-rater reliability, such as what has been reported in multiple sclerosis (Dice ∼0.60;Egger et al., 2017). A limitation of the current study is that there is only a single expert annotation for both the BraTS dataset and the WMH, thus the contribution of inter-rater reliability could not be assessed. In the future we also plan to improve detection of smaller lesions by using different neural network architectures, such as two-stage detectors (Girshick et al., 2014), or implementing different loss functions, such a focal loss (Lin et al., 2018;Abraham and Khan, 2019).
As artificial intelligence tools start to become integrated with clinical workflows for more precise quantitative assessments of disease burden, it will be necessary to distinguish, quantify and longitudinally assess a variety of disease processes, in order to assist with more accurate and efficient clinical decisionmaking. Explicitly tackling intra-subject disease heterogeneity by training models to perform these tasks should help translate these advanced computational methods into clinical practice.

DATA AVAILABILITY STATEMENT
The datasets generated for this study can be found in the BraTS 2018 dataset https://www.med.upenn.edu/sbia/brats2018/ data.html. The manual WMH segmentations and code used in this manuscript have been made available for public use at https: //github.com/johncolby/svid_paper.

ETHICS STATEMENT
Ethical review and approval was not required for the study on human participants in accordance with the local legislation and institutional requirements. Written informed consent for participation was not required for this study in accordance with the national legislation and the institutional requirements.

AUTHOR CONTRIBUTIONS
JR and JC performed the analysis and prepared the initial manuscript. DW, RS, and JW performed some of the analyses. AR, LS, and SB helped to direct the research. All authors helped to revise the manuscript.

FUNDING
This study was partly supported by the institutional T-32 Training Grants from the NIH/NIBIB (Penn T32 EB004311-10 and UCSF T32-EB001631-14) and NIH/NINDS:R01NS042645 and NIH/NCI:U24CA189523 grants. The content of this publication is solely the responsibility of the authors and does not represent the official views of the NIH. We would like to acknowledge the NVIDIA Corporation that donated three Titan Xp GPUs as part of the NVIDIA GPU grant program (awarded to JR, AR, and JC).