Knowing What You Know in Brain Segmentation Using Bayesian Deep Neural Networks

McClure, Patrick; Rho, Nao; Lee, John A.; Kaczmarzyk, Jakub R.; Zheng, Charles Y.; Ghosh, Satrajit S.; Nielson, Dylan M.; Thomas, Adam G.; Bandettini, Peter; Pereira, Francisco

doi:10.3389/fninf.2019.00067

ORIGINAL RESEARCH article

Front. Neuroinform., 17 October 2019

Volume 13 - 2019 | https://doi.org/10.3389/fninf.2019.00067

Knowing What You Know in Brain Segmentation Using Bayesian Deep Neural Networks

Patrick McClure^1,2^*

Nao Rho^1,2

John A. Lee^2,3

Jakub R. Kaczmarzyk⁴

Charles Y. Zheng^1,2

Satrajit S. Ghosh⁴

Dylan M. Nielson^2,3

Adam G. Thomas³

Peter Bandettini²

Francisco Pereira^1,2

¹Machine Learning Team, National Institute of Mental Health, Bethesda, MD, United States
²Section on Functional Imaging Methods, National Institute of Mental Health, Bethesda, MD, United States
³Data Sharing and Science Team, National Institute of Mental Health, Bethesda, MD, United States
⁴McGovern Institute for Brain Research, Massachusetts Institute of Technology, Cambridge, MA, United States

In this paper, we describe a Bayesian deep neural network (DNN) for predicting FreeSurfer segmentations of structural MRI volumes, in minutes rather than hours. The network was trained and evaluated on a large dataset (n = 11,480), obtained by combining data from more than a hundred different sites, and also evaluated on another completely held-out dataset (n = 418). The network was trained using a novel spike-and-slab dropout-based variational inference approach. We show that, on these datasets, the proposed Bayesian DNN outperforms previously proposed methods, in terms of the similarity between the segmentation predictions and the FreeSurfer labels, and the usefulness of the estimate uncertainty of these predictions. In particular, we demonstrated that the prediction uncertainty of this network at each voxel is a good indicator of whether the network has made an error and that the uncertainty across the whole brain can predict the manual quality control ratings of a scan. The proposed Bayesian DNN method should be applicable to any new network architecture for addressing the segmentation problem.

1. Introduction

Identifying which voxels in a structural magnetic resonance imaging (sMRI) volume correspond -to different brain structures (i.e., segmentation) is an essential processing step in neuroimaging analyses. These segmentations are often generated using the FreeSurfer package (Fischl, 2012), a process which can take a day or more for each subject (FreeSurfer, 2018). The computational resources for doing this at a scale of hundreds to thousands of subjects are beyond the capabilities of the computational resources available to most researchers. This has led to an interest in the use of deep neural networks as a general approach for learning to predict the outcome of a processing task, given the input data, in a much shorter time period than the processing would normally take. In particular, several deep neural networks have been trained to perform segmentation of brain sMRI volumes (Ronneberger et al., 2015; Fedorov et al., 2017a,b; Li et al., 2017; Dolz et al., 2018; Roy et al., 2018b), taking between a few seconds and a few minutes per volume. These networks predict a manual or an automated segmentation from the structural volumes [Fedorov et al. (2017a,b), Dolz et al. (2018), and Roy et al. (2018b) used FreeSurfer, Petersen et al. (2010) used GIF (Cardoso et al., 2015), and Rajchl et al. (2018) used FSL (Jenkinson et al., 2012), SPM (Friston et al., 1994) and MALP-EM (Ledig et al., 2015)].

These networks, however, have been trained on a limited number (on the order of hundreds or low thousands) of examples from a limited number of sites (i.e., locations and/or scanners), which can lead to poor cross-site generalization for complex segmentation tasks with a large number of classes (McClure et al., 2018). This includes several of the recent DNNs proposed for fine-grain sMRI segmentation. (Note: We focus on DNNs which predict >30 classes.)

Roy et al. (2018b) performed 33 class segmentation using 581 sMRI volumes from the IXI dataset to train an initial model and then fine-tuned on 28 volumes from the MALC dataset (Marcus et al., 2007). They showed an ~9.4% average Dice loss on out-of-site data from the ADNI-29 (Mueller et al., 2005), CANDI (Kennedy et al., 2012), and IBSR (Rohlfing, 2012) datasets. Fedorov et al. (2017a) used 770 sMRI volumes from HCP (Van Essen et al., 2013) to train an initial model and then fine-tuned on 7 volumes from the FBIRN dataset (Keator et al., 2016). Li et al. (2017) performed a 160 class segmentation using 443 sMRI volumes from the ADNI dataset (Petersen et al., 2010) for training. Rajchl et al. (2018) used 5,000 sMRI volumes from the UK Biobank Sudlow et al. (2015) dataset to train a model by using multi-task learning to simultaneously predict 4 SPM labels, 17 FSL labels, and 139 MALP-EM labels. However, Fedorov et al. (2017a), Li et al. (2017), and Rajchl et al. (2018) did not report test results for sites that where not used during training.

These results show that it is possible to train a neural network to carry out segmentation of a sMRI volume. However, they provide a limited indication of whether such a network would work on data from any new site not encountered in training. While fine-tuning on labeled data from new sites can improve performance, even while using small amounts of data (Fedorov et al., 2017a; McClure et al., 2018; Roy et al., 2018b), a robust neural network segmentation tool should generalize to new sites without any further effort.

As part of the process of adding segmentation capabilities to the “Nobrainer” tool¹, we trained a network to predict FreeSurfer segmentations given a training set of ~10,000 sMRI volumes. This paper describes this process, as well as a quantitative and qualitative evaluation of the performance of the resulting model.

Beyond the segmentation performance of the network, a second aspect of interest to us is to understand whether it is feasible for a network to indicate how confident it is about its prediction at each location in the brain. We expect the network to make errors, be it because of noise, unusual positioning of the brain, very different contrast than what it was trained on, etc. Because our model is probabilistic and seeks to learn uncertainties, we expect it to be less confident in its predictions in such cases. It is also possible that, for certain locations, there are very similar brain structures labeled as different regions in different people. In such locations, there would be a limit to how well the network could perform, the Bayes error rate (Hastie et al., 2005). Additionally, the network should be less confident for examples that are very different from those seen in the training set (e.g., contain large artifacts). While prediction uncertainty can be computed for standard neural networks, as done by Dolz et al. (2018), these uncertainty estimates are often overconfident (Guo et al., 2017; McClure and Kriegeskorte, 2017). Bayesian neural networks (BNNs) have been proposed as a solution to this issue. One popular BNN approach is Monte-Carlo (MC) Bernoulli Dropout (Srivastava et al., 2014; Gal and Ghahramani, 2016). Using this method, Li et al. (2017) and Roy et al. (2018a) showed that the segmentation performance of the BNN predictions was better for voxels with low dropout sampling-based uncertainties and that injected input noise can lead to increased uncertainty. Roy et al. (2018a) also found that using MC Bernoulli dropout decreased the drop in segmentation performance from 9.4% to 7.8% on average when testing on data from new sites compared to Roy et al. (2018b). However, MC Bernoulli dropout does not learn dropout probabilities from data, which can lead to not properly modeling the uncertainty of the predicted segmentation. Recent works has shown that these dropout probabilities can be learned using a concrete relaxation (Gal et al., 2017). Additionally, learning individual uncertainties for each weight has been shown to be beneficial for many purposes (e.g., pruning and continual learning) (Blundell et al., 2015; McClure et al., 2018; Nguyen et al., 2018). In this paper, we propose using both learned dropout uncertainties and individual weight uncertainties.

Finally, we test the hypothesis that overall prediction uncertainty across an entire image reflects its “quality,” as measured by human quality control (QC) scores. Given the effort required to produce such scores, there have been multiple attempts to either crowdsource the process (Keshavan et al., 2018) or automate it (Esteban et al., 2017). The latter, in particular, does not rely on segmentation information, so we believe it is worthwhile to test whether uncertainty derived from segmentation is more effective.

2. Methods

2.1. Data

2.1.1. Imaging Datasets

We combined several datasets (Table 1), many which themselves contain data from multiple sites, into a single dataset with 11,480 T1 sMRI volumes. In-site validation and test sets were created from the combined dataset using a 80-10-10 training-validation-test split. This resulted in a training set of 9,184 volumes, a validation set of 1,148 volumes, and a test set of 1,148 volumes. The training set was used for training the networks, the validation set for setting DNN hyperparameters (e.g., Bernoulli dropout probabilities), and the test set was used for evaluating the performance of the DNNs on new data from the same sites that were used for training.

TABLE 1

Table 1. The number of examples used from different datasets.

We additionally used 418 sMRI volumes from the NNDSP dataset (Lee et al., 2018) as a held-out dataset to test generalization of the network to an unseen site. In addition to sMRI volumes, each NNDSP sMRI volume was given a QC score from 1 to 4, higher scores corresponding to worse scan quality, by two raters (3 if values differed by more than 1), as described in Blumenthal et al. (2002). If a volume had a QC score greater than 2, it was labeled as a bad quality scan; otherwise, the scan was labeled as a good quality scan.

2.1.2. Segmentation Target

We computed 50-class FreeSurfer (Fischl, 2012) segmentations, as in Fedorov et al. (2017a), for all subjects in each of the datasets described earlier. These were used as the labels for prediction. Although, FreeSurfer segmentations may not be perfectly correct, as compared to manual, expert segmentations, using them allowed us to create a large training dataset, as one could not feasibly label it by hand. FreeSurfer trained networks can also outperform FreeSurfer segmentations when compared to expert segmentations (Roy et al., 2018b). The trained network could be fine-tuned with expert small amounts of labeled data, which would likely improve the results (McClure et al., 2018; Roy et al., 2018b).

2.1.3. Data Pre-processing

The sMRI volumes were resampled to 1mm isotropic cubic volumes of 256 voxels per side and the voxel intensities were normalized according to Freesurfer's mri_convert with the conform flag. After resampling, input volumes were individually z-scored across voxels. We then split each sMRI volume into 512 non-overlapping 32 × 32 × 32 sub-volumes, similarly to Fedorov et al. (2017b) and Fedorov et al. (2017a), to be used as inputs for the neural network. The prediction target is the corresponding segmentation sub-volume. This resulted in 512 pairs (x, y), of sMRI and label sub-volumes, respectively, for each sMRI volume.

2.2. Convolutional Neural Network

2.2.1. Architecture

Several deep neural network architectures have been proposed for brain segmentation, such as U-net (Ronneberger et al., 2015), QuickNAT (Roy et al., 2018b), HighResNet (Li et al., 2017), and MeshNet (Fedorov et al., 2017a,b). We chose MeshNet because of its relatively simple structure, its lower number of learned parameters, and its competitive performance, since the computational cost of Bayesian neural networks scales based on structural complexity and number of parameters.

MeshNet uses dilated convolutional layers (Yu and Koltun, 2015) due to the 3D structural nature of sMRI data. Applying a discrete volumetric dilated convolutional layer to one input channel for one weight filter can be expressed as:

\begin{array}{l} {(W_{f} *_{l} h)}_{i, j, k} = \sum_{\tilde{i} = - a}^{a} \sum_{\tilde{j} = - b}^{b} \sum_{\tilde{k} = - c}^{c} w_{f, \tilde{i}, \tilde{j}, \tilde{k}} h_{i - l \tilde{i}, j - l \tilde{j}, k - l \tilde{k}} = {(W_{f} *_{l} h)}_{v} \\ = \sum_{t \in W_{a b c}} w_{f, t} h_{v - l t} . & (1) \end{array}

where h is the input to the layer, a, b, and c are the bounds for the i, j, and k axes of the filter with weights w_f (i, j, k) is the voxel, v, where the convolution is computed. The set of indices for the elements of w_f can be defined as $W_{a b c} = {- a, \dots, a} \times {- b, \dots, b} \times {- c, \dots, c}$ . The dilation factor, number of filters, and other details of the MeshNet-like architecture that we used for all experiments is shown in Table 2. Note that we increased the number of filters per layer from 72 to 96, compared to Fedorov et al. (2017a) and McClure et al. (2018), since we greatly increased the number of training volumes.

TABLE 2

Table 2. The MeshNet dilated convolutional neural network architecture used for brain segmentation.

2.2.2. Maximum a Posteriori Estimation

When training a neural network, the weights of the network, w, are often learned using maximum likelihood estimation (MLE). For MLE, $log p (D | w)$ is maximized where $D = {(x_{1}, y_{1}), \dots, (x_{N}, y_{N})}$ is the training dataset and (x_n, y_n) is the nth input-output example. This often overfits, however, so we used a prior on the network weights, p(w), to obtain a maximum a posteriori (MAP) estimate, by optimizing $log p (w | D)$ :

\begin{array}{l} w^{*} = \underset{w}{argmax} \sum_{n = 1}^{N} log p (y_{n} | x_{n}, w) + log p (w) . & (2) \end{array}

We used a fully factorized Gaussian prior (i.e., $p (w_{f, \tilde{i}, \tilde{j}, \tilde{k}}) = N (0, 1))$ . This results in the MAP weights being learned by minimizing the softmax cross-entropy with L2 regularization. At test time, this point estimate approximation, w^*, is used to make a prediction for new examples:

\begin{array}{l} p (y_{t e s t} | x_{t e s t}) \approx p (y_{t e s t} | x_{t e s t}, w^{*}) & (3) \end{array}

2.2.3. Approximate Bayesian Inference

In Bayesian inference for neural networks, a distribution of possible weights is learned instead of just a MAP point estimate. Using Bayes' rule, $p (w | D) = p (D | w) p (w) / p (D)$ , where p(w) is the prior over weights. However, directly computing the posterior, $p (w | D)$ , is often intractable, particularly for DNNs. As a result, an approximate inference method must be used.

One of the most popular approximate inference methods for neural networks is variational inference, since it scales well to large DNNs. In variational inference, the posterior distribution $p (w | D)$ is approximated by a learned variational distribution of weights q_θ(w), with learnable parameters θ. This approximation is enforced by minimizing the Kullback-Leibler divergence (KL) between q_θ(w), and the true posterior, $p (w | D)$ , $KL [q_{θ} (w) | | p (w | D)]$ , which measures how q_θ(w) differs from $p (w | D)$ using relative entropy. This is equivalent to maximizing the variational lower bound (Hinton and Van Camp, 1993; Graves, 2011; Blundell et al., 2015; Kingma et al., 2015; Gal and Ghahramani, 2016; Louizos and Welling, 2017; Molchanov et al., 2017), also known as the evidence lower bound (ELBO),

\begin{array}{l} L_{E L B O} (θ) = L_{D} (θ) - L_{K L} (θ), & (4) \end{array}

where $L_{D} (θ)$ is

\begin{array}{l} L_{D} (θ) = \sum_{n = 1}^{N} 𝔼_{q_{θ} (w)} [log p (y_{n} | x_{n}, w)] & (5) \end{array}

and $L_{K L} (θ)$ is the KL divergence between the variational distribution of weights and the prior,

\begin{array}{l} L_{K L} (θ) = KL [q_{θ} (w) | | p (w)], & (6) \end{array}

which measures how q_θ(w) differs from p(w) using relative entropy.

Maximizing $L_{D}$ seeks to learn a q_θ(w) that explains the training data, while minimizing L_KL (i.e., keeping q_θ(w) close to p(w)) prevents learning a q_θ(w) that overfits to the training data.

The objective function in Equation (4) is usually impractical to compute for deep neural networks, due to both: (1) being a full-batch approach and (2) integrating over q_θ(w). (1) is often dealt with by using stochastic mini-batch optimization (Robbins and Monro, 1951) and (2) is often approximated using Monte Carlo sampling. As discussed in Graves (2011) and Kingma et al. (2015), these methods can be used to perform stochastic gradient variational Bayes (SGVB) in deep neural networks. For each parameter update, an unbiased estimate of $\nabla_{θ} L_{D}$ for a mini-batch, {(x₁, y₁), …, (x_M, y_M)}, is calculated using one weight sample, w_m, from q_θ(w) for each mini-batch example. This results in the following approximation to Equation (4):

\begin{array}{l} L_{E L B O} (θ) \approx L_{D}^{S G V B} (θ) - L_{K L} (θ), & (7) \end{array}

where

\begin{array}{l} L_{D} (θ) \approx L_{D}^{S G V B} (θ) = \frac{N}{M} \sum_{m = 1}^{M} log p (y_{m} | x_{m}, w_{m}) . & (8) \end{array}

At test time, the weights, w would ideally be marginalized out, p(y_test|x_test) = ∫p(y_test|x_test, w)q_θ(w)dw, when making a prediction for a new example. However, this is often impractical to compute for DNNs, so a Monte-Carlo approximation is often used. This results in the prediction of a new example being made by averaging the predictions of multiple weight samples from q_θ(w) (Figure 1):

\begin{array}{l} p (y_{t e s t} | x_{t e s t}) \approx \frac{1}{N_{M C}} \sum_{n}^{N_{M C}} p (y_{t e s t} | x_{t e s t}, w_{n}) & (9) \end{array}

where w_n ~ q_θ(w).

FIGURE 1

Figure 1. Illustration of generating a prediction from a Bayesian neural network using Monte Carlo sampling (modified from Blundell et al., 2015). A standard neural network (A, top left) has one weight for each of its connections (w^*), learned from the training set and used in generating a prediction for a test example. A Bayesian neural network (B, bottom left) has, instead, a posterior distribution for each weight, parameterized by theta (q_θ(w)). The process of training starts with an assigned prior distribution for each weight, and returns an approximate posterior distribution. At test time (C, right), a weight sample w₁ (red) is drawn from the posterior distribution of the weights, and the resulting network is used to generate a prediction p(y|x, w₁) for an example x. The same can be done for samples w₂ (blue) and w₃ (green), yielding predictions p(y|x, w₂) and p(y|x, w₃), respectively. The three networks are treated as an ensemble and their predictions averaged.

2.2.3.1. MC Bernoulli dropout

For MC Bernoulli dropout (BD) (Gal and Ghahramani, 2016), we drew weights from q_θ(w) by drawing a Bernoulli random variable (b_{i, j, k} ~ Bern(p_l)), where i, j, k are the indices of the volume axes, for every element of the layer, l, input, h, and then elementwise multiplying b and h before applying the next dilated convolutional layer. This effectively sets the filter weights to zero when applied to a dropped element. Gal and Ghahramani (2016) approximated the KLD between this Bernoulli variational distribution and a zero-mean Gaussian by replacing the variational distribution with a mixture of Gaussians, resulting in an L2-like penalty. However, this can lead to pathological behavior due to Bernoulli distributions not having support over all real numbers (Hron et al., 2018). In Bernoulli dropout, p_l codes for the uncertainty of the weights and is often set layerwise via hyperparameter search (for our experiments, we found the best value of p to be 0.9 after searching over the values of 0.95, 0.9, 0.75, and 0.5 using the validation set). However, Bayesian models would ideally learn how uncertain to be for each weight.

2.2.3.2. Spike-and-slab dropout with learned model uncertainty

We propose a form of dropout that both learns the dropout probability for each filter using a concrete relaxation of dropout (Gal et al., 2017), and an individual uncertainty for each weight using fully factorized Gaussian (FFG) filters (Graves, 2011; Blundell et al., 2015; Molchanov et al., 2017; McClure et al., 2018; Nguyen et al., 2018). This is in contrast to previous spike-and-slab dropout methods, which did not learn the model (or epistemic) uncertainty (Der Kiureghian and Ditlevsen, 2009; Kendall and Gal, 2017) from data either by learning the dropout probabilities or by learning the variance parameter of the Gaussian components of the weights (McClure and Kriegeskorte, 2017). In our proposed method, we assume each of the F filters are independent (i.e., $p (w) = \prod_{f = 1}^{F} p (w_{f})$ ), as done in previous FFG methods (Graves, 2011; Blundell et al., 2015; Molchanov et al., 2017; McClure et al., 2018; Nguyen et al., 2018). We then decompose each filter into a dropout-based component, b_f, and a Gaussian component, g_f, such that w_f = b_fg_f. Per this decomposition, we perform variational inference on the joint distribution of {b₁, …, b_F, g₁, …g_F}, instead of on p(w) directly (Titsias and Lázaro-Gredilla, 2011; McClure and Kriegeskorte, 2017). We then assume each element of g_f is independent (i.e., $p (g_{f}) = \prod_{t \in W_{a b c}} p (g_{f, t})$ ), and that each weight is Gaussian (i.e., $g_{f, t} ~ N (μ_{f, t}, σ_{f, t}^{2})$ ) with learned parameters μ_{f, t} and σ_{f, t}. Instead of drawing each b_f from Bern(p_l), we draw them from a concrete distribution (Gal et al., 2017) with a learned dropout probability, p_f, for each filter:

\begin{array}{l} b_{f} = s i g m o i d (\frac{1}{t} (\log p_{f} - \log (1 - p_{f}) + \log u - \log (1 - u)) & (10) \end{array}

where u ~ Unif (0, 1). This concrete distribution converges to the Bernoulli distribution as the sigmoid scaling parameter, t, goes to zero. (In this paper, we used t = 0.02). As discussed in Kingma et al. (2015) and Molchanov et al. (2017), randomly sampling each g_{f, t} for each mini-batch example can be computationally expensive, so we used the fact that the sum of independent Gaussian variables is also Gaussian to move the noise from the weights to the convolution operation, as in McClure et al. (2018). For, dilated convolutions and the proposed spike-and-slab variational distribution, this is described by:

\begin{array}{l} {(w_{f} *_{l} h)}_{v} = b_{f} {(g_{f} *_{l} h)}_{v} & (11) \end{array}

where

\begin{array}{l} {(g_{f} *_{l} h)}_{v} ~ N (μ_{f, v}^{*}, {(σ_{f, v}^{*})}^{2}), & (12) \end{array}

\begin{array}{l} μ_{f, v}^{*} = \sum_{t \in W_{a b c}} μ_{f, t} h_{v - l t}, & (13) \end{array}

and

\begin{array}{l} {(σ_{f, v}^{*})}^{2} = \sum_{t \in W_{a b c}} σ_{f, t}^{2} h_{v - l t}^{2} . & (14) \end{array}

For this spike-and-slab dropout (SSD) implementation, we used a spike-and-slab prior, instead of the Gaussian prior used by Gal and Ghahramani (2016) and Gal et al. (2017). Using a spike-and-slab prior with MC Bernoulli dropout was discussed in Gal (2016), but not implemented. As in the variational distribution, each filter is independent in the prior. Per the spike-and-slab decomposition discussed above, the KL-divergence term of the ELBO can be written as

\begin{array}{l} L_{K L} (θ) = \sum_{f = 1}^{F} KL [q_{p_{f}} (b_{f}) q_{μ, σ} (g_{f}) | | p (b_{f}) p (g_{f})], & (15) \end{array}

where $θ = ⋃_{f}^{F} ⋃_{t \in W_{a b c}} {p_{f}, μ_{f, t}, σ_{f, t}}$ are the learned parameters and p(b_f) and p(g_f) are priors. Assuming that each weight in a filter is independent, as commonly done in the literature (Graves, 2011; Blundell et al., 2015; Nguyen et al., 2018), allows the term to be rewritten as

\begin{array}{l} L_{K L} (θ) = \sum_{f = 1}^{F} (KL [q_{p_{f}} (b_{f}) | | p (b_{f})] + \sum_{t \in W_{a b c}} KL [q_{μ, σ} (g_{f, t}) | | p (g_{f, t})]) . & (16) \end{array}

For KL[q_{p_f}||p(b_f)], we used the KL-divergence between two Bernoulli distributions,

\begin{array}{l} K L [q_{p_{f}} (b_{f}) | | p (b_{f})] = p_{f} log \frac{p_{f}}{p_{p r i o r}} + (1 - p_{f}) log \frac{1 - p_{f}}{1 - p_{p r i o r}}, & (17) \end{array}

since we used a relatively small sigmoid scaling parameter. Using $p (g_{f, t}) = N (μ_{p r i o r}, σ_{p r i o r}^{2})$ ,

\begin{array}{l} K L [q_{μ, σ} (g_{f, t}) | | p (g_{f, t})] = log \frac{σ_{p r i o r}}{σ_{f, t}} + \frac{σ_{f, t}^{2} + {(μ_{f, t} - μ_{p r i o r})}^{2}}{2 σ_{p r i o r}^{2}} - \frac{1}{2} . & (18) \end{array}

For this paper, the spike-and-slab prior parameters were set as p_prior = 0.5, μ_prior = 0, and σ_prior = 0.1. p_prior = 0.5 corresponds to a maximum entropy prior (i.e., in the absence of new data be maximally uncertain). Alternatively, a p_prior close to 0 is a sparcity prior (i.e., in the absence of data do not use a filter).

2.3. Implementation Details

The DNNs were implemented using Tensorflow (Abadi et al., 2016). During training, the parameters of each DNN were updated using Adam (Kingma and Ba, 2015) with an initial learning rate of 1e-4. A mini-batch size of 32 subvolumes was used with data parallelization across 4 12GB NVIDIA Titan X Pascal GPUs was used for training and a mini-batch size of 8 subvolumes on 1 12GB NVIDIA Titan X Pascal GPU was used for validation and testing.

2.4. Quantifying Performance

2.4.1. Segmentation Performance Measure

To measure the quality of the produced segmentations, we calculated the Dice coefficient, which is defined by

\begin{array}{l} D i c e_{c} = \frac{2 | {\hat{y}}_{c} \cdot y_{c} |}{| | {\hat{y}}_{c} | |^{2} + | | y_{c} | |^{2}} = \frac{2 T P_{c}}{2 T P_{c} + F N_{c} + F P_{c}}, & (19) \end{array}

where ${\hat{y}}_{c}$ is the binary segmentation for class c produced by a network, y_c is the ground truth produced by FreeSurfer, TP_c is the true positive rate for class c, FN_c is the false negative rate for class c, and FP_c is the false positive rate for class c. We calculate the Dice coefficient separately for each class c = 1, …, 50, and average across classes to compute the overall performance of a network for one sMRI volume.

2.4.2. Uncertainty Measure

We quantify the uncertainty of a prediction, p(y_m|x_m), using the aleatoric uncertainty (Der Kiureghian and Ditlevsen, 2009; Kendall and Gal, 2017) for each voxel, v. This was measured by the entropy of the softmax across the 50 output classes,

\begin{array}{l} H (Y_{m, v} | X_{m} = x_{m}) = - \sum_{c = 1}^{50} p (y_{m, v} = c | x_{m}) log p (y_{m, v} = c | x_{m}), & (20) \end{array}

We calculated the uncertainty for one sMRI volume by averaging across all output voxels not classified as background (i.e., given the unknown label).

3. Results

3.1. Segmentation Performance

We trained MAP, MC Bernoulli Dropout (BD), and Spike-and-Slab Dropout (SSD) Meshnet-like CNNs on the 9,184 sMRI volumes in the training set. We then applied our networks to produce segmentations for both the in-site test set and the out-of-site test data. For the BD and SSD networks, 10 MC samples were used for test predictions. The means and standard deviations across volumes for the average Dice across all 50 classes are shown in Table 3. Dice scores for each label for the in-site and out-of-site test sets are shown in Figures 2, 3, respectively. We found that, compared to MAP and BD, SSD significantly increased the Dice for both the in-site (p < 1e − 6) and out-of-site (p < 1e − 6) test sets, per a paired t-test across test volumes. We found that SSD had a 5.7% drop in performance from the in-site test set to the out-of-site test set, where as the MAP has a drop of 6.2% and BD a drop of 5.4%. This is better than drops of 9.4% and 7.8% on average reported in the literature by Roy et al. (2018b) and Roy et al. (2018a), respectively. In Figures 4, 5, we show selected example segmentations for the SSD network for volumes that have Dice scores similar to the average Dice score across the respective dataset.

TABLE 3

Table 3. The average and standard deviation of the class Dices across test volumes for the maximum a posteriori (MAP), MC Bernoulli dropout (BD), and spike-and-slab dropout (SSD) network on the in-site and out-of-site test sets.

FIGURE 2

Figure 2. Average Dice scores, sorted in decreasing order by average method performance, and standard errors across in-site test volumes for each label for the maximum a posteriori (MAP), MC Bernoulli dropout (BD), and spike-and-slab dropout (SSD) networks.

FIGURE 3

Figure 3. Average Dice scores, sorted in decreasing order by average method performance, and standard errors across out-of-site test volumes for each label for the maximum a posteriori (MAP), MC Bernoulli dropout (BD), and spike-and-slab dropout (SSD) networks.

FIGURE 4

Figure 4. In-site segmentation results for the spike-and-slab dropout (SSD) network for a test subject with average Dice performance. The columns show, respectively, the structural image used as input, the FreeSurfer segmentation used as a prediction target, the prediction made by our network, the voxels where there was a mismatch between prediction and target, and the prediction uncertainty at each voxel.

FIGURE 5

Figure 5. Out-of-site segmentation results for the spike-and-slab dropout (SSD) network for a test subject with average Dice performance. The columns show, respectively, the structural image used as input, the FreeSurfer segmentation used as a prediction target, the prediction made by our network, the voxels where there was a mismatch between prediction and target, and the prediction uncertainty at each voxel.

3.2. Utilizing Uncertainty

3.2.1. Predicting Segmentation Errors From Uncertainty

Ideally, an increase in DNN prediction uncertainty indicates an increase in the likelihood that prediction is incorrect. To evaluate whether this is the case for the trained brain segmentation DNN, we performed a receiver operating characteristic (ROC) analysis. In this analysis, voxels are ranked from most uncertain to least uncertain and one considers, at each rank, what fraction of the voxels were also misclassified by the network. An ROC curve can then be generated by plotting the true positive rate vs. the false negative rate for different uncertainty thresholds used to predict misclassification. The area under this curve (AUC) typically summarizes the results of the ROC analysis. The average ROC and AUCs across volumes for MAP, BD, and SSD for the in-site and out-of-site test sets are shown in Figure 6. Compared to MAP and BD, SSD significantly improved the AUC for both the in-site (p < 1e − 6) and out-of-site (p < 1e − 6) test sets, per a paired t-test across test set volumes.

FIGURE 6

Figure 6. Receiver operating characteristic (ROC) curves for predicting errors for the in-site and out-of-site test sets from the voxel uncertainty of the maximum a posteriori (MAP), MC Bernoulli dropout (BD), and spike-and-slab dropout (SSD) networks.

3.2.2. Predicting Scan Quality From Uncertainty

Ideally, the output uncertainty for inputs not drawn from the training distribution should be relatively high. This could potentially be useful for a variety of applications. One particular application is detection of bad quality sMRI scans, since the segmentation DNN was trained using relatively good quality scans. To test the validity of predicting high vs. low quality scans, we performed an ROC analysis on the held-out NNDSP dataset, where manual quality control ratings are available. We also did the same analysis using MRIQC (v0.10.5) (Esteban et al., 2017), a recently published method that combines a wide range of automated QC algorithms. To statistically test whether any method significantly outperformed the other methods, we performed bootstrap sampling of the AUC for predicting scan quality from average uncertainty by sampling out-of-site test volumes. We performed 10,000 bootstrap samples, each with 418 volumes. The average ROC and AUC for the MAP, BD, SSD, and MRIQC methods are shown in Figure 7. The MAP, BD, and SSD networks all have significantly higher AUCs than MRIQC (p = 1.369e − 4, p = 1.272e − 5, and p = 1.381e − 6, respectively). Additionally, SSD had a significantly higher AUC than both MAP and BD (p = 1.156e − 3 and p = 1.042e − 3, respectively).

FIGURE 7

Figure 7. Receiver operating characteristic (ROC) curves for predicting scan quality for the NNDSP out-of-site test set from the average non-background voxel uncertainty of the maximum a posteriori (MAP), MC Bernoulli dropout (BD), and spike-and-slab dropout (SSD) networks and from MRIQC scores.

4. Discussion

Segmentation of structures in sMRI volumes is a critical pre-processing step in many neuroimaging analyses. However, these segmentations are currently generated using tools that can take a day or more for each subject (FreeSurfer, 2018), such as FreeSurfer. This computational cost can be prohibitive when scaling analyses up from hundreds to thousands of subjects. DNNs have recently been proposed to perform sMRI segmentation is seconds to minutes. In this paper, we developed a Bayesian DNN, using spike-and-slab dropout, with the goals of increasing the similarity of the DNN's predictions to the FreeSurfer segmentations and generating useful uncertainty estimates for these predictions.

In order to evaluate the proposed Bayesian network, we trained a standard deep neural network (DNN), using MAP estimation, to predict FreeSurfer segmentations from structural MRI (sMRI) volumes. We trained on a little under 10,000 sMRIs, obtained by combining approximately 70 different datasets (many of which, in turn, contain images from several sites, e.g., NKI, ABIDE, ADHD200). We used a separate test set of more than 1,000 sMRIs, drawn from the same datasets. The resulting standard DNN performs at the same level of state-of-the-art networks (Fedorov et al., 2017a). This result, however, was obtained by testing over an order of magnitude more test data, and many more sites, than those papers. We also tested performance on a completely separate dataset (NNDSP) from a site not encountered in training, which contained 418 sMRI volumes. Whereas Dice performance dropped slightly, this was less than what was observed in other studies (Roy et al., 2018a,b); this suggests that we may be achieving better generalization by training on our larger and more diverse dataset, and we plan on testing this on more datasets from novel sites in the future. This is particularly important to us, as this network is meant to be used within an off-the-shelf tool¹.

We demonstrated that the estimated uncertainty for the prediction at each voxel is a good indicator of whether the standard network makes an error in it, both in-site and out-of-site. The tool that produces the predicted segmentation volume for an input sMRI will also produce an uncertainty volume. We anticipate this being useful at various levels, e.g., to refine other tools that rely on segmentation images, or to improve prediction models based on sMRI data (e.g., modification of calculation of cortical thickness, surface area, voxel selection or weighting in regression (Roy et al., 2018a) or classification models, etc).

We also demonstrated that the average prediction uncertainty across voxels in the brain is an excellent indicator of manual quality control ratings. Furthermore, it outperforms the best existing automated solution (Esteban et al., 2017). Since automation is already used in large repositories (e.g., OpenMRI), we plan on offering our tool as an additional quality control measure.

Finally, we showed that a new Bayesian DNN using spike-and-slab dropout with learned model uncertainty was significantly better than previous approaches. This spike-and-slab method increased segmentation performance and improved the usefulness of output uncertainties compared both to a MAP DNN method and an MC Bernoulli dropout method, which has previously been used in the brain segmentation literature (Li et al., 2017; Roy et al., 2018a). These results show that Bayesian DNNs are a promising method for building brain segmentation and automated sMRI quality control tools. We have also made a version of “Nobrainer,” that incorporates the networks trained and evaluated in this paper, available for download and use within a Singularity/Docker container².

We believe it may be possible to improve this segmentation processing, in that we did not use registration. One option would be to use various techniques for data augmentation (e.g., variation of image contrast, since that is pretty heterogeneous, rotations/translations of existing examples, addition of realistic noise, etc). Another would be to eliminate the need to divide the brain into sub-volumes, which loses some global information; this will become more feasible in GPUs with more memory. Finally, we plan on using post-processing of results (e.g., ensure some coherence between predictions for adjacent voxels, leverage off-the-shelf brain and tissue masking code).

Data Availability Statement

All datasets generated for this study are included in the manuscript/supplementary files.

Author Contributions

PM and CZ developed the novel Bayesian neural network method. PM, NR, JK, and SG wrote the code for training and testing the neural networks. PM and NR analyzed the neural network predictions. JL, DN, and AT acquired and preprocessed the magnetic resonance imaging (MRI) data. PM and FP wrote the first manuscript draft. All authors contributed to manuscript revision, read the manuscript, and approved the submitted version.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

This research was supported (in part) by the Intramural Research Program of the NIMH (ZICMH002968). This work utilized the computational resources of the NIH HPC Biowulf cluster (http://hpc.nih.gov). JK's and SG's contribution was supported by NIH R01 EB020740.

Footnotes

1. ^https://github.com/neuronets/nobrainer

2. ^https://github.com/neuronets/kwyk

References

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., et al. (2016). “Tensorflow: a system for large-scale machine learning,” in 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16) (Savannah, GA), 265–283.