Lowering the computational barrier: Partially Bayesian neural networks for transparency in medical imaging AI

Deep Neural Networks (DNNs) can provide clinicians with fast and accurate predictions that are highly valuable for high-stakes medical decision-making, such as in brain tumor segmentation and treatment planning. However, these models largely lack transparency about the uncertainty in their predictions, potentially giving clinicians a false sense of reliability that may lead to grave consequences in patient care. Growing calls for Transparent and Responsible AI have promoted Uncertainty Quantification (UQ) to capture and communicate uncertainty in a systematic and principled manner. However, traditional Bayesian UQ methods remain prohibitively costly for large, million-dimensional tumor segmentation DNNs such as the U-Net. In this work, we discuss a computationally-efficient UQ approach via the partially Bayesian neural networks (pBNN). In pBNN, only a single layer, strategically selected based on gradient-based sensitivity analysis, is targeted for Bayesian inference. We illustrate the effectiveness of pBNN in capturing the full uncertainty for a 7.8-million parameter U-Net. We also demonstrate how practitioners and model developers can use the pBNN's predictions to better understand the model's capabilities and behavior.

We seek to enable Bayesian UQ for million-parameter DNNs, i.e., to create million-parameter Bayesian neural networks (BNNs) (Blundell et al., 2015;Gal, 2016). We achieve this by introducing a low-computation strategy for performing Bayesian inference on a small, strategically chosen portion (layer) of the entire DNN, thereby creating a partially BNN (pBNN) that approximates the uncertainty of a full BNN. We use Flipout (Wen et al., 2018) VI algorithm to solve the resulting Bayesian problem on the target layer. Unlike related strategies (Riquelme et al., 2017;Azizzadenesheli et al., 2018;Valentin Jospin et al., 2020) that attempt Bayesian inference only on the last layer of a DNN which does not necessarily offer the best uncertainty representation (Zeng et al., 2018), our approach allows a justified selection on which part of the DNN to "Bayesianize" guided by sensitivity analysis (SA).
To impart the ability to express uncertainty in tumor segmentation models, we build pBNN for a state-of-the-art 7.8 million-parameter U-Net (Ronneberger et al., 2015) proposed for medical image segmentation. To empirically validate our approach, we train a full BNN as a baseline, and compute the uncertainty approximation discrepancy of the pBNNs to this full BNN. Our experiment demonstrates that the pBNN based on the SA-selected layer is the most efficient in approximating the full Bayesian uncertainty (per Bayesianized parameter) compared to other layer choices. Our results suggest that the SA-based layer-selection offers an effective and inexpensive way to identify the best DNN layer to perform Bayesian inference.
The main contribution of our work is to enable Bayesian UQ for million-dimensional medical DNN models. We features a novel computational method that uses a SA-based DNN layer selection for targeted Bayesian inference, which is especially useful in scenarios where practitioners and model developers want to understand the model's uncertainty but do not have the resources for a full Bayesian inference on the entire DNN. Our work advances future interactions for Transparent and Responsible AI that can support the investigation of the uncertainty in DNN models. We illustrate one such interaction, through the use of uncertainty maps, to show how our work can allow clinicians to interpret uncertainty of the tumor segmentation AI.

. Background and preliminaries
We begin by provide some mathematical background for understanding DNNs and BNNs.

. . Bayesian neural networks
A DNN takes input x and predicts outputŷ: we writeŷ = f (x; w) where w represent all tunable model parameters of the DNN (e.g., DNN weights and bias). Given N T training data points (x T , y T ) = {x n , y n } N T n=1 , the DNN is typically trained by finding w to minimize a loss function: For example, a popular choice is the least squares loss The optimization is often done with stochastic gradient descent (Robbins and Monro, 1951;LeCun et al., 2012). Once Equation 1 is solved, it produces a deterministic DNN that makes single-valued prediction for any new input x:ŷ = f (x; w * ). BNN (MacKay, 1992;Neal, 1996;Graves, 2011;Blundell et al., 2015;Gal, 2016), in contrast, treats w as random variables with an associated probability density function (PDF) that represents the uncertainty on w. When training data become available, these PDFs are updated through Bayes' rule: where p(w) is the prior PDF, p(y T |x T , w) is the likelihood PDF, p(w|x T , y T ) is the posterior PDF, and p(y T |x T ) is the marginal likelihood (a PDF-normalization term). The prior thus represents the Note that p(w|x T ) = p(w), i.e., the prior uncertainty should not change from knowing only the input values of the training data without their output values.

Frontiers in Computer Science
frontiersin.org . /fcomp. . uncertainty on w before seeing any training data, and the posterior describes the updated uncertainty after incorporating the training data. Solving the Bayesian inference problem then entails computing the posterior p(w|x T , y T ). Conventional Bayesian solutions largely rely on MCMC (Hastings, 1970;Andrieu et al., 2003;Robert and Casella, 2004;Brooks et al., 2011) to sample the posterior distribution. While MCMC provably converges to the exact posterior that may be highly non-Gaussian, it converges slowly in high-dimensional settings due to the onset of measure concentration to the so-called typical set (Betancourt, 2017). Hamiltonian Monte Carlo (Neal, 2011;Hoffman and Gelman, 2014;Betancourt, 2017), one of the more scalable MCMC variants, has been used for Bayesian inference for up to hundreds of parameters, but remains orders of magnitude short of the million-parameter BNNs targeted in this paper.
VI (Jordan et al., 1999;Wainwright and Jordan, 2007;Blei et al., 2017) provides much better scalability by finding the best approximation to the true posterior from a parametric family of distributions (e.g., all independent Gaussians), thereby turning the sampling task into an optimization one: where q(w; θ ) is the appromating posterior PDF parameterized by θ , and the Kullback-Leibler (KL) divergence D KL quantifies the farness from the true posterior p(w|x T , y T ) to q(w; θ ). One can further show that θ * is also the maximizer of the well-known evidence lower bound (ELBO): which can be estimated through Monte Carlo that samples w (m) from q(w; θ ): The optimization can be approached leveraging gradient-based algorithms, where gradient with respect to θ can be obtained via back-propagation. We adopt the recent Flipout VI algorithm (Wen et al., 2018) to solve the Bayesian problem. Flipout (Wen et al., 2018) has been demonstrated to provide substantial computational savings for dense, convolutional, and recurrent neural network architectures. The method injects pseudo-independent weight perturbations in order to decorrelate gradients, thereby achieving drastically decreased variance for the ELBO Monte Carlo estimator. It also offers vectorized implementation that allows one to take advantage of GPU computations.
However even with Flipout, the computational and memory requirements for million-parameters is still extremely high, if not outright prohibitive. Krishnan et al. (2020) found that for large DNN architectures with O(10 7 − 10 8 ) parameters, MFVI failed to converge altogether. Therefore, one cannot simply take Flipout offthe-shelf to build million-dimensional BNNs, additional algorithmic developments are still needed.

. Method for building a partially Bayesian neural network
We introduce a novel method to address the scalability challenges of building million-parameter BNNs, by performing targeted Bayesian inference on a strategically selected portion of the overall DNN-we call the result a partially Bayesian neural network (pBNN). We present the overall procedure via three main steps (see Figure 1): (1) access a deterministic DNN model, (2) conduct SA to select a DNN layer for Bayesian inference, and (3) perform targeted Bayesian inference on the selected layer using Flipout VI to arrive at the final pBNN. We describe each step below in detail.

. . Step : Access a deterministic DNN model
The first step is to access a deterministic DNN model (i.e., one that is trained by minimizing the loss in Equation 1). The purpose of this DNN is to provide an inexpensive and meaningful starting point for the upcoming SA and layer selection. One scenario is that such a deterministic DNN is already available from a pre-existing study or another research group, and we can simply inherit. If it does not exist yet, one may train a new model following Equation 1, which is relatively (to any BNN) inexpensive.
Since this deterministic DNN will be used to guide the selection of model layer for Bayesian inference, it must have the same architecture as the DNN that will eventually undergo Bayesian inference. However, the precise training setup (e.g., choice of loss function, learning rate) for creating this deterministic DNN is flexible since the goal in the next step is to seek a general sense of the sensitivity behavior. This flexibility is important in practice, for instance in situations where we are only given a pre-existing deterministic model but without information on how it was trained.
Step : Conduct sensitivity analysis to select a layer for Bayesian inference Next, we perform SA on the deterministic DNN from Step 1 in order to assess the model prediction behavior as a result of model parameter variation. We adopt gradient-based sensitivity for its simplicity, although other types of sensitivity (e.g., variance-based sensitivity) may also be used. Specifically, we first take the partial gradient of model prediction output with respect to the parameters of the candidate layer being considered (e.g., ℓth layer), which can be calculated easily from a DNN using only a single pass of backpropagation. Since this partial gradient is a vector, we then take its L2norm to transform it into a scalar (other norms such as the L1-norm may be used as well). Lastly, to avoid automatic favoring of larger layers due to simply having more parameters (and therefore more entries contributing to the gradient vector), we normalize (divide) by the number of parameters in layer ℓ (N ℓ ) to arrive at the average sensitivity per parameter. Our overall Sensitivity Index for layer ℓ is: The layer with the highest SI, ℓ * = arg max ℓ SI(ℓ), would then induce the largest change in predictionŷ when its parameters w ℓ are Frontiers in Computer Science frontiersin.org . /fcomp. .

FIGURE
Proposed framework to address scalability challenges of Bayesian inference on neural networks by building pBNN: Step ( ) represents a million-parameter deterministic NN.
Step ( ) conduct sensitivity analysis to identify the layer that has greatest impact on model output.
Step ( ) conduct Bayesian inference on the most sensitive layer to build pBNN. The resulting pBNN combines the strengths of both deterministic and Bayesian NNs, allowing scalability.
perturbed. This layer thus presents as the most critical layer, justifying to focus our resources to capture the Bayesian uncertainty for layer ℓ * . Layer ℓ * will then undergo Bayesian inference in the next step. We note that gradient is a local operator and captures sensitivity locally at the optimized parameter values of the deterministic DNN (i.e., from Equation 1). To capture sensitivity more globally across a wider range of possible w ℓ values, one may also consider averaging the gradient from multiple locations of w ℓ , for instance in estimating the prior-expectation of the gradient: E w ℓ 1 N ℓ ∇ w ℓŷ 2 . However, these global measures are more expensive to compute.

. . Step : Perform targeted Bayesian inference to build pBNN
In the last step, we perform targeted Bayesian inference on the identified layer ℓ * to build the pBNN. Formally, we partition w into two subsets: w = {w B , w D } where w B are the DNN parameters corresponding to layer ℓ * that will undergo Bayesian inference, and w D consists of all remainder parameters that are treated deterministically (i.e., not as random variables). This sets up for a pBNN where only a strategically chosen portion (layer) is Bayesianized.
At this point, we may simply freeze w D at their deterministic DNN values w * D (from Step 1), and invoke VI to compute the conditional posterior: However, allowing w D to change may allow additional improvement in approximating the posterior. Therefore, we propose to jointly optimize w D (the deterministic parameters) and θ B (parameters of the variational posterior for the Bayesian DNN parameters) Lastly, we solve Equation 8 numerically using the Flipout VI algorithm introduced at the end of Section 2.1.

. Illustration with a test problem
Before applying the pBNN to the tumor segmentation DNN, we first demonstrate our framework on a test problem using a small densely-connected DNN with synthetic training data in order to bring intuitions and insights.
To set up the problem, we generate 256 synthetic training data points following the example from Blundell et al. (2015) with some modifications: where ǫ i ∼ N(0, 0.02 2 ). We fit the data to a DNN comprised of an input layer, 9 hidden dense layers with swish activation, and a dense output layer with linear activation (see Table 1). The dense layers varied in width to mimic the varying sizes of convolutional layers encountered in tumor segmentation DNNs. The DNN has a total of 13, 385 trainable parameters. In Step 1 of the pBNN procedure, we produced a deterministic DNN by minimize the loss in Equation 1, with the result plotted as the solid orange line in Figure 3. In Step 2, we computed the gradient-based SI in Equation 6 for each of the 10 candidate layers, shown in Figure 2. The bar graph indicates generally lower SI for the middle layers compared to those near the input or output; the lowest SI is at Layer 7, and the highest SI is at Layer 9. Layer 9 is therefore selected for Bayesianization. In Step 3, we construct a pBNN by performing Bayesian inference on Layer 9 using Flipout VI and training all other layers deterministically, following Equation 8. We provide more details of the pBNN implementation and validation below.
Our pBNN is implemented using TensorFlow Probability (TFP) (Dillon et al., 2017), where Layer 9 is replaced with a DenseFlipout layer and all other layers remaining to be Dense. Training is then performed simultaneously on θ B and w D as in Equation 8. For the Bayesianized DNN parameters (w B ), we adopt independent Gaussian priors with a rather wide standard deviation to reflect an initial uninformative distribution: p(w B ) ∼ N(0, 10 2 ).
. /fcomp. . A Gaussian likelihood is also used to depict independent Gaussian observation noise (σ ǫ = 0.02 to be consistent with the datageneration in Equation 9) on the y targets: The variational posteriors are from the family of independent Gaussians q i (w B ; θ B ) ∼ N(µ i , σ 2 i ), and so the variational parameters are θ = {µ i , σ i }. We train this pBNN using the Nadam (Dozat, 2016) optimizer for 10 4 epochs with a learning rate of 10 −3 and batch size of 64.
To assess whether the pBNN constructed with SA-selected Layer 9 provides a good approximation to the uncertainty of the full BNN, we also train a full BNN where all layers are replaced by DenseFlipout layers; the posterior push-forward uncertainty of the full BNN is shown in Figure 3. Furthermore, to investigate the performance of pBNNs if a layer different from Layer 9 were selected, we construct separate pBNNs where a different layer is Bayesianized.
Since a good pBNN is one that faithfully approximates the uncertainty in the full BNN, we measure its performance based on how close the pBNN's posterior push-forward is to the full BNN's posterior push-forward via the KL divergence. Following the same reasoning as the SI from Equation 6, we normalize this quantity by multiplying N ℓ (since lower KL is better), to arrive at our validation metric: where w partial,B ∼ q partial (w partial,B ; θ * B ) are the Bayesianized parameters in the pBNN, and w full ∼ q full (w full , θ * ) are the full set of parameters (all Bayesianized) in the full BNN. The D norm values for each pBNN are plotted against SI in Figure 4, where we observe a Spearman's correlation of r s = −0.83 which is considered to be strongly correlated in literature (Akoglu, 2018). This supports our SA-based layer selection strategy, where the layers with high SI will also lead to low D norm . Lastly, we note that while we perform this brute-force comparison to validate the effectiveness of our approach, in practice one would only train the pBNN on the SA-selected layer as described in Section 3.

. Demonstration: Tumor segmentation with U-Net
Here, we present the main demonstration of our pBNN, applying it to a state-of-the-art DNN model used for medical image segmentation: the U-Net (Ronneberger et al., 2015). Since our goal Posterior push-forward is p(f (x; w B , w D )|x, x T , y T ), which di ers from the posterior predictive p(y|x, x T , y T ).

FIGURE
Normalized KL divergence from the full BNN posterior push-forward to the di erent pBNNs is highly correlated with the SI that is used to select the layer for Bayesianization (Spearman's correlation r s = − . ).
is to demonstrate our pBNN framework presented in Section 3 and not to develop or justify the choice of architecture, we take the U-Net architecture as given.

. . Step : Access a deterministic DNN model
We begin by training a U-Net deterministically using the architecture and code published by Ojika et al. (2020). As shown in Figure 5, the U-Net comprises of an encoder, decoder, and skip connections. The encoder module composes of 2D Convolutional layers followed by MaxPooling layers with window size 2 × 2, and 2D Spatial Dropout layers for regularization. The decoder module is composed of corresponding Conv2DTranspose layers followed by Concatenate layers. Nested in between the encoder and decoder is the 2D Upsampling layer with bilinear interpolation. The U-Net has feature map size set to 32, kernel size 3 × 3, and dropout rate 0.2. For the 2D Convolutional layers, ReLU activation is used along with He uniform variance scaling initializer, and kernel size is set to 2 × 2. Overall, this U-Net has a total of 7.8 million tunable parameters.
We perform the training on dataset from the Medical Segmentation Decathlon Challenge Task 01 (Simpson et al., 2019) consisting of brain MRI scans of patients with glioma. The training set contains of 58,464 2D images (each sized 144×144) and a separate validation set contains 6,624 images. Data augmentation is employed during training to increase data diversity including random choices of horizontal and vertical flips, ±45 • rotation, and ±2.5 • shearing. The ADAM optimizer (Kingma and Ba, 2015) is run for 30 epochs used a learning rate 0.001 and a mini-batch size of 128. The training loss function is a weighted sum of the Dice coefficient (Dice, 1945;Crum et al., 2006) and standard Binary Cross Entropy. The training results are summarized in Table 2.

. . Step : Conduct sensitivity analysis to select a layer for Bayesian inference
We computed the gradient-based SI in Equation 6 for each candidate layers, shown in Figure 6. In particular, the layers near the top portion of the encoder carry larger SI. This response gradually decreases proceeding further into the model, reaching a minimum near the middle, followed by a light rebound at the decoder layers. The lowest SI occurs at Layer decoder_3a, and the highest SI is at Layer encoder_1a. Layer encoder_1a is therefore selected for Bayesianization.

. . Step : Perform targeted Bayesian inference to build pBNN
We construct a pBNN by performing Bayesian inference on encoder_1a using Flipout VI and training all other layers deterministically, following Equation 8. We implement our pBNN using TensorFlow Probability (TFP) (Dillon et al., 2017) by replacing the 2DConvolution layer of encoder_1a with tfp.layers.Convolution2DFlipout. Training is then performed simultaneously on θ B and w D as in Equation 8. For the Bayesianized DNN parameters (w B ), we adopt independent Gaussian priors with a rather wide standard deviation to reflect an initial uninformative distribution: p(w B ) ∼ N(0, 10 2 ). Since our data label for each pixel is binary while our model prediction ranges [0, 1], we adopt a likelihood associated with the binary cross entropy: . /fcomp. .  The variational posteriors are from the family of independent Gaussians q i (w B ; θ B ) ∼ N(µ i , σ 2 i ), and so the variational parameters are θ = {µ i , σ i }. We train this pBNN using the Nadam (Dozat, 2016) optimizer for 400 epochs with a learning rate of 10 −4 and batch size of 196, using 8 GeForce RTX 2080 Ti GPUs in parallel. The ELBO training history of the pBNN is shown in Figure 7 (plot with encoder_1a emphasized in red). While the plot indicates a small rise around 150 epochs due to numerical instabilities with the optimization algorithm, overall the curve steadies out by 350 epochs (note that the y-axis is logarithmic which amplifies the lowervalue fluctuations) and appears noticeably flatter compared to the beginning of the optimization.

. Validation of the pBNN
In this section, we provide validation for the pBNN procedure, by illustrating whether the U-Net pBNN constructed with SA-selected FIGURE Sensitivity Index (SI) for all layers in the U-Net used for tumor segmentation. The layer encoder_ a has the highest SI and is selected for Bayesianization. encoder_1a layer provides a good approximation to the uncertainty of the full BNN. To achieve this, we also need to train a full BNN where Bayesian inference is performed on all layers. Such attempt to obtain the full uncertainty is highly expensive, and is in fact the . /fcomp. .
original motivation for the new pBNN. We clarify that a full BNN is not needed as a part of the regular pBNN procedure described in Section 3. We train the full BNN, where all 19 layers of the U-Net are Bayesianized, using the full extent of our available computational resources: 2000 epochs using the Nadam (Dozat, 2016) optimizer with learning rate 10 −4 and batch size of 196, 94.72 GB of memory, 8 GPUs in parallel, and it required 3 days of continuous computation to arrive at a steadied ELBO. The best-performing model is saved during the training process. Due to the tremendous size of this BNN, expert experience is also relied upon for randomized initialization and algorithm tuning to arrive at a reasonably converged full BNN. Additionally, we also train the 19 pBNNs, each with one of the 19 layers targeted for Bayesian inference, but with shorter 400 epochs. The ELBO training history for all models are summarized in Figure 7, with the pBNN using the SA-selected layer (encoder_1a) emphasized in red. While we train the full BNN for significantly longer than the pBNNs, we plot their ELBO to around 350 epochs for better readability and comparison. All training curves gradually decrease and start flattening around 150 epochs, after which they become noisy. The plot suggests that all pBNN models we built have converged reasonably well.
Since a good pBNN is one that faithfully approximates the uncertainty in the full BNN, we measure the pBNN performance using the normalized KL divergence between the pBNN's posterior push-forward to the full BNN's posterior push-forward, D norm , defined in Equation 11.
With each U-Net output being a 144 × 144 pixel values, directly computing the KL divergence for the joint 144 × 144dimensional posterior push-forward PDF would be very difficult. Therefore, we further simplify the computation of D norm through an independence approximation, breaking the 144 × 144-dimensional KL into 20,736 pixel-wise one-dimensional KL computations. For a given prediction image, each pixel's KL is estimated by first generating 100 samples of the DNN parameters w partial,B from the optimized variational posterior q partial (w partial,B ; θ * B ), combine it with the other deterministic parameters and evaluate the corresponding pixel predictions f (x; w * D , w partial,B ), and for each pixel fit its 100 samples to a Beta distributions to approximate its posterior pushforward PDF. Beta distributions are often used to model the uncertainty of Bernoulli distributions and suitable for our problem to ensure pixel-wise prediction values reside between [0, 1]. Once the Beta PDFs are extracted, KL divergence can be calculated analytically. The pixel-wise KL divergence values are then summed to obtain the estimate to the joint PDF's KL value (since the KL divergence between two joint PDFs that are independent factors into the sum of KL divergences of individual marginals). The final D norm are shown in Figure 8, and they show a correlated, inverse trend compared to the SI values in Figure 6, supporting that SI is a good inexpensive indicator of the KL performance metric. Figure 9 shows a scatter plot for all 19 pBNNs the KL divergence (not yet normalized by N ℓ ) vs. N ℓ the number of parameters of the layer targeted for Bayesian inference; the pBNN with the SAselected layer (encoder_1a) is marked in red. Ideally, we would like to select a layer for Bayesianization that has low KL divergence and small number of layer parameters (i.e., toward the bottom-left corner). However, these two desirable properties generally conflict and a tradeoff needs to be made. This can be seen by the Pareto optimality front by the plot's lower-left convext hull. In this case, we see our SA criterion indeed identifies one of the Pareto points in layer encoder_1a. The output_layer also resides on the Pareto, and would also serve as a reasonable selection for Bayesianization; indeed, the SI values presented in Figure 6 does show the output_layer to be the runner-up. However we point out that the nature of how the two layers are good choices are different: encoder_1a achieves a lower KL than output_layer, but output_layer has a smaller number of parameter.

. Looking toward uncertainty maps and their interpretation
Although the main contribution of our work is to provide a scalable method for Bayesian UQ in large DNNs, here we illustrate one way that our uncertainty information could be used in practice. We introduce uncertainty maps as a tool that communicates how confident (uncertain) a model is in its prediction. Figure 10 illustrates these maps for one such example: top row displays the 4 modalities of MRI; bottom row displays the ground truth, prediction, uncertainty and truth-prediction discrepancy respectively for the SA-guided pBNN (Section 5.3).

. . Construction
We describe how we compute and visualize the uncertainty of the pBNN's prediction for a specific use case ( Figure 10). We chose this image from our test dataset as an example because the ground truth indicated a large percent of the pixels belonging to the tumor class (25%). First, we obtain 100 samples from the posterior pushforward distribution of our trained pBNN for this MRI. To construct the prediction map, we take the mean of these 100 samples. We then map the continuous values [0,1] to their dominant class (tumor or non-tumor) with thresholding (value of 0.5).
We build the uncertainty map with the same 100 samples, by computing within-pixel standard deviation. The range of uncertainty displayed is unique to the image and is not a global uncertainty measure. To visualize the truth-prediction discrepancy map, we subtract the prediction map from the ground truth. The negative extreme value (−1) indicates regions where the model predicts tumor when the tumor is absent in ground truth (false positive, or overprediction). The positive extreme value (+1) indicates regions where the model does not predict tumor when the tumor is present in the ground truth (false negative, or under-prediction).

. . Interpretation
We show a use-case scenario on how to interpret and analyze uncertainty maps. We asked one of the co-authors who is a researcher in radiation oncology, and a collaborator with routine knowledge of MRI image interpretation to conduct a preliminary interpretation and analysis of the MRI and the uncertainty map in Figure 10. Note that this interpretation was not performed by board-certified radiologists, so it only serves as a guiding example.
According to their interpretation, it was clear to them that the pBNN was fairly confident in distinguishing the tumor core from the background, but was highly uncertain at the . /fcomp. .

FIGURE
ELBO training history for all models. boundary regions of the tumor. This is consistent with manual segmentation. For example, even when a single radiologist performs multiple segmentations on the same image, the most differences (intra-rater variability) are often present in the boundary region. In this particular example, the necrosis, enhancing, and non-enhancing parts of the tumor are surrounded by edema.
Since the model predicts the whole tumor and does not perform subclass recognition, it predicts the edematous region of the tumor. However, since edema has a distinct contrast in FLAIR, there is a possibility that this FLAIR modality is contributing largely to the observed uncertainty. This is in line with segmentation by radiologists, which also has the most uncertainty in FLAIR modality.   Further, comparing the uncertainty map with the truthprediction discrepancy revealed that the pBNN is highly uncertain where it under-predicts and had low uncertainty where it overpredicts. A possible reason for this is in the context of the model's ability to correctly call out all of the tumor pixels in the ground truth. For over-prediction, the model covered the tumor region and went beyond it (which does not count as a "miss"), so the uncertainty is low. Whereas for under-prediction, the model failed to correctly call .
out tumor (which counts as a "miss"), thus the uncertainty is high. One such area of under-prediction is in the lateral ventricle of the right hemisphere. This region is darker than the rest of the tumor (in FLAIR modality), but is still a part of the tumor due to the presence of edema. One possible explanation of this model behavior is that the model has learned to distinguish tumors based on pixel intensity and contrast. As the algorithm "sees" the lateral ventricle of the right hemisphere to be significantly darker than the rest of the tumor, it thinks this is not a tumor. While this seems like a reasonable mistake for the model to make, a radiologist would have no problem marking this as a tumor as they have a firm understanding of anatomy. Example regions of over-prediction are near the middle frontal gyrus and near the necrotic region in the right hemisphere. This region is craggy, and it seems like the model smoothed over these intricacies to get a "good enough" prediction.

. Discussion
The results of our empirical validation in Section 6 illustrate the effectiveness of the pBNN: the pBNN where Bayesian inference is performed on the SA-selected layer achieves the lowest normalized KL divergence-that is, compared to if a different layer were chosen, it provides the best capturing of the full BNN uncertainty. Our threestep pBNN methodology in Section 3 is set up with implementation considerations in mind.
Step 1 allows the leveraging of pre-existing trained DNNs, but also describes the procedure if a new DNN needs to be trained from scratch.
Step 2 assesses and identifies the portion of DNN for Bayesian inference on a layer-by-layer basis (i.e., keeping "layer" as the unit), since accessing and replacing an entire layer (e.g., from a Dense layer to DenseFlipout layer) is very convenient in programming infrastructure such as TensorFlow Probability. Such modifications do not need the user to program new layers, or to Bayesianize portions of a layer. The gradient-based sensitivity measure adopted is also inexpensive to compute and can be readily extracted from a deterministic DNN.
Step 3 takes advantage of existing implementations of Bayesian inference algorithms, such as the Flipout VI that can be specified and used together with non-Bayesian layers.
Our method is particularly valuable for high-dimensional problems (e.g., DNNs with millions of parameters), where performing a full Bayesian inference would be extremely expensive. Our work enables principled approximate Bayesian inference to be scaled up to millions of dimensions. While other related partial Bayesian strategies compare test error performance (Zeng et al., 2018), we explicitly compute closeness to the fully Bayesian model, with the aim of faithfully capturing the true uncertainty that we are justified to have. Our overarching goal is provide a transparent quantification of uncertainty of our AI system (not to only optimize accuracy).
We illustrate how our computational framework could be used to visually represent prediction uncertainty in Section 7. While uncertainty communication is not the primary goal of our paper, we demonstrate how our work can advance an understanding of AI in a healthcare context. We conduct preliminary analysis on these maps to demonstrate how a radiologist might interpret these images. The maps allow one to correlate the model's mistakes and how confidently the model fails. With clinical insights and interpretation, one can also explore the model's behavior on specific use cases. For example, model predictions on out-of-distribution data can be potentially flagged based on uncertainty (Ovadia et al., 2019) to caution the radiologist about model's reliability in those regions. This would be especially useful for healthcare contexts where models are highly domain-dependent and localized (Finlayson et al., 2021). Apart from it's potential to enable flagging of unreliable cases, our Bayesian framework can allow for construction of an expert-informed pBNN. It permits a flexible choice for prior and likelihood, potentially allowing practitioners to inject their specialized knowledge. For this, we refer to a branch of Bayesian statistics called expert knowledge elicitation (O'Hagan et al., 2006).
While our work produced promising results, limitations still exist. While we provided a detailed validation by brute-force computation of the full BNN, one would not do so in practice. Assessing the quality of posterior approximation is a challenge in general for all BNNs, where inexpensive and effective diagnostics are needed to monitor the algorithm progress especially for large models. Analyzing ELBO curves is one such approach, but one cannot know when the ELBO is stuck in a local minimum. Furthermore, the magnitude of ELBO value does not directly indicate how far the variational approximation is to the true posterior, since ELBO differs from the KL divergence by an unknown constant (i.e., one should not expect ELBO does not converge toward zero with more epochs, as one would with mean-squared-error). We also have not explored performing Bayesian inference on multiple layers, or portions of layers, which may offer further improvements. Such expansions would require the development of efficient strategies to optimize SI across various combinations of multiple layers, and additional benchmarking and validation to assess the tradeoffs between their computational cost and ability to approximate the full BNN. Despite these limitations, our work demonstrates the effectiveness of SA-based pBNN in Bayesian UQ of large, million-dimensional DNNs.

. Conclusion and future work
In this paper, we proposed an approach to compute the Bayesian uncertainty of million-dimensional DNNs for medical image segmentation. Starting from a deterministic DNN, we used gradient-based sensitivity analysis to identify a layer to perform Bayesian inference, thereby creating a partially Bayesian neural network (pBNN) that is computationally much less expensive to construct than a full BNN. We demonstrated the pBNN method on state-of-the-art 7.8-million parameter U-Net for brain tumor segmentation. Our validation indicated that the pBNN based on SAselected layer provided the best approximation to the uncertainty from a full BNN, compared to other layer choices.
Our methodology enables model developers and practitioners to compute the Bayesian uncertainty for large deep learning models. In a life-critical domain such as healthcare, deep learning models can have far-reaching impact, but also can make mistakes leading to disastrous consequences. Communicating uncertainty in model predictions help encourage clinicians to engage with the AI-based DSS with a healthy dose of skepticism and failure-centric mindset. Additionally, UQ mitigate "super-human" perceptions of AI that can lead to unjustified over-reliance. Thus, quantifying and communicating uncertainty would allow for safer and responsible deployments of AI-based CDSS in clinical workflows. For future work, we plan to conduct a large scale evaluation of the impact of uncertainty communication to radiologists for the task of tumor segmentation. In this paper, we conduct preliminary interpretation of these maps and demonstrate the possibility of such a study. We would further like to understand if these maps can be used as a tool for model explanation, and whether these maps are more useful than conventional metrics. It can also be a useful tool in reducing alarm and click fatigue, as the model can only predict when absolutely certain, and refrain from predicting otherwise.

Data availability statement
Publicly available datasets were analyzed in this study. This data can be found here: https://www.med.upenn.edu/cbica/sbia/ brats2018/tasks.html.

Author contributions
JH: primarily conduct experiments and setting up Bayesian inference with test problem and U-Net. SP: data acquisition and preprocessing. SP and JH: Bayesian and deterministic model training and data analysis for U-Net. AR, NB, and XH: advising and supervision. All authors contributed to the conception of ideas, design of experiments, interpretation of results, and writing the manuscript. All authors contributed to the article and approved the submitted version.
Funding SP, XH, and AR were partially supported by the University of Michigan (UM) Michigan Institute for Computational Discovery and Engineering (MICDE) Catalyst Grant Enabling Tractable Uncertainty Quantification for High-Dimensional Predictive AI Systems in Computational Medicine. AR was supported by CCSG Bioinformatics Shared Resource 5 P30 CA046592, a Research Scholar Grant from the American Cancer Society (RSG-16-005-01), and a UM Precision Health Investigator Award to AR (along with L. Rozek and M. Sartor). SP and AR were also partially supported by the NCI Grant R37-CA214955. JH, XH, and NB were supported in part by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, under Award Numbers DE-SC0021397 and DE-SC0021398. This paper was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor any agency thereof, nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.