Robust and fast Monte Carlo Markov Chain sampling of diffusion MRI microstructure models

In diffusion MRI analysis, advances in biophysical multi-compartment modeling have gained popularity over the conventional Diffusion Tensor Imaging (DTI), because they possess greater specificity in relating the dMRI signal to underlying cellular microstructure. Biophysical multi-compartment models require parameter estimation, typically performed using either Maximum Likelihood Estimation (MLE) or using Monte Carlo Markov Chain (MCMC) sampling. Whereas MLE provides only a point estimate of the fitted model parameters, MCMC recovers the entire posterior distribution of the model parameters given the data, providing additional information such as parameter uncertainty and correlations. MCMC sampling is currently not routinely applied in dMRI microstructure modeling because it requires adjustments and tuning specific to each model, particularly in the choice of proposal distributions, burn-in length, thinning and the number of samples to store. In addition, sampling often takes at least an order of magnitude more time than non-linear optimization. Here we investigate the performance of MCMC algorithm variations over multiple popular diffusion microstructure models to see whether a single well performing variation could be applied efficiently and robustly to many models. Using an efficient GPU-based implementation, we show that run times can be removed as a prohibitive constraint for sampling of diffusion multi-compartment models. Using this implementation, we investigated the effectiveness of different adaptive MCMC algorithms, burn-in, initialization and thinning. Finally we apply the theory of Effective Sample Size to diffusion multi-compartment models as a way of determining a relatively general target for the number of samples needed to characterize parameter distributions for different models and datasets. We conclude that robust and fast sampling is achieved in most diffusion microstructure models with the Adaptive Metropolis-Within-Gibbs (AMWG) algorithm initialized with an MLE point estimate, in which case 100 to 200 samples are sufficient as a burn-in and thinning is mostly unnecessary. As a relatively general target for the number of samples, we recommend a multivariate Effective Sample Size of 2200.

: Illustration of parameter uncertainty and correlation for the Ball&Stick model using MCMC sampling, with the Fraction of Stick (FS) and the non-diffusion weighted signal intensity (S0). A) On the left, a single FS sampling trace and its corresponding histogram for the highlighted voxel with a Gaussian distribution function fitted to the samples with its mean indicated by a black dot. On the right, the mean and standard deviation (std.) maps generated from the independent voxel chains per voxel. B) On the left, the scatter-plot for two parameters (FS and S0) with the corresponding marginal histograms for the voxel highlighted in the maps. On the right, the S0-FS correlation map.
Assuming no further a priori knowledge than logical or biologically plau-99 sible ranges, we use uniform priors for each parameter, p i (x i ) ∼ U (a, b). 100 Additionally, for multi-compartment models with volume fraction weighted 101 compartments (i.e. Ball&Stick in1, NODDI and CHARMED in1) we add 102 a prior on the n − 1 volume fractions w k to ensure n−1 k=0 w k <= 1, to ensure proper volume fraction weighting. Note that the last volume fraction 104 is not sampled but is set to one minus the sum of the others, w n = 1 − n−1 k=0 w i . To the Tensor compartment (used in the Tensor and CHARMED -106 in1 model), we add a prior to ensure strictly decreasing diffusivities (d > 107 d ⊥ 0 > d ⊥ 1 ), this prevents parameter aliasing of the Tensor orientation pa-108 rameters (see (Gelman et al., 2013) on aliasing).

Model (M ) Prior
BallStick in1, NODDI, CHARMED in1 p j (x, M ) = 1 if n−1 k=0 w k <= 1, else p j (x, M ) = 0 Tensor, CHARMED in1 (extra axonal compartment) 2: The model priors p j (x|M ) for model M . Each of these priors should be interpreted as a boolean, that is, they return a value of 1 if the condition is fulfilled, else they return 0. These priors are combined with the parameter specific priors in table 1 to form the complete model priors.  (Hastings, 1970), Gibbs (Turchin, 1971;  The general Metropolis-Hastings algorithm works as follows. Given a current position X (t) at step t on a p-dimensional Markov chain, a new position X (t+1) is obtained by generating a candidate position Y from the proposal density q(X (t) |·), which is then either accepted with probability α, or rejected with probability 1 − α. If the candidate position is accepted,

Markov Chain Monte Carlo
The acceptance criteria α is a function given by (Hastings, 1970): where π(·) is our target density, generally given by our posterior distribu- as the candidate position for component i, and as the temporary position in the chain while component i is being updated.

150
The proposals Y * i are generated using the symmetric proposal q i (X (t+1) * |·)

Adaptive Metropolis
where t s denotes the iteration after which the adaptation starts (we use t s = 100). A small constant is necessary to prevent the standard deviation 177 from shrinking to zero. This adaptation algorithm has been proven to re-178 tain ergodicity, meaning it is guaranteed to converge to the right stationary  The other two methods work by adapting the acceptance rate of the gen- work), the optimal target acceptance rate is 0.44 (Gelman et al., 1996).

186
The first of the two acceptance rate scaling strategies is from the FSL Bed-187 postX software package. This strategy, which we refer to as the FSL strat- Burn-in is the process of discarding the first z samples from the chain and using only the remaining samples in subsequent analysis. The idea is that if the starting point had a low probability then the limited number of early samples may oversample low probability regions. By discarding the first z samples as a burn-in, the hope is that, by then, the chain has converged to its stationary distribution and that all further samples are directly from the stationary distribution (Robert, 2015). Theoretically, burn-in is unnecessary since any empirical averagê for any function g will convert to µ(g) given a large enough sample size 213 and given that the chain is ergodic (Robert, 2015). Additionally, since it 214 can not be predicted how long it will take for the chain to reach conver-   Carlo error to the variability in the target distribution). W (p, α, ) can be 281 determined a priori and is defined as: with χ 2 the chi-square function and Γ(·) the Gamma function (Vats et al.,283 2015). Figure 2 shows the effect of α and on W (p, α, ing the ratio: where s is the number of samples we started out with, W (p, α, ) the theo- For this study we used two groups of ten subjects coming from two stud-306 ies, each whith a different acquisition protocol. The first ten subjects are 307 from the freely available fully preprocessed dMRI data from the USC-  We refer to these datasets as HCP MGH -1.5mm -552vol -b10k and to the

376
We begin by comparing the four different proposal strategies for sampling       burn-in, initialization and thinning. We finally applied the theory of Ef-