Variational Bayesian Parameter Estimation Techniques for the General Linear Model

Variational Bayes (VB), variational maximum likelihood (VML), restricted maximum likelihood (ReML), and maximum likelihood (ML) are cornerstone parametric statistical estimation techniques in the analysis of functional neuroimaging data. However, the theoretical underpinnings of these model parameter estimation techniques are rarely covered in introductory statistical texts. Because of the widespread practical use of VB, VML, ReML, and ML in the neuroimaging community, we reasoned that a theoretical treatment of their relationships and their application in a basic modeling scenario may be helpful for both neuroimaging novices and practitioners alike. In this technical study, we thus revisit the conceptual and formal underpinnings of VB, VML, ReML, and ML and provide a detailed account of their mathematical relationships and implementational details. We further apply VB, VML, ReML, and ML to the general linear model (GLM) with non-spherical error covariance as commonly encountered in the first-level analysis of fMRI data. To this end, we explicitly derive the corresponding free energy objective functions and ensuing iterative algorithms. Finally, in the applied part of our study, we evaluate the parameter and model recovery properties of VB, VML, ReML, and ML, first in an exemplary setting and then in the analysis of experimental fMRI data acquired from a single participant under visual stimulation.

In this section, we evaluate the VB, VML, ReML, and ML free energies for the GLM and derive update equations for their maximization. The notation follows the applied approach used in the main text. We commence with some remarks on additional notation and matrix dierentation.

Expectations
To ease the notation, we will often write the expectation of a function f of random variable x under the probability distribution p(x) using the expectation operator f (x) p(x) = f (x)p(x) dx (S1.1) Furthermore, on numerous occasions, we require the following property of expectations of multivariate random variables x ∈ R d under normal distributions: for x, m, µ ∈ R d , Σ ∈ R d×d p.d. and A ∈ R d×d it holds that (x − m) T A(x − m) N (x;µ,Σ) = (µ − m) T A(µ − m) + tr(AΣ) (S1.2) (see e.g. Petersen and Pedersen (2012), eq. (380)).

Gradient and Hessian
The gradient and Hessian of a real-valued function evaluated at a point a ∈ R n will be denoted by (S1.5) When it eases the notation, we also occasionally denote the partial derivative of f with respect to x i evaluated at a ∈ R n by ∂ ∂xi f | x=a .

Matrix dierentiation
The following matrix dierentiation rules are used in the subsequent derivations (Petersen and Pedersen, 2012). For a matrix A depending on a scalar parameter x, we have ∂ |A| ∂x = |A| tr A −1 ∂A ∂x (S1.6) ∂ ln |A| ∂x = tr A −1 ∂A ∂x (S1.7) (S1.9) For a matrix A depending on a two-dimensional vector x = (x 1 , x 2 ), the second-order partial derivatives of its inverse are A −1 (S1.10) A −1 (S1.11) and assuming that A has continuous second derivatives, such that the symmetry of second-order derivatives (Schwarz's theorem) holds. For the update equations of the matrix parameters S β and S λ , we also need to compute derivatives regarding matrices. We have ∂ ln (|A|) ∂A = A −1 (S1.13) and for matrices A and B of matching dimensions ∂ tr (AB) ∂A = B T . (S1.14)

S1.2 Variational Bayes
Evaluation of the VB free energy To evaluate the VB free energy, we rst rewrite it from its denition in eq.
Finally, we consider the rst term of (S1.15). Based on the denition of p(y|β, λ), the expectation with respect to q(β) can be evaluated exactly, yielding ln p(y|β, λ) q(β)q(λ) = − n 2 ln 2π − 1 2 ln |V λ | q(λ) (S1.20) To make it possible to evaluate the remaining expectations, we use a second order Taylor approximation. Let in the vicinity of m λ . If q (λ) is suciently narrow, that is, if most of its mass is concentrated close to m λ , we can thus approximate (S1.23) This approximation needs to be applied to all expectations in equation (S1.20). Thus, using the linearity of the trace to subsume all Hessian matrices (S1.24) thereby pooling the second-order terms, we arrive at terms 1 -5 of equation (28) in the main text, and the derivation is complete.
To estimate the VB free energy in practice, the Hessian matrices on the right-hand side of (S1.24) have to be evaluated. For the linear form of the error covariance matrix V λ := exp(λ 1 )I n + exp(λ 2 )Q 2 (S1.25) the three Hessian matrices of (S1.24) can be evaluated analytically: • H ln |V λ | Using (S1.7), the rst order partial derivatives are given by (S1.27) Exploiting the linearity of the trace operator (S1.9) and using (S1.8) for the derivative of the inverse yields the second order partial derivatives: where in the last equation we used that the trace is invariant under cyclic permutations, e.g. tr (ABC) = tr (CAB) = tr (BCA).
Notably, the evaluation of these Hessian matrices will necessitate the inversion of V λ on every iteration of the optimization algorithm. This inversion can be performed eciently using the diagonalized form of Q 2 . As Q 2 is a real, symmetric matrix by design, there exists a diagonalized form given by Q D 2 = P T Q 2 P , where P is a unitary transformation matrix (P T = P −1 ). The entries l i , i ∈ {1, . . . , n} of Q D 2 are the eigenvalues of Q 2 . We thus have (S1.36) As exp (λ 1 ) I n +exp (λ 2 ) Q D 2 is a diagonal matrix, its inverse is easily evaluated, and the diagonalizing matrix P only needs to be computed once for any given Q 2 .
The VB free energy update equations In this section, we consider the iterative maximization of the VB free energy function with respect to its vector and matrix parameters m β , S β , m λ and S λ .
In each case, we identify the relevant subpart of the VB free energy function depending on the respective parameter, evaluate its gradient with respect to the parameter in question, set the gradient to zero, and, if possible, solve the ensuing equation for a parameter update equation. To emphasize the iterative character of this endeavour, we use the superscript (i) to denote the values of parameters at a given algorithm iteration.
We consider the update with respect to S λ rst. The relevant subpart of Using the identities (S1.13), (S1.14), and considering that B m (i) (S1.38) Setting the gradient to zero and solving for the parameter update S (S1.39) Note that with the linearity properties of the trace operator, this update equation implies as a result, that the sum of the two trace terms involving S λ in the VB free energy (equation (28) of the main text) evaluates to − k 2 and the term B m (i) does not need to be considered when deriving the update equations for m β , S β , and m λ .
where we omitted iteration superscripts for visual clarity. With (S1.2), the Setting the gradient to zero then yields the update equation and the resulting update equation (S1.45) Note that the update equations (S1.42) and (S1.45) conform to the wellknown closed-form expressions for Bayesian inference in the conjugate Gaussian model (cf. eq. (9) of the main text), with the dierence of the parametric dependence of the error covariance matrix on m ing on m λ is given by, again omitting iteration superscripts for visual clarity, (S1.47) The evaluation of these entries for the two-component linear error covariance (S1.25) then yields (S1.49) for j = 1, 2, we employ the routine fsolve.m provided by Matlab (MATLAB and Optimization Toolbox Release 2014b, The MathWorks, Inc., Natick, Massachusetts, United States). This function implements a trust-region dogleg algorithm for the minimization of nonlinear real-valued functions of multiple variables (Coleman and Li, 1996;Nocedal and Wright, 2006).

S1.3 Variational maximum likelihood
Evaluation of the VML free energy The VML free energy is dened as (S1.51) The latter two terms on the right-hand side of (S1.51) have been evaluated in Section S1.2. The rst term can be evaluated using (S1.2), yielding which completes the derivation of the VML free energy as eq. (39) of the main text.
The VML free energy update equations To identify the update equations for the maximization of the VML free energy, we proceed as in Section S1.2. Because the main dierence between the VB and VML framework is the parameterization of the error covariance matrix V λ in terms of λ rather than m λ and the vanishing of terms relating to the prior and variational distributions of λ, we can keep the discussion very concise.
and ensuing update equation and the resulting update equation (S1.59) Here, in analogy to eqs. (S1.48) and (S1.49), the entries of ∇f V M L (λ) for the case of the two-component error covariance matrix of interest (eq. (S1.25)) evaluate to Restricted maximum likelihood The ReML objective function as VML free energy We rst show that for the probabilistic model it holds that the VML free energy with variational distribution q(β) := p λ (β|y) (S1.63) evaluates to the ReML objective function up to an additive constant, i.e.
To this end, we rst note that for the probabilistic model (S1.62) and with the denition of the GLS estimator (S1.68) In brief, (S1.68) follows as a limiting case of the conditional properties of Gaussian distributions for the case of zero prior precision, i.e. the case of an improper prior p(β) = 1 (see e.g. Murphy (2012) for a more detailed discussion).
Evaluation of the VML free energy in the current scenario then yields (S1.69) Evaluation of the rst term on the right-hand side (S1.69) yields where the second equality follows with (S1.2). The third equality uses the invariance of the trace under cyclic permutations. The second term on the right hand of (S1.69) corresponds to the entropy of the distribution p λ (β|y) and thus evaluates to We thus have shown that which concludes the derivation.

Evaluation of the ReML free energy function
To align the discussion of ReML with the previous discussions of VB and VML, we next dene the ReML free energy function as the VML free energy evaluated for the probabilistic model (S1.62) at the exact posterior distribution p λ (β|y), i.e., (S1.73) By noting that with (S1.68) the variational parameters are given by we can then rewrite the ReML free energy as in the main text: (S1.75) The ReML free energy update equations Finally, we derive the update equations for the parameters m β , S β , and λ of the ReML free energy. Note that because the ReML objective function is identical to the ReML free energy up to an additive constant which is independent of these parameters, the resulting iterative algorithm also maximizes the ReML objective function.
β , λ (i) ) that depends on m β is given by, omitting iteration superscripts for ease of notation, and ensuing update equation Unsurprisingly, this is the GLS estimator. Further, the relevant subpart of β , λ (i) ) depending on S β is given by, again omitting iteration superscripts for ease of notation, and ensuing update equation (S1.81) Finally, because the subpart of F ReM L depending on λ is identical to the subpart of F V M L depending on λ, the update procedure for F ReM L with respect to λ is identical to that of F V M L .

S1.5 Maximum likelihood
The ML free energy update equations For the GLM, we have by denition To derive parameter update equations, we consider the dependency of F M L on β (i) and λ (i) in turn. The relevant subpart of F M L (β (i) , λ (i) ) that depends on β is then given by, omitting iteration superscripts for ease of notation, and ensuing update equation corresponding to the GLS estimator as in the case of ReML. The relevant subpart of F M L (β (i+1) , λ (i) ) that depends on λ diers from the VML and ReML scenarios and is given by, again omitting iteration superscripts for ease of notation, Here, in analogy to eqs. (S1.47), (S1.48), and (S1.49), the entries of ∇f M L (λ) for the case of the two-component error covariance matrix of in- As they correspond to a disregard of prior information and posterior uncertainty about β, equations (S1.85), (S1.87) and (S1.88) can also be attained from the VML update equations (S1.55), (S1.60) and (S1.61) by setting

S2 Foundations of variational Bayes
In this section we formulate a probability-theoretic model of the probabilistic model considered in the main text in order to derive the VML and ML scenarios as special cases of VB. By probability-theoretic we mean a measure theory-based approach to probabilistic concepts, as prevalent in contemporary mathematics (e.g. Billingsley, 2012;Shao, 2003;Fristedt and Gray, 1997).
This approach is rather uncommon in the neuroimaging and machine learning literature, where many application-oriented developments on VB have taken place (Blei et al., 2016). In the current context, it is necessitated by the fact that point probability masses cannot be represented by probability density functions. This implies that to derive VML and ML under VB requires a careful dierentiation between those random variables whose distribution can and cannot be represented by probability density functions. This is aorded by the measure theory-based approach. We assume that the reader is familiar with the measure-theoretic viewpoint of probability theory, including Lebesgue integration. To establish notation and prepare some aspects of the discussion to follow, we provide a brief summary of key elements in Section S2.1. In Section S2.2 we then review a selection of entropy formulations which will be required for the formulation of VB and VML in probability-theoretic terms. Finally, in Section S2.3 we formulate the VB, VML, and ML scenarios is probability-theoretic terms and discuss their mutual relationships.

S2.1 Preliminaries
Measurable, measure, and probability spaces Our formulation rests on the concepts of measurable, measure, and probability spaces. A measurable space is a pair (Ω, F), where Ω denotes a set and F denotes a σ-eld on Ω. An important measurable space in the following will be An important measure space in the following will be (R d , Please note that we do not use the more conventional notation λ d for the An important probability space in the following will be (R d , B d , δ x ), where δ x denotes the Dirac measure, dened as Of key importance in the derivations to follow is the fact that the Lebesgue integral of a measurable function f : R d → R d with respect to the Dirac measure δ x is readily evaluated as (e.g. Lieb and Loss, 2001) f dδ x = f (x).

Probability density functions
For a measure space (Ω, F, µ) any quasiintegrable function f : Ω → R ≥0 is a density function and denes a measure ν on F by means of its Lebesgue We say that ν is a measure with density function f with respect to the measure µ and write ν = f µ for short. Recall that a F-measurable function g : Ω →R is integrable with respect to ν, if the function product g · f is integrable with respect to µ, and that in this case Crucially, this implies that one can evaluate Lebesgue integrals using Riemann integration as done throughout Section S1 (right-hand side of (S2.9), for details see e.g. Schmidt (2011), Chapter 9).
We further require the notion of conditional probability density functions.
For a random variable (X, Y ) on a product measure space (Ω, F, µ) := (Ω x × Ω y , F x × F y , µ x × µ y ) with joint probability density function f X,Y : Ω → R >0 with respect to µ, the conditional probability density function of X given Y = y is dened as is the marginal probability density function of Y with respect to µ y (e.g. Shao, 2003, Chapter 1.4).

Discrete, continuous, and mixed random vectors
Because we are considering multivariate random entities in the application of VB, VML, and ML to the GLM, we also require the notion of random vectors as the multivariate extension of random variables. More specically, we require the concepts of discrete, continuous and mixed random vectors, which we introduce in the following.

(S2.15)
A special discrete random vector required in the following is a constant random vector, which for a xed x * ∈ R d we write as In this case the alphabet X * = {x * } of X * comprises a single element, the associated probability mass of which is given by Analogously to a random variable, a random vector X : Ω → R d induces a probability measure P X on the measurable space (R d , B d ). Notably, the induced probability measures of a constant random vector is the Dirac measure (e.g. Bauer, 1991, p. 25), i.e. with (S2.2) and (S2.16) A d-dimensional continuous random vector is a function Y of the form (S2.12) whose induced probability measure P Y on (R d , B d ) is absolutely continuous with respect to Lebesgue measure and can thus be represented by a probability density function f Y : R d → R ≥0 .
Finally, we construct the concept of a mixed random vector as follows.
Set d := d 1 + d 2 (d 1 , d 2 ∈ N), let (Ω, F) be a measurable space, and let X : Ω → R d1 be a discrete random vector, which induces a distribution P X on P(X ). Let P Y |X be a Markov kernel from (X , P(X )) to (R d2 , B d2 ), i.e. P Y |X is a mapping with the properties • P Y |X (x, ·) : B d2 → [0, 1] is a probability measure on B d2 for every x ∈ X , • P Y |X (·, B) : X → R is P(X )-measurable for every B ∈ B d2 .
Then P X,Y = P X P Y |X is a probability measure on (Ω × R d2 , F ⊗ B d2 ) (e.g. Shao, 2003, 1.4.3). Assume in addition that for probability density functions f Y |X=x with x ∈ X . Let Y denote the identity mapping on R d2 and dene Z := (X, Y ). Then Z is a d-dimensional random vector on (Ω×R d2 , F ⊗B d2 ) with marginal distributions having the properties for all A ∈ P(X ) and Note that if we set P Y |X (x, ·) := P Y for every x ∈ X with a probability measure P Y on B d2 , then P X,Y = P X P Y is a product measure and the random vectors X and Y are independent. Vice versa, assuming independent d 1 -and d 2 -dimensional random vectors X and Y , respectively, we can use the construction above to construct a d-dimensional mixed random vector with discrete and continuous components whose marginal distributions are independent.

S2.2 Entropies of distributions of random vectors
Entropy of the distributions of a discrete random vector Following (Gray, 2011, Chapter 3), we dene the entropy of the distribution P X of a discrete random vector X with alphabet X as with the convention 0 ln 0 := 0. For later reference we note that the the entropy of the distribution of a constant random vector of the form (S2.16) is zero, because in this case the dening sum (S2.24) comprises a single term which evaluates to zero:

Entropy of the distribution of a continuous random vector
We dene the entropy of the distribution P Y of a continuous random vector Y as its dierential entropy (e.g. Cover and Thomas, 2012, Chapter 8), i.e.
we set Entropy of the distribution of a mixed random vector Finally, following (Nair et al., 2006), we dene the entropy of the distribution P Z of a mixed random vector Z = (X, Y ) with the property for all x ∈ X by Note that we can rewrite this denition as H(P Z ) thus comprises the sum of the entropy of the marginal distribution of the discrete component of Z and a convex combination of the dierential entropies of the conditional distributions of the continuous components.
In particular, for the case of independent discrete and continuous compo- More generally, if the components X and Y are independent, and Y may be either continuous or discrete (rendering Z a discrete random vector), we write where H denotes the entropy of the continuous or discrete component Y .

Variational Bayes
To express the VB scenario of the main text in probability-theoretic terms, we set d 1 := n + p + k (n, p, k ∈ N), consider a probability space (Ω, F, P ) and the measurable space (R d1 , B d1 ) and dene the continuous random vector which induces a probability measure P Y,B,L on B d1 . We thus obtain the probability space (R d1 , B d1 , P Y,B,L ). Because (Y, B, L) is a continuous random vector, P Y,B,L can be represented by a probability density function with respect to Lebesgue measure µ d1 l on B d1 . Note that f Y,B,L is denoted by p(y, β, λ) in the main text.
To dene the variational Bayes free energy function F V B , we rst consider a random vector where QB and QL denote the marginal distribution on B p and B k induced bỹ B andL, respectively (e.g. Fristedt and Gray, 1997, Chapter 9). We hence write QB ⊗L for Q (B,L) . Let QB ⊗L denote the set of all such distributions. For a xed y ∈ R n we then dene Here the symbol H denotes the entropy of the distribution QB ⊗L , the evaluation of which depends on the type of random vectorsB andL, as discussed in Section S2.2.

Variational maximum likelihood
In analogy to the above, we set d 2 := n + p (n, p ∈ N), consider a probability space (Ω, F, P ) and the measurable space (R d2 , B d2 ), and dene the continuous random vector which induces the probability measure P Y,B on B d2 . We thus obtain the probability space (R d2 , B d2 , P Y,B ). As in the main text, we assume that P Y,B is represented by a parameter-dependent probability density function with respect to Lebesgue measure on B d2 . Note that f λ Y,B is denoted by p λ (y, β) in the main text.
To dene the variational maximum likelihood free energy function F V M L , we rst consider a random vectorB which induces a probability measure QB on the measurable space (R p , B p ). Let QB denote the set of all such induced probability measures. For a xed y ∈ R n we then dene where as above the symbol H denotes the entropy of the distribution QB, the evaluation of which depends on the type of random vectorB as discussed in Section S2.2.

Maximum likelihood
Finally, to express the maximum likelihood scenario in probability-theoretic terms, we consider a probability space (Ω, F, P ), the measurable space (R n , B n ) and dene the continuous random vector which induces the probability measure P Y on B n and hence the probability space (R n , B n , P Y ). As in the main text, we assume that P Y is represented by a parameter-dependent probability density function with respect to Lebesgue measure on B n . Note that f β,λ Y is denoted by p β,λ (y) in the main text. We dene the maximum likelihood free energy function F M L as the standard log-likelihood function of the maximum likelihood scenario, i.e. for xed y ∈ R n , we set and, under the mean-eld approximation,  as indicated on the right, and without and with a correction for outliers as indicated.
For implementational details, please see vbg_2.m.

S4 SPM12 ReML estimation
In the parameter recovery assessment of our VB, VML, ReML, and ML implementation, we found that the covariance component parameter estimation fails in a signicant number of cases. To investigate whether this behaviour is specic to our implementation, we performed the same analyses using the covariance component parameter estimation functions spm_reml_sc.m (Version 4805) and spm_reml.m (Version 5223) of the SPM12 distribution. These functions perform a Fisher scoring ascent on the ReML objective function to identify maximum-a-posteriori covariance component parameter estimates, probably documented best in (Friston et al., 2002). The function spm_reml_sc.m uses weakly informative log normal priors to ensure the positivity of the covariance component parameter estimates, while the spm_reml.m function, which is called by SPM12 central spm_spm.m function, does not.
We visualize the results in Figure S2. The panel columns of this gure refer to the two covariance component parameter estimates and the panel rows refer to the dierent SPM12 functions. In the rst row, we visualize the cumulative average and variances of the respective parameter estimates based on the spm_reml_sc.m function without the removal of outliers. The performance for λ 1 is acceptable, but for the estimation of λ 2 outliers from approximately the 10th simulation on bias the cumulative average signicantly away from the true, but unknown, parameter value and strongly amplify the cumulative variance. This is similar to the behaviour we detected in our implementation which led us to remove these outliers automatically (Grubbs, 1969). The second row of Figure S2 depicts the parameter recovery performance for spm_reml_sc.m after removal of appoximately 15% of outliers.
This results in similar performance as in our implementation. Finally, the last row of Figure SS2 depicts the parameter recovery performance for the spm_reml.m function. Because spm_reml.m can return negative covariance components and because the SPM12 procedures assume a covariance structure In summary, we conclude that the numerical optimization problems that we encountered for the estimation of covariance components based on our implementation of the VB, VML, ReML, and ML estimation techniques are not an uncommon phenomenon in the analysis of neuroimaging data.

S5 Model recovery free energy contributions
To understand the observed pattern of average free energies in Figure 7 in further detail, we tabulated the sum terms of each free energy function (Table S1) and visualize the average term contributions to the overall average free energy in Figure S3. We omit from visualization the rst term, which is identical for all free energy functions and evaluates to T1 = −367.58. Of the remaining terms, the largest contributions are provided by T3 and T2, reecting the residual sum of squares and the log determinant of the estimated data covariance matrix, respectively ( Figure S3A). The remaining terms T4 -T17, as far as they exist for each free energy function, make smaller contributions ( Figure S3B). Notably, the residual sum of squares is virtually identical over all pairings of data generating and data analysis model. This reects the fact that the two-regressor model MA2 can readily capture the data pattern of the single-regressor model MG1 by estimating β 2 to be approximately zero. The average free energy dierences for the two data analysis models in case of MG2 thus appear to be primarily accounted for by the dierent contributions of T2.
It is likely that these dierences result from the erroneous allocation of data − n 2 ln 2π − n 2 ln 2π − n 2 ln 2π − n 2 ln 2π 2 − 1 2 ln |Vm λ | − 1 2 ln |Vλ| − 1 2 ln |Vλ| − 1 2 ln |Vλ|  . T4 -T16 for VB, VML, and ReML, as evident from the more negative sums (Σ) of these terms. In summary, for the current data generating and data analysis model comparison, both covariance component overestimation and the free energy model complexity terms T4 -16 appear to contribute to the identiability of the true, but unknown, model structure. according to the terms tabulated in Table S1. Panel A displays the largest contributions, aorded by terms T2 and T3, Panel B displays the remaining term contributions, and the sum (Σ) of these remaining distributions. Note the dierence in scale between Panels A and B. Panel rows refer to the four estimation techniques.
In each subpanel, the left two bar groups refer to data generated by MG1 analyzed with data analysis models MA1 and MA2, and the right two bar groups refer to data generated by MG2 analyzed with MA1 and MA2. For visual comparison, thin grey lines corresponding to the values obtained under the MG1/MA1 combination are included. For implementational details, please see vbg_3.m.