A Review of R-packages for Random-Intercept Probit Regression in Small Clusters

Josephy, Haeike; Loeys, Tom; Rosseel, Yves

doi:10.3389/fams.2016.00018

ORIGINAL RESEARCH article

Front. Appl. Math. Stat., 13 October 2016

Sec. Quantitative Psychology and Measurement

Volume 2 - 2016 | https://doi.org/10.3389/fams.2016.00018

A Review of R-packages for Random-Intercept Probit Regression in Small Clusters

Updated

A correction has been applied to this article in:

Corrigendum: A Review of R-packages for Random-Intercept Probit Regression in Small Clusters
1. Read correction

Haeike Josephy^*

Tom Loeys

Yves Rosseel

Department of Data Analysis, Faculty of Psychology and Educational Sciences, Ghent University, Ghent, Belgium

Generalized Linear Mixed Models (GLMMs) are widely used to model clustered categorical outcomes. To tackle the intractable integration over the random effects distributions, several approximation approaches have been developed for likelihood-based inference. As these seldom yield satisfactory results when analyzing binary outcomes from small clusters, estimation within the Structural Equation Modeling (SEM) framework is proposed as an alternative. We compare the performance of R-packages for random-intercept probit regression relying on: the Laplace approximation, adaptive Gaussian quadrature (AGQ), Penalized Quasi-Likelihood (PQL), an MCMC-implementation, and integrated nested Laplace approximation within the GLMM-framework, and a robust diagonally weighted least squares estimation within the SEM-framework. In terms of bias for the fixed and random effect estimators, SEM usually performs best for cluster size two, while AGQ prevails in terms of precision (mainly because of SEM's robust standard errors). As the cluster size increases, however, AGQ becomes the best choice for both bias and precision.

1. Introduction

In behavioral and social sciences, researchers are frequently confronted with clustered or correlated data structures. Such hierarchical data sets for example arise from educational studies, in which students are measured within classrooms, or from longitudinal studies, in which measurements are repeatedly taken within individuals. In these examples, two levels can be distinguished within the data: measurements or level-1 units (e.g., students or time points), and clusters or level-2 units (e.g., classes or individuals). These lower level units are correlated, as outcome measures arising from students with the same teacher, or measurements within an individual, will be more alike than data arising from students with different teachers, or measurements from different individuals. As such, an analysis that ignores these dependencies may yield underestimated standard errors, while inappropriate aggregation across levels may result in biased coefficients [1, 2].

Over the course of decades, several frameworks that can deal with such lower-level correlation have been developed. One such framework entails mixed effect models, which model both the ordinary regression parameters common to all clusters (i.e., the fixed effects), as well as any cluster-specific parameters (i.e., the random effects). Using a parametric approach, two different types can be distinguished: Linear Mixed Models (LMMs) when the outcome is normally distributed, and Generalized Linear Mixed Models (GLMMs) when it is not. A second framework that allows the analysis of multilevel outcomes consists of Structural Equation Models (SEM). Structural Equation Models can be split up into two main classes: “classic” SEM, which is restricted to balanced data, and multilevel SEM, which is able to deal with unbalanced data structures by relying on likelihood-based or Bayesian approaches. Generally, SEM supersects its GLMM counterpart, as the former is able to additionally include latent measures (and measurement error) and assess mediation, in one big model. Discounting these two assets, however, recent literature proves that SEM is completely equivalent to its GLMM counterpart when considering balanced data (e.g., when considering equal cluster sizes in a random intercept model) [3–5].

As clustered Gaussian outcomes have already been discussed thoroughly in the LMM and SEM literature [4–10], we will focus on GLMM- and SEM-methods for non-normal outcome data. More specifically, we will target binary data from small clusters, with a particular focus on clusters of size two, as such settings have proven difficult for the available GLMM methodologies [11, 12]. Clusters of size two are frequently encountered in practice, e.g., when studying dyads [13], in ophthalmology data [14], in twin studies [15], or when analyzing measurements from a 2-period - 2-treatment crossover study [16].

Focusing on the two aforementioned frameworks, current literature on the analysis of clustered binary outcomes reveals two major limitations: clusters of size two were either not considered [17–20], or they were, but limited to only one of both frameworks [21–24]. Here, we compare several estimation procedures within both GLMM- and SEM-frameworks for modeling this type of data, by considering the performance of relevant R-packages. By limiting our comparison to implementations from the statistical environment R (version 3.2.3., [25]), we rely on estimation techniques that are easily accessible to all practitioners (this software is freely available, while at the same time enjoying a wide range of open-source packages). Additionally, we choose to only focus on R-packages which stand on themselves and are not dependent on external software. We do, however, check several of the R-based implementations against others such as implementations in SAS^® software (version 9.4 [26])¹, the MPLUS^® program (version 7 [27]) or the JAGS implementation (version 4.1.0. [28]), as to verify the independence of conclusions on the software used.

In the following sections, we first introduce a motivating example. After this we elaborate on the GLMM and SEM frameworks in general, so that the various estimation methods capable of analyzing the example can be enumerated. Next, we illustrate these methods on our example data. To facilitate the practitioner's decision on which method is most appropriate in which setting, we subsequently conduct a simulation study. Based on our findings we provide recommendations, and end with a discussion.

2. An Example

As a motivating example, we consider data from a randomized study executed by Vandeweghe et al., in preparation in two Flemish nursery schools. As healthy eating habits are important to achieve healthy growth and development in young children, Vandeweghe et al., in preparation focus on strategies to improve the liking of vegetables in preschool children: a child given a tangible or non-tangible reward after tasting should be motivated to taste again. To this end, Vandeweghe et al., in preparation incorporated four possible intervention plans: encouragement toward eating chicory, an active reward after consumption, repeated exposure of the vegetable, and a control group. The binary variable “vegetable liking” (like/ok vs. dislike) was measured during three phases: once during a pre-test (to test their inherent liking of chicory), once during a post-test, and once during a follow-up test. When we only consider the pre- and post-test, we end up with two measurements for each child, while additionally including the follow-up measurements will increase this number to three. So irrespective of whether or not the follow-up measurement is included, the authors end up with a small cluster size.

For illustrative purposes, we will only consider the results from a single school, so that the data structure simplifies to a simple two-level setting where a binary outcome is assessed repeatedly within within each child. Additionally, we will only contrast the “encouragement” vs. the “control” group, as to simplify interpretation and results. The sample size of this reduced data set consists of 37 children (only retaining the complete cases), of which 21 were assigned to the control group and 16 to the encouragement group.

To test whether encouragement increases the liking of chicory, we consider the following random-intercept probit regression model:

\begin{matrix} P (y_{i j} = 1 ∣ x_{i j}, b_{j}) = Φ (β_{0} + β_{1} x_{i j} + b_{j}) & (1) \end{matrix}

with index i referring to the measurement moment (i = 0, 1 or 2 for pre-, post- and follow-up test, respectively), index j to the individual (j = 1, …, 37), and with Φ representing the cumulative normal distribution. Additionally, a random intercept b_j, which is assumed to follow a normal distribution, is included in model (1) to capture the correlation between measures taken from the same toddlers. In this model, the outcome variable Y_ij represents Liking (Liking equals zero when child j dislikes the vegetable at time i and one it is liked/tolerated), while the predictor x_ij represents Encouragement (x_ij equals one when child j is encouraged at time i, and zero when it is not). To capture the effect of Encouragement within a single parameter, we have opted to model the intervention as a time-dependent covariate, rather than a between-subject effect interacting with time. This assumption is reasonable here, given the absence of group differences at the pre-test, the non-existence of a time effect in the control group, and a similar effect of Encouragement during the post-test and follow-up (see Figure 1).

FIGURE 1

Figure 1. Percentages of vegetable liking in 37 preschool children, for the tree measurement moments (pre-test, post-test, and follow-up) and two reward systems (control vs. encouragement).

With model (1) defined, the research question of whether or not a reward system will increase the liking of chicory will amount to testing the null hypothesis H₀:β₁ = 0. When this null hypothesis is rejected, we will conclude that the reward system significantly increases (when β > 0) the probability of liking the vegetable. But how do we estimate and test the fixed effects and random intercept variance? Since there are myriad options and recommendations in current literature, and some of these may not yield satisfactory results for binary outcomes in such small clusters, we will introduce and compare several possibilities. As mentioned in the introduction, these estimation methods stem from both the GLMM- and SEM-frameworks; to this end, the next section provides an introduction of both frameworks, a short note on their equivalence, and an explanation of the difficulties that accompany marginalizing the GLMM-likelihood function over the random effects distribution.

3. Methods

3.1. Generalized Linear Mixed Models

Generalized linear mixed models (GLMMs) are basically extensions of Generalized Linear Models (GLMs) [29], which allow for correlated observations through the inclusion of random effects. Such effects can be interpreted as unobserved heterogeneity at the upper level, consequently inducing dependence among lower-level units from the same cluster.

Let x_ij and y_ij denote the ith measurement from cluster j, for the predictor and the binary outcome respectively (where i = 1, .., I and j = 1, …, J). Note that since we primarily focus on clusters of size two, we will set I to 2. Moreover, as I = 2 limits the identification of random effects, we will consider GLMMs with a random intercept only. In a fully parametric framework, this particular GLMM is typically formulated as:

\begin{matrix} E (Y_{i j} | x_{i j}, b_{j}) = g^{- 1} (β_{0} + β_{1} x_{i j} + b_{j}) with b_{j} ~ N (0, τ) & (2) \end{matrix}

where g⁻¹(·) represents a known inverse link function, β₀ represents the intercept, β₁ the effect of the predictor x_ij, and b_j the cluster-specific random intercept. In this paper, we only consider probit regression models, where the standard normal cumulative distribution Φ(·) is defined as the inverse link function g⁻¹(·) [or equivalently the link function g(·) is defined as probit(·)]. Our reasoning behind this is that probit-regression applies to all estimation procedures we investigate, in contrast to the logit link. Converting Equation (2) to a random intercept probit-regression model yields us:

\begin{matrix} P (y_{i j} = 1 ∣ x_{i j}, b_{j}) = Φ (β_{0} + β_{1} x_{i j} + b_{j}) & (3) \end{matrix}

In order to obtain estimates for β₀, β₁ and τ, the marginal likelihood function is typically maximized. For a random-intercept GLMM, this function is obtained by integrating out the cluster-specific random effect, and can be written as:

\begin{matrix} l (β, τ | y_{i j}) = \prod_{j = 1}^{J} \int_{- \infty}^{+ \infty} \prod_{i = 1}^{I} f (y_{i j} | β, b_{j}) ϕ (b_{j} | τ) d b_{j} & (4) \end{matrix}

where f denotes the density function of the outcomes and ϕ the density of the random intercept (which is assumed to be normal here).

Unfortunately, statistical inference based on maximizing Equation (4) is hampered, because integrating out the random effects from the joint density of responses and random effects is, except for a few cases, analytically intractable. To tackle this, several techniques have been proposed, which can be divided into two main classes: likelihood-based methods and Bayesian approaches.

3.1.1. Estimation Through Likelihood-Based Approximation Methods

One way to tackle the intractability of integrating out the random effects of the GLMM likelihood function, is to either approximate the integrand or to approximate the integral itself. We briefly introduce three such methods below, and refer the interested reader to Tuerlinckx et al. [30] for more details.

Technically speaking, the Laplace approximation [31] approximates the integrand by a quadratic Taylor expansion. This results in a closed-form expression of the marginal likelihood, which can be maximized to obtain the maximum likelihood estimates of the fixed effects and random effect variances. In R, the implementation based on this approximation is available within the function glmer, from the package lme4 [32].

The Penalized Quasi-Likelihood method (PQL) [11, 33, 34] also approximates the integrand; more intuitively put, PQL approximates the GLMM with a linear mixed model. This is achieved by considering a Taylor expansion of the response function and by subsequently rewriting this expression in terms of an adjusted dependent variable on which estimation procedures for LMM can be implemented. Consequently, the algorithm cycles between parameter estimation by linear mixed modeling, and updating the adjusted dependent variable until convergence. This approach can be implemented using the function glmmPQL from the R-package MASS [35].

Finally, a tractable marginal likelihood can also be obtained by approximating the integral itself with a finite sum. In regular Gauss-Hermite (GH) Quadrature (e.g., [36]), this summation occurs over a fixed set of nodes, while Adaptive Gaussian Quadrature (AGQ) [37] uses a different set of nodes for each cluster. As such, when applying AGQ, fewer nodes are necessary to achieve equal accuracy as compared to the regular GH quadrature. AGQ estimation in R is also possible within the glmer function from lme4.

The detailed R-code on how to implement these three likelihood-based methods for a binary multilevel probit-model, can be found in the Supplementary Material (see Appendix - Likelihood-based methods). To check the R-implementation of AGQ against other software, we use the NLMIXED procedure within SAS^® [26].

3.1.2. Estimation Through Bayesian Methods

A second strategy that tackles the intractability of the GLMM likelihood function, pursues a Bayesian approach where Markov Chain Monte Carlo (MCMC) methods are used to obtain a posterior distribution of the parameters. MCMC methods simulate the likelihood rather than computing it, by calculating the sample average of independently simulated realizations of the integrand. As such, MCMC is thought to provide a more robust approach to marginalizing the random effects [18, 38].

In R, the MCMCglmm function from the package MCMCglmm [39] is available for such an approach. Technically, latent variables are updated in block by means of the Metropolis-Hastings algorithm [40–42], while the fixed parameters are Gibbs sampled within such a single block [43].

MCMC methods are known to be computationally intensive and sometimes have a hard time in reaching convergence. To this end, hybrid models based on an Integrated Nested Laplace Approximation (INLA) of the posterior marginals for latent Gaussian models [44] were proposed. In short, the INLA approach provides fast Bayesian inference by using accurate Laplace approximations for the marginal posterior density of the hyperparameter τ, and for the full conditional posterior marginal densities of the fixed and random effects. The final posterior marginals of the model parameters can then be computed through numerical integration, where the integration points are defaultly obtained by estimating the curvature of the approximation for the marginal posterior of the hyperparameter density [44]. Not surprisingly, these hybrids have shown a steep decline in the computational burden of MCMC algorithms, while at the same time converging more easily. In R, such an approach is implemented in the function inla from the package R-inla.

The detailed R-code of both implementations, as well as their prior specifications, can be found in the Supplementary Material (see Appendix - Bayesian methods). To check the R-based MCMC-implementation against other software, we rely on the he JAGS program [28] through the use of the R-package rjags [45]. It has been suggested by Betancourt and Girolami [46] that a non-centered parameterization of the hierarchal model works best when data are sparse, while a centered parameterization prevails when the data strongly identifies the parameters. However, we observed quite similar results stemming from the two parameterizations in our settings (results not shown).

3.2. Structural Equation Models

Although at first sight GLMM and SEM may seem like two completely different modeling frameworks, it is now well established that SEM can also be relied on to model balanced multilevel data structures. For an excellent overview of SEM, we refer the interested reader to Skrondal and Rabe-Hesketh [47]. In order to account for clustered observations, SEM lets its latent factors represent the random effects from their respective multilevel models [48, 49]. This results in a “conventional” SEM which is analytically equivalent to the corresponding multilevel model, under a broad set of conditions [4]; we illustrate this for model (2).

SEM consists of two modeling parts: a measurement model and a structural part [47]. The former defines unobserved variables in terms of observed variables measured with error, so that the latent variables can be interpreted as the “true” underlying variables (which might be correlated). The structural model on the other hand, links the different latent variables together. When focusing on random intercept models (read: with only one latent variable) with an explanatory variable in clusters of size two, both modeling-parts can be written as:

\begin{matrix} \begin{array}{l} y_{j} = ν + Λ η_{j} + K x_{j} + ϵ_{j} \\ η_{j} = ζ_{j} \end{array} & (5) \end{matrix}

where y_j represents the responses within cluster j, ν = (ν ν) the vector of intercepts, η_j a latent variable with its matrix of factor loadings Λ = (1 1)^T, x_j = (x_1j x_2j)^T represents the explanatory variable, with K its matrix of regression coefficients, and ϵ_j = (ϵ_1j ϵ_2j)^T the vector of normally distributed measurement errors. In the structural part of the model, ζ_j represents a random disturbance term ~ N(0, τ). Note that in accordance to Equation (2), we assume the effect of x to be fixed within- as well as between clusters. Because of this, K reduces to $(\begin{matrix} k & 0 \\ 0 & k \end{matrix})$ . Alternatively, we can write the above equations in reduced form, resulting in:

\begin{matrix} \begin{array}{l} y_{j} = ν + K x_{j} + Λ ζ_{j} + ϵ_{j} \\ = ν + K x_{j} + ζ_{j} + ϵ_{j} \end{array} & (6) \end{matrix}

where ζ_j = (ζ_j ζ_j)^T.

Traditionally, estimation methods in SEM are based on the assumption that the observed responses are measured on a continuous scale. In order to reconcile SEM with binary outcomes, the Latent Response Variable approach was introduced, where a dichotomous Y is considered a crude approximation of an underlying continuous variable Y*. Y* is not directly observed (hence a latent response variable), and is written in terms of a linear predictor. When we separate the two observations within each cluster to eliminate matrix notations, we obtain:

\begin{matrix} {\begin{array}{l} y_{1 j}^{*} & = ν + k x_{1 j} + ζ_{j} + ϵ_{1 j} \\ y_{2 j}^{*} & = ν + k x_{2 j} + ζ_{j} + ϵ_{2 j} \end{array} & (7) \end{matrix}

where ϵ_1j and ϵ_2j are i.d.d. residuals of the latent response variables ~ N(0, θ). Because Y* exhibits an arbitrary mean and variance, a link between Y and Y* needs to be established through variance constraints. Since the variance of Y* conditional on x_ij is τ + θ, there are two possible ways to constrain this variance [50]. First, Generalized Linear Models standardly fix the residual variance θ to one. In contrast to this theta parameterization, identification can also be achieved by standardizing the latent variable Y* itself: the delta parameterization fixes the the sum of τ and θ to one. This parameterization is traditionally used in the SEM-literature.

The relationship between the binary and latent continuous variable is then: Y = 1⇔Y* > κ. Fixing the threshold κ at 0 (for model identifiability, either the threshold or the intercept in Equation (7) needs to be constrained), and assuming that $Y_{i j}^{*} ~ N (0, 1)$ (i.e., making use of the delta parametrization so that ζ_j ~ N(0, τ_δ) and ϵ_ij ~ N(0, 1 − τ_δ)), it follows that:

\begin{matrix} \begin{array}{l} E [Y_{i j} | x_{i j}, ζ_{j}] = P (ν + k x_{i j} + ζ_{j} + ϵ_{i j} > 0 | x_{i j}, ζ_{j}) \\ = P (ϵ_{i j} < ν + k x_{i j} + ζ_{j} | x_{i j}, ζ_{j}) \\ = P (\frac{ϵ_{i j}}{\sqrt{1 - τ_{δ}}} < \frac{ν + k x_{i j} + ζ_{j}}{\sqrt{1 - τ_{δ}}} | x_{i j}, ζ_{j}) \\ = Φ (\frac{ν}{\sqrt{1 - τ_{δ}}} + \frac{k x_{i j}}{\sqrt{1 - τ_{δ}}} + \frac{ζ_{j}}{\sqrt{1 - τ_{δ}}}) \end{array} & (8) \end{matrix}

which reduces to the random intercept probit-model from Equation (3), where $\frac{ν}{\sqrt{1 - τ_{δ}}}$ , $\frac{k}{\sqrt{1 - τ_{δ}}}$ and $\frac{ζ_{j}}{\sqrt{1 - τ_{δ}}}$ are equivalent to β₀, β₁, and b_j, respectively.

3.2.1. Estimation in SEM

Within the SEM-framework, there are two common estimation approaches for modeling binary outcomes: maximum likelihood (ML) estimation and weighted least squares (WLS) [47]. In contrast to WLS, ML estimation for binary outcomes is not widely available in SEM software. Being a “full information” method, ML is more regularly employed in item response theory [51]. In contrast, as WLS-based methods adopt a multiple-step estimation procedure in which only first- and second-order information from the data is used, they are referred to as a “limited information” approach (see [52] for a review). In SEM, WLS is employed to differentially weigh the residuals resulting from the observed vs. the model-implied sample statistics by their full asymptotic covariance matrix W.

Since WLS requires extremely large samples for accurate estimation of the weight matrix W, more contemporary approaches were developed to improve small sample performance. One such version entails diagonally weighted least squares (DWLS), which utilizes a diagonal weight matrix instead of a full one [53, 54] (note that statistical inference in DWLS still relies on the full weight matrix, even when a diagonal matrix is used during estimation). Following Muthén et al. [54], who have shown DWLS to be statistically and computationally efficient in large samples, more recent studies have proven that DWLS is also more stable than WLS in small samples [51, 55, 56]. Note that WLS and DWLS estimation is limited to probit-regression models and therefore exclude logit-models from our current review study.

SEM relying on DWLS can be implemented through the sem- function from the package lavaan [57]. The detailed R-code of this implementation can be found in the Supplementary Material (see Appendix - SEM methods). To check the lavaan package against other implementations, we will verify our results with DWLS estimation in MPLUS^® software [27] through the use of the R-package MplusAutomation [58].

4. Analysis of the Example

We illustrate the above six approaches by applying them to our example. To assess the impact of cluster size, we consider the fit of model (1) when solely looking at the pre- and post-test (i.e., cluster size two) vs. all three time points together (i.e., cluster size three). The estimated parameters for the fixed effects (and their standard errors), alongside the estimated random intercept variance for each of the estimation approaches are summarized in Table 1.

TABLE 1

Table 1. The estimates (and (robust) standard errors) from the six approaches for the intercept β₀, the slope parameter β₁ and the random intercept variance τ.

We observe that for both cluster sizes all methods perform rather similar in their estimation of β₀, except for a higher estimate produced by MCMC. The estimates for β₁ show more variation, especially within clusters of size two (again with an outlying MCMC-estimate). For the random intercept variance τ, we see that the MCMC estimate is somewhat larger than the others, while the estimates from the Laplace approximation and the hybrid approach are at the lower end of this spectrum. In terms of computing times, most approaches performed equivalently, with the Laplace approximation providing the fastest analysis, closely followed by AGQ, SEM and PQL. The MCMC approach took about ten times as long as the aforementioned approaches, while the hybrid approach only increased the computing time threefold.

Now the question becomes: which of these estimation methods is most reliable here? In order to find out, we conduct an extensive simulation study in the next section.

5. Simulation Study

In our simulation study we compare the performance of the six above-described estimation methods in different settings. For this, random binary outcome variables from small clusters are generated under a random intercept probit-regression model (see Supplemantary Material - Data Generating Mechanism). More specifically, we assume an underlying latent variable $Y_{i j}^{*}$ , such that Y_ij = 1 if $Y_{i j}^{*} > 0$ :

\begin{matrix} P (Y_{i j} = 1 | x_{i j}, b_{j}) = P (Y_{i j}^{*} > 0 | x_{i j}, b_{j}) & (9) \end{matrix}

\begin{matrix} \begin{array}{l} = P (β_{0} + β_{1} x_{i j} + b_{j} + ϵ_{i j} > 0), \\ b_{j} ~ N (0, τ) and ϵ_{i j} ~ N (0, 1) \end{array} & (10) \end{matrix}

First of all, we consider different cluster sizes: we will look at clusters of size two, three and five. Second, we also consider a different numbers of clusters. Since, Loeys et al. [59] reported that sample sizes in studies using the Actor-Partner- Interdependence-Model [60] within dyads typically ranged from 30 to 300 pairs, we consider sample sizes n of 25, 50, 100, and 300. Third, we also examine different intracluster correlations (icc) for the latent response variable. As the latent icc_l is defined as the proportion of between-group vs. total variance in Y* ( $i c c_{l} = \frac{V a r (b_{j})}{V a r (Y_{i j}^{*})} = \frac{τ}{τ + 1}$ ), a latent icc_l of 0.10, 0.30, and 0.50 corresponds to a random intercept variance of 0.11, 0.43, and 1.00, respectively. Fourth, we consider rare as well as more abundant outcomes, with an overall event rate of 10 and 50%, respectively. Since the marginal expected value of the outcome E(Y) equals $Φ (\frac{β_{0}}{\sqrt{1.25 + τ}})$ , an outcome prevalence of 50% implies that β₀ must be set to zero. Equivalently, when fixing β₀ to −1.50, −1.66, and −1.92 for a random intercept variance τ = 0.11, 0.43, and 1, respectively, an outcome prevalence of 10% is obtained². In all simulations, β₁ is fixed to 1. Finally, four types of covariates are compared: we consider a predictor that only varies between clusters, vs. one that varies within clusters; and a Gaussian distributed predictor ~ N(0, 0.25), vs. a zero-centered Bernoulli x with success rate 0.5.

In total, 2000 simulations are generated for the 3 × 4 × 3 × 2 × 4 combinations of clusters size (3), sample size (4), intracluster correlation (3), outcome prevalence (2) and type of predictor (4). The above-introduced methods are compared over these 288 settings in terms of convergence, relative bias, mean squared error (MSE) and coverage. The relative bias is defined as the averaged difference between the estimated (e.g., $\hat{β}$ ) and true parameter values (e.g., β), divided by the latter (so that the relative bias $= \frac{β - \hat{β}}{β}$ ); as such, a relative bias enclosing zero will indicate an accurate estimator. A relative bias measure was chosen over an absolute one, as the accuracy of some procedures tends to depend on the magnitude of the parameter values [63]. The MSE is estimated by summing the empirical variance and the squared bias of the estimates, simultaneously assessing bias and efficiency: the lower the MSE, the more accurate and precise the estimator. The coverage is defined as the proportion of the 95%-confidence intervals that encompass their true parameter value, where coverage rates nearing 95% represent nominal coverages of the intervals. For the likelihood-based and SEM approaches, Wald confidence intervals are used, while the Bayesian approaches rely on the quantile-based 95% posterior credible intervals. Note that coverage rates for τ are not provided, as not all estimation procedures provide this interval. Lastly, in order to conclude model convergence, several criteria must be met: first, whenever fixed effect estimates exceed an absolute value of ten, or the random effect estimate exceeds 25, the fit is classified as “no convergence.” We decided on this as parameters in a probit-regression exceeding an absolute value of five are extremely unlikely for the given covariate distribution and effect sizes. Secondly, convergence has also failed when a model fit does not yield estimators or standard errors. In addition, for MCMCglmm we specified that both chains must reach convergence as assessed by Geweke diagnostics; only when this statistic is smaller than two, convergence is concluded. To ensure a fair comparison between methods, we only present results for simulation runs in which all six methods converged.

6. Results

Below, we discuss the results of the simulation study for clusters of size two with a Gaussian predictor in detail.

6.1. Convergence

Generally, convergence improves as the number of clusters and the outcome prevalence increase, and as the icc_l decreases (see Figure 2). In contrast, convergence is rather unaffected by the level of the predictor, except for PQL which tends to show more convergence difficulties for a within-cluster x. The Laplace approximation also shows a slight decline in convergence for rare outcomes combined with a within-cluster predictor. Note that for 300 clusters most approaches reach 100% convergence, except for MCMC (as in Ten Have and Localio [21]) and at times the Laplace approximation. For rare outcomes in small samples (n = 25), however, the hybrid approach and SEM (see e.g., [51, 64]) often perform worse than MCMC. Overall, AGQ shows least difficulty in reaching convergence.

FIGURE 2

Figure 2. Model convergence of the six approaches, for different measurement levels of X (within- or between clusters), outcome prevalence (0.1 and 0.5), icc_l (0.1, 0.3, and 0.5), and sample size (25, 50, 100, and 300).

6.2. Relative Bias

First, for the fixed effect estimators we typically observe that the relative bias decreases as the number of clusters increases (see Figure 3). The Laplace approximation and PQL contradict this, however: for rare outcomes the relative bias tends to increase with n. Second, we see that an increase in the icc_l tends to shift the relative bias downwards. This implies an improvement in the performance of MCMC (in contrast to Ten Have and Localio [21]), but not of most other methods [65–68]. As such, we observe that MCMC performs worse than most methods, but that this difference attenuates as the icc_l increases. Third, the relative bias is generally smaller for a 0.5 outcome prevalence, compared to rare events; this is most clear for the hybrid approach, but is also visible in AQG (see [69]). For an outcome prevalence of 0.5, the bias in the β₀-estimators even becomes negligible for all methods. For β₁, however, the MCMC method actually performs worse in small samples when E(Y) = 0.5, compared to 0.1. Fourth, different measurement levels of the predictor do not much sway the bias, except for PQL; this method reveals slightly more bias for low event rates when the predictor is measured within- rather than between-clusters. Overall, SEM provides the least biased estimators for the fixed effects, closely followed by AGQ.

FIGURE 3

Figure 3. Relative bias in β₀ (left) and β₁ (right) for the six approaches, for different measurement levels of X (within- or between clusters), outcome prevalence (0.1 and 0.5), icc_l (0.1, 0.3, and 0.5), and sample size (25, 50, 100, and 300). These results stem from simulation runs where all methods converged.

For the variance of the random effect, better estimators are typically found in larger samples (see left part of Figure 4, also see Hox [68]). Similar to the fixed effect estimators, the Laplace approximation and PQL pose an exception to this rule, by inverting this relation for rare outcomes (see [70]). As such, the conclusions of Capanu et al. [20], stating that the hybrid approach outperforms the Laplace approximation by reducing bias in τ, do hold here, but only for large n. We also observe that as the icc_l decreases, bias in the estimates for τ increases in all methods. Finally, a slightly negative bias in the AGQ- and SEM-estimates for τ is observed when the outcome is rare and n small [2]. This negative bias, however, attenuates as the number of clusters is increased [70]. Overall, SEM yields the least biased estimators for the random intercept variance when the outcome prevalence is rare, while AGQ performs best when E(Y) = 0.5.

FIGURE 4

Figure 4. Relative bias (left) and MSE (right) of τ for the six approaches, for different measurement levels of X (within- or between clusters), outcome prevalences (0.1 and 0.5), icc_l (0.1, 0.3, and 0.5), and sample sizes (25, 50, 100, and 300). These results stem from simulation runs where all methods converged.

6.3. MSE

For both β₀ and β₁, the MSE is often higher for rare outcomes, compared to a 0.5 prevalence (see Figure 5). Additionally, the MSE drops as the sample size grows, and as the icc_l decreases. The Laplace estimator for β₀ again contradicts these trends: for rare events, the MSE increases with sample size and icc_l. As before, the measurement level of x does not much alter performance, except in PQL where a within-cluster predictor slightly increases the MSE. For both fixed effects, MCMC often yields the highest MSE when the prevalence equals 0.5, while the hybrid approach regularly performs worst for a prevalence of 0.1. In general, the Laplace approximation yields the lowest MSE when E(Y) = 0.5, but performs much worse when the outcome is rare. Overall, AGQ (closely followed by SEM) performs best in terms of MSE.

FIGURE 5

Figure 5. MSE of β₀ (left) and β₁ (right) for the six approaches, for different measurement levels of X (within- or between clusters), outcome prevalences (0.1 and 0.5), different icc_l (0.1, 0.3, and 0.5), and sample sizes (25, 50, 100, and 300). These results stem from simulation runs where all methods converged.

For the random intercept variance τ, we observe a decrease in MSE as the sample size increases, and as the icc_l decreases (see right part of Figure 4). The latter conclusion does not hold for MCMC as here the MSE tends to decrease with rising icc_l. Again, PQL performs slightly worse for a within-cluster predictor. In general, the Laplace approximation yields the lowest MSE for 0.5 prevalences, but performs worst when the outcome is rare. Overall, AGQ performs best in terms of MSE, better than SEM, especially in smaller samples.

6.4. Coverage

For both fixed effect estimators, coverage of their 95% confidence intervals is typically better when the outcome prevalence is 0.5 (see Figure 6). Also, an increasing icc_l usually worsens coverage, except for MCMC (where coverage improves with increasing icc [21]). The impact of the icc_l on coverage has also been observed by Zhang et al. [63], who found nominal coverages for AGQ and the Laplace approximation for low random intercept variances (i.e., low icc), but more liberal ones as τ increases (i.e., high icc). Generally, SEM and AGQ provide the best coverage rates [70], with SEM taking the upper hand for the coverage of β₀, and AGQ for β₁ with a low to medium icc.

FIGURE 6

Figure 6. Coverage of β0 (left) and β₁ (right) for the six approaches, for different measurement levels of X (within- or between clusters), outcome prevalences (0.1 and 0.5), icc_l (0.1, 0.3, and 0.5), and sample sizes (25, 50, 100, and 300). These results stem from simulation runs where all methods converged.

6.5. Summary of the Other Simulation Settings

Until now, we only discussed the results of the simulation study for clusters of size two with a Gaussian predictor. The results for other settings are available in the Supplementary Material and are briefly discussed in the next paragraphs.

When looking at a binary predictor instead of a Gaussian one, our conclusions remain more or less the same. One exception is that most methods experience a steep decline in convergence for smaller sample sizes, when the predictor is binary compared to continous. This is most apparent in SEM, where lower convergence rates are due to empty cell combinations of outcome and predictor. In SEM, this produces a warning, which we interpreted as an error (as in MPLUS), since such runs yield unreliable results.

As the cluster size increases from two to three or five, we observe a general increase in performance in all methods except SEM. This approach now no longer yields the lowest bias, with AGQ gradually taking over. As such, increasing cluster size favors AGQ in terms of precision, as well as in terms of relative bias.

6.6. MPLUS, JAGS, and SAS

MPLUS and lavaan performed quite similarly throughout our settings, although there were some minor differences (results shown in the Supplementary Material). While MPLUS version 7 slightly dominates in terms of convergence and coverage, lavaan takes the upper hand for the relative bias and the MSE. These differences are trivial, however, and most likely due to lavaan incorporating a slightly higher number of iterations in reaching convergence.

When comparing JAGS to MCMCglmm, we observe some important differences in performance; for most settings, JAGS 4.1.0. outperforms MCMCglmm, except when a small n is combined with a medium to large icc_l (see Supplementary Material). Note that although JAGS performs slightly better in most settings, its computing times are also significantly higher.

In contrast to Zhang et al. [63], who found a superior performance of SAS NLMIXED compared to R's glmer-function, we found that glmer performed equally well or even slightly better in terms of convergence rates, relative bias, and coverage (using SAS version 9.4). When the outcome prevalence is 0.5 and for some rare events settings, glmer also provided a slightly lower MSE.

7. Discussion

In this paper, we provided an overview of several R-packages based on different estimation techniques, as to fit random-intercept probit regression models. More specifically, we focused on techniques capable of modeling binary outcomes in small clusters. Additionally, we presented an extensive simulation study in which we assessed the impact of various data features on a number of performance criteria. In summary, we found that some of our results confirmed findings from previous studies, while others have (to the best of our knowledge) not been observed before:

Interestingly, both SEM and AGQ performed considerably well for paired data. Though both approaches disclosed some sensitivity to sample size, they manifested remarkable robustness when varying the icc, the event rate, and the measurement level of the predictor. As such, these methods can be considered the most stable over all settings in terms of relative bias, for the fixed effect regression coefficients as well as the random intercept variance. While AGQ performs slightly better than SEM in terms of convergence and MSE, SEM performs slightly better when considering the relative bias. As SEM relies on robust standard errors, it yields higher MSE's, but also provides robustness against model misspecification (which was not investigated here). For the coverage, we observed that SEM performs slightly better for β0, while AGQ tentatively gains the upper hand for β₁. As the cluster size increases, however, AGQ takes over and becomes most reliable in terms of bias and precision.

Since the Laplace approximation is known to be precise only for normally distributed data or for non-normal data in large clusters [30], we observed an expected poor performance of this approximation in our settings [23]. PQL also exhibits an inferior performance for low icc's and a low outcome prevalence, while additionally revealing disconcerting performance issues for a within-cluster measured predictor. Finally, the two Bayesian approaches performed below par in terms of most criteria considered.

Let us once again consider our motivating example with a within-cluster measured predictor, a sample size of 37, an outcome prevalence of 0.4, and a medium to large latent icc. When we apply our conclusions to these settings, we can state that SEM will yield the most trustworthy estimates when the cluster size is two, while AGQ will take over as a measurement is added. MCMC will yield the most biased estimates in both cases (as can be clearly seen in Table 1).

Several limitations can still be ascribed to this paper. First, we restricted our comparisons to estimation techniques available in R-packages. As such, several improvements regarding the estimation methods discussed, could not be explored. For example, while the glmmPQL function employed in this paper is based on Breslow and Clayton [11]'s PQL version, a second-order Taylor expansion [71] might provide a more precise linear approximation (this is referred to as PQL-2, in contrast to the first order version PQL-1). Be that as it may, not all the evidence speaks in favor of PQL-2: even though it yields less bias than PQL-1 when analyzing binary outcomes, Rodriguez [72] found that the estimates for both fixed and random effects were still attenuated for PQL-2. Furthermore, PQL-2 was found to be less efficient and somewhat less likely to converge [72]. Second, certain choices were made with respect to several estimation techniques, such as the number of quadrature points used in the AGQ-procedure. However, acting upon the recommendation of 8 nodes for each random effect [17], we argue that surpassing the ten quadrature points considered, would carry but little impact in our random intercept model. Also, the repercussions of our choices on prior specification in the Bayesian framework deserves a more thorough examination, as different priors may lead to somewhat different findings. Third, the performance results presented here may not be intrinsic to their respective estimation techniques, but instead due to decisions made during implementation. As we demonstrated for MCMCglmm when comparing it to JAGS, its disappointing performance is most likely due to a suboptimal implementation, and not an inherent treat of the MCMC estimation procedure. Fourth, some scholars [73] have recommended the evaluation of different estimation methods and their dependence on different data features, by applying ANOVA-models rather than graphical summaries. Treating the different settings (sample size, icc, level of the predictor, the event rate, and their two-way interactions) as factors, did not provide us much insight since almost all variables (as well as their interactions) were found to be highly significant. Fifth, in our simulation study we only considered complete data; in the presence of missing data, however, DWLS estimation in SEM will exclude clusters with one or more missing outcomes, resulting in a complete case analysis. This exclusion stands in contrast to maximum likelihood and Bayesian approaches from the GLMM-framework, as they consider all available outcomes when there is missingness present. Consequently, the GLMM-framework will not introduce any (additional) bias under the missing at random assumption, while DWLS-estimation requires the more stringent assumption of data missing completely at random. Sixth, we do not focus on measurement imprecision in this study and assume that all observed variables are measured without error. Of course, as Westfall and Yarkoni [74] recently pointed out, this rather optimistic view might pose inferential invalidity when this assumption fails. In light of this, it is important to note that SEM can deal with such measurement error, in contrast to GLMM-based approaches.

With the results, as well as the limitations of the current paper in mind, some potential angles for future research might be worth considering. As we explicitly focused on conditional models, we deliberately excluded marginal approaches such as Generalized Estimating Equations (GEE), because such a comparison is impeded by the fact that marginal and conditional effects differ for binary outcomes. Whereas, multilevel models allow for the separation of variability at different levels by modeling the cluster-specific expectation in terms of the explanatory variables, GEE only focuses on the respective marginal expectations. Previous research [75] has revealed excellent small sample performance of GEE in terms of bias, when analyzing binary data in clusters of size two. Also, it might we worth considering a pairwise maximum likelihood (PML) approach, as PML estimators have the desired properties of being normally distributed, asymptotically unbiased and consistent [76]. This estimation method breaks up the likelihood into little pieces and consequently maximizes a composite likelihood of weighted events. PML in R is currently unable to cope with predictors, but this will most likely be possible in the near future. And finally, as pointed out by one of the reviewers, Hamiltonian Monte Carlo (used in Stan software, Carpenter et al. [77]) may be a more efficient sampler compared to a Metropolis-Hastings (i.e., MCMCglmm) or a Gibbs sampler (i.e., JAGS). To this end, exploring the performance of the Stan software might prove worthwhile when further focusing on Bayesian analysis.

Author Contributions

HJ performed the analysis of the example, conducted the simulation studies, and wrote the paper. TL participated in the writing and reviewing of the paper, alongside checking the analyses and simulations. YR critically reviewed and proof-read the entire manuscript.

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

This work was carried out using the Stevin Supercomputer Infrastructure at Ghent University provided by the VSC (Flemish Supercomputer Centre), funded by Ghent University, the Hercules Foundation and the Flemish Government - department EWI.

Supplementary Material

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fams.2016.00018

Footnotes

1. ^SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ^® indicates USA registration.

2. ^Note that the observed icc_o is dependent on the intercept β₀, the random intercept variance τ, and the latent icc_l through the following formula [61]: $i c c_{o} = \frac{Φ_{2} (\frac{β_{0}}{\sqrt{(τ + 1) (1 + 2 τ)}}, \frac{β_{0}}{\sqrt{τ + 1}}, i c c_{l}) - Φ_{1} {(\frac{β_{0}}{\sqrt{τ + 1}})}^{2}}{Φ_{1} (\frac{β_{0}}{\sqrt{τ + 1}}) (1 - Φ_{1} (\frac{β_{0}}{\sqrt{τ + 1}}))}$ . In this equation, Φ₁ represents the cumulative standard normal distribution, and Φ₂ the cumulative bivariate standard normal distribution with correlation icc_l. Since the outcome prevalence dictates the value of the intercept, each combination of icc_l and E(Y) provides different icc_o's; for rare outcomes, the observed icc_o are 0.06, 0.25, and 0.51, while for E(Y) = 0.5 they are 0.06, 0.19, and 0.33 (corresponding to latent icc_l's of 0.10, 0.30, and 0.50, respectively). As such, the observed icc_o's range from small to large, according to Hox [62]'s recommendations.

References

1. Snijders T, Bosker R. Multilevel Analysis: An Introduction to Basic and Advanced Multilevel Modeling. Thousand Oaks, CA: Sage (1999).

2. Raudenbush S, Bryk A. Hierarchal Linear Models. Applications and Data Analysis Methods. 2nd Edn., Thousand Oaks, CA: Sage (2002).

3. Rovine MJ, Molenaar PC. A structural modeling approach to a multilevel random coefficients model. Multiv Behav Res. (2000) 35:55–88. doi: 10.1207/S15327906MBR3501_3

PubMed Abstract | CrossRef Full Text | Google Scholar

4. Curran PJ. Have multilevel models been structural equation models all along? Multiv Behav Res. (2003) 38:529–69. doi: 10.1207/s15327906mbr3804_5

PubMed Abstract | CrossRef Full Text | Google Scholar

5. Bauer DJ. Estimating multilevel linear models as structural equation models. J Educ Behav Stat. (2003) 28:135–67. doi: 10.3102/10769986028002135