Analyzing average and conditional effects with multigroup multilevel structural equation models

Conventionally, multilevel analysis of covariance (ML-ANCOVA) has been the recommended approach for analyzing treatment effects in quasi-experimental multilevel designs with treatment application at the cluster-level. In this paper, we introduce the generalized ML-ANCOVA with linear effect functions that identifies average and conditional treatment effects in the presence of treatment-covariate interactions. We show how the generalized ML-ANCOVA model can be estimated with multigroup multilevel structural equation models that offer considerable advantages compared to traditional ML-ANCOVA. The proposed model takes into account measurement error in the covariates, sampling error in contextual covariates, treatment-covariate interactions, and stochastic predictors. We illustrate the implementation of ML-ANCOVA with an example from educational effectiveness research where we estimate average and conditional effects of early transition to secondary schooling on reading comprehension.


INTRODUCTION
In the social sciences, analysis of covariance (ANCOVA) is one of the most important statistical techniques. It is used to analyze effects of an independent variable on an outcome variable controlling for a vector of covariates. In this article, we focus on the application of ANCOVA-like techniques in quasi-experimental multilevel designs with treatment implementation at the level of clusters. Multilevel structural equation models Mehta and Neale, 2005;Marsh et al., 2009;Lüdtke et al., 2011) are used to extend traditional ANCOVA to account for the clustered structure in multilevel designs. In order to estimate causal effects in such designs, it is important to control for all relevant covariates, including covariates at the individuallevel (e.g., student characteristics, intelligence, socio-economic status), contextual covariates (e.g., neighborhood-average socioeconomic status, school-average achievement) and true clusterlevel covariates (characteristics of the cluster, e.g., school resources, location). For extended discussions of causal effects in multilevel quasi-experiments, see Gitelman (2005), Hong and Raudenbush (2006), Sobel (2006), Hong and Raudenbush (2008), VanderWeele (2008), and Nagengast (2009).
We propose a multigroup multilevel structural equation modeling approach (MG-ML-SEM approach; see Muthén, 2002) that extends conventional multilevel ANCOVA in several ways. The particular strengths of the MG-ML-SEM approach are: (1) It allows for latent covariates at the individual-level and at the cluster-level to control for measurement error; (2) it takes into account sampling error in the aggregation of covariates to the cluster-level; (3) it naturally includes interactions between the cluster-level treatment variable and (latent) covariates at both levels and (4) all predictor variables are treated as stochastic rather than fixed predictors. The MG-ML-SEM approach is illustrated by an application from educational research in Germany (ELEMENT study, Lehmann and Lenkeit, 2008), where we estimate average and conditional effects of early transition to secondary schooling on reading achievement.
The paper is structured as follows: First, we introduce conventional multilevel ANCOVA for the analyses of quasi-experimental designs with treatment application at the cluster-level. Then, we discuss the shortcomings of this conventional approach. Next, we illustrate how the MG-ML-SEM approach can be used to overcome these shortcomings for estimating average and conditional treatment effects and compare several models of the MG-ML-SEM approach. In the discussion, we point out the assumptions required to estimate causal effects and discuss further research directions.

MULTILEVEL DESIGNS WITH TREATMENT APPLICATION AT THE CLUSTER-LEVEL
Multilevel designs with two or more treatment groups are typically differentiated along the dimensions of (a) the level of treatment application (at the individual-or the clusterlevel) and (b) the treatment assignment process (randomized or non-randomized assignment) (e.g., Plewis and Hurry, 1998;Ukoumunne et al., 1999). In the remainder, we will focus on a class of multilevel designs that are of particular importance for educational research: Designs in which whole clusters are treated and there are systematic differences between clusters. Such quasiexperimental designs are particularly relevant when randomized assignment to conditions is not possible, for example when the effects of differential learning environments or school types are studied. As in all quasi-experimental designs, observed group differences do not reflect unbiased treatment effects in multilevel designs with non-random assignment at the cluster-level, because confounding variables can bias the estimates of treatment effects. Therefore it is crucial to measure and statistically control such confounders (e.g., Rosenbaum, 2002;Shadish et al., 2002). In the remainder, we call a potential confounder a covariate.
Conventionally, multilevel ANCOVA has been the recommended approach for analyzing quasi-experimental multilevel designs with treatment application at the cluster-level (e.g., Plewis and Hurry, 1998;Seltzer, 2004). Multilevel ANCOVA for a design with two conditions includes an indicator for the treatment condition X and controls for k covariates Z 1 , . . . , Z k . A typical implementation of an hierarchical two-level ANCOVA, for example given in Raudenbush and Bryk (2002), p. 26, Equation (2.16) is as follows: Y = γ 00 + γ 10 · X + γ 20 · Z 1 + . . . + γ k0 · Z k + u + r ε .
(1) If this model is appropriately specified, i.e., if all covariates are included and there are no unmodeled non-linear effects or treatment covariate interactions, γ 10 is an estimate of the average treatment effect AVE (e.g., Aiken and West, 1996). In order to obtain correct standard errors, the model should be implemented as a hierarchical linear model to take the clustered structure of the design into account (e.g., Murray, 1998Murray, , 2001Seltzer, 2004), i.e., the residual variable ε can be decomposed into a clusterlevel component u, that reflects the residual variation between classrooms and an individual level-component r, that reflects variability of individual students around the values predicted by the covariates and their cluster. The model can be easily extended by including cluster-level covariates W to control for selection effects that operate at the cluster-level.

LIMITATIONS OF CONVENTIONAL MULTILEVEL ANCOVA
The conventional multilevel ANCOVA presented in Equation (1), however, has several drawbacks that limit its potential for the analyses of quasi-experimental multilevel designs with treatment application at the cluster-level.

MEASUREMENT ERROR
Conventional multilevel ANCOVA implemented in the hierarchical linear model does not take measurement error into account, leading to potentially biased estimates of treatment effects (Cook et al., 2009;Culpepper and Aguinis, 2011;Lockwood and McCaffrey, 2014). It is well known that unreliably measured covariates can yield biased results in regression analysis (Degracie and Fuller, 1972;Carroll et al., 2010). This problem is exacerbated in multilevel designs in educational and psychological research, where many covariates can only be measured unreliably and latent covariates can appear at several levels. The emergence of multilevel structural equation models Mehta and Neale, 2005;Marsh et al., 2009;Lüdtke et al., 2011) allows for controlling measurement error in covariates (and outcomes) by including multiple indicators both at the individual-and cluster-level. Multilevel structural equation models also allow for analyzing the factorial structure of latent variables and tests of measurement invariance across levels (Mehta and Neale, 2005;Jak et al., 2013)-two assumptions of conventional multilevel ANCOVA that are not routinely tested (see also Schweig, 2013).

CONTEXTUAL EFFECTS AND SAMPLING ERROR
In quasi-experimental designs with treatment application at the cluster-level, contextual covariates are especially important. Contextual covariates (sometimes also referred to as compositional variables, e.g., Harker and Tymms, 2004;Hutchison, 2007) reflect the composition of cluster-level units, for example the average achievement level in a classroom or the average socioeconomic status in a neighborhood. Contextual covariates are particularly important covariates because it is very likely that they are associated with selection processes at the cluster-level.
Formally, contextual covariates are conditional expectations of a unit-level covariate Z given the cluster variable C. Every unitlevel covariate Z can be decomposed into a contextual covariate or between-cluster component Z b and a residual or within-cluster component Z w : (Lüdtke et al., 2008;Snijders and Bosker, 2012). This decomposition shows that a contextual covariate Z b is the regression of the individual-level covariate Z on the cluster-variable C. As such, all properties of a regression residual hold for the within-cluster component Z w . Notably, its expected value is equal to zero and it is regressively independent of the cluster variable C as well as of all functions of the cluster-variable such as other covariates at the cluster-level. The notion of "contextual effects"-the differential effects of the within-cluster and between-cluster components on the outcome variable-has long been discussed (Raudenbush and Willms, 1995). It is important to take account of such differential effects when controlling for cluster-level differences in multilevel models and it is critical to the unbiased estimation of treatment effects when selection into different treatments occurs at the cluster level. The conventional multilevel ANCOVA of Equation (1) does not make this distinction. The coefficient γ k0 represents the total effect of the covariate Z k that is a blend of the effects of Z kb and Z kw . The value of γ k0 depends on the intraclasscorrelation-coefficient of the covariate Z k (Raudenbush and Bryk, 2002;Snijders and Bosker, 2012). If only Z k is included as covariate in an ANCOVA, but selection into treatment conditions varies as a function of Z kb , the adjusted effects will be biased (Nagengast, 2009).
In applications another difficulty arises: The values of contextual covariates Z b are often estimated by the manifest sample mean in each cluster, assuming perfect reliability of cluster means. This assumption is unrealistic if there is only a sample of individuals taken from each cluster or if the individual-level ratings measure a cluster-level construct such as classroom climate (Marsh et al., 2012). Recently, new approaches have been developed that take the unreliability of cluster means into account (Croon and van Veldhoven, 2007;Lüdtke et al., 2008;Shin and Raudenbush, 2010;Grilli and Rampichini, 2011). Marsh et al. (2009) and Lüdtke et al. (2011) introduced doubly-latent models that further extend these approaches by allowing for latent aggregation of latent variables measured by multiple indicators. They also demonstrated that the contextual effect, i.e., the effect of the contextual variable Z b after controlling for the effects of the unit-level covariate Z, will be biased when sampling error is not controlled. This bias, that is similar to the bias due to measurement error, will also affect the estimation of treatment effects when a contextual covariate is included in the model.

TREATMENT-COVARIATE INTERACTIONS
The literature on ANCOVA for multilevel designs with treatment application at the cluster-level has been surprisingly sparse on the issue of including interactions between the treatment and various covariates at the unit-and the cluster-level (see, Plewis and Hurry, 1998;Pituch, 2001;Seltzer, 2004, for notable exceptions). Such interactions indicate that the effect of the treatment is not constant across all units and clusters, but depends on the values of individual-and cluster-level covariates. Interactions contain important information about the differential effectiveness of the treatment for subgroups of units or clusters. In the presence of interactions, researchers may consider conditional treatment effects (i.e., treatment effects given particular values of the covariate(s) and/or the treatment), or the average treatment effect (i.e., the treatment effect obtained by averaging the conditional treatment effects over the unconditional distribution of covariates). In models ignoring important interactions, the regression coefficient for the treatment effect is an aggregate of the conditional treatment effects, but is not equal to the average treatment effect (see, Rogosa, 1980, for a detailed explanation in the context of single-level models). Using an example with a single covariate, Rogosa (1980) showed that the treatment effect obtained from traditional ANCOVA (misspecified by ignoring the interaction) is an estimate of the (Z = z ca )-conditional treatment effect, where z ca denotes the center of accuracy. The center of accuracy is the point, where the conditional variance of the effect function is minimal and it is not necessarily equal to the average of Z.

STOCHASTIC REGRESSORS
The conventional hierarchical linear model explicitly assumes that the predictors within a sample are fixed quantities that do not vary from sample to sample (see e.g., Pinheiro and Bates, 2000;Raudenbush and Bryk, 2002;Snijders and Bosker, 2012). Hence, all inferences are conditional on the values of the set of observed covariates in the sample (Senn et al., 2007). While this assumption simplifies the implementation of the statistical estimation procedure, the case has been made that it is not appropriate for the analysis of quasi-experimental designs and observational studies (Crager, 1987;Chen, 2006;Nagengast, 2006;Kröhne, 2009) when unconditional inferences to the true distribution (and not to the sample distribution) of covariates are desired. In these designs, it is unlikely that the distribution of covariates in a sample would be identical in a replication of the study and treating the covariates as fixed predictors is not appropriate. Kröhne (2009) showed analytically and in simulation studies that standard errors of average effects obtained from the conventional general linear model assuming fixed predictors will be biased in the presence of treatment-covariate interactions if the covariate is, in fact, a stochastic predictor (see also Sampson, 1974). While the problem of stochastic predictors and unconditional inference has received some attention in experimental design (Gatsonis and Sampson, 1989), correlation analysis (Shieh, 2006) and power analysis (Steiger and Fouladi, 1992) for the general linear model, the topic has not been studied widely for hierarchical linear models.

GENERALIZED ANCOVA
Before we discuss the MG-ML-SEM framework in more detail, we first introduce generalized ANCOVA (Steyer and Partchev, 2008) that was developed to overcome some of the issues of conventional ANCOVA in single-level designs. For a binary treatment indicator X and a multivariate covariate Z = (Z 1 , . . . , Z k ), the regression of Y on X and Z can always be written as (3) In this representation of the regression, the intercept function g 0 (Z) describes the conditional regressive dependency of the outcome Y and the covariates in the control group (i.e., for X = 0). The values of the effect function g 1 (Z) are the conditional treatment effects given particular values z of the covariate Z.
In order to estimate treatment effects, one has to choose a parameterization for both g 0 (Z) and g 1 (Z). Often, linear parameterizations are chosen for the intercept and the effect functions (e.g., Aiken and West, 1996) although other parameterizations, e.g., non-linear functions are also possible. Using only a single covariate Z and assuming linear functions for g 0 (Z) and g 1 (Z) yields E (Y | X, Z) = (γ 00 + γ 10 · Z) + (γ 01 + γ 11 · Z) · X.
(4) Equation (4) extends conventional ANCOVA by including the interaction term γ 11 . If there is an interaction effect, i.e., if γ 11 = 0, treatment effects are not constant, but vary as a linear function of the covariate Z. Based on Equation (4), one can also obtain the average effect of the treatment (AVE) by taking the expectation of the effect function g 1 (Z): Hence, in the presence of interaction effects, the AVE is no longer represented by a single parameter as in conventional ANCOVA, but is identified by a non-linear function of regression coefficients and the expected value of the covariate. If the expected value of the covariate E(Z) were included as a model parameter, e.g., in a multigroup structural equation model (Kröhne, 2009;Nagengast, 2009), the estimation of the AVE using Equation (5) could take into account the uncertainty associated with the covariate means that is introduced when covariates are stochastic predictors (Sampson, 1974;Gatsonis and Sampson, 1989;Steiger and Fouladi, 1992;Chen, 2006;Shieh, 2006). In contrast, the more common approach of mean centering the covariate(s) to obtain average effects (Aiken and West, 1996) ignores this uncertainty. Based on Equation (4), one can also obtain conditional treatment effects CTE = g 1 (z) given particular values z of Z: which are also identified by non-linear functions of regression coefficients and a value z of Z. While generalized ANCOVA and its implementation into multigroup SEM solves some of the problems of ANCOVA without interaction, further steps are necessary in order to account for contextual covariates and measurement error in quasi-experimental multilevel designs.

MG-ML-SEM IMPLEMENTATION OF GENERALIZED ML-ANCOVA
In the remainder of this paper, we introduce the MG-ML-SEM implementation of generalized ML-ANCOVA as an alternative to the conventional implementation of ML-ANCOVA models in hierarchical linear models. The MG-ML-SEM approach naturally overcomes the limitations of the conventional approach mentioned above: (1) It controls for measurement error by including measurement models for multiple indicators of covariates and outcomes.
(2) It easily allows the inclusion of contextual effects with the appropriate controls for sampling error by the latent aggregation approach.
(3) It includes treatment-covariate interactions as a default as it is based on a multiple-group multilevel SEM model. (4) All predictors are treated as stochastic rather than fixed quantities. In addition, the MG-ML-SEM approach allows for group-specific variances of the dependent variable Y given covariates Z.

MG-ML-SEM
The implementation of generalized ML-ANCOVA is based on the multilevel structural equation model of Muthén (1989Muthén ( , 1994 and its extension to multigroup multilevel structural equation models (Muthén et al., 1997). Note that it is also possible to present and implement the model in the GLLAMM-framework ). The elements of these vectors can be modeled in two ways-either using group-mean centering of unit-level variables and the corresponding group means as additional predictors (Kreft et al., 1995;Raudenbush and Bryk, 2002;Enders and Tofighi, 2007) or using a full-information latent aggregation approach (Lüdtke et al., 2008). Similar to the correction of the between-cluster variance of the outcome variable in conventional multilevel models (e.g., Snijders and Bosker, 2012), the latter approach accounts for the fact that the observed between-cluster variances and covariances of the predictors are biased estimators of the true between-cluster variances and covariances. It corrects the effects of the between-cluster covariance matrix for the effects of the within-cluster variances and covariances. Throughout this article, we will denote the latent aggregation approach with a superscript of an asterisk 1 . This decomposition is given by: where Y is the vector of manifest indicators of latent variables, Z is the vector of manifest covariates measured at the unit-level, and W b is the vector of true cluster-level covariates. The vec- contains the latent between-cluster components of the variables and the vector (Y * w , Z * w , 0) contains the latent within-cluster components of the corresponding variables on the unit-level 2 .
The MG-ML-SEM consists of (1) the group-specific withincluster measurement model, (2) the group-specific betweencluster measurement model, (3) the group-specific within-cluster structural model, and (4) the group-specific between-cluster structural model: Muthén (2004) for details on the implementation of the MG-ML-SEM as a sampling model. Considering only one individual-specific covariate is a between-cluster intercept in β x0 , and β x1 is a between-cluster regression coefficient in B x .

GENERALIZED ML-ANCOVA
The generalized ML-ANCOVA combines aspects of conventional ML-ANCOVA (by considering the nested structure and withinand between-cluster components of variables) and aspects of generalized single-level ANCOVA (by considering interactions between the treatment variable and covariates).
The generalized ML-ANCOVA for a single covariate Z = Z * b + Z * w and a dichotomous treatment variable X with values 0 and 1 is given by: Since we want to analyze average and conditional treatment effects, our main interest lies in the effect function g 1 (Z * b , Z * w ). In the multi-group setting, the effect function can be computed as the difference between the two group-specific regressions Inserting Equation (8) into (10) yields the effect function of generalized ML-ANCOVA based on parameters of the MG-ML-SEM: This effect function can be used to compute average and conditional effects. In order to compute the average treatment effect, we first need to compute the unconditional expectations of covariates. The MG-ML-SEM only contains parameters for the conditional expectations of covariates given treatment group x, but the unconditional expectation can be computed by: where μ 01 , μ 11 are between-cluster "intercepts" of exogenous covariates in β x0 , i.e., μ 01 , μ 11 are the group-specific true means . Generalized ML-ANCOVA can easily be extended to j + 1 treatment groups and to include more covariates at the withinand at the between-cluster level, as well as pure cluster level covariates: where I X = j is an indicator for treatment condition j, and Z * b , Z * w , W are multivariate random variables. Based on Equation (13), we can consider average treatment effects . Next, we illustrate the MG-ML-SEM approach to generalized ML-ANCOVA with a model from educational effectiveness research. Using a single dataset, we present the increasingly complex MG-ML-SEM models that show the features and the versatility of the approach.

PARTICIPANTS AND PROCEDURE
To illustrate the MG-ML-SEM, we use data from the ELEMENT study in Berlin, Germany (Lehmann and Lenkeit, 2008). ELEMENT is a three wave longitudinal study aimed at examining effects of early transition to secondary school (after 4th grade) on students' reading and mathematics proficiency. From 2002From /2003From to 2004From /2005 total of N = 4926 students were measured in 4th grade, 5th grade, and 6th grade with several ability tests. N = 3169 students attended elementary school until the end of 6th grade, whereas N = 1757 decided to make the transition to secondary school after 4th grade.
In this article, we do not present a comprehensive analysis of the ELEMENT study. Instead, the primary goal of our paper is to illustrate the MG-ML-SEM approach as a means to estimate conditional and average effects in educational research. For didactic purposes, we restrict ourselves to reading comprehension as outcome, to a limited set of covariates (prior reading achievement and interest in reading), to two occasions of measurement (4th grade and 6th grade), and to only one of the five data sets with imputed missing values provided by the Research Data Centre at the Institute for Educational Quality Improvement. A complete causal analysis would most likely require the inclusion of more covariates and/or propensity scores (see causal inference section in the discussion for details), and careful consideration of the sampling design and the missing data structure (e.g., Baumert et al., 2009;Lehmann, 2010;Becker et al., 2014).

Reading comprehension
The reading test used in the ELEMENT study was based on the theoretical framework of the IEA and OECD reading assessments. It contained items from PIRLS (Mullis et al., 2006) and from the LAU Study (Lehmann et al., 1997(Lehmann et al., , 1999. Students' reading abilities were obtained as weighted likelihood estimates based on an IRT model. In order to obtain comparable scores across time, an anchor items design was used. In our analyses, we used test scores from 4th grade as covariate Z in order to control for pre-existing differences in reading, and we used test scores from 6th grade as outcome variable Y.

Interest in reading
The scale "interest in reading" consisted of five items. For our analyses including a latent covariate η, we used the following three positively worded items as indicators of a common latent variable: V 1 (Item Asf0902): I like talking to others about books. V 2 (Item Asf0903): I am pleased with a book received as a gift. V 3 (Item Asf0905): I like reading.
The response format was a four-point Likert scale with categories "1 = strongly agree," "2 = agree," "3 = disagree," and "4 = strongly disagree." The three items were recoded so that higher values represent higher interest in reading. For the analyses including only manifest variables, we computed a scale V for "interest in reading" by taking the mean of the three recoded items for each student.

Other variables
Our treatment variable X is type of school at 6th grade with values X = 0 (elementary school) and X = 1 (secondary school). We used class ID in 6th grade as cluster variable C. For some of the analyses ignoring latent aggregation, we created betweencluster components of all covariates by computing the empirical class means. We also created the within-cluster components by computing the difference between the corresponding variable and their empirical class means. This decomposition was done for reading comprehension at 4th grade Z, each of the three indicators of "interest in reading" V 1 , V 2 , V 3 , and the scale for "interest in reading"

MODELS
We analyzed the data with six multigroup multilevel models in order to illustrate the different effect estimates and highlight the strengths of the MG-ML-SEM approach. In the simplest model M1, we did not control for any covariates; in M2, we controlled for manifest covariates without distinguishing within and between components (as in a traditional multilevel ANCOVA model); in M3, we controlled for within and between components of manifest covariates using manifest aggregation; in M4, we controlled for within and between components of manifest covariates using latent aggregation; in M5, we controlled for within and between components of a latent and a manifest covariate using manifest aggregation; and finally in M6, we specified the full doubly latent model controlling for within and between components of a latent and a manifest covariate using latent aggregation. See Table 1 for an overview of models and their characteristics. We chose to present all six models in order to show differences in point estimates and standard errors between models in the context of our illustrative example. These analyses show considerable differences between models and demonstrate the need for a careful examination of treatment effects while meeting the requirements of complex multilevel designs.

M1: MG-ML-SEM WITHOUT COVARIATES
The first MG-ML-SEM is a way of estimating the (unadjusted) means of reading comprehension from 6th grade Y in elementary school (X = 0) and secondary school (X = 1). The parameters in this model are the group-specific within and between variances of Y, and the group-specific means of Y. The within and between structural model for both treatment groups (X = 0 and X = 1) are given by: where Y * w is the within-cluster component of Y and Y * b is the between-cluster component of Y. The asterisk indicates that Y * w and Y * b are latent variables obtained by latent aggregation. In this model without covariates, the structural intercepts β 00 and β 10 are the group-specific means of Y, ζ w is a residual at the within level, and ζ b is a residual at the between level.
Based on the parameters of this model, we can compute the effect function (see Equation 11), which is identical with the average treatment effect in this model, because there are no covariates: Obviously, in the ELEMENT study as in other observational studies, the AVE obtained from M1 does not reflect an unbiased estimate of the average causal effect of school type on reading comprehension from 6th grade. It is only in randomized controlled trials, that we can estimate the causal effect without bias using a model such as M1, i.e., without controlling for covariates.

M2: MG-ML-SEM MANIFEST COVARIATES/WITHOUT CONTEXTUAL COVARIATES
In the second MG-ML-SEM, we implemented the conventional multilevel ANCOVA by adding two covariates to the model, namely reading comprehension from 4th grade Z and the scale score of interest in reading V, and consider the group-specific regressions E X = x (Y | Z, V) in the following MG-ML-SEM: In this model we demonstrate the consequences of disregarding the decomposition of (V, Z) into within and between components. This decomposition was ignored by constraining the corresponding regression coefficients to be equal across levels, i.e., α 01 = β 01 , α 02 = β 02 , α 11 = β 11 , α 12 = β 12 . Based on the parameters of this model, we can compute the effect function g 1 (Z, V) and the average effect AVE as follows (see Equations 11, 12 for details): The average treatment effect of school type on reading comprehension from 6th grade obtained from this model can be causally interpreted only under strong assumptions (see causal inference section in the discussion). If there are other confounders not included in the model or if there are contextual effects, the AVE from M2 must not be causally interpreted.

M3: MG-ML-SEM MANIFEST COVARIATES/MANIFEST AGGREGATION
In the third MG-ML-SEM, we add contextual covariates to the model, i.e., we decompose the manifest covariates (Z, V) into between-cluster components (Z b , V b ) and within-cluster components (Z w , V w ). For this decomposition, we follow the group-mean centering approach assuming perfect reliability of manifest cluster means of covariates. In the terminology of Marsh et al. (2009), M3 is called a "Doubly-Manifest Model." We computed the values of the betweencluster variables (Z b , V b ) and the within-cluster variables (Z w , V w ) before fitting the model. We consider the group- Again, based on the parameters of M3, we can compute the effect function g 1 (Z b , V b , Z w , V w ) and the average treatment effect as follows (see Equations 11, 12): The treatment effects in this model depend on the values of the within-cluster components and the between-cluster components of covariates. Note that the equation for computing the AVE simplifies considerably, because the expectation of the within-cluster components of the covariates are zero, i.e., E(Z w ) = E(V w ) = 0 (see section 3.2). The unconditional expectations of covariates are computed as shown in Equation (12). M3 is less restrictive compared to M2. It requires causal unbiasedness of the regression E(Y | X, Z b , V b , Z w , V w ), which means that there must not be any omitted confounders.

M4: MG-ML-SEM MANIFEST COVARIATES/LATENT AGGREGATION
Our fourth MG-ML-SEM is very similar to M3, with the only exception that contextual covariates are treated as latent variables using the full-information latent aggregation approach (Lüdtke et al., 2008) as indicated by the superscript of an asterisk in the model equations for the fourth MG-ML-SEM. This model was first presented in Nagengast (2009): ⎞ ⎠ between structural model X = 1 www.frontiersin.org April 2014 | Volume 5 | Article 304 | 7 Marsh et al. (2009) termed this model a "Manifest-Latent Model." The computations of the g 1 (Z * b , V * b , Z * w , V * w ) and the average treatment effect mimic the corresponding computations shown for M3 (see Equations 11, 12):

M5: MG-ML-SEM LATENT COVARIATES/MANIFEST AGGREGATION
In our fifth MG-ML-SEM, we take one step back and again apply manifest aggregation for covariates. Unlike previous models, we include a measurement model for the latent covariate "interest in reading," i.e., the three positively worded items V 1 , V 2 , V 3 are indicators of a latent construct η. The three indicators are decomposed into their within-and between-cluster components using manifest aggregation. In the terminology of Marsh et al. (2009), M5 is called a "Latent-Manifest Model." Particular advantages of explicitly including a latent covariate are: (1) one can appropriately account for measurement error in the covariate, (2) the option to test measurement invariance across groups and across levels, and (3) the availability of fit indices to examine model fit.
Adding a latent variable in the MG-ML-SEM approach requires the specification of a within-cluster measurement model and a between-cluster measurement model for both treatment groups. We assume measurement invariance across levels (Mehta and Neale, 2005;Jak et al., 2013) and across groups (Meredith, 1993).
In the model equations, we show the common within-cluster and the common between-cluster measurement models: Average and conditional effects are computed as in the previous models. Note that we use the unconditional expectation of the latent covariate η to compute the average treatment effect. E(η) will depend on the chosen scale: In a τ -congeneric measurement model as used in this example, η is uniquely defined up to positive linear transformations. We fixed the scale of η by fixing the first measurement intercept to ν 1 = 0 and the first loading to λ 1 = 1 in both treatment groups 3 . Given a group-invariant measurement model, E(η) can be computed: The effect function and the AVE from M5 are then (see Equations 11, 12):

M6: MG-ML-SEM LATENT COVARIATES/LATENT AGGREGATION
The sixth and final model is the full doubly latent MG-ML-SEM including latent covariates and latent aggregation. The only difference compared to M5 is that the aggregation of pre-treatment reading comprehension Z and the three indicators of interest in reading is latent as indicated by the superscript of an asterisk. M6 is specified by the following equations: ⎞ ⎠ within measurement model 3 In principle, other ways of fixing the scale of η, such as fixing the latent mean and variance, are possible as well. However, we recommend this approach in the context of multigroup multilevel SEM for analyzing average and conditional effects. Using the latent variance for setting the scale would complicate specification, because the variance needs to be specified appropriately in the multigroup-multilevel setting, which requires constraints across groups and across levels. Also, we do not want to fix the mean of latent variables in each of the groups, because this would lead to biased estimates of treatment effects.
⎞ ⎠ between structural model X = 1 Figure 1 illustrates M6. The upper path diagram of Figure 1 refers to treatment group X = 0 and the lower path diagram refers to X = 1. Each of the group-specific path diagrams is divided into three parts: the gray-shaded middle part shows the observed variables that are decomposed into contextual variables (upper part L2) and unit-level residuals (lower part L1). Based on the parameters of the doubly latent model shown in the model equations above and displayed in Figure 1, we can compute the effect function and its expectation by (see Equations 11, 12): Mplus syntax for M6 is given in the Supplementary Material.

RESULTS
The aim of our illustrative example is twofold: First, we demonstrate the consequences of estimating average effects with different models. Second, we show how to use the MG-ML-SEM approach to estimate conditional effects and average effects of the treated and the untreated. We begin our presentation with the descriptive statistics of the central variables in our analyses. Table 2 shows means, standard deviations, and intra-class correlation coefficients for all manifest variables used in our analyses. The left-hand columns depicts descriptive statistics for those students that stayed in elementary school until the end of 6th grade (the control group X = 0), whereas the right-hand columns depicts descriptive statistics for those students that made the early transition to secondary school after 4th grade (the treatment group, X = 1). Descriptively, the distributions of pre-treatment covariates are quite different between the two groups, which is expected in quasi-experimental designs like the ELEMENT study.

COMPARISON OF MODELS
The first aim of our illustrative example was to examine whether model choice is important for assessing the effects of early transition to secondary school on reading comprehension. To examine the differences between the models, we analyzed the data from the ELEMENT study with the six MG-ML-SEMs presented in the last section. All models were implemented using Mplus 7 (Muthén and Muthén, 2012). Table 3 shows the average treatment effect together with standard errors and effect sizes, the parameters of the conditional effect functions g 1 (Z), and group-specific contextual effects of covariates for all models. Effect sizes (ES) for the average effects were computed by dividing AVE by the standard deviation of Y given X = 0. The substantive conclusions drawn from the models are quite different, as will be discussed in detail in the following paragraphs. There was a strong and significant average treatment effect obtained from M1 (AVE = 13.59, 95% CI [12.10, 15.08], ES = 1.07). In a randomized experiment, an applied researcher might conclude that, on average, early transition to secondary school was beneficial for students' reading comprehension. However, in the present quasi-experimental setting, the observed difference in outcome means between treatment groups may be due to pre-existing differences. In fact, as shown in Table 2, those students who chose to make the early transition to secondary school already have higher values on reading comprehension in 4th grade and higher interest in reading on average. Therefore, we need to control for these pre-existing differences by conditioning on relevant covariates.
In M2, we controlled for the manifest covariates "reading comprehension in 4th grade" Z and "interest in reading" V without distinguishing within and between components of these covariates. The average treatment effect obtained from M2 (AVE = 5.23, 95% CI [3.90, 6.47], ES = 0.43) was considerably lower compared to the AVE from M1 but is still significant. If there were no contextual effects and no other covariates introducing bias, one would conclude that early transition to secondary school positively affects students' reading comprehension on average.
M3 is the first model in our sequence of models that separated the within-cluster and between-cluster components of covariates and estimates contextual effects. In our illustrative example, there was a significant contextual effect for reading comprehension at 4th grade for those students that stayed in primary schools (CE ZY;X = 0 = 0.27, 95% CI [0.19, 0.35]), but no significant contextual effect of reading comprehension at 4th grade for students that moved to secondary schools and no contextual effects for interest in reading in both groups (see Table 3). Substantively, the contextual effect showed that students with equal initial achievement levels and equal interest in reading (i.e., given Z = z and V = v) profited from being schooled together with high achieving students (i.e., high values on Z b and V b ) in primary school, but not if they made an early transition to secondary school. In M3, where we took account of contextual covariates in the computation of the average treatment effect, the AVE was no longer significant (AVE = 1.76, 95% CI [−0.20, 3.72], ES = 0.14).
All subsequent models (M4-M6) revealed a similar pattern of results: The AVE is not significant and there was a contextual effect of initial reading comprehension for students that stayed in primary school. The estimates from models M3-M6 ranged from AVE = 1.76 (95% CI [−0.20, 3.72], ES = 0.14; M3) to 0.14 (95% CI [−2.57, 2.85], ES = 0.01; M4). Descriptively, the models using latent aggregation of covariates (M4 and M6) show the lowest AVE. When the effect estimates were standardized using the standard deviation of the outcome in primary school, the results were very similar (see row ES in Table 3). With regard to standard errors, we find that the models including latent aggregation (M4 and M6) had the highest standard errors, followed by the models using manifest aggregation and contextual covariates (M3 and M5), and the model ignoring contextual covariates (M2) had the lowest standard error. M6 had the highest standard error, because it accounts for uncertainty in the aggregation of covariates and also includes latent covariates.
In summary, the average total effects obtained from models M3-M6 were pretty similar, but they were substantially different from the apparent effects obtained from M1 and M2. The comparison of models clearly shows that it is necessary to control for contextual covariates, if there are contextual effects.

DOUBLY-LATENT MODEL M6
In order to further illustrate the advantages of the MG-ML-SEM approach, we present a more detailed analysis of conditional and average effects based on the doubly-latent model M6. Although there were only minor differences between M3 and M6 in our illustrative example, we generally prefer M6, because it overcomes all of the limitations of conventional ANCOVA mentioned in the introduction. We note, however, that there might be situations in which partial correction models may be more appropriate for substantive or statistical reasons (Lüdtke et al., 2008(Lüdtke et al., , 2011Marsh et al., 2012).
Detailed results for all parameters of the doubly-latent model M6 and model fit information are given in Table 4. The effect function g 1 (Z), the average treatment effect AVE, contextual effects, and all conditional effects considered in subsequent paragraphs, were based on these parameters. Figure 1 shows a path diagram of the full model. Next, we demonstrate how the MG-ML-SEM approach could be used to answer substantive research questions using our preferred model (M6) as case example. In particular, we show how the MG-ML-SEM approach can be used to examine average effects as well as conditional treatment effects given certain values of covariates.
In our illustrative example, the average effect is not the only interesting quantity from a substantive point of view. When The effect function is given by g 1 (Z ) = γ 10 + γ 11 Z b + γ 12 V b + γ 13 Z w + γ 14 V w (in models with latent variables, V is replaced by η; in models using latent aggregation evaluating an intervention it is not only of interest if there is an average effect, but also for whom the intervention is beneficial or even harmful. Early transition to secondary school might not have a significant average effect, but could still affect specific students, e.g., students with high or low values of pretreatment covariates. For example, high achieving students might particularly benefit from an early transition to secondary schooling, whereas the early transition might negatively impact students with low initial achievement. The MG-ML-SEM approach can be used to study such conditional effects. In addition, it is also possible to study average treatment effects for certain sub-populations. From a policy perspective, it might be interesting to study the effect of the early transition on the group of students actually choosing early transition to secondary school (this effect has also be termed the "effect on the treated"). Or one might be interested in whether early transition to secondary school would have a beneficial effect for those who did not take this opportunity ("effect on the untreated"). Prior to estimating M6 for computing average and conditional effects, we tested for measurement invariance of "interest in reading" by comparing a model with invariance across groups and across levels (χ 2 = 81.57, df = 14, p < 0.001; BIC = 28192.23; RMSEA = 0.05; CFI = 0.97; TLI = 0.98) with a model where invariance was not assumed (χ 2 = 1.69, df = 6, p = 0.95; BIC = 28198.01; RMSEA = 0.00; CFI = 1.00; TLI = 1.00) 4 . While the χ 2 -test of model fit was significant for the model with invariance, the other fit measures indicated a good fitting model, and the BIC was lower for the model with invariance. We concluded that the more parsimonious model with measurement invariance is adequate.
Based on M6, we first tested the null hypothesis that all conditional effects were equal using a Wald test, i.e., H 0 : γ 11 = γ 12 = γ 13 = γ 14 = 0, where γ 11 to γ 14 are regression coefficients in the effect function (cf. Table 3). Average effects on the treated and the untreated will differ from the average treatment effect only if the conditional effects are not constant. It is one of the particular strengths of the MG-ML-SEM approach that it allows tests of the treatment-covariate interactions and does not presume that conditional treatment effects are constant. Based on the Wald test, we rejected the null hypothesis (χ 2 = 14.25, df = 4, p = 0.0065) of constant treatment effects.
Next, we considered conditional treatment effects, such as the conditional effect of early transition to secondary school on the treated (CTET). The CTET is the conditional expectation of the effect function given X = 1, i.e., It can be computed based on the parameters of the model (see Table 4): for reading interest and omitted the single-item reading comprehension measures.

www.frontiersin.org
April 2014 | Volume 5 | Article 304 | 11 using the fact that the conditional expectations of within-cluster components Z * w and η * w given X = x is zero 5 . In our illustrative example, the CTET was not significant (CTET = −0.92, 95% CI [−3.00, 1.16]). This implies that the early transition to secondary schooling was not beneficial for the group of students that actually underwent this transition. If this effect was causally interpretable (which would require controlling a larger set of covariates), it would imply that there were no beneficial effects of the supposedly enriched learning environment of the secondary school for the group of students that went there. These students would have obtained similar reading outcomes in the 6th grade after attending primary schools. 5 E(Z * w | X = x) = E(η * w | X = x) = 0, because Z * w and η * w are defined as residuals with respect to the regressor C (see section about contextual effects and sampling error). Therefore, for any function f (C) of C and the treatment variable X is such a function of C in designs with treatment application at the cluster-level considered in this paper.
Similarly, the conditional treatment effect on the untreated (CTEUT) is the effect that early transition to secondary school would have on students who stay in elementary school until the end of sixth grade. The CTEUT can be computed as the conditional expectation of the effect function given X = 0: In our analyses, the CTEUT is also not significant (CTEUT = 0.74, 95% CI [−3.28, 4.76]). Again, under the assumption that all relevant covariates had been controlled, this would imply that students who stayed in primary school longer would not have profited from an earlier transition to secondary schooling with respect to their reading proficiency.

DISCUSSION
In this paper, we presented the MG-ML-SEM implementation of generalized ML-ANCOVA for the analysis of quasi-experimental multilevel designs with non-randomized assignment at the cluster-level. We demonstrated the flexibility of this approach and its potential for the analyses of average and conditional treatment effects using data from a German educational study focusing on early transitions to secondary schooling. The MG-ML-SEM approach overcomes the limitations of conventional multilevel ANCOVA by (1) accounting for measurement error, (2) systematically including contextual covariates with appropriate control for sampling error, (3) naturally including treatmentcovariate interactions as a default and (4) treating predictors as stochastic rather than fixed. We now discuss some limitations of our example and focus on the conditions for interpreting average and conditional effect estimates obtained from generalized multilevel ANCOVA as causal effects.

CAUSAL INFERENCE
Causal effects are crucial in theories in the social sciences. Almost all theories include statements that can be formalized as statements about causal effects of a variable X on a variable Y. Many researchers have the idea that a causal treatment effect is what is estimated by a mean difference (between treatment and control) in a randomized experiment. Although it turns out that this idea is not wrong, it does not help much in situations in which it is not possible or desirable to randomly assign persons/clusters to treatments. How and under which conditions can we estimate a causal effect? The general conclusion is that specific techniques of causal modeling are indispensable, whenever we are beyond total effects in a randomized experiment. Approaches to causality include Rubin's causal model (Rubin, 1974), Pearl's graphical approach (Pearl, 2009), the stochastic theory of causal effects (Steyer et al., 2014), or Dawid's approach (Dawid, 2011). These theories of causality provide definitions of causal effects and point out the assumptions required to estimate average and conditional causal effects from empirically estimable conditional expectations in experimental and quasi-experimental designs.
A requirement of a causal analysis is a temporal structure, i.e., the focused cause has to be prior to the outcome, and the covariates have to be prior or simultaneous to the cause, so that the covariates can not be affected by the cause. This distinguishes covariates from potential mediators. The temporal structure of random variables can either be described by a directed acyclic graph (DAG; Pearl, 2009) or using a filtration (Steyer et al., 2014). In the ELEMENT study, the longitudinal aspect of the study makes such a time order plausible: The covariates "interest in reading" and "pre-test reading comprehension" are measured at 4th grade, prior to the focused cause "early transition to secondary school," and the outcome "post-test reading comprehension" is measured at 6th grade which is posterior to early transition.
One strategy for identifying causal effects is to control for all confounders. In informal terms, a confounder is a covariate that will bias the effects of the focused cause on the outcome, if we do not control for it appropriately (see Steyer et al., 2014, for a definition of unbiasedness). Applied researchers are encouraged to test whether a covariate is a confounder or not: In Pearl's theory, the (conditional) independence statements implied by the DAG can be tested. Based on a correctly specified DAG, the researcher can then read off the covariates that need to be controlled in a causal analysis using the backdoor criterion (Pearl, 1993). In the stochastic theory of causal effects, there are several testable causality conditions that imply unbiasedness (Steyer et al., 2000(Steyer et al., , 2014. For example, the independent cause condition is defined as conditional independence of the cause and all covariates given the selected covariates in the model (see CC1Z in Steyer et al., 2014). The regressively independent outcome condition (see CC2Z in Steyer et al., 2014) is defined as conditional regressive independence of the outcome variable from all covariates given the cause and the selected covariates in the model. Both of these causality conditions imply conditional unbiasedness and are testable in the sense that the corresponding conditional independence statements can be falsified in empirical studies.
Once we have identified all relevant confounders, i.e., once we have selected the covariates such that one of the causality conditions holds, we have to control for these covariates in order to estimate causal effects and not just mere associations. Based on the theories of causality mentioned above and work by others (e.g., Rosenbaum and Rubin, 1983;Robins, 1999;Shadish et al., 2002), several techniques have been developed to control for potential confounders, e.g., ANCOVA adjustment, propensity scores, weighting techniques, matching, subclassification, marginal structural models and many more (see Schafer and Kang, 2008, for an overview). In this article, we suggested an approach to analyze data from the ELEMENT study that extends ANCOVA techniques in several ways. Some of the other techniques could also be applied for analyzing data in quasi-experimental multilevel designs (see Baumert et al., 2009;Becker et al., 2014 for applications of propensity score matching to the ELEMENT data).
As mentioned previously, the primary goal of our paper was to illustrate the MG-ML-SEM approach as a means to estimate average and conditional (total) effects in educational research. The effects obtained from models M1-M6 presented in this article can only be causally interpreted if the corresponding regressions are unbiased. For example, the average effect in M1 requires unbiasedness of the conditional expectations E(Y | X=x), which is very unlikely to hold in the ELEMENT study. It would only be plausible in a randomized experiment. M2 requires unbiasedness of the group-specific regressions E X = x (Y | Z, V), and M3 requires unbiasedness of the group-specific regressions . Similar arguments apply to the other three models presented in this paper. A careful causal analysis would require including more potential confounders and testing the causality conditions.

LIMITATIONS AND DIRECTIONS FOR FURTHER RESEARCH
As we only compared our models on an example dataset, there was no way to pick the correct model from the set of models presented. Obviously M6, the full doubly-latent model, controls for shortcomings of conventional multilevel ANCOVA, but other models could also be defended on substantive and statistical grounds. For example, M5 that does not use latent aggregation could also be appropriate as the data in the ELEMENT study has been obtained from complete samples of the students within classrooms in Grade 4. In this case, there is only a small proportion of sampling error in the aggregation of covariates to the classroom level (Lüdtke et al., 2008;Marsh et al., 2012), e.g., due to students missing the assessment date. Similarly, M4 and M5 might be more efficient in estimating the average treatment effect as the full doubly-latent model M6 has been shown to yield very variable estimates (Lüdtke et al., 2011). However, as illustrated in the empirical examples, controlling for sampling error (and measurement error) only had minor effects on the estimated effects. It was the separation into within-and between-components of the covariates that mattered most. This separation, in turn, was justified by the presence of contextual effects of achievement in the control group. In applications, bigger differences between the models are expected when sampling and measurement error increase or when contextual effects are more pronounced. In these cases, the bias of the contextual effect will be relatively larger in M3 compared to the full doubly-latent model (Lüdtke et al., 2011).
Further research on the MG-ML-SEM implementation of the generalized ML-ANCOVA should study more carefully the required sample size at both the within-and the between-level. Simulation studies suggest that there are circumstances where a sample size of at least 10 students within a class are required for reliably estimating contextual and average effects with M4 (Lüdtke et al., 2008;Nagengast, 2009) and large sample sizes at the class-level might be required for the doubly-latent model to be sufficiently accurate (Lüdtke et al., 2011). Further simulation work is needed to extend these findings to the MG-ML-SEM case and clarify what sample sizes are required for reliable inferences about average and conditional effects in this framework.
Further developments of the MG-ML-SEM framework will likely yield additional options for the application and development of generalized multilevel ANCOVA. In particular, accounting for cross-classified multilevel structures in multilevel structural equation models would be a major step forward. By cross-classified structure, we refer to a multilevel structure, which is not strictly hierarchical, but can be more complex (for details and two examples of a cross-classified structure, see Raudenbush, 1993). Properly accounting for the cross-classified structure by considering different clustering structures for the covariates and the outcomes would be particularly useful for studies such as ELEMENT were students move from one school type to another. In addition, the further development of multilevel structural equation models with latent interactions (e.g., Leite and Zuo, 2011;Nagengast et al., 2013;Schermelleh-Engel et al., 2014) could allow to include covariate-covariate interactions and other nonlinear effects of latent covariates in the effect functions. Similarly, the development of three-level multilevel structural equation models will yield further opportunities for extending the models presented and allow researchers to address the complexities of multilevel quasi-experimental designs more comprehensively.
In this article, we provided applied researchers with a comprehensive toolbox to analyze average and conditional effects in non-randomized multilevel designs. We hope that this presentation encourages researchers to apply these advanced techniques to address issues of measurement error, sampling error, contextual effects, treatment-covariate interactions and ultimately causal effects in the analyses of multilevel quasi-experimental designs.