A general non-linear multilevel structural equation mixture model

In the past 2 decades latent variable modeling has become a standard tool in the social sciences. In the same time period, traditional linear structural equation models have been extended to include non-linear interaction and quadratic effects (e.g., Klein and Moosbrugger, 2000), and multilevel modeling (Rabe-Hesketh et al., 2004). We present a general non-linear multilevel structural equation mixture model (GNM-SEMM) that combines recent semiparametric non-linear structural equation models (Kelava and Nagengast, 2012; Kelava et al., 2014) with multilevel structural equation mixture models (Muthén and Asparouhov, 2009) for clustered and non-normally distributed data. The proposed approach allows for semiparametric relationships at the within and at the between levels. We present examples from the educational science to illustrate different submodels from the general framework.

In the past 2 decades latent variable modeling has become a standard tool in the social sciences. Linear structural equation models have been extended to include non-linear interaction and quadratic effects (for a review see Schumacker and Marcoulides, 1998;Algina and Moulder, 2001;Marsh et al., 2004Marsh et al., , 2006, and for the capability to model multilevel data structures (e.g., Rabe-Hesketh et al., 2004;Muthén and Asparouhov, 2009). However, a systematic combination of both non-linear structural equation modeling and multilevel modeling has not been implemented in a more general framework. In this article, we present a GNM-SEMM that combines recent semiparametric non-linear structural equation models (Kelava and Nagengast, 2012;Kelava et al., 2014) with multilevel structural equation mixture models (Muthén and Asparouhov, 2009) for clustered and non-Gaussian data. The proposed framework is capable of modeling non-linear parametric and semiparametric relationships at the within and at the between levels, and it allows non-normally distributed data to be considered. We will provide an empirical example from educational sciences to illustrate the applicability of the proposed framework. We will begin by providing an overview of current approaches for estimating non-linear structural equation models and current frameworks for multilevel structural equation (mixture) models.

NON-LINEAR STRUCTURAL EQUATION MODELS
Numerous parametric approaches for the estimation of nonlinear effects have been developed (for a review, see Schumacker and Marcoulides, 1998;Algina and Moulder, 2001;Marsh et al., 2004Marsh et al., , 2006, including product indicator approaches (e.g., Kenny and Judd, 1984;Bollen, 1995;Jaccard and Wan, 1995;Ping, 1995;Jöreskog and Yang, 1996;Algina and Moulder, 2001;Marsh et al., 2004Marsh et al., , 2006Little et al., 2006;Kelava and Brandt, 2009), distribution analytic approaches (Klein and Moosbrugger, 2000; Klein and Muthén, 2007), Bayesian approaches (e.g., Arminger and Muthén, 1998;Lee et al., 2007), and method of moments based approaches (Wall and Amemiya, 2003;Mooijaart and Bentler, 2010). Whereas most product indicator approaches have been ad-hoc methods for the specification of non-linear interaction effects and have thus suffered from requiring complicated measurement models, recent distribution analytic and Bayesian approaches have tried to overcome the need for non-linear measurement models. Method-of-moments-based approaches (Wall and Amemiya, 2003;Mooijaart and Bentler, 2010) and some indicator approaches (Bollen, 1995;Jöreskog and Yang, 1996) have been proposed as methods that do not rely as heavily on the normality assumption of the latent variables as other approaches (e.g., the distribution analytic approaches). The relaxation of distributional assumptions may lead to a reduction in the threat of biased estimates for non-linear effects in situations in which data are non-normally distributed, but for most of these approaches, relaxing these assumptions is associated with a low power for detecting the effects (Schermelleh-Engel et al., 1998;Brandt et al., 2014).
A different approach for modeling non-linear relations between latent variables is the use of semiparametric structural equation mixture models (SEMM; Arminger and Stein, 1997;Jedidi et al., 1997a,b;Dolan and van der Maas, 1998;Arminger et al., 1999;Muthén, 2001;Bauer and Curran, 2004;Bauer, 2005;Pek et al., 2009Pek et al., , 2011. Finite mixtures of linear structural equation models are used to approximate the unknown functional form of the non-linear relationship of the latent variables 1 . 1 In SEMM linear models are estimated within several latent classes. Nonlinear relationships between two variables are modeled by the parameter estimates for the linear effects that change in size across the (finite number of) latent classes. Furthermore, by assuming mixtures, the SEMM approach relaxes the assumption of normally distributed latent variables and disturbances necessary in conventional structural equation models. Therefore, the SEMM approach is a flexible tool for predicting latent dependent variables when data are not normal, and when obtaining a strict parametric representation of the functional relation does not have the highest priority (for a discussion see Bauer, 2005). However, one drawback is that the linearity assumption of latent relationships and the normality assumption of the latent variables are relaxed simultaneously. This drawback can be manifested in the problem that observed non-normality in the data cannot be attributed to either non-normality of the latent variables or non-linearity between the latent variables. A way to overcome this problem is the specification of non-linear structural equation mixture models (NSEMM; Kelava et al., 2014) that allow distributional and linearity assumptions to be relaxed separately for the latent variables and their relationships.
Although, the use of mixtures for modeling non-linear latent variable relationships (e.g., Curran et al., 1996;Dolan and van der Maas, 1998;Bauer and Curran, 2004;Bauer, 2005) or the nonnormality of latent variables in the context of non-linear structural equation models (Lubke and Muthén, 2005;Lee et al., 2008;Yang and Dunson, 2010;Kelava and Nagengast, 2012;Brandt et al., 2014;Kelava et al., 2014) have received increased attention in recent years, systematic evaluations have been rare. As an additional limitation, all approaches presented so far have been strictly limited to single-level models and have not accounted for nested data structures.

MULTILEVEL STRUCTURAL EQUATION MODELING
Nested data structures have been addressed with multilevel models for relationships between manifest variables (for an introduction see Snijders and Bosker, 1999;Hox, 2010). In the past 2 decades, researchers have proposed frameworks that are capable of modeling nested data structures in latent variable models (e.g., Muthén, 1994;Rabe-Hesketh et al., 2004;Muthén and Asparouhov, 2009). For example, these frameworks have included models that account for random effects on the within-level, multilevel path analysis (Heck and Thomas, 2000), or multilevel confirmatory factor analysis (Muthén, 1994). Furthermore, mixtures of distributions have been applied in latent growth curve modeling (Muthén and Asparouhov, 2009).
So far, very limited psychometric developments have been proposed in the context of non-linear multilevel structural equation models that incorporate latent interaction effects. Leite and Zuo (2011) presented a product-indicator-based approach that allows for a specification of latent interactions on the between-level (e.g., at the school level). Their approach was a first attempt to extend the product-indicator approach for non-linear interaction effects in latent multilevel models. Products of between-level indicators are used for the specification of a measurement model of the between-level latent product variable.
Focusing more generally on within-person processes in psychology (Molenaar, 2004;Molenaar and Campbell, 2009), Nagengast et al. (2013) adapted the unconstrained product indicator approach to account for latent interactions on the withinlevel. In predicting homework motivation, they found support for the latent interaction between homework expectancy and homework value at the within-student level.
Despite these first successful adaptations, several problems that are associated with single-level non-linear structural equation modeling remain unsolved. First, the hitherto applied constrained and unconstrained product-indicator approaches for multilevel models are vulnerable to violations of distributional assumptions (normal distributions are typically assumed; for a discussion see Kelava et al., 2011). The specification of constrained and unconstrained product-indicator approaches strongly depends on the distributions involved (Kelava and Brandt, 2009), and biased estimates of the parameters and standard errors can be expected when specification errors occur (Kelava et al., 2008) or distributional assumptions are not met (Kelava and Nagengast, 2012). Hence, product-indicator approaches that are extended for multilevel data structures are even more vulnerable because more distributional assumptions on different levels have to be met.
Second, the proposed extensions of single-level non-linear structural equation models specify a parametric non-linearity (by involving products of latent variables). Recently, a strong emphasis has been placed on the relaxation of this simple functional relationship, including mixtures of latent variables that also allow for non-normally distributed variables (e.g., Bauer, 2005;Kelava et al., 2014). Therefore, on the one hand there is a need for an optional specification of a semiparametric relationship of the latent variables (at the within and between levels) to better approximate the non-linear reality. On the other hand, there is a need for an optional specification of mixtures that can account for non-normality or heterogeneity across subpopulations.
Third, the application of single-level non-linear structural equation modeling in substantive research has suffered from too many approaches that use the same distributional assumptions (see paragraphs above) and too few simulation studies that offer clear recommendations for the application of specific approaches (for an overview, see Kelava et al., 2011). Approaches that agree with regard to distributional assumptions may lead to contradictory results; that is, some approaches might suggest significant non-linear effects, whereas others might not. Substantive researchers cannot solve this kind of problem by referring to empirical data. Further information that is based on simulation studies (for single-level non-linear models see e.g., Brandt et al., 2014) is needed here.
In total, there is a need for a framework that incorporates several special cases of multilevel modeling and that offers general as well as specific solutions for both substantive and methodological research in non-linear latent variable modeling. From a substantive standpoint, non-linear hypotheses (e.g., interactions) can be examined in more detail. From a methodological standpoint, the framework will foster the comparison of different kinds of estimators (e.g., MCMC, ML, or moment methods) in the context of different distributions.
As a result of these considerations, in the next section, we will present a general non-linear multilevel structural equation mixture modeling (GNM-SEMM)framework that allows for the separate relaxation of distributional and linearity assumptions of the latent variables and their relationships on different levels of a nested data structure. We will provide several theoretical and practical examples to illustrate what is possible within the framework. In general, within this framework, it is possible to derive specific submodels that include crucial parts of the model as well as a combination of several aspects that have not been combined before.

A GENERAL NON-LINEAR MULTILEVEL STRUCTURAL EQUATION MIXTURE MODEL
In this section, we will present a GNM-SEMM framework that allows for semiparametric latent non-linear effects on the within and the between levels. The framework presented here is similar to the general multilevel mixture model and notation presented by Muthén and Asparouhov (2009). Whereas Muthén and Asparouhov's (2009) model focuses only on linear relationships, the GNM-SEMM framework accounts for non-linear semiparametric relationships of the manifest and latent variables involved. This allows for a more precise modeling of latent variable relationships at different data levels while relaxing the linearity assumptions of standard latent multilevel frameworks (e.g., Rabe-Hesketh et al., 2004).

Definition
Let y jik be the score of the j-th (j = 1, . . . , J) observed (indicator) variable for individual i (i = 1, . . . , N k ) in a cluster k (k = 1, . . . , K). Note that the individual index i is cluster-specific. Its range depends on the cluster size N k (e.g., the number of pupils in a given school k is denoted as N k ). Let z lk be the score of the l-th (l = 1, . . . , L) observed (indicator) variable for cluster k. The observed scores y jik and z lk could be realizations of dichotomous, ordered categorical, continuous normally distributed, or count variables. Categorical (mixture) variables are used for the definition of mixtures on the individual (within) and cluster (between) levels. Let C ik be an within-level latent categorical variable for individual i in cluster k, which takes values 1, . . . , C * d . Let D k be a betweenlevel latent categorical variable for cluster k, which takes values 1, . . . , D * . Note that the number of latent classes on the withinlevel may be different across the latent classes on the betweenlevel.
Analogous to Rabe-Hesketh et al. (2004), Muthén (1984), and Muthén and Asparouhov (2009), for observed dichotomous and ordered categorical variables, the underlying normally distributed latent variables y * jik and z * lk are defined such that for a set of threshold parameters τ jscd and τ ls d , and categories s and s , respectively, the following equations hold for each subject i in cluster k: where the vertical bar ·|· indicates a "conditional on" statement, and ↔ indicates an equivalence. For continuous normally distributed variables, y * jik = y jik and z * lk = z lk are assumed, and for count variables, y * jik = log(λ jik ) and z * lk = log(λ lk ) hold, where λ jik and λ lk are the expectations of the Poisson distribution. Additional assumptions regarding the mean and covariance structure will be made in the following subsections, which will specify the measurement and structural models on the within and between levels.

Example
Suppose that pupils from several schools take part in a math test. For a given pupil i from school k the score on a sub-task j from the math test is given by y jik . In addition, for school k, there is a score z lk that indicates the school's social problems (e.g., the degree of bullying reported by the principal). In Figure 1, two latent categorical variables C ik and D k on the within-level (Level 1) and the between-level (Level 2), respectively, are introduced. These variables may account for heterogeneity that occurs in the scores on both levels. On Level 1, heterogeneity in the distribution of the math test may occur due to additional private lessons in math that some pupils received. On Level 2, heterogeneity may occur in the distribution of the school's social problems, for example, due to the general (unobserved) socioeconomic status of the neighborhood where the school is located. Furthermore, school k might belong to an unobserved group of schools D k = d that explicitly prepared for the math test. This may then influence the distribution of the math scores. Figure 1 shows a diagram with the observed and mixture variables. At this stage, there is no model that can explain the relationship between the scores y jik and z lk and no measurement model that can describe the realizations of the scores. The mixtures are indicated by C ik and D k .

Definition. Let y *
ik be the J-dimensional vector for individual i in cluster k that includes scores for all dependent observed within variables. The measurement model is defined by a mixture distribution model where ν 1kcd is a J-dimensional vector of latent intercepts, 1kcd is a J × m (f 1 ) loading matrix. η 1ikcd = (η 11ikcd , . . . , η 1ikmcd ) is an m-dimensional vector of variables including all latent exogenous and endogenous variables. f 1 ( · ) is a smooth polynomial function mapping the m-dimensional variable vector η 1ikcd to an m (f 1 )dimensional vector f 1 (η 1ikcd ). f 1 (η 1ikcd ) could be a vector that includes product variables [e.g., (η 11ikcd , η 12ikcd , η 11ikcd η 12ikcd ) FIGURE 1 | Observed variable scores y jik (within-level) and z lk (between-level) as well as mixtures C ik (within-level) and D k (between-level).
www.frontiersin.org July 2014 | Volume 5 | Article 748 | 3 or (η 11ikcd , (η 11ikcd ) 2 , η 12ikcd , (η 12ikcd ) 2 ) ] (e.g., Schumacker and Marcoulides, 1998;Kelava et al., 2011) or splines (Freund and Hoppe, 2007). K 1kcd is a J × Q (g 1 ) matrix with regression coefficients. x 1ik is a Q-dimensional vector of all observed unexplained (within) covariates that may have an additional influence on the indicator variables y * ik . g 1 ( · ) is a smooth polynomial function mapping the Q-dimensional vector of covariates to a Q (g 1 )dimensional vector g 1 (x 1ik ), and 1ikcd is a J-dimensional vector of residual variables with a zero mean vector and covariance matrix 1kcd . For observed categorical variables y ik , a normality assumption for 1ikcd is equivalent to a probit regression for y ik on η 1ikcd and x 1ik . Alternatively, for dichotomous variables y ik , 1ikcd can have a logistic distribution, resulting in a logistic regression. For count variables y ik , the residual 1ikcd is assumed to be zero. For normally distributed continuous variables y ik , the residual variable 1ikcd is assumed to be normally distributed.

Example.
Suppose that in the above-mentioned math test example, data for two additional constructs (attitude toward reading and the teaching strategies experienced by the student) were collected with three items for each construct. The measurement model [cp. Equation (3)] is illustrated in Figure 2, and accordingly, it assumes two latent factors η 11ikc (attitude toward reading) and η 12ikc (experienced teaching strategies). For didactical purposes, all schools here belong to one class D = 1, so that the index d can be omitted, and there is no between-level model. Furthermore, heterogeneity is assumed on the withinlevel such that each pupil i belongs to an unobserved class (mixture) C ik = c. The example measurement model derived from the framework above is a confirmatory factor mixture model that is given by y ik | C ik =c = ν 1kc + 1kc η 1ikc + 1ikc . The heterogeneity, which is implied by the mixture c, can be accounted for differently by the (statistical) model depending on the hypothesized population model: First, a non-normal distribution of the latent variables can be modeled as a mixture distribution. For example, attitude toward reading might not be normally distributed. A mixture distribution of η 11ikc (with varying expectations and FIGURE 2 | A measurement model for subject i for two latent variables with a mixture distribution on the within-level (the between-level ith not included in this example). The mixture distribution is symbolized by the frame with dashed lines. It was assumed that all subjects belonged to one latent class D = 1 on the between-level so that the index d could be omitted.
covariance structure for each mixture component c) could represent the non-normality (see Kelava et al., 2014). Second, the measurement model might be completely different for each unobserved subgroup (with varying factor loadings etc.). For example, some pupils might have poor reading skills, and hence, do not understand the items well enough. As a consequence, factor loadings in this subgroup may be lower (or residual variances may be larger) compared with other subgroups. and such differences may lead in turn to an observed heterogeneity.

Structural model
The structural model for the latent variable vector η 1ikcd is given for each subject i in cluster k by (4) where α kcd is an m-dimensional vector of intercepts, B 1kcd is an m × m (F 1 ) loading matrix. F 1 ( · ) is a smooth polynomial function mapping the m-dimensional vector of latent variables η 1ikcd to an m (F 1 ) -dimensional vector F 1 (η 1ikcd ). 1kcd is an m × Q (G 1 ) matrix with regression coefficients. G 1 ( · ) is a smooth polynomial function mapping the Q-dimensional vector of covariates x 1ik to a Q (G 1 ) -dimensional vector G 1 (x 1ik ). Note that for identification purposes, vector G 1 (x 1ik ) has to be completely different from vector g 1 (x 1ik ). ζ 1ikcd is an m-dimensional vector of residual variables with zero mean vector and covariance matrix 1kcd .

Mixture part
The model for the latent categorical variable C ik is a multinomial logit model where a 1kcd and b 1kcd are regression coefficients, and h 1 ( · ) is again a smooth (e.g., polynomial) function.

Example.
In the following illustrative example, the math skills of pupil i from school k (η 13ikc ) are predicted by the attitude toward reading (η 11ikc ) and by experienced teaching abilities (η 12ikc ; see also the example above). All three constructs are modeled as latent variables, which are measured with three indicator variables each. In addition, we assume that math skills can be predicted by gender, which is introduced into the model as an observed covariate (x 11ik ). For simplicity, the model is restricted to the within-level. Furthermore, it is assumed that there is unobserved heterogeneity due to a latent class C ik . Membership in one of the latent classes is predicted by a second observed covariate x 12ik (e.g., additional private math lessons). In contrast to an ordinary linear approximation of the relationship between the latent variables, the unknown and potentially curvilinear relationship is approximated by a latent spline model. Figure 3 illustrates the proposed model; the semiparametric spline model is indicated by the snake-type arrow.

LEVEL 2 -BETWEEN (CLUSTER) LEVEL
The multilevel (between) part of the model is conceptualized as follows. Each of the intercepts (ν 1kcd , α kcd , a 1kcd ) and slopes or loading parameters ( 1kcd , K 1kcd , B 1kcd , 1kcd , b 1kcd ) in Equations (3), (4), and (5) can be either a fixed coefficient or a random effect that varies across the observed clusters k.

Structural model
Let η 2kd be the U-dimensional vector of all such random effect variables and any additional between-level latent exogenous variables that explain these random effects and vary across the clusters. Note that η 2kd is different from η 1ikcd which is the individual-level latent variable vector. For a given cluster k, the between-level structural model for η 2kd is defined as (6) where μ d is a U-dimensional vector of intercepts, and B 2d is a U × U (F 2 ) loading matrix. F 2 ( · ) is a smooth polynomial function mapping the U-dimensional vector of variables η 2kd to a U (F 2 ) -dimensional vector F 2 (η 2kd ). 2d is a U × V (G 2 ) matrix with regression coefficients. x 2k is a V-dimensional vector of all observed unexplained between-level covariates that may have an additional influence on the variables in vector η 2kd . Note that x 2k is different from x 1ik . G 2 ( · ) is a smooth polynomial function mapping the V-dimensional vector of between-level covariates x 2k to a V (G 2 ) -dimensional vector G 2 (x 2k ). ζ 2kd is a U-dimensional vector of residual variables with a zero mean vector and covariance matrix 2d . μ d , B 2d , and 2d are fixed parameters. Figure 3 is extended to allow for multilevel effects on the between-level (Level 2). In Figure 4 depicts a latent random intercept model that implies a school-specific intercept (α 3kd ) for school k when the math skills (η 13ikd ) of a given pupil i are examined. In order to approximate a potentially non-normal distribution of the school-specific intercepts or to reveal a certain heterogeneity in the latent intercepts (i.e., average math skills), a latent mixture model with the latent categorical variable D k is applied. This mixture reflects Level-2 heterogeneity that may stem from (unobserved) sources, for example, certain school characteristics that influence the average math skills in school k.

Measurement model
Let z * k be the L-dimensional vector for cluster k that includes scores on all observed between-level variables that are indicators of the latent variables in vector η 2kd . For a given cluster k, the measurement model is defined by where ν 2d is an L-dimensional vector of intercepts, 2d is an with regression coefficients. x 2k is the V-dimensional vector of all observed unexplained between-level covariates that may have an additional influence on the indicator variables z * k . g 2 ( · ) is a smooth polynomial function mapping the V-dimensional vector of between-level covariates x 2k to a V (g 2 ) -dimensional vector g 2 (x 2k ). Note that for identification purposes g 2 (x 2k ) has to be completely different from G 2 (x 2k ). 2kd is a L-dimensional vector of residual (mixture) variables with a zero mean vector and covariance matrix 2d . ν 2d , 2d , and K 2d are fixed parameters.
FIGURE 4 | Structural model for subject i in cluster k with a nonlinear spline relationship between the latent variables on the within-level (indicated by the snake-type arrow) and a random intercept (α 3kd ) that is modeled as a mixture of normal distributions on the between-level.

Mixture part
The model for the between-level categorical variable D k is also a multinomial logit regression where a 2d and b 2d are regression coefficients, and h 2 ( · ) is again a smooth (e.g., polynomial) function.

Example.
In this last example (see Figure 5, the random intercept model in Figure 4 has been expanded by adding two latent Level-2 predictor variables (η 21kd and η 22kd ) that may influence the average math-skill level, for example, structural problems and social problems in school. Besides the linear effects of the latent predictors, there is an interaction effect that models the hypothesis that high scores on both between-level predictors FIGURE 5 | Structural model for subject i in cluster k with a spline relationship between the latent variables on the within-level (indicated by the snake-type arrow), and a random intercept (α 3kd ) that is predicted by an interaction model on the between-level. The distribution of the between-level's predictors is approximated by a mixture of normal distributions. The latent categorical variable D k is predicted by a between-level covariate x 21k . may lead to a particularly low (or high) average math-skill level. A potential heterogeneity of the latent predictors (e.g., a nonnormal distribution) is taken into account by introducing a latent categorical variable D k . In addition, a manifest predictor variable x 21k , for example, school size or school type (private or public), is included in the model to predict the latent class probability of D k as described more generally in Equation (8).

SUMMARY
In the model described by Equations (3) to (8), the latent variables on Level 1 (η 1ikcd , 1ikcd , and ζ 1ikcd ) and on Level 2 (η 2kd , 2kd , and ζ 2kd ) are conceptualized as variables stemming from mixtures on level 1 and level 2, respectively. The possibility of specifying within-and between-level mixture components is a result of introducing the latent categorical variables C ik and D k on the individual and cluster levels, respectively. On the within-level, unobserved latent classes may refer to different subpopulations (within each cluster), for example, pupils with different socioeconomic backgrounds in a given school. On the between-level, latent mixtures additionally allow for a specification of heterogeneity across/between observed clusters, for example, heterogeneity that is caused by certain characteristics of the schools. Furthermore, due to the conceptualization of mixture variables, a semiparametric modeling of non-normally distributed latent variables is available (e.g., Yang and Dunson, 2010;Kelava and Nagengast, 2012;Kelava et al., 2014), or a simple semiparametric formulation of the latent relationships (e.g., Bauer, 2005) is possible. Finally, the implementation of general polynomial functions F 1 ( · ), f 1 ( · ), G 1 ( · ), and g 1 ( · ) allows for a flexible inclusion of different parametric or semiparametric relationships (e.g., interaction effects or splines; Hastie et al., 2009), which extends the opportunities to model non-linear effects (e.g., Guo et al., 2012;Song et al., 2013).

MODEL IDENTIFICATION
As in any other latent variable framework, within the GNM-SEMM framework, the user must ensure that the specified model is identified. In this section, we will summarize important aspects that need to be considered even though model identification is not straightforward (cf. San Martín et al., 2011;Song et al., 2013). For the identification of the proposed model, four separate aspects need to be taken into account. However, the actual identification of a specific model needs to be examined individually.
First, within each mixture component standard assumptions for non-linear structural equation models need to be met. This mainly implies that restrictions be placed on manifest scaling variables or latent exogenous variables (e.g., a necessary condition for the identification is to set one factor loading for each latent predictor variable or the latent predictors' variance to one). In addition, either the latent intercepts of the indicator variables or the latent intercepts of the latent variables may be estimated in a model. Note that when latent exogenous variables (e.g., η 11ikcd , η 12ikcd ) are identified, their latent product terms (e.g., η 11ikcd η 12ikcd ) do not need product indicators for identification (cf. Klein and Moosbrugger, 2000).
Second, regarding the inclusion of polynomial functions for the observed covariates, it is necessary that the vectors g 1 (x 1ik ) and G 1 (x 1ik ) on Level 1 and, respectively, the vectors g 2 (x 2k ) and G 2 (x 2k ) on Level 2 are completely different from each other. For example, a model including g 1 (x 1ik ) = G 1 (x 1ik ) = (x 11ik , x 2 11ik ) would not be identified because x 11ik would be a predictor in the measurement and structural models [see Equations (3) and (4)]. In this case, two effects of x 11ik would be estimated simultaneously on the right side of one regression equation, which would not be identified. The same holds for the polynomial functions of the latent variables. Again, f 1 (η 1ikcd ) and F 1 (η 1ikcd ) on Level 1 as well as f 2 (η 2kd ) and F 2 (η 2kd ) on Level 2 have to be unequal [see Equations (7) and (6)] 2 . Otherwise, perfect collinearity would be the result, meaning that the covariates and latent variables, respectively, would have the same influence on the measurement and the structural models. Their impacts would not be separable. Furthermore, polynomial (semiparametric) functions should not include constants. Otherwise, latent intercepts in the measurement and structural models would not be identified.
Third, on the between (cluster) level the inclusion of latent exogenous variables, which explain the variability in the random coefficients, requires measurement models (see Figure 5). The exogenous latent variables at Level 2 need to be identified as well according to identification rules, which are the same as in single-level structural equation models.
Fourth, additional assumptions concerning the latent classes of the mixture components are required. For the identification of the discrete latent variables, (a) the unconditional probabilities in Equations (5) and (8) need to sum up to one. and (b), the ambiguity of mixture components due to the so-called label switching problem makes it necessary to impose additional (artificial) constraints or relabeling strategies e.g., restrictions on the mean structure or ordinality of mixture proportions (see Equations 15-19; Redner and Walker, 1984;Stephens, 2000;Kelava and Nagengast, 2012).
Note that the identification of separate parts of a model (e.g., the measurement model and the structural model) does not automatically imply that the whole model is identified. General necessary and sufficient conditions to guarantee the identifiability of a latent variable model are difficult to establish. Hence, we focus primarily on the necessary identification conditions in this article.

MODEL ESTIMATION
Generally speaking, latent variable modeling offers a large variety of methods for the estimation of specified models. The choice of the best estimation method strongly depends on the distributional assumptions of the observed and latent variables, the given sample size, the type of specified model, potential confounders, and many more aspects. Just to mention a few large classes, these methods comprise maximum likelihood estimators (e.g., Jöreskog, 1973;Rabe-Hesketh et al., 2005;Muthén and Asparouhov, 2009), least squares methods (e.g., Joreskog and Goldberger, 1972;Browne, 1974Browne, , 1984, and methods of moments (e.g., Bentler, 1983), among others. For example, when applying a maximum likelihood estimator, in the well-known EM algorithm (Dempster et al., 1977), which treats latent variables as missing data, the likelihood L of the observed indicator vector y is given as: where f 1ikcd ( · ), ψ 1ikcd ( · ), and ψ 2kd ( · ) are probability density functions for the observed variables y, and the latent variables η 1ikcd and η 2kd , respectively (cf. Muthén and Asparouhov, 2009). Because the likelihood function L of the observed indicator vector y ik is not given in closed form in general, numerical integration can be utilized in the evaluation of the likelihood using both adaptive and non-adaptive quadrature. As an alternative, the likelihood could be directly optimized by applying a quasi-Newton algorithm. Both approaches of estimating parameters are very complex due to the non-linearity (for a discussion of latent interaction effects, see Klein and Moosbrugger, 2000). However, in recent years, the Bayesian framework has become very popular in latent variable modeling (e.g., Lee et al., 2004;Lee, 2007;Lee et al., 2007;Song et al., 2009). The main reason is that it provides flexible options for specifying and estimating models. Bayesian estimation methods and algorithms (e.g., Markov Chain Monte Carlo: MCMC) can handle numerous complex parametric, semiparametric, and non-parametric relationships and distributions, for example, latent mixture distributions (e.g., Yang and Dunson, 2010;Kelava and Nagengast, 2012), non-linear models (e.g., Lee et al., 2007;Guo et al., 2012;Song et al., 2013), and multilevel structures (e.g., Fox and Glas, 2001;Song and Lee, 2004). Referring to the proposed GNM-SEMM framework with its semiparametric functional forms and its capability of considering non-normally distributed variables, a Bayesian approach seems to be a viable way to estimate models. In this sense, we will provide general descriptions of the specifications of the variables' distributions and the selection of prior distributions.

Level 1
For the Bayesian analysis, the j = 1, . . . , J indicator variables on Level 1 are specified as a cluster-specific mixture distribution. The single mixture is given as where μ y * (θ M1kcd , θ S1kcd , x 1ik ) is the vector of conditional expectations of y * ik , which are specified in Equation (3) and depend on the parameter vectors θ M1kcd and θ S1kcd , and on the covariate vector x 1ik . −1 1kcd is the precision matrix of the multivariate normal distribution of the measurement error variables (i.e., the inverse of the covariance matrix). The model implies a specific mean vector and covariance matrix for subjects stemming from a certain latent class c on Level 1 that is clustered in a latent class d on Level 2, which in turn is given for an observed cluster k. Within each cluster k, y * ik is a mixture of D * components, which model heterogenity in the observed clusters. Further, within in each mixture component d, y * ik is a mixture of C * d components, which induce heterogenity on the individual level (C * d may change across different latent classes on Level 2).
The latent variables η 1ikcd on Level 1 are specified as with the vector μ η 1 (θ S1kcd , x 1ik ) of conditional expectations for η 1ikcd that depend on the parameter vector θ S1kcd and covariate vector x 1ik as specified in Equation (4) as well as in the precision matrix −1 1kcd .

Level 2
Analogous to the specification of the variables' distributions on Level 1, the indicator vector z * k is modeled as with the vector μ z * (θ M2d , θ S2d , x 2d ) of conditional expectations for z * k as specified in Equation (7) and precision matrix −1 2d . The unconditional indicator vector z * k is composed of D * mixture components. Finally, the distribution of the latent variable vector η 2kd , is given as with the vector of conditional expectations μ η 2 (θ S2d , x 2k ) specified in Equation (6) and precision matrix −1 2d .

SPECIFICATION OF PRIOR DISTRIBUTIONS
For the prior specification, informative or non-informative priors can be selected (Gelman et al., 2004). This selection is primarily based on the availability of prior knowledge. Because the application of non-informative priors may lead to suboptimal solutions (e.g., Lee et al., 2007), it may be necessary to analyze parts of the model (e.g., a confirmatory factor analysis for the Level-2 predictors) to obtain information about the parameters. Here, a very general description of the proposed model is provided. For a detailed description of priors see Gelman et al. (2004). The class probabilities Pr (C ik = c|D k = d, x 1ik ) and Pr(D k = d|x 2k ) depend on the multinomial logit models given in Equations (5) and (8) and thus depend on the parameters in θ m1 and θ m2 . For these parameters, uninformative priors are suggested unless information about heterogeneity is available (see also Kelava and Nagengast, 2012).
For each precision matrix of the mixture distributions defined above, that is for −1 1kcd , −1 2d for the indicator variables, and for −1 1kcd , −1 2d for the latent variables, a multivariate normal distribution is assumed within each component. Conjugate priors are then given for c = 1, . . . , The hyperparameters ρ and the (positive definite) matrices 01kcd , 02d , 01kcd , and 02d of the Wishart distribution include parameter information that may stem from previous studies or knowledge about the parameters. For example, 02d includes information about the variances and covariances of the random coefficients, and about the latent endogenous and exogenous variables on Level 2. This information may refer to estimates of the (co)variances for the latent exogenous variables retrieved from a separately estimated confirmatory factor analysis.
The conjugate priors can be modified, for example, if the residual covariance matrix 2d on Level 2 is assumed to be diagonal, then each diagonal element j 2d (j = 1, . . . , J) can be assumed to be inverse Gamma distributed, that is ( ) (with hyperparameters α, β) (Kelava and Nagengast, 2012). Further information about the selection of priors for count or ordinal data can be found in Song et al. (2013).
For the other parameters in θ M1 , θ S1 , θ M2 , and θ S2 , normally distributed priors are used within each mixture component. Though, the definition of some priors needs to be formulated recursively (cf. Kelava and Nagengast, 2012). For example, let ν j 1kcd be the j-th element of the vector ν 1kcd (which specifies the intercept of the j-th variable in y * ik | C ik = c,D k = d ), and let j 1kcd be the j-th diagonal element in the matrix 1kcd . Then for the latent classes c = 1, d = 1, the conjugate (normal) prior for ν j 1k11 is specified as with hyperparameters H 0 and ν j 01k11 that include information about ν j 1k11 . For all other latent classes, that is c > 1 or d > 1, the following prior is selected: If parameters are constrained to be the same across mixture components (e.g., ν 1kcd = ν 1k and 1kcd = 1k ), Equations (15) to (19) simplify to For the other parameter matrices, that is for 1kcd , K 1kcd , α kcd , B 1kcd , 1kcd and so forth on Level 1 and ν 2d , 2d , K 2d , μ d , B 2d , 2d and so forth on Level 2, a specification corresponding to the formulation above given is straightforward when the appropriate precision matrices are used. In order to avoid the label-switching problem in a mixture distribution, only one of the parameter matrices needs to be formulated recursively.

EMPIRICAL EXAMPLE
In this section, we will provide an extensive illustration of the GNM-SEMM with an example that is based on data from the Program for International Student Assessment 2009 (PISA; Organisation for Economic Co-Operation and Development, 2010), which is publicly available under http://pisa2009.acer.edu. au/downloads.php. The sample was a German subsample of N = 1, 474 pupils from 226 schools who took a math test. Additional covariate information were available on the individual level as well as on the school level. As before, we predicted pupil's math skills (Math) with their general attitude toward reading (Att) and the teaching strategies they experienced (Strat). We further expected that pupil's average math skills (latent intercept of Math) would vary systematically across schools 3 , and that this variation could be (partly) accounted for by Level-2 predictors with measurement errors, here, structural problems in school (Prob) and the schools's social environment (Soc).
We will report the results for a model that accounted for different aspects of the general model. The example is not exhaustive with regard to all potential parameters within the GNM-SEMM framework, but it provides an indication of the flexibility of the proposed framework in accommodating different aspects of the data: A spline model on Level 1 described a semiparametric flexible relationship between Att, Strat, and Math. A random intercept for Math was explained by the Level-2 predictors Prob and Soc, and the interaction effect between them. Furthermore, a mixture model accounted for the non-normality of the latent predictors on Level 2 (heterogeneity).

MODEL FORMULATION
In the following, we will provide the specified measurement and structural equations for the model. For reasons of clarity, we restricted the subscripts (k, c or d) in the model formulation to those model parameters that actually depended on the latent classes or the Level-2 model. Figure 6 presents a diagram of the model and its parameters.

Structural models
The Level-1 structural model [cf. Equation (4)] for the i-th pupil in school k was given by where F 11 and F 22 both defined a latent cubic spline model with two knots at ξ 1 = 2, ξ 2 = 3 that approximated the (curvilinear) FIGURE 6 | Structural models and measurement models on the within-level (Level 1) and between-level (Level 2). On Level 1, the math skill (Math) of a pupil i is predicted by his/her general attitude toward reading (Att) and his/her experienced teaching strategies (Strat). The snake-type arrows indicate a flexible spline approximation of the latent variable relationship. On Level 2, the average math skills of pupils (latent intercept α 3k ) in school k are explained by a nonlinear interaction between structural problems in the school (Prob) and the school's social environment (Soc). The non-normality of the latent predictors is approximated by a mixture distribution.
www.frontiersin.org July 2014 | Volume 5 | Article 748 | 9 relationships between the variables (e.g., Hastie et al., 2009): Only the latent intercept α 3k was assumed to vary across schools. The Level-2 structural model [cf. Equation (6)] for school k was given by with η 2kd = (Prob kd , Soc kd , α 3k ) and F 2 (η 2kd ) = (Prob kd , Soc kd , Prob kd · Soc kd ) . The product term Prob kd · Soc kd implemented the interaction effect of the structural problems in school and the social environment. Because the non-normal distributions of the latent predictors were approximated by a mixture distribution, their expectations μ 1d and μ 2d were assumed to vary across the unobserved mixtures (Kelava and Nagengast, 2012).

Measurement models
For each of the latent variables between nine and 13 items were available; they were aggregated to three indicator variables for each latent variable (item parcels) for didactic purposes. It was assumed that the indicator variables were continuously distributed, resulting in an identity link function in the measurement model (y * ik = y ik and z * k = z k , respectively). On Level 1, the measurement model for pupil i in the k-th school [cf. Equation (3)] was given by where f 1 (η 1ik ) = (Att ik , Strat ik , Math ik ) . On Level 2, the measurement model [cf. Equation (7)] was given by where f 2 (η 2kd ) = (Prob kd , Soc kd ) . The factor loading matrices 1 and 2 were formulated with a simple structure (i.e., each item loaded on only one latent variable). The residual variables 1ik and 2ik were assumed to be mutually uncorrelated and normally distributed with zero mean vectors and (diagonal) covariance matrices 1 and 2 , respectively.

Parameter constraints and identification
Besides employing the standard identification constraints for structural equation models, we restricted the measurement model parameters and the structural model parameters to be the same across schools except for the latent intercept α 3k . Due to the invariance of the measurement models for the latent predictors on Levels 1 and 2, in Equations (24) and (25) the non-linear effects in the polynomial spline model and the interaction effect in Equations (22) and (23) were identified. For the mixture model, we fit two latent classes (D k = 1, 2).

MODEL ESTIMATION
To keep this example as simple as possible, missing data were assumed to be missing at random, and this was accounted for directly in the analysis by applying the Gibbs sampler (Gelman et al., 2004). The analysis of the latent multilevel model was implemented by using the R-project software (R Core Team, 2013) and the OpenBugs package (Lunn et al., 2009). Syntax for the empirical example can be obtained upon request from the authors.

Starting values and prior selection
Starting values for the measurement model parameters were based on the prior analyses conducted in Mplus Muthén (1998-2010) for separate parts of the model. Informative priors were then selected in accordance with recommendations by Gelman et al. (2004) as well as Kelava and Nagengast (2012).

Bayesian analysis
For the analysis, three chains with 100,000 iterations each were generated. The first 75,000 iterations (burn in) were then discarded. As proposed by Gelman (1996), convergence of the estimation procedure was achieved when all (EPSR Estimated Potential Scale Reduction;Gelman, 1996) values were below 1.2, which occurred after about 60,000 iterations (see the Supplementary Material, Figure S1). Trace plots also indicated good convergence (see the Supplementary Material, Figure S2). Means, standard errors, t-values, and percentiles of the posterior distributions of the parameter estimates based on the last 25,000 iterations are reported in the next subsection.

RESULTS
We will summarize the main results in this subsection. Detailed results for the estimated multilevel model are presented in Table 1. In the measurement models, the factor loadings were all significant and positive, thus indicating that the latent constructs were measured reliably.
The results for the semiparametric approximation of the true relationships between the Level-1 latent variables Att, Strat, and Math are illustrated in Figure 7. The relationship between Math and Att resembled a cubic relationship; the subjects' Math scores slowly increased with increasing Att scores, whereby a stronger increase was found for Att scores greater than 3 and a slight decrease for Att scores greater than 4. The relationship between Strat and Math seemed to be slightly quadratic with the highest Math scores for medium Strat scores.  In order to test the hypotheses on the cubic relationship for Att and the quadratic relationship for Strat 4 , we estimated a model that changed Equation (22) to β 1 F(Att ik ) = β 11 Att ik +

FIGURE 7 | Semiparametric Level-1 relationships between pupils' math skills (Math) and their general attitude toward reading (Att; left), and Math and experienced teaching strategies (Strat; right).
The gray crosses indicate the predicted slope with a predicted school-specific random intercept; the black line indicates the predicted Math score for the mean random intercept. β 12 Att 2 ik + β 13 Att 3 ik and β 2 F 12 (Strat ik ) = β 21 Strat ik + β 22 Strat 2 ik . Results for the structural parameters on the within-level can be found in Table 2. The parametric cubic relationship for Att was not significant (β 13 = 0.003, p = 0.745 for the cubic effect andβ 11 = −0.045, p = 0.723 for the linear effect). The attitude toward reading did not significantly predict the math ability. The parametric model for Strat indicated a significant negative quadratic relationship (β 22 = −0.034, p = 0.037). This indicated that pupils' math skills were highest for those subjects who rated the experienced teaching strategies as average.
On Level 2, the random intercept factor α 3k had a significant negative intercept (μ 3 = −0.365, p = 0.024) and an unexplained variance across schools ofψ 233 = 0.051. The linear effects of the predictors were significant withβ 3 = 0.558 (p < 0.001) for school problems (Prob) andβ 4 = 0.442 (p < 0.001) for social problems (Soc). The interaction effect was significant and negative withβ 5 = −0.289 (p < 0.001). Figure 8 illustrates the complex non-linear association between Prob, Soc, and the random intercept α 3k . The expected math level of a school with an average score on school and social problems was about 0.5 (E[α 3 |Prob = Prob, Soc = Soc] = 0.461); the expected math level was higher in schools for which one of the problems was above average and the other was below average; and the math level decreased rapidly when both problems increased together.

FIGURE 9 | Predicted slightly non-normal densities of the Level-2 predictors Prob and Soc obtained from a two-class solution.
Finally, the results of the mixture model for the Level-2 predictors are illustrated in Figure 9. As can be inferred from Figure 9, the distribution of the latent variables was slightly non-normal. In this empirical example, the means of the latent variables in the two classes were marginally different (with means of aboutμ 11 ≈ μ 21 ≈ 1.9 in Class 1 andμ 12 ≈μ 22 ≈ 2.1 in Class 2). Additional analyses may reveal the necessity to increase or decrease the number of latent classes (e.g., using the DIC). Here, the DIC was 14,780 for a model including the mixtures and 14,770 for a model without the mixture distribution. This indicates that a mixture may not have been necessary in this case.

DISCUSSION
In this article, we presented a generalized non-linear multilevel structural equation mixture model (GNM-SEMM) framework.
A key characteristic its ability to specify non-linear functional relationships between outcome variables on one side and latent predictors or manifest covariates on the other side by using semiparametric regression functions (e.g., splines; Freund and Hoppe, 2007;Hastie et al., 2009). This feature is given for both levels, the within and between (cluster) levels of nested data structures. Given that (multilevel) latent variable modeling frameworks are typically linear (Bollen, 1989;van der Linden and Hambleton, 1997;Rabe-Hesketh et al., 2004;Muthén and Asparouhov, 2011), the relaxation of the linearity assumption is a step forward toward a more realistic approximation of a non-linear world. It thus extends the hitherto available multilevel modeling frameworks.
A second key characteristic is the ability to specify latent mixture distributions on both levels. As in recent semiparametric latent variables approaches (e.g., Bauer and Curran, 2004;Bauer, 2005;Kelava et al., 2014), this allows for an approximation of non-normally distributed latent predictor variables for a thorough introduction with regard to manifest variables, see McLachlan and Peel (2000). Again, the relaxation of a typical assumption that can be found in most applications of latent variable modeling allows for a more precise modeling of relationships for heterogeneous populations or distributions.
A third key characteristic of the proposed approach is that it is flexible enough to specify a large number of special cases. For example, it offers the ability to approximate a non-normal distribution using mixture modeling and provides an easy way to interpret the parametric functional form of the latent variable relationship. As another example, it is possible to specify a non-linear latent variable relationship in one subpopulation but not in the other. The same is true for different levels. If functional forms of the relationships are unknown, semiparametric approximations of these relationships are also possible using mixtures.
Taken together, these properties are desirable. Nevertheless, the identification and estimation of the models is a general issue. Additional assumptions have to be introduced as was exemplified in the sections before (see Level-1 section on the measurement model). Fortunately, these assumptions are standard identification assumptions in latent mixture, latent (non)linear, and (semi)parametric modeling, but researchers should be careful when specifying models. For example, multiple intercepts in spline models might lead to identification issues. However, the wide range of specifiable models offers a variety of adaptable estimators that could be applied from a theoretical standpoint. Bayesian MCMC, Newton-type algorithms, and adapted EM-Algorithms are just a few examples.
In this paper, we also used a substantive example from educational science. A complex model was applied to data from the large-scale PISA study (Organisation for Economic Co-Operation and Development, 2010) illustrating several conditions that may occur in empirical data. First, an a priori unknown curvilinear relationship between the latent variables was identified on Level 1 using a semiparametric latent spline model. Second, the proposed mixture part on Level 2 could be used to control for the potential non-normality of the latent Level-2 predictors. In this example, only a slight indication of non-normality was visible. The model may have also been extended to include a mixture model on Level 1. Third, on Level 2 a latent random intercept modeled a school-dependent math skill, which allowed us to account for the clustering of the data. The random intercept was predicted by a latent non-linear interaction model. The model may be extended further, for example, to test the linearity assumption on Level 2 of the relationship between the latent variables apart from the interaction effect. Other random effects could also be included. In any case, the specification of these effects should be theory-driven.
Finally, we want to mention two important considerations. The proposed model should be viewed as a general framework that includes a variety of different possible models. A model that includes all aspects as presented in the model section would be highly parameterized and may overfit the data. In each empirical situation, we recommend that the actual applied model be restricted to a simpler model that allows for an adequate but parsimonious representation of the data. A decision concerning the necessity to include different parts of the model depends on the hypothesized model (e.g., random factor loadings in a confirmatory factor model or a latent spline to predict a latent slope in the structural model) and on model comparisons. In the Bayesian framework, Bayesian indices/information criteria for model selection (e.g., the deviance information criterion, DIC: Spiegelhalter et al., 2002;Celeux et al., 2006; or the Bayes factor, Bernardo and Smith, 1994) are the primary model fit indices, although they only allow only for a model comparison to be made, and they are not absolute fit indices. In general, for (both frequentist and Bayesian) non-linear models there are no absolute fit indices (Klein and Schermelleh-Engel, 2010). Hence, a topdown (or bottom-up) strategy using information criteria may be a viable way to improve the model (i.e., to restrict the model to its necessary parts). An illustration of such a strategy for multilevel models in general can be found, for example, in West et al. (2007).
Furthermore, we did not show how to implement the presented framework with statistical software. In this article, a Bayesian estimator was applied and implemented in OpenBugs, thus allowing us to analyze a complete but specific semiparametric non-linear multilevel model. Future research should improve this implementation so that it will be feasibly available within standard statistical latent variable software (e.g., Mplus) that can be directly applied to different models by the substantive researcher.