The Impact of Partial Measurement Invariance on Testing Moderation for Single and Multi-Level Data

Hsiao, Yu-Yu; Lai, Mark H. C.

doi:10.3389/fpsyg.2018.00740

METHODS article

Front. Psychol., 15 May 2018

Sec. Quantitative Psychology and Measurement

Volume 9 - 2018 | https://doi.org/10.3389/fpsyg.2018.00740

This article is part of the Research TopicRecent Advancements in Structural Equation Modeling (SEM): From Both Methodological and Application PerspectivesView all 20 articles

The Impact of Partial Measurement Invariance on Testing Moderation for Single and Multi-Level Data

Yu-Yu Hsiao¹^*

Mark H. C. Lai²

¹Center on Alcoholism, Substance Abuse, and Addictions, University of New Mexico, Albuquerque, NM, United States
²School of Education, University of Cincinnati, Cincinnati, OH, United States

Moderation effect is a commonly used concept in the field of social and behavioral science. Several studies regarding the implication of moderation effects have been done; however, little is known about how partial measurement invariance influences the properties of tests for moderation effects when categorical moderators were used. Additionally, whether the impact is the same across single and multilevel data is still unknown. Hence, the purpose of the present study is twofold: (a) To investigate the performance of the moderation test in single-level studies when measurement invariance does not hold; (b) To examine whether unique features of multilevel data, such as intraclass correlation (ICC) and number of clusters, influence the effect of measurement non-invariance on the performance of tests for moderation. Simulation results indicated that falsely assuming measurement invariance lead to biased estimates, inflated Type I error rates, and more gain or more loss in power (depends on simulation conditions) for the test of moderation effects. Such patterns were more salient as sample size and the number of non-invariant items increase for both single- and multi-level data. With multilevel data, the cluster size seemed to have a larger impact than the number of clusters when falsely assuming measurement invariance in the moderation estimation. ICC was trivially related to the moderation estimates. Overall, when testing moderation effects with categorical moderators, employing a model that accounts for the measurement (non)invariance structure of the predictor and/or the outcome is recommended.

Many theories in education and psychology rely on moderators, which in Baron and Kenny's (1986) words, “[affect] the direction and/or strength of the relation between an independent or predictor variable and a dependent or outcome variable” (p. 1,174). For many years, social and behavioral researchers are interested in understanding whether a specific moderation effect occurs as well as what factors may influence the extent of the moderation effect. Numerous methodological studies regarding different aspects of moderation effects have been done in contexts such as multiple regression (Aiken and West, 1991), multiple-group structural equation modeling (multiple-group SEM; Jaccard and Wan, 1996), latent variable models with observed composites (Bohrnstedt and Marwell, 1978; Busemeyer and Jones, 1983; Hsiao et al., 2018), within-subject designs (Judd et al., 1996, 2001), cross-level interactions (Kreft et al., 1995), and Bayesian estimations (Lüdtke et al., 2013).

Much of the methodological research regarding moderation effects focused on continuous variables, and less research has been done for categorical moderators. As an example of the latter, researchers may be interested in how the effect of social support on happiness differs by gender. Gender as a categorical variable is treated as the moderator, and social support and happiness are the predictor and outcome variables, respectively. In testing such a moderation with conventional methods such as multiple regression and multiple-group SEM, researchers implicitly assume that the predictor and the outcome variables are measurement invariant across the categorical moderators; that is, the measurement characteristics for social support and happiness are the same by different gender categories. However, such an assumption is seldom investigated before testing moderation effects. Additionally, little is known about how measurement non-invariance influences the estimation of the moderation effects. Hence, it is worth investigating whether measurement invariance for both the predictor and the outcome variables with respect to the moderator categories is a necessary prerequisite before conducting a moderation effect testing.

Measurement invariance (MI) is an important issue in a variety of social and behavioral research settings, especially when the data are collected from multiple populations (Millsap and Kwok, 2004). Full MI holds when individuals with identical ability but from different groups have the same propensity to get a particular score on that specific ability scale (Yoon and Millsap, 2007). Under the multiple-group confirmatory factor analysis framework, a simplified but commonly used version of MI analyses can be conducted by testing four models with hierarchical orders across groups: equal model structures (configural invariance), equal factor loadings (metric invariance), equal intercepts (scalar invariance), and equal unique factor variances (strict invariance; Vandenberg and Lance, 2000; Millsap and Kwok, 2004; Chen et al., 2005; Brown, 2015). Among the four types of MI, metric invariance has been suggested as one basic requirement for doing prediction (Vandenberg and Lance, 2000), which is closely related to moderation effect as moderation effect is about the difference in path coefficients across groups. Hence, in this paper we focus on the impact of metric non-invariance on the estimation of moderation effects. We also focus on testing moderation effects with the multiple-group approach, which is generally being used for examining measurement invariance.

Previous Research on the Effect of Metric Non-invariance on Prediction

Millsap (1995, 1997, 1998, 2007) delineated several theorems and corollaries for the relationship between MI and prediction bias. Donahue (2006) conducted a simulation study to examine the change of the prediction accuracy when the measure of the exogenous (predictor) variable was non-invariant in some part of the factor loadings, or with the presence of partial metric invariance, across groups. Her study found that, if one correctly assumes a partial invariance model on the latent predictors' structures, the path coefficient estimates on the outcome variables are unbiased even with a larger degree of metric non-invariance (i.e., more non-invariant items) on the latent predictors. However, the study only included the effects on tests of simple regression coefficient in each group, but not moderation, which can be defined as the difference in path coefficients across groups. Additionally, the study did not show the consequences of failing to correctly model the non-invariance structure.

Guenole and Brown (2014) used Monte Carlo studies to investigate the impact of ignoring measurement invariance (including metric invariance) on testing linear and nonlinear effects (including moderation effects). They adopted relative bias of the estimated path coefficients and 95% coverage rate of the estimated confidence intervals from both the reference group and focal group. They found biased estimates of the path coefficients from the two groups when two or more (out of six) ignored non-invariant loadings occurred. The same results were observed when the non-invariance occurred for predictors and outcomes¹.

In the present research, we address two gaps from the work of Donahue (2006) and Guenole and Brown (2014). First, we would show the degree to which estimations and tests of moderation are affected when researchers incorrectly assume that (metric) MI holds. Second, we are interested in whether the location of measurement non-invariance, particularly in the predictor or in the outcome variable, makes a difference. Furthermore, we extend their work by investigating the Type I error rate of misidentifying null moderation effect and the statistical power of detecting nonzero moderation effects in the presence of non-invariance.

Additionally, Donahue (2006) and Guenole and Brown (2014) focused on single level data structure, in which all the observations were assumed to be independent. However, educational and psychological data often have nesting structures (e.g., students nested within classrooms; Kim et al., 2012). For example, a researcher is interested in how the association between students' motivation and their academic achievement differs in public and private schools. Since students are nested within schools, the school variable is a moderator defined in the between level and motivation is a predictor defined in the within level. Therefore, the scenario represents a “cross-level” moderation effects. In this situation, the measurement characteristics of motivation and academic achievement are assumed invariant across school types (i.e., public vs. private). It is still unclear that how multilevel measurement metric (non)invariance across groups in the between level influences the cross-level moderation effects. Therefore, we also show how unique features of multilevel data affect the MI-moderation relationship².

Study 1

In Study 1, we aim to show the effect of measurement non-invariance on the power and Type I error rate when testing a moderator with two categories. Both the predictor and the outcome have a measurement structure and the moderation effects are tested with multiple-group approach, as shown in Figure 1. Specifically,

\begin{array}{l} X_{g} = λ_{X g} F_{X g} + δ_{g}, \\ Y_{g} = λ_{Y g} F_{Y g} + ε_{g}, \\ F_{Y g} = γ_{g} F_{X g} + ζ_{g}, \end{array}

where g = 1, 2 was the group index number, $X = {[X_{1}, X_{2}, \dots]}^{'}$ and $Y = {[Y_{1}, Y_{2}, \dots]}^{'}$ were observed indicators as shown in Figure 1, λ_X and λ_Y were two vectors of factor loadings of the indicators on the latent variables, δ and ε were vectors of the effects of unique factor on X and Y, γ_g is the path coefficient between F_X and F_Y for group g, and ζ was the latent disturbance term for F_Y. In addition, both the impacts of having metric non-invariance on the outcome and on the predictor were investigated. The simulation study was described below.

FIGURE 1

Figure 1. Data generating model for Study 1. F_X and F_Y are the latent predictor and outcome variables, each indicated by six observed indicators.

Monte Carlo Simulation

The study had a 3 (p_ni, number of non-metric-invariant indicators) × 4 ( $γ = {γ_{1}, γ_{2}}^{'}$ , vector of population regression coefficients of the two groups) × 2 (location of non-invariance) × 2 (N, sample size of each group) design. In each condition there were two groups, and the sample sizes were assumed equal across groups. Both the predictor F_X and the outcome F_Y were latent variables with six indicators.

Number of Non-metric-Invariant Indicators, p_ni

Across the simulation conditions, p_ni will either be 0, 2, or 4. For all indicators in Group 1, the factor loadings were set to 0.7, while some of those in Group 2 were set to 0.3 to represent moderate degree of metric non-invariance. This was similar to the conditions in some previous studies (Kaplan and George, 1995; Donahue, 2006).

Regression Coefficients, γ

There were four levels of γ, two of which with equal regression coefficients ({0.1, 0.1} and {0.5, 0.5}) and two with them different ({0.5, 0.33} and {0.33, 0.5}). In the equal γ conditions the grouping variable did not moderate the effects of F_X on F_Y, and Type I error rates were investigated. We were also interested in whether the effect of F_X being large (i.e., 0.5) and small (i.e., 0.1) influences Type I error rates. In the conditions with different γ the effects of F_X on F_Y were different for Group 1 and for Group 2, so there were moderation effects between groups and F_X on F_Y and powers of detecting the true moderation effects were investigated. The numbers were chosen based on the benchmark of small (γ = 0.1), medium (γ = 0.33), and large effects (γ = 0.5; Cohen, 1988).

Location of Non-invariance

The metric non-invariance occurred either on only F_X or only F_Y. Note that this design factor were not applicable to conditions with p_ni = 0.

Sample Size, N

There were two levels of sample size: 200 and 500, in consistent with some previous studies (e.g., Yoon and Millsap, 2007).

Mplus 7.0 (Muthén and Muthén, 2012) was used to generate 500 data sets for each condition. All variables were assumed multivariate normally distributed. The two factor variances in Group 1 were 1.0 and those in Group 2 were 1.3. For Group 1, the unique factor variances of all indicators were set to 0.51 in the population, so that the invariant indicators had a variance of 1.0. The unique factor variances for Group 2 were set to 0.51 × 1.3 = 0.663 so that the proportion of explained variances for the invariant indicators was constant across groups. Because scalar invariance was not the focus of the present study and might not be required for correctly modeling moderation effects, all intercepts and factor means in the population were set to zero.

The data sets generated were then analyzed in Mplus. The analytic model was identified by fixing the factor loadings of the first indicators for F_X and for F_Y to the population value (i.e., 0.7), while allowing the latent factor variances of F_X and of F_Y to be freely estimated. Hence, both F_X and F_Y were scaled to the same unit as the population model and across replications, so that the γ values from the two groups were comparable. To identify the mean structure, the latent factor mean of F_X and the latent intercept of F_Y were fixed to zero for both groups, while the intercepts and the unique factor variances were allowed to be freely estimated without cross-group equality constraints, as scalar and strict invariance conditions were not assumed.

For conditions with p_ni = 0, the data sets were analyzed by fitting only the model with metric invariance. For other conditions with p_ni > 0, both the (misspecified) model with metric invariance and the (correct) model with partial metric invariance were fitted. Then for each data set, we obtained the point estimate of $Δ \hat{γ} = γ_{1} - γ_{2}$ (using the MODEL CONSTRAINT command in Mplus) and the Wald test statistic (using the MODEL TEST command in Mplus) for the null hypothesis γ₁ = γ₂. Note that we also obtained the results for the likelihood ratio test, which is usually more accurate for finite samples, but we only presented the results for the Wald test as the two tests were nevertheless asymptotically equivalent and produced similar empirical powers and Type I error rates across simulation conditions.

The dependent variables of investigation for the simulations were the percentage of replications where the test statistics were statistically significant at 0.05 level and the standardized bias of $Δ \hat{γ}$ . If in the population, γ₁ = γ₂, then the percentage of replications with statistically significant Wald test statistic was the empirical Type I error rate (α*). Taking into account the sampling variability in 500 replications, an α* between 3.4% and 7.3% is within the 95% confidence interval when the true Type I error rate is 5%. Empirical Type I error rates over the range of [3.4%, 7.3%] are defined as biased. We expected to see biased Type I error rates and the standardized biases to be large when metric invariance is incorrectly assumed.

If in the population γ₁ ≠ γ₂, the percentage where the test statistics were statistically significant at 0.05 level was the empirical power. Given that power is a function of effect size and sample size, the empirical power rates yielded from fitting the model with metric invariance in p_ni = 0 condition were treated as the baseline; those yielded with p_ni > 0 from incorrectly assuming measurement invariance and correctly assuming partial invariance models were then compared to the baseline. We expected to see power estimates from models incorrectly assuming measurement invariance were more different from the baseline then the correctly assuming partial invariance models.

Denote ${\hat{γ}}_{1}^{(i)}$ and ${\hat{γ}}_{2}^{(i)}$ as the estimated values of γ₁ and γ₂ for the ith replication, and ${\bar{γ}}_{1}$ and ${\bar{γ}}_{2}$ as the corresponding means across replications. The standardized bias (Collins et al., 2001) was computed as

standardized bias = \frac{({\bar{γ}}_{1} - {\bar{γ}}_{2}) - (γ_{1} - γ_{2})}{SD ({\hat{γ}}_{1} - {\hat{γ}}_{2})},

where

SD ({\hat{γ}}_{1} - {\hat{γ}}_{2}) = \sqrt{\frac{\sum_{i = 1}^{R} {[({\hat{γ}}_{1}^{(i)} - {\hat{γ}}_{2}^{(i)}) - ({\bar{γ}}_{1} - {\bar{γ}}_{2})]}^{2}}{R}},

and i = 1, 2, …, R was the index of replications where R = 500. The standardized bias was the ratio of the average raw bias over the standard error of the sample estimator of the parameter, and a standardized bias with absolute value < 0.40 was regarded as acceptable (Collins et al., 2001).

Result

The simulation results for the condition with null moderation effects were displayed in Table 1. When the measurement invariance assumption held on both the predictor and the outcome in population model (i.e., p_ni = 0), using the analytic model assuming measurement invariance across groups yielded unbiased moderation effect estimates and unbiased α*.

TABLE 1

Table 1. Empirical type I error rate (in percentage) and standardized bias for study 1.

When the non-invariance occurred on F_X, as partial metric invariance was the correctly specified model, with a partial invariance model α* was close to the 0.05 nominal significance level and the moderation effect was estimated with absolute values of standardized bias <0.02 (< 0.40 as acceptable). On the other hand, α* was inflated when metric invariance were falsely assumed. The difference between α* from the nominal level increased as one or more of p_ni, N, and the values of γ increased. For example, when N = 200, p_ni = 2, and γ = {0.1, 0.1}, α* = 4.2%; when N = 500, p_ni = 4, and γ = {0.1, 0.1}, α* = 7.6%; and when N = 500, p_ni = 4, and γ = {0.5, 0.5}, α* = 77.8%. An analysis of variance (ANOVA) including N, γ, and p_ni showed that p_ni produced the largest impact on α* (η² = 0.34), followed by γ (η² = 0.21) and N (η² = 0.04). The bias of the estimated values of Δγ followed a similar pattern. For instance, With N = 500, p_ni = 4, and γ = {0.5, 0.5}, the standardized bias of the null moderation effects was −2.79, which was a substantial bias.

The pattern of α* and the absolute values of the standardized bias when non-invariance occurred in F_Y was very similar to those when non-invariance occurred in F_X. However, the sign of the standardized bias was reversed, which means that when non-invariance occurred in the outcome's structure, the moderation effects were overestimated. Considering both the locations of the non-invariance, we found that using models that incorrectly assumed measurement invariance would result in substantially biased moderation effect estimate and inflated Type I error rate.

Table 2 showed the results of both the powers and standardized biases with nonzero moderation effects. When the non-invariance occurred on F_X, the corrected partial metric invariance models performed well as they showed no bias on the moderation effect estimates with standardized biases from −0.03 to 0.01. On the contrary, the metric invariance model yielded biased estimates of the moderation effects and the influence was more salient as both N and p_ni increased. For example, when γ = {0.5, 0.33}, the standardized bias was −0.43 with N = 200 and p_ni = 2; the standardized bias increased to −1.84 with N = 500 and p_ni = 4. An ANOVA showed that p_ni produced the largest impact on the biased moderation estimates (η² = 0.79), followed by N (η² = 0.09) and γ (η² = 0.01).

TABLE 2

Table 2. Empirical power (in percentage) and standardized bias for study 1.

In terms of the powers for detecting the moderation effects, the corrected partial invariance model yielded powers around 30% and 60% for N equals 200 and 500, respectively. Such power estimates were close to population model with the measurement invariance assumption held (33% for N = 200 and 70% for N = 500). On the other hand, if metric invariance was falsely assumed, there was a substantial decrease in powers for the conditions where non-invariance occurred. For example with γ = {0.5, 0.33}, N = 500, p_ni = 2, and non-invariance on F_X, the empirical power was half as would be obtained when metric invariance held in the population (33.8% vs. 70.2%); with γ = {0.33, 0.5}, N = 200, p_ni = 4, and non-invariance on F_Y, the empirical power was only 1/8 as the power would be obtained when metric invariance held in the population (4.2% vs. 33.0%).

Note that power loss was detected as both the N and p_ni increased when the non-invariance occurred on F_X and γ = {0.5, 0.33}; simulation conditions related to non-invariance occur-ed on F_Y and γ = {0.33, 0.5} would lead to inflated power estimates as both the N and p_ni increased. The main reason for different patterns on the power estimates were that when the factor loadings of Group 1 (0.7) was larger than those of Group 2 (0.3) in the presence of non-invariance on F_X, the estimated moderation effect was negatively biased, whereas when non-invariance occurred in F_Y, the estimated moderation effect was positively biased. Additionally, the true moderation effect was −0.17 when γ = {0.33, 0.5}; therefore, the negative biases caused by falsely assuming measurement invariance would result in more negative moderation effects estimates and inflated power.

Study 2

In Study 2, we aim to extend the scope of the MI-moderation relation to multilevel data. We focused on how the measurement (non-)invariance across groups at the between level influences the test of cross-level moderation effect, which was one of the prevailing issues among social and behavioral research. Specifically, we used the data generating model shown in Figure 2, which was one of the simplest models including multilevel measurement (non-)invariance and a within-level predictor, to depict the cross-level moderation effect. As can be seen in Figure 2, the latent predictor was measured by six indicators and the cross-level interaction effect was denoted by the difference between the within-level path coefficient from the predictor to the outcome across groups. It was assumed that the predictor did not have an effect on the outcome in the between level.

FIGURE 2

Figure 2. Data generating model for Study 2. F_X(w) and F_X(b) are the latent predictor variable at the within-level and the between-level, respectively. Y_ij and Y_j are the within-level and the between-level components of the outcome variable Y. β_1j = within-level regression coefficient of Y on F_X(w), whose magnitude varies across clusters as indicated by the black dot. Conditioning on the grouping variable, measurement invariance was assumed across clusters such that the within-level and the between-level factors loadings were identical, and that there were no residual variances for the six indicators at the between-level.

Because multilevel data are usually of larger sample size, we expect the impact of multilevel non-invariance on the Type I error rate and power to be bigger. In addition, we are interested in whether the impact varies across multilevel specific design factors such as the intraclass correlation (ICC), number of clusters, and cluster size. Because in Study 1, we found that different locations of non-invariance mainly resulted in changes in signs of the biases of the moderation effects, in Study 2 we only focused on measurement non-invariance on the predictor side. Likewise, we only consider the positive moderation effects condition in Study 2 given that negative moderation effect led to similar results in biases in Study 1. A second Monte Carlo simulation study was conducted, as described below.