Assessing Construct Validity in Math Achievement: An Application of Multilevel Structural Equation Modeling (MSEM)

The purpose of the present study was to model math achievement at both the person and university levels of the analyses in order to understand the optimal factor structure of math competency. Data involved 2,881 students who took a national mathematics examination as part of their entry at the university public system in Saudi Arabia. Four factors from the National math examination comprised the math achievement measure, namely, numbers and operations, algebra and analysis, geometry and measurement, and, statistics and probabilities. Data were analyzed using the aggregate method and by use of Multilevel Structural Equation Modeling (MSEM). Results indicated that both a unidimensional and a 4-factor correlated model fitted the data equally well using aggregate data, where for reasons of parsimony the unidimensional model was the preferred choice with these data. When modeling data including clustering, results pointed to alternative factor structures at the person and university levels. Thus, a unidimensional model provided the best fit at the University level, whereas a four-factor correlated model was most descriptive for person level data. The optimal simple structure was evaluated using the Ryu and West (2009) methodology for partially saturating the MSEM model and also met criteria for discriminant validation as described in Gorsuch (1983). Furthermore, a university level variable, namely the year of establishment, pointed to the superiority of older institutions with regard to math achievement. It is concluded that ignoring a multilevel structure in the data may result in erroneous conclusions with regard to the optimal factor structure and the tests of structural models following that.


INTRODUCTION
Mathematics achievement is one of the most important criterion to entering college and also on achieving career readiness (Adelman, 2003;Maruyama, 2012), particularly in the fields of STEM (Science, Technology, Engineering, and Mathematics) (Adelman, 1999). Moreover, math achievement (along with verbal skills) is one of the two fundamental sub-components of the widely used conceptualization of academic self-concept (Shavelson et al., 1976;Marsh, 1990;Möller et al., 2009). Researchers have consistently found that math achievement is predicted by both individual and contextual factors. For example, Cvencek et al. (2015) found that students' beliefs about math and their math achievement are linked to their performance, with students with low math efficacy performing lower than students with high math efficacy. Furthermore, gender stereotypes about math achievement (i.e., boys perform better than girls) seem to influence math performance, with girls performing worse than males when their negative gender stereotype is activated (Ambady et al., 2001;Galdi et al., 2014). León et al. (2015), based on Self-Determination Theory (SDT; Deci and Ryan, 1985), reported that autonomous motivation (when students engage in learning from their own choice and preference without external pressure), is positively related to math achievement.
Regarding contextual factors, parental involvement (Kung and Lee, 2016), and family health (Barr, 2015) have been found to be influential factors in predicting students' mathematics achievement. Others have focused on within-school factors such as learning environments and motivational classroom discourse, since they influence the learning process (Vršnik Perše et al., 2010;Herndon and Bembenutty, 2014). The learning environment is the broader context in which the instruction is delivered and is concerned with an institution's policy, curriculum, budget, infrastructure (e.g., libraries, labs, IT facilities, etc.), institutional commitment, quality of academic staff, etc. For example, Gamoran (1992) found that school policies for admitting students to advanced math courses (i.e., standard procedures such as admissions tests vs. nonstandard approaches such as teachers' preferences-perceptions) influences math achievement. Furthermore, previous studies reported that schools located in rural areas compared to schools in urban areas exhibit what is called as the: "rural math achievement gap" (Khattri et al., 1997). Reeves (2015) suggested that one possible explanation for this gap comes from the difficulty of rural schools to attract qualified teaching staff, thereby reducing students' opportunity to practice and master advanced math topics.
In higher education, the foundation year of the university seems to be another important factor which might affect the quality of the offered academic degrees. A recent study by U-Multirank (2014), an organization funded by the European Union to compare university performance across a range of different academic activities, revealed that older universities tend to perform better than newer ones across most measures of research excellence. The term "new university" has been used informally to refer to several different waves of universities created in recent years around the Globe as a result of economic growth in Europe and the US. For example, in the United Kingdom, the term is synonymous with post-1992 universities and sometimes modern universities, referring to any of the former polytechnics, central institutions or colleges of higher education that were given university status from the British government post 1992.
In a study among academic staff at UK universities, it was reported that 75% of respondents in the old universities in UK were located in departments ranked "4" and above (indicating an outstanding performance), whilst no respondents from newer universities found themselves in departments ranked higher than a "3A" (indicating a moderate performance, see Harley (2002). More recently, recent higher institution evaluations in the U.K. showed that older, compared to younger establishments had higher student entry standards, graduate prospects, research quality and intensity, smaller staffto-student ratios, better facilities, larger amounts of time spent on academic services, receipt of honors from students, and increased rates of degree completion (University League Tables and Rankings, 2017). In a recent study, McCormack et al. (2014) examining more than 250 university department across 100+ UK universities in terms of management practices -shown to predict academic excellence-, found that departments in older universities tend to be better managed than departments in newer universities.
Several reasons for the superiority of the old established universities compared to new ones have been proposed, including a more robust organizational structure (due to tradition), a better infrastructure (e.g., libraries, labs, IT facilities, etc.), more qualified staff (higher academic achievements), better networking after graduation (providing better employment prospects), and a better academic environment (e.g., clubs, social events, sports, etc.). The above factors likely contribute to a differential attraction by older institutions of more qualified individuals, that also result in higher graduation rates and better qualified professionals (Aghion et al., 2010). One important hypothesis of the present study was to test math achievement level differences between old and new establishments, after concluding the optimal factor structure at the university level of the analysis.
Up until recently, the measurement of complex constructs and competencies involved data at the individual level. More recently, however, the advancement of Structural Equation Modeling (SEM) expanded our previous use of data at only a single level in the analysis, namely the person level, though accounting for data complexities and higher order relations (Brown, 2015). Several researchers (McDonald and Goldstein, 1989;Longford and Muthén, 1992) have proposed that patterns of relationships between variables may be different when taking into account nesting in that a relationship between two constructs or two measurements maybe saliently different when viewed under the lenses of a person level analysis (e.g., when the unit analysis is individuals) versus a cluster level analysis (e.g., when the unit of analysis is a cluster of individuals,-e.g., an organization). The combination of the two has contributed to what is now known as Multilevel Structural Equation Modeling (MSEM) which essentially combines the methodologies of structural equation modeling and multilevel modeling (Rabe-Hesketh et al., 2004;Heck and Thomas, 2009). At the measurement level in the analysis, this combination may suggest that different simple structures (i.e., factor solutions) may be operative at different levels in the analysis. At the structural level one can predict different outcome variables that emerge from the earlier measurement models and posit structural paths at each level in the analysis. For example, in measuring academic achievement it is possible that a 5domain factor structure (e.g., math, language, biology, chemistry, social studies) best describes individuals (who due to "domain specificity" may have variable performance across subject matters), but it is possible that a one-factor structure is the most parsimonious solution at the university level, with good universities having higher levels of achievement across subject matters. The implications for including nesting in our evaluation of measures is tremendous for construct validation as a given instrument may operationally define differently a construct at one level in the analysis (e.g., student with a 5factor solution) compared to another level (e.g., university level where a general achievement factor best fits the data). Such findings have implications not only for theory development and falsification but also, measurement, which lies in the core of all scientific efforts. That is, if proper levels of measurement error are at the person level, then scores are valuable and interpretable; If not, unreliable and likely detrimental for accuracy and prediction. Furthermore, the constructs under study may have different interpretations at each level in the analysis with implications for both operational definitions and use of scores (Huang et al., 2015). Up until recently, the only available means for evaluating the construct validity of instruments confounded (through ignoring) the presence, and differential effects, that nesting of individuals within clusters may exert on the data. The purpose of the present study was to demonstrate, using a national measure of math achievement, the evaluation of its optimal factor structure through accounting for the correlated structure of the university where students originate from using Multilevel Structural Equation Modeling (MSEM).

Multilevel Structural Equation Modeling (MSEM)
Multilevel Structural Equation Modeling (MSEM) evaluates measurement and structural models at more than one levels in the analysis when nesting is in place (Geldof et al., 2014;Heck and Thomas, 2015). The primary purpose of modeling data at two or more levels is to avoid the violation of the independence of observations assumption which is introduced when ignoring the clustering variable (e.g., the effects a school administration, teacher, school culture, or classroom climate exerts on all students-causing a baseline between person correlation that reflects a systematic source of measurement error) (Julian, 2001). That is, participants within a cluster are expected to have a higher correlation compared to individuals between clusters (e.g., within a class versus between classes). As shown earlier, such factors have proved to influence math achievement in significant ways (Vršnik Perše et al., 2010;Herndon and Bembenutty, 2014), thus it is important to examine their influential role within a multilevel perspective.
In the present study, we employed MSEM as a means of evaluating the math achievement at both the person and university levels of the analysis. The hypothesis of testing a simple factor structure at the person level of analysis makes inherent sense and is linked to assessment and evaluation, using person scores for future decision making, etc. However, the idea of testing math achievement at the university level of analysis needs to be justified (as well as for any other clustering variable for that purpose). At the measurement level in the analysis, universities are evaluated for the quality and standards they provide to their students, and that evaluation is oftentimes a function of their students' performance. Furthermore, within a university different emphasis may exist that are associated with differential levels of performance. Thus, to evaluate the role different universities might play on math achievement performance, one needs to test the most optimal measurement model for the assessment of a particular domain (e.g., math) in order to make informed decisions (such as staff recruitment, proper allocation of funds, and future university planning) based on that measurement model. For example, in the U.S. in 2013 most of the Federal and State funding was directed to community colleges and small universities compared to research institutions which focused on research grant funding (Woodhouse, 2015). Such decisions need to be grounded in empirical evidence in order to evaluate the services and qualities provided by small and newer institutions compared to older and larger research universities. Furthermore, when involving the MSEM methodology, one is also able to predict students' achievement from university and department level variables such as the year an establishment became a higher education institution, ratio of students to staff, facilities, etc.
If person level math achievement and university level math achievement do not match, measurement-wise, then the most optimal factor structure at each level in the analysis needs to be estimated and applied. The equivocal assumption that person level and university level achievement match, is clearly a tentative assumption and needs to be empirically tested. Based on the above, the purpose of the present study was to model math achievement at both the person and university levels of the analyses to understand the optimal factor structure of math achievement using information from the factor model at each level in the analysis and test the invariance of the proposed structure at the person level by gender. A secondary goal was to predict math achievement at different levels in the analysis, after estimating first the most optimal factor structure.

Participants
Participants were 2,881 individuals who took the math teacher test during a national examination at the National Center for Assessment in Higher Education in Saudi Arabia. The participants took on the measure as part of a licensure program to teach mathematics in elementary and higher education. There were 1,672 males and 1,209 females. The mean age was 24.02 years with an S.D. of 2.517 years (Males: Mean = 23.28, SD = 1.837; Females: Mean = 25.35, SD = 3.586). Participants were nested within 22 universities, which were classified as "new" if they were established within the last 10 years, or "old" establishments. Consequently, 511 students were nested within "old" universities and 2,370 within "new" universities.

Math Achievement
The present measure was a standardized math competency examination, which was administered regularly as part of satisfying requirements for licensure by the state in Saudi Arabia, thus, they were part of a National Examination study. The instrument included four subscales, namely: (a) numbers and operations (6 exercises), (b) algebra and analysis (17 exercises), (c) geometry and measurement (13 exercises), and (d) statistics and probabilities (7 exercises). Exercises were administered using standardized instructions using a paper-and-pencil format within a specific time period (30 min per domain) and were scored as either correct or wrong. The instrument We opted for creating item parcels 1 because models based on parceled data: (a) are more parsimonious, (b) present heightened reliability, (c) have distributions that approximate normality, (d) have fewer chances for residuals to be correlated or dual loadings to emerge (both because fewer indicators are used and because unique variances are fewer), and, (e) are associated with enhanced model fit (Bagozzi and Heatherton, 1994;Marsh et al., 1998;Bandalos and Finney, 2001). Furthermore, one of the main weaknesses of item-level factor analysis (i.e., the assumption that the observed variables are continuously measured interval-level data) may be partially overcome using item parcels (Panter et al., 1997). Parcels were created using 3-4 exercises per parcel selected at random from the domain's exercise pool, in order to account for systematic measurement error due to serial dependency, level of difficulty, or content similarity. Consequently, data were analyzed by use of Maximum Likelihood as recommended in the literature when data have 3 or more categories (Dolan, 1994;Beauducel and Herzberg, 2006). A prerequisite assumption to utilizing parcels, however, pertained to observing normality of the parcels' distributions, which was evaluated through inspecting values of skewness and kurtosis. We differed from utilizing the K-S statistic, as using our large sample size, trivial deviations from normality would likely support alternative model hypotheses. For skeweness and kurtosis acceptable values have been reported in the range of ±2 (Field, 2000(Field, , 2009Trochim and Donnelly, 2006;Gravetter and Wallnau, 2014) or ±1.5 (Tabachnick and Fidell, 2013). In the present study, values of skeweness ranged between 0.061 and 1.297 and kurtosis between −0.745 and +0.708, all laying within acceptable limits.

Data Analyses
Data were analyzed using Multilevel Structural Equation (MSEM). Initially, a series of confirmatory factor analysis (CFA) models were tested to verify the proper simple structure using aggregate data, ignoring nesting. We tested a model consisting of four latent variables (Numbers/Operations, Algebra/Analysis, Geometry/Measurement, and Statistics/Probabilities) using item parcels as indicators per latent variable. Structural Equation Modeling (SEM) evaluates discrepancies between data based and hypothesized variance-covariance matrices by use of an omnibus chi-square test using a system of linear equations. Provided that the chi-square test is a test of "exact fit" and thus, any model with measurement error is bound to be rejected as misfitting the data, a number of ancillary descriptive fit indices are oftentimes employed, along with residual values. Specifically, fit indices such as the Comparative Fit Index (CFI) as an absolute fit index 2 , the Tucker-Lewis index (TLI) as an incremental fit index 3 , and unstandardized residual values (Root Mean Square Error of Approximation) need to be greater than 0.900 and less than 0.08, respectively, to suggest a strong resemblance between sample-based and hypothesized variance-covariance matrices. These indices were utilized with both aggregate and multilevel data as there are currently no level specific fit indices (with the exception of the SRMR) that are available in commercial programs. Ryu and West described how to estimate CFI and RMSEA values for their partially saturated approach, but these methods are not currently available in any software as there is no direct estimation of the independence model for each level of the analysis 4 . Last, information criteria in the form of the Akaike index were employed using difference values using conventions described by Raftery (1995).
At a second step in the analysis, the simple factor structures were tested for verification at both levels in the MSEM analysis (person-within and university-between levels in the analyses) assuming there were ample levels of variance at the clustering level (prerequisite assumption). Data were analyzed by means of Maximum Likelihood (ML) estimation, which results in inflated estimates when residual observations are correlated (Pornprasertmanit et al., 2014). Multilevel Structural Equation Modeling (MSEM) involves random variation due to individual differences (individual-within level) and random variation due to groups in which the individuals belong to (group-between level) with the response of person i who belongs to group j on item y being a function of the between-group random component (y Bj ) and the within-group random component (y Wij ) as follows (Ryu and West, 2009): With the individuals who belong to the same group having an enhanced relationship compared to individuals belonging 2 Absolute in the sense that the best fitted model returns a value of zero. 3 Incremental or relative in that the best fitted model returns a value of one and zero reflects the worst fitted model, usually defined as the independence model in which there is no relation between variables but means of the observed variables are estimated at their true value. 4 Use of the standard independence model is inappropriate as it would yield spurious estimates of model fit by use of indices such as the CFI and other because model fit will include the saturated part of the model that provides no misfit to the overall evaluation. to different groups. As the total variance-covariance matrix is decomposed to within and between levels, these components are orthogonal. Since, theoretically speaking the mathematics measure was designed to assess four domains, the estimation of a 4-factor simple structure at the within and between levels was estimated using the Equations (2) through (4) as described by Muthén and Asparouhov (2009), using slightly different notation: With subscripts i and j being the person and clustering units, respectively. In Equation (2), y ij is a vector of measured variables and j a matrix of factor loadings linking the measured variables to corresponding latent variables at both the within and between levels of the analysis. Subsequently, the within part of the model is estimated as follows: The above equation involves estimating the within level model of the latent responses Y ij as being part of a common factor model that includes random intercepts α that vary over clusters j. B j contains a matrix of factor loadings at the within levels and ζ ik , residual values of the unique and common factor model at the within level. The between, structural, part of the model is estimated using the following formula: Which contain all random coefficients of intercepts α and slopes B, that vary over clusters j, the means µ and the factor loadings estimated from the between-cluster variancecovariance matrix β. Last, ζ k contains residual values of unique and common factors at the between level of the analysis. Subsequently, the hypothesized 4-factor correlated model that describes math competency as a 4-dimension correlated structure (see Figure 1) at both levels of the analysis is shown below using expanded matrices 5 (see also Geldof et al., 2014 for an explanation of the notation). Multilevel Structural Equation Model (MSEM) using matrix notation that posits a 4-factor structure at both the within and between levels of the analysis.     Figure 1 shows the hypothesized 4-factor structure. Note that with the above notation of Equations (2-4), all variables are treated as endogenous (i.e., dependent). For alternative conceptualizations see Muthén and Asparouhov (2009). Also, factor loadings were not fixed to unity; instead, identification of the metric of the factor should be done by fixing the factor variance to unity when modeling the data.
The test of simple factor structures involved several stages. At a first stage the optimal factor structure at a specific level in the analysis (e.g., multi-factor correlated) was contrasted to a competing structure (e.g., unidimensional), after saturating the other level of the analysis. The goal of this test was to conclude the optimal structure at each level in the analysis controlling for measurement error introduced by the structure tested at the other level of the analysis. This partially saturated modeling approach was first introduced by Hox (2002) and then expanded by Ryu and West (2009). It provides a test of model fit at each level in the analysis, as currently there are no level-specific fit indices to evaluate model misfit (except for the SRMR index). The standard method for evaluating model fit in SEM involves the use of a likelihood ratio statistic that tests the null hypothesis 6 that databased model fit, as estimated using the ML fitting function, is equivalent to the fit of a saturated model and, thus, there is no difference between the data-based model and a "perfectly fitted model.": Applying the standard approach (Yuan and Bentler, 2007) to nested structures would suggest that overall model fit would be dominated by the within group model potential misfit for which there is a larger sample size and thus, overall model fit woucertainly downplay misfit due to the between level of the analysis. This is due to the fact that the entire model is estimated simultaneously to test for the goodness of fit of a model (i.e., both covariance components at the between level θb and the within level θw). In other words, a conclusion pointing to a misfitted model would fail to describe the location of the misfit. Ryu and West (2009) proposed a partially saturated model fit approach in which the discrepancy between estimated and saturated models would be restricted to one level in the analysis. For example, if one wants to test a within group model with no misfit introduced by the between group model the later has to be saturated. A chisquare test would then test the discrepancy due to the within level as shown in the equation below: With any potential misfit found above attributed to discrepancies between estimated and saturated within level covariance matrices only. Consequently, we adopted this approach to test level specific model fit.
At a second stage, we tested the discriminant validity of the simple structure's components via the Gorsuch (1983) approach, which involves contrasting a model where between factor covariations are freely estimated to a model that these covariations are constrained to be equal to 1. If the later model is not inferior to the freely estimated model, then the hypothesis that the factors assess conceptually distinct components is not supported (and thus, unidimensionality is the likely alternative). Consequently, the proposed simple structure should fit the data well at both levels of analysis and should meet criteria for discriminant validation, when appropriate.
At a third step in the present modeling, the functioning of gender was investigated as a within person variable for which the proposed factorial structure may not hold (measurement non-invariance). Thus, a series of Differential Item Functioning (DIF) analyses were conducted by use of the Multiple Indicators  Multiple Causes (MIMIC) model (Muthén, 1978;Mislevy, 1986) following the Muthén (1989a) approach. The model tests the probability that item u j that belongs to factor η i and receives a direct effect from a dichotomous covariate x i (gender in the present case) has a response probability of 1 as shown below (Gallo et al., 1994): with λ being the factor loading of item u j on factor η with a mean of zero, κ j being the effect of the covariate on item u j at values x i . The probability of correct responding is then estimated as follows: With θ jj being the item residual variance, τ j the item threshold, λ j the factor loading, η i the factor mean (usually specified to be zero), κ j the effect of the covariate on item j, and F the normal distribution function (Muthén, 1989a,b). The approach utilized herein for testing invariance across gender at the person level in the analysis has been described by Muthén (1989a,b) and involves the following steps: (a) test for optimal factor structure, (b) test for effects of covariate(s) through constraining those direct effects to zero and evaluate magnitude of modification indices, verifying that factor structure does not change, (c) add direct effects of covariate on items recommended by modification indices, verifying again that the factor structure does not change as a function of modeling the covariate, (d) conclude on meeting requirements for full or partial invariance due to the covariate. All analyses were run using Mplus and Maximum Likelihood (ML) estimation using raw data as inputs and through analyzing variance-covariance matrices.

Simple Structure of Math Test Overall 7
Figures 2, 3 display the simple structures tested with aggregate data (i.e., ignoring clustering due to university) including a unidimensional, and a multidimensional model. Results indicated that model fit was adequate using unstandardized residuals (RMSEA) and less so the descriptive fit indices across all models (e.g., Unidimensional Model RMSEA = 0.013, CFI = 0.985; Multi-factor model RMSEA = 0.011, CFI = 0.985). When comparing the models using a Chi-square difference test, *p < 0.01; The level of significance was set to 0.01 to adjust for the excessive levels of power associated with an n-size of 5,445 participants. The critical value of a Chi-square statistic with 3 degrees of freedom is 11.345 at p < 0.01. n.s. = Non-significant. The above ICCs may appear on the low side but, although not customary, tests of significance and confidence intervals were constructed based on the parametric bootstrap distribution using routines initially developed for use with categorical data (Preacher and Selig, 2012;Raykov and Marcoulides, 2015 8 ). As shown above only the Algebra and Analysis exercises 4 and 6 did not present themselves with ample variability at the university level of the analysis. Further information was provided by use of the Design Effect (DEFF) index for which values greater than 2.0 suggest the need to employ a multilevel structure. *p < 0.05; ** p < 0.01. † Significance using a one-tailed test at p < 0.05. model fit was not significantly different between the two nested models, the univariate and multi-factor, although the multifactor model showed slightly better fit [ Chi−square(6) = 8.119, p > 0.05] (see Table 1).  (Raftery, 1995) suggested that whenever the difference in AIC values is less than 10 units (3.881 in the present instance), there is not strong support for the superiority of one model over another. Interestingly, this early conclusion was severely challenged in the next section, after modeling the mathematics simple structure at each level in the analysis and after testing misfit at each level of the analysis using the Ryu and West methodology.

Simple Structure of MA at the Between-Person (Within) and Between-University (Between) Level
A series of models were fit to the data and subsequently compared and contrasted in order to determine the best fitted model at each level in the analysis (person or university). However, it was necessary to first test that variability in math achievement scores was present at the university level of analysis (Raudenbush and Bryk, 2002;Maas and Hox, 2005). Consequently, a series of Intraclass Correlation Coefficients (ICCs) were assessed in order to verify that variances of the match exercises at the betweenuniversity level were non-zero (Werts et al., 1974;Raykov, 1997;Hsu et al., 2016). The coefficient is estimated as the ratio of the between-level variance σ 2 u0 to that of the total variance (within σ 2 r and between σ 2 u0 ) and makes use of the null model as follows (Kreft and de Leeuw, 2004): with σ 2 u0 being the cluster-based variance and σ 2 r the betweenperson within cluster variability. Furthermore, we supplemented the ICC analysis using the "design effect" index (Muthén and Satorra, 1995) which targets at correcting the negative bias associated with nested data due to the violation of the independence of standard errors. It contributes a multiplier that intents to correct standard errors. It is computed as follows: with n c being the number of level-1 units that comprise the clustering variable. As shown in the above equation the design effect is a function of both the number of units in the clustering variable but also the magnitude of the ICC. Values that warrant the need to account for the correlated structure due to clustering are in excess of 2.0 units. Table 2 shows those estimates which confirmed the need to model the information at the university level of the analysis. 9 Table 3 provides significance tests based on difference chi-square test statistics for nested models. The models tested were ordered based on the number of modeled parameters (from parsimonious to more parameterized) and were: (a) a one-factor model at both levels, (b) a unidimensional model within and multidimensional between, (c) a multi-factor model within and one-factor model between, and, (d) a multi-factor model at both levels 10 . Of interest was the comparison between unidimensional and multi-factor structures at both levels in the analyses, in light of the fact that there was no significant difference between the univariate and 4-factor correlated model with the aggregate data (i.e., when ignoring the nesting of participants onto clusters). Table 3 initially shows the fit of the four competing models followed by chi-square difference tests in the case when models were nested, along with values from information criteria, for comparisons of non-nested models. The best model fit was associated with a one-factor model structure at the between level and a 4-factor correlated structure at the between level (4W1B) of the analysis (RMSEA < 0.001, CFI = 1.0, TLI = 1.0, SRMR Within = 0.016, SRMR Between = 0.037) with the chisquare test being non-significant [χ 2 (148) = 130.885, p = 0.841] suggesting "exact fit" between the specified model and the data (MacCallum et al., 1996). This 4W1B model was superior to the unidimensional model at both levels in the analysis by use of a chi-square difference test [ Chi−square (6) = 59.472, p ≤ 0.001]. In the comparison between the preferred 4W1B model and the 4-factor model at both levels (4W4B), the chi-square difference test was not significant. In the case of two models that one is not clearly superior we opt for the less complex model based on the principle of parsimony. However, when utilizing information criteria, it appears that the more complex model (i.e., 4W4B) was associated with larger AIC and BIC values, in excess of 10 units (AICDIF = −10.078, BICDIF = −49.694). Based on the work of Raftery, when AIC difference values exceed 10 units, there is strong evidence that one model is superior to the other. Thus, the 4W1B model appears to be the preferred choice with these data (see Figure 4). Further analyses to verify  --*p < 0.01; The level of significance was set to 0.01 in order to adjust for the excessive levels of power associated with an n-size of 2,881 participants. n.s., Non-significant finding using an alpha level of 0.01. † It may sound strange that negative chi-square values are associated with tests of significance. Absolute chi-square values were utilized in those instances as it is possible that modeling additional parameters was associated with decrements in model fit, which was the case moving from the 1W4B to the 4W4B model.
AIC, Akaike Information Criterion; BIC, Bayesian Information Criterion; SABIC, Sample-size adjusted BIC. There is some controversy over estimating fit indices and RMSEA values with small numbers in degrees of freedom (Kenny et al., 2015) but the present models did not present that limitation. Before concluding the optimal factor structure at any level in the analysis it was important to evaluate the level of misfit between competing models as a function of the information provided at that level only. The methodology has been described by Ryu and West (2009) as the partially saturated approach in that the level that is not tested is saturated so that it does not contribute any measurement error toward the overall fit of the model. Thus, when the 4-factor model was fitted to the data at the person level (within), with a saturated model at the between level, model fit was good [χ 2 (71) = 107.784, p = 0.003, RMSEA = 0.013, CFI = 0.992, TLI = 0.979]. At a second step, a unidimensional model was fit to the data and produced the following fit [χ 2 (77) = 166.446, p < 0.001, RMSEA = 0.020, CFI = 0.980, TLI = 0.953]. Because the two models are nested a chi-square difference test was utilized that was equal to the difference in chi-square units between the two models and was evaluated with the respective difference in the number of degrees of freedom. Results indicated that the difference chi-square statistic was equal to 58.662 units, which was significantly different from zero with 6 degrees of freedom (the critical value was 12.592 chi-square units). Thus, the 4-factor correlated model at the within level of the analysis provided superior model fit compared to the unidimensional structure and was associated with low amounts of measurement error.
A similar evaluation took place at the between level in the analysis through saturating the within level model. The fit of the 4-factor correlated model [χ 2 (71) = 18.680, p = 1.00, RMSEA < 0.001, CFI = 1.0, TLI=1.0] was contrasted to that of the unidimensional structure at the university level [χ 2 (77) = 22.174, p = 1.0, RMSEA < 0.001, CFI = 1.0, TLI = 1.0]. Results pointed to accepting the null hypothesis that both models fit the data equally well. Consequently, due to parsimony, the 1-factor model was deemed the most appropriate structure at the university level (between person level).

Testing for the Discriminant Validity of the Optimal Multilevel SEM Model
One important hypothesis related to the discriminant validation of the mathematics measure as the between factor correlations were very high. To this end, we compared the freely estimated correlated factor model at the within level and saturated between  comparing the fit of the 4-factor correlated model with freely estimated between-factor correlations to that of the same simple structure but with fixed correlations to unity, results indicated significant misfit of the later as the difference chi-square value was equal to 58.662 with again a critical value of 12.592. Thus, a conclusion of discriminant validation was supported as the model with fixed correlations was statistically inferior to that of the 4-factor freely correlated model. A MIMIC model was applied at the within (person) level to test the measurement invariance of the instrument across gender although alternative approaches based on multi-group modeling are also available (Kim and Cao, 2015). Based on recommendations by Muthén (1989a) the effects of the covariate and measurement non-invariance should be examined by constraining the effects of the covariate on the item parcels to be zero and through examining the misfit documented in the modification indices. After fitting this constrained model to the data, results indicated that item parcels 1, 3, 4, and 10 were associated with increases in chi-square values between 11.039 and 38.985 units, all significant given a critical threshold value of 10 chi-square units. Furthermore, a direct effect of gender on the first factor (Numbers/Operations) was significant and negative suggesting that females had lower scores than males on that factor. Inspection of the behavior of item parcel 1 that loaded onto the Numbers/Operations factor revealed that its factor loading was positive, thus, the expectation was that females would have higher scores compared to males on that item parcel. The covariate effect, however, was negative and significant suggesting that females actually had lower scores on that item parcel. That was evidence of measurement noninvariance for the first arithmetic item parcel across gender. The same exact effect was also observed with item parcel 3, the first item of the Algebra/Analysis factor pointing again to the presence of non-invariance due to gender. For item parcels 4 and 10, however, there was no significant effect on the factor mean pointing to an expectation that differential responding should not be expected across gender. Nevertheless, the effects of the covariate were significant and positive suggesting that females, had significantly elevated scores on those item parcels, indicating non-invariance or the presence of Differential Item Functioning (DIF). Figure 5 displays the difference across gender on both the logit (y-axis to the right) and the probability scale (yaxis to the left of the figure). As shown in Figure 5, differences that exceeded levels of significance were practically meaningless. For example, at the probability scale, success rates between 2 and 4% were significantly different from each other but likely represent miniscule differences using an effect size metric. The largest difference represented 3 percentage units. Furthermore, the pattern of findings was not consistent in that all four item parcels were favored by males only or females only suggesting a balance across gender that is likely reflective of random variation that exceeded levels of significance due to excessive levels of power of the z-test statistics. Consequently, a conclusion of measurement invariance was drawn, suggesting that the few significant observed discrepancies likely reflect Type-I errors due to the large sample size.

Testing for Level Differences Across Gender Using Multilevel MIMIC Model
Assuming measurement invariance across gender as per the previous section, with the few significant findings reflecting very low effect sizes and were rather an artifact of the large sample size, a latent means analysis was conducted using procedures described by Kim and Cao (2015). Consequently, the latent factor means of the four math constructs were regressed on a dummy gender variable. Results pointed to the presence of null effects across all constructs except the Numbers/Operations factor. The mean of females was −0.214 units lower compared to that of males on the respective construct (z = −2.791, p = 0.005).

Multilevel Structural Equation Modeling (MSEM) for the Prediction of Math Achievement From Type of University (Old-New)
Following evaluation of the measurement models above, a last aim involved a structural model in which the latent factor mean of the unidimensional math structure at the between level of the analysis was regressed on the year the university was established (see Figure 6). Prior to testing this model, it was necessary to verify the measurement invariance of the model across old and new establishments. Consequently, a multi-group MSEM model was tested with age of university comprising the between-level grouping variable. Factor loadings and intercepts were constrained to be equivalent across type of universities and model fit was subsequently evaluated. Results indicated that imposing these constraints was associated with excellent model fit. Specifically, the overall chi-square test was non-significant suggesting "exact fit" [χ 2 (336) = 308.371, p = 0.858] and among fit indices both the CFI and TLI were equal to 1. The RMSEA was less than 0.001 and the misfit introduced by the different levels of the age of institution variable were very similar (168.773 and 139.597 chi-square units for old and new establishments, respectively). Thus, a conclusion of strict measurement invariance was supported, and generalized math competency was subsequently regressed on a dummy variable (year the university was established) coded with zero representing older institutions and with a value of 1, newly established institutions. Results indicated that there were significantly lower math achievement levels in students nested within newer establishments compared to older ones (b Math = −2.174, p < 0.001).
FIGURE 6 | Effects of type of university (old and new establishments) on the math achievement of university students. Only latent variables and level-2 predictor are shown for parsimony. *p < 0.05, two-tailed test.

DISCUSSION
The measurement of academic achievement has been predominantly examined with person level data which essentially fail to disaggregate the person variability from that of between person structures such as the university students belong to. Consequently, when using person-based estimates of achievement any influences due to university are confounded. The purpose of the present study was to model math achievement at both the person and university levels of the analyses in order to understand the optimal factor structure of math achievement using information from the factor model so that all available information regarding the measurement of math achievement is accounted for. Several salient findings emerged, which are presented in order of importance in the sections that follow.
The most important finding related to the fact that the simple structure of math achievement appears to be different when viewed under the lenses of the aggregated model (person level data) and under the differentiation as person level and university level data. Specifically, the aggregate data analysis supported a conclusion of an optimal univariate model for the measurement of math achievement and the multilevel structural equation model a conclusion favoring a unidimensional structure at the university level and a multi-factor model at the person level. This is an important consideration that affects both theory and measurement practice and utility. That is, a simple structure should be evaluated for fit at each level in the analysis and that conclusion should inform theory; also, those findings should inform measurement in that they should lead to simple structures with the least amount of measurement error so that subsequent phenomena (i.e., structural relations) would be modeled properly with the most appropriate measurement models for each level in the analysis. At the between university level, the correlation between constructs was high suggesting that math ability at the university level is driven by the overall capacity of the university, without showing domain specific effects. At the person level, differential performance (low-high achievement) across subspecialties was observed suggesting that individuals may have a preference and different level of skill in some math area (e.g., statistics) but less so in another (e.g., algebra). So, although at the university level, institutions were either good or not so good across math specialties, math achievement at the person level was governed by math subspecialty in that performance in one subject matter was unrelated to the performance in another subject matter (between factor correlations were at times zero). The findings at the person level are expected in that individuals should not necessarily be "equally" good across math specialties. The finding at the university level did not provide support to the hypothesis that there are different emphases within a university (defined by different quality staff and resources so that for example, a department within a university may emphasize statistics but less so, algebra and analysis). This apparently diverse simple structure observed at the university level compared to the aggregate data is surprising provided that the ICCs were not that large to warrant such a saliently different solution (Opdenakker and Van Damme, 2000).
The second most important finding relates to the ability of the MSEM model to understand the variability of university phenomena, after employing the most appropriate simple structure, using university level predictors, after first verifying measurement invariance. In the present study, the age of an institution was factored in, to understand math achievement with strict invariance being justified across old and new institutions. Results favored older institutions in that math achievement was significantly elevated. This finding agrees with previous data from e.g., UK university evaluations in 2016 in that older institutions had significantly higher ranking, research quality and intensity, better student to staff ratio, significantly higher allocation of funds, more facilities, higher honors, and higher completion rates (University League Tables and Rankings, 2017). Thus, the present analysis using the university as the unit of analysis and after evaluating the most optimal simple structure at the university level, allows for a proper evaluation of university departments, their degree of competency and production, which is primarily associated with funding from federal and/or state sources. Interestingly, in the U.S. the major research institutions (that are mostly older) seem to be penalized with regard to funding as the likely newer community colleges and non-research institutions seem to be receiving the largest share of their budgets from federal and state funds (Woodhouse, 2015). In the absence of the MSEM methodology, one could neither test for the most optimal model using the university as the level of analysis, nor would be able to predict how the age of the institution could contribute to achievement in math. Subsequent public policy decisions could then be adjusted for the present findings.
A third important finding related to the evaluation of within level predictors such as person demographic characteristics. In the present study the effects of gender were evaluated after first establishing partial measurement invariance of the 4-factor solution between males and females. Only four out of the 14 item parcels showed significant amounts of DIF, which, when evaluated using practical means appeared to be very small and, in that sense, insignificant. Consequently, measurement invariance was assumed and, in a MIMIC, structural model all latent factors were regressed onto the dummy gender variable. Results pointed to the existence of minimal differences across gender, with one significant effect observed for factor 1 (numbers/operations), with females having a significantly lower mean on that construct. Overall, these findings suggest that males and females are comparable in their levels of math achievement across math domains and contrasts earlier findings pointing to the existence of gender differences with females having lower aptitude in math compared to males across math domains (e.g., Régner et al., 2016).
The present study is also limited for several reasons. First, the most optimal simple structure was not consistently pursued through deleting item parcels or persons as the goal of the present study was not to purify and improve the instrument under study. Furthermore, disaggregation was the preferred method of analysis provided that the current measure was reflective rather than formative. Thus, we deferred from this approach of initially purifying the measure using aggregate data. A third limitation pertains to the fact that several intraclass coefficients were low, particularly since low ICC values (along with other factors) have been implicated with biased estimates of factor loadings (Muthén and Satorra, 1995;Hox and Maas, 2001;Wu and Kwok, 2012), non-convergence (Toland and De Ayala, 2005), and/or inadmissible estimates such as the presence of negative variance estimates (Li and Beretvas, 2013), leading to proposals for involving Bayesian estimation approaches (Depaoli and Clifton, 2015). In the present study, issues of non-convergence were not present suggesting that the large cluster sizes (mean cluster size = 247 participants) acted as a buffer to estimation problems (Preacher et al., 2011) along with creating item parcels as categorical data have been largely implicated with estimation problems and non-convergence (Yang-Wallentin et al., 2010). As Muthén (1991) pointed out, small ICC values are common in educational and psychological research, as well as small cluster sizes (e.g., with 5-20 participants, Mathisen et al., 2006), however, research has shown that ignoring ICCs as low as 0.02 lead to parameter inflation and a large number of Type-I errors (Murray and Hannan, 1990;Siddiqui et al., 1996;Baldwin et al., 2011). Another limitation pertains to the unbalanced samples in old and new establishments, which, may have affected the generalizability of the findings. Last, fit at the within level may suggest an overidentified model, which potentially creates problems with parameter estimation.
Nevertheless, the present study's novelty lies on the fact that proper measurement of many conceptual phenomena likely involves "nesting." The use of the factor model as part of multilevel modeling further disattenuates measurement error and provides improved accuracy of person scores. The simultaneous modeling of the covariance structure at both levels in the analysis allows for a proper disaggregation of variances and covariances at each level. Under those lenses latent variable models are the most appropriate means for assessing construct validity and should be tested separately at each level in the analyses as needed in order to more accurately measure the constructs under study. However, despite the analytical benefits, as Stapleton et al. (2016) have noted, if theoretically speaking the interest is at the person level and the construct being measured also makes only sense to be assessed at that level only, modeling level-2 structures, may not be appropriate.

ETHICS STATEMENT
This study was carried out in accordance with the recommendations of the human experimentation committee of the National Center for Assessment in Higher Education.

AUTHOR CONTRIBUTIONS
GS drafted the manuscript, run statistical analyses, created tables and figures, and monitored all aspects of the written product. IT contributed significantly to the write-up of the study, run some data analyses and created tables. He approved all aspects of the written product. AA-S collected the data, drafted parts of the data analysis section, proofread the entire manuscript and approved all parts of the written product.