Model Fit after Pairwise Maximum Likelihood

Maximum likelihood factor analysis of discrete data within the structural equation modeling framework rests on the assumption that the observed discrete responses are manifestations of underlying continuous scores that are normally distributed. As maximizing the likelihood of multivariate response patterns is computationally very intensive, the sum of the log–likelihoods of the bivariate response patterns is maximized instead. Little is yet known about how to assess model fit when the analysis is based on such a pairwise maximum likelihood (PML) of two–way contingency tables. We propose new fit criteria for the PML method and conduct a simulation study to evaluate their performance in model selection. With large sample sizes (500 or more), PML performs as well the robust weighted least squares analysis of polychoric correlations.


INTRODUCTION
Tests and questionnaires usually consist of items with discrete ordinal response scales. In the factor analysis of discrete item responses, multivariate normally distributed scores are assumed to underly the discrete item responses (e.g., Wirth and Edwards, 2007;Rhemtulla et al., 2012).
Let X = (X 1 , X 2 , . . . , X k ) denote the vector of the k variables with discrete response scales, with realizations x i ∈ {1, 2, . . . , m i }, so that each item i has m i response options. The observed score x i on item i is related to the unobserved score x * i on the underlying continuum through where τ x i −1 and τ x i are the threshold parameters for the category of item i. An item with m i categories really only has m i − 1 thresholds, as τ 0 ⇒ −∞ and τ m i ⇒ ∞. Herinafter, to simplify notation, we assume that the number of response options is equal across items, m i = m for all i.
As the underlying continuous variable X * i is not observed, its mean and variance are not identified without further constraints. One can either fix the mean and variance (e.g., zero mean and unit variance), or one can fix two of the thresholds (e.g., at zero and unity). The latter is not possible with dichotomous items, because they are associated with just a single threshold.
Various estimation methods have been proposed for the factor analysis of (observed) discrete responses with (unobserved) underlying continuous scores. Here we discuss the weighted least squares method, the multivariate maximum likelihood method, and the bivariate maximum likelihood method.
The weighted least squares method was introduced as a two-step method. In the first step, the polychoric correlations between the observed variables are estimated. In the second step, the parameters of the structural equation model are estimated on the basis of the polychoric correlations. The general WLS fit function for discrete data, based on Browne (1984) who described the WLS fit for continuous data, is given by whereq is a vector with the non-redundant elements of the k × k matrix of polychoric correlations and g is a vector with the corresponding elements of the k × k matrix of modelimplied correlations. The weight matrix W is a positive definite matrix of order v × v, with v = k(k + 1)/2. It contains consistent estimates of the asymptotic variances and covariances of the polychoric correlations (e.g., Jöreskog, 1990Jöreskog, , 1994. Other authors also included the observed and model implied threshold values in theq and g vectors, and the associated asymptotic covariances ofq in matrix W, which resulted in two-step (see Lee et al., 1995) and three-step approaches (e.g., Muthén, 1984Muthén, , 1989Lee et al., 1990b). As the weight matrix can only be accurately estimated with large sample sizes (e.g., Rigdon and Ferguson, 1991;Muthén and Kaplan, 1992;Dolan, 1994), it is practically unfeasible to use the WLS function with the full weight matrix. An alternative is to use the WLS function with diagonal matrix W D , containing only the diagonal elements of W to estimate the parameter estimates. However, for inference, one needs the full weight matrix as implemented in the so-called robust WLS. The three-step robust WLS with mean-and-variance corrected chi-square and standard errors (WLSMV; Muthén et al., 1997;Asparouhov and Muthén, 2010), also referred to as RDWLS (see Katsikatsou et al., 2012), has been advocated because of good performance in simulation studies (e.g., Beauducel and Herzberg, 2006;Barendse et al., 2015). In the multivariate maximum likelihood estimation method (Lee et al., 1990a), the maximum likelihood estimator is used to estimate the variances, covariances, means, and thresholds of all X * simultaneously, in a single step. The method is also known as the full information maximum likelihood method, as one maximizes the likelihoods of the complete response patterns. This implies that one uses all information in the data, and does not have to rely on polychoric correlations, like in the WLS related estimation methods. Let ρ denote the vector containing the correlations between all pairs of continuous variables X * i and X * j with i, j = 1 . . . k, and i < j. The expected proportion π of response vector x, given correlations ρ and thresholds τ , is given by where f denotes the k-dimensional normal density. Let index r refer to a complete item response pattern (x 1 , x 2 , . . . , x k ), and let p r denote the observed proportion of respondents with response pattern r in the sample. The log-likelihood of response pattern r is given by which is maximized to obtain the estimates for the parameters ρ and τ . As maximizing this log-likelihood requires numerical evaluation of high-dimensional integration over x * (Equation 3) in order to obtain the probability function of a response vector, Jöreskog and Moustaki (2001) already concluded that FIML is only feasible with a small numbers of variables (e.g., four or less). This seriously limits the application of FIML in practice.
In the bivariate maximum likelihood estimation method, high numerical integration is avoided by considering bivariate information only. In this one-step method, the sum of the log-likelihoods of all possible bivariate response patterns is maximized, rather than that of the full multivariate response patterns.
For two items i and j, the expected proportion of respondents with scores x i , x j is given by for τ i = (τ 1 i , τ 2 i , . . . , τ m−1 ) and τ j = (τ 1 j , τ 2 j , . . . , τ m−1 ). In order to obtain the likelihood estimates of the parameters ρ ij and τ i , τ j , instead of maximizing the multivariate likelihood, we maximize the sum of all bivariate log-likelihoods: where p x i ,x j denotes the sample proportion of responses x i and x j . Jöreskog and Moustaki (2001) denoted this method the underlying bivariate normal method. They originally suggested to use both the univariate and bivariate distributions. Based on results of their simulation study (Katsikatsou et al., 2012) concluded that the univariate distributions have no additional value in the parameter estimation. The estimation method that only relies on bivariate likelihoods is referred to as the pairwise maximum likelihood (PML) method.
The PML estimation method has the advantage over FIML that it is computationally feasible, but it has the disadvantage that it only uses the bivariate distributions of the observed variables, and thus does not utilize all available information.
As an overall measure of fit, Jöreskog and Moustaki (2001) proposed to use the average of all bivariate likelihood ratio test statistics, but this statistic cannot be used as a goodnessof-fit test as its distribution is unknown. Maydeu-Olivares (2006) and Maydeu-Olivares and Joe (2006) introduced a family of fit statistics for testing composite null hypotheses in multidimensional contingency tables. As the PML method has been recognized as a special case of the maximum composite Frontiers in Psychology | www.frontiersin.org 2 April 2016 | Volume 7 | Article 528 likelihood method (Varin, 2008;Varin et al., 2011) and the bivariate maximum likelihood estimation method, it can be used to obtain a residual based fit statistics (Maydeu-Olivares, 2006;Maydeu-Olivares and Joe, 2006) and standard errors (Xi, 2011) for the PML estimation method. In a simulation study, Xi (2011) found the composite likelihood fit statistic and standard errors estimates of the bivariate maximum likelihood estimation method to be appropriate, when compared to a full information expectation maximization algorithm. However, these test statistics are not yet readily available, as the have not yet been implemented in a computer program.
In the present paper, we propose three new fit statistics. We investigate these test statistics in a simulation study and compare them with the overall goodness-of-fit that is associated with robust WLS estimation. The new fit statistics have been made available in the open source SEM software lavaan (see Appendix in Supplementary materials; Rosseel, 2012).

METHODS
To evaluate the three new fit statistics (explained in Section 2.2) for the PML estimation method, we conduct a simulation study in which we vary sample size (200, 500, and 1,000) and the number of response options (2, 3, and 4) in a fully crossed design, yielding nine conditions. With 1,000 replications, we obtain 9,000 datasets that are analyzed using the PML and robust WLS estimation methods.

Data Generation
We partly replicate the simulation study conducted by Katsikatsou et al. (2012). They generated item scores on six items according to a two factor model with factor loadings common factor variances and covariances and residual variances Continuous item scores are drawn from a multivariate normal distribution with variances and covariances and zero means. For each sample size (200, 500, and 1000), we generate 1000 datasets of continuous scores. These scores are categorized into two categories (threshold 0, yielding expected proportions 0.50 and 0.50), three categories (thresholds -0.6 and 0.6, yielding expected proportions of 0.27, 0.45, and 0.27), and four categories (thresholds -1.2, 0, and 1.2, yielding expected proportions 0.11, 0.39, 0.39, and 0.11; in line with Katsikatsou et al., 2012).

Model Fit Statistics
In the PML method, model parameters are estimated by maximizing the sum of the log-likelihoods of all bivariate responses patterns, for all pairs of items. As the distribution of this sum is not known, we propose three measures of fit that are based on likelihood ratios: C F , C M , and C P . The C F and C M fit statistics compare the model-implied proportions of response patterns with, respectively, the observed proportions of full response patterns (signified by subscript F) and the expected proportions under the assumption of multivariate normality (signified by subscript M). The C P fit statistic compares the model-implied proportions of pairs of item responses to the observed proportions of pairs of item responses (signified by subscript P). Specifically, C F compares the log-likelihood of the expected proportions of the multivariate response patterns (Equation 4) with the observed proportions of response patterns. Multiplied by two times the sample size, we obtain that is asymptotically chi-square distributed with degrees of freedom equal to the difference between the number of possible response patterns and the number of model parameters to be estimated minus one (Agresti, 2002, pp. 590-591), where n is the number of parameters to be estimated. As the number of possible response patterns m k is usually much larger than sample size N, most response patterns will not be observed at all, yielding many empty cells in the multivariate m k table, thereby causing bias in the C F statistic. As a possible solution (Jöreskog and Moustaki, 2001) considered the number of response patterns that is actually observed only, and calculated degrees of freedom as where u r denotes the number of observed response patterns.
The fit statistic C M compares the log-likelihood of the the model-implied proportions of response patterns of the modelof-interest with the model-implied proportions of the model that only assumes an underlying multivariate normal distribution (without any further restrictions): where C F1 is C F for Model 1, the model of interest, and C F0 is C F for Model 0, the model that assumes underlying multivariate normality and that has all polychoric correlations ρ and all Frontiers in Psychology | www.frontiersin.org 3 April 2016 | Volume 7 | Article 528 thresholds τ as its parameters. Statistic C M has asymptotically a chi-square distribution with degrees of freedom equal to the difference in the numbers of parameters of Models 0 and 1, where k(k − 1)/2 is the number of polychoric correlations, k(m − 1) is the number of thresholds, and n 1 is the number of parameters of the model of interest. If the bias in C F1 and C F0 caused by empty cells in the m k table cancels out in C M , then C M may outperform C F . The fit statistic C P is based on pairs of responses only, by comparing the observed and model-implied proportions of those pairs. For items i and j (Agresti, 2002), which has an asymptotic chi-square distribution with degrees of freedom equal to the information (which is (m 2 − 1)) minus the number of parameters [i.e., 2(m − 1) thresholds and 1 correlation], To test the overall goodness-of-fit of the model, we consider all C P ij and select C P = maximum (C P ij ). As there are k(k − 1)/2 possible pairs of items, this C P should be applied with a Bonferroni adjusted level of significance α * , with to keep the family-wise error rate at α. The hypothesis of overall goodness-of-fit is tested at α and rejected when C P is significant at α * . Notice that with dichotomous items, m = 2, df P = 0, so that the hypothesis of an underlying bivariate normal distribution cannot be tested. So, statistic C P can only be applied when there are more than two response options. We will compare the performance of these statistics with the chi-square measure of overall goodness-of-fit that is associated with robust WLS estimation, which we will refer to as C W . To account for the violation of distributional assumptions, this C W statistic is subject to a scaling correction (Muthén et al., 1997;Satorra and Bentler, 2001;Asparouhov and Muthén, 2010). Here we will use the mean-and-variance corrected chi-square statistic (Asparouhov and Muthén, 2010).

Analysis
We fit three models to each of the 9,000 datasets: a baseline model, a one-factor model, and a two-factor model. The baseline model includes all polychoric correlations and thresholds. If the baseline model does not fit then we must reject the hypothesis of an underlying multivariate normal distribution. The one-factor model has a free 6 × 1 matrix and the 1 × 1 matrix is fixed at unity. The two-factor model corresponds to the data generation model and has a 6 × 2 matrix with a pattern of free factor loadings that corresponds with above, a 2 × 2 symmetric matrix with diagonal elements fixed at unity and a free offdiagonal element. In both the one-factor and two-factor model is a 6 × 6 diagonal matrix equal to I − diag( ′ ). We use two estimation methods: PML and robust WLS. Model fit will be evaluated with measures C F , C M , and C P after PML estimation and with measure C W after robust WLS estimation. The computer program Mx (Neale et al., 2002) is used for PML estimation, and the computer program Mplus 6.11  for robust WLS estimation. The computer program R is used to calculate the fit measures C F , C M , and C P (using the "mvtnorm" package; R version 2.12.0; R Development Core Team, 2010).
The performance of the four fit measures will be evaluated by calculating the proportions of model rejection in each of the conditions. The baseline model and the two-factor model should fit. When testing at a 5% level of significance, these two models should be rejected in 5% of all cases. The one-factor model should not fit and should always be rejected.

RESULTS
Before presenting the results of the different methods for the evaluation of model fit, we briefly comment on the accuracy and efficiency of parameter estimation through PML. The accuracy is evaluated by calculating the absolute differences between the parameter estimates and the population values. The standard deviations indicate the efficiency of the parameter estimates.
Across all conditions, the average absolute difference of the factor loadings is 0.001 and the average standard deviation is 0.052. The average absolute difference of the correlation between the latent variables across all conditions is 0.002 and the standard deviation across all conditions is 0.069. Noteworthy, PML shows a slightly higher accuracy than robust WLS in terms of the estimates of the factor loadings and the correlation, with average absolute differences of 0.001 for PML and 0.003 for robust WLS. The efficiency is about the same. Katsikatsou et al. (2012) already reported on the accuracy and efficiency of the parameter estimates in the case of four point response options. Our results are consistent with the results of Katsikatsou et al. (2012). Table 1 gives the results of fit evaluation with the C F statistic for the baseline model, the one-factor model, and the two-factor model. For each condition, the means and standard deviations of the fit statistic are calculated across 1,000 replications. Rejection rates and 95% confidence intervals are given two times: Once with degrees of freedom based on the number of possible response patterns (df F , Equation 12) and with degrees of freedom based on the number of observed response patterns (df F * , Equation 13). Means and standard deviations of df F * are given as well.

The C F Fit Statistic
The fit of the baseline model is a test of the assumption of underlying multivariate normality, so we would expect rejection rates that equal the level of significance (5%). The overall rejection rates with the df based on the number of possible response patterns (df F ) in conditions with two-point response  scales are too high (13.5%, 15.4%, 8.8%). With three-point and four-point response scales, df F is very large so that the baseline model never gets rejected. The same is true for the two-factor model that should fit the data, but is rejected too often in the conditions with two-point response scales (12.0%, 16.0%, 9.6%) and never rejected in the other conditions. The one-factor model is not correct and should be rejected, which is the case in conditions with two-point response scales but not in conditions with three-point scales and four-point scales.
We attribute the bad results in conditions with three-point scales and four-point scales to the large numbers of empty cells in the multivariate contingency tables. In the cases of three-point response scales and four-point response scales the numbers of possible response patterns are 729 and 4096, whereas the total numbers of observations are only 200, 500, or 1000, rendering the C F statistic unsuitable. The overall rejection rates of the baseline model with degrees of freedom based on the number of observed response patterns (i.e., df F * ) are consistently much too high, in all conditions, showing that the use of df F * is not justified. Table 2 gives the results of fit evaluation with the C M statistic for the one-factor model and the two-factor model. The onefactor model is almost always rejected, except in the condition with sample size 200 and two-point response scales (with 0.987 rejection rate). The rejection rates for the two-factor model should be about equal to the level of significance (5%), but vary from 6.8% to 9.6%.

The C M Fit Statistic
Overall, we consider the C M results satisfactory. Apparently, the sparseness of data and (almost) empty cells that invalidate Frontiers in Psychology | www.frontiersin.org 5 April 2016 | Volume 7 | Article 528

Means (M) and standard deviations (SD) of the fit statistic, rejection rates (RR) at a 5% level of significance, and 95% confidence intervals (CI) of the rejection rates are calculated across the 1000 simulated datasets.
the use of the C F statistic does not seem to affect the C M statistic much.

The C P Fit Statistic
The C P results are given in Table 3. As explained above, the C P statistic cannot be used with two-point response scales. For all other conditions Table 3 gives the means, standard deviations, and rejection rates of the highest C P among the 15 bivariate tests that are conducted with each dataset. To guard against inflation of the family-wise error rate, the level of significance is adjusted to 5% / 15 = 0.33%. The rejection rates for the baseline model vary between 4.4% and 5.5%, and for the two-factor model between 3.5% and 6.0%, which is reasonably close to the significance level of 5%. The onefactor model is almost always rejected in conditions with sample sizes of 500 and 1,000. However, in the small sample conditions rejection rates are only 67.0% and 53.9%.

The C W Fit Statistic
For the purpose of comparison, Table 4 gives the C W results after analysing all data sets with the robust WLS method of estimation. The one-factor model is almost always rejected. The rejection rates for the two-factor model vary between 3.9% and 6.4%.
The C W results with the two-factor model are somewhat better (closer to 5% rejection rates) than the C M results. The C W results are about similar to the C P results, except for the rejection rates of the one-factor model in small sample size conditions, in which the C W statistic seems to have more power.

DISCUSSION
We proposed three new statistics for goodness of overall fit of models that are fitted through the pairwise maximum likelihood (PML) method. With the C F statistic we test the difference between the model-implied proportions of multivariate response patterns and the observed proportions of multivariate response patterns. With the C M statistic we test the difference between model-implied proportions of multivariate response patterns and the proportions of response patterns that are implied by the assumption of underlying multivariate normally distributed continuous variables. With the C P statistic we test the difference between model-implied proportions of bivariate response patterns and observed proportions of bivariate response patterns. The C F statistic appeared unsuitable for the evaluation of model fit. The performance of the C M statistic was good, although the rejection rates for the two factor model were consistently a little too high (varying between 6.8% and 9.6% instead of 5%). The C P statistic showed the best results with rejection rates close to the expected values (around 5% for models that should fit, and close to 100% for models that should not fit), except for relatively small sample sizes of 200 with which the rejection rates for the Frontiers in Psychology | www.frontiersin.org 6 April 2016 | Volume 7 | Article 528 wrong one-factor model were substantially too low. For all fit statistics, we only reported results of testing at the 5% level of significance, as the results at the 1% level of significance were very similar.
As an aside, we note that in the condition with four response options and sample size 500, we have reported the results of a second drawing of 1,000 datasets. The first drawing produced by chance unexpected low C P rejection rates for the baseline model (i.e., 0.032 with a confidence interval of 0.021-0.043) and one-factor model (i.e., 0.035 with a confidence interval of 0.025-0.046) that did not seem representative. No other statistics were affected.
The performance of the PML fit statistics is only partly dependent on sample size. The C F statistic is not suitable with any sample size, as we observe the negative consequences of very large contingency tables affected by sparseness of data (e.g., Agresti and Yang, 1987;Reiser and VandenBerg, 1994;Reiser and Lin, 1999;Jöreskog and Moustaki, 2001;Bartholomew and Leung, 2002). The alternative way of calculating degrees of freedom of Jöreskog and Moustaki (2001) on the basis of the number of observed response patterns instead of the number of possible response patterns, appeared unsuitable. In practice, one can deal with sparseness by for example combining cells, reducing the number of categories, or eliminating the most offending variables (see Agresti and Yang, 1987;Jöreskog and Moustaki, 2001). However, it was not possible to implement this in this simulation study.
The C M statistic seems not that much affected by sparseness of data. The C P statistic uses bivariate tables only, but its power for rejecting the one-factor model is mediocre when the sample size is small. Still, the C P rejection rates for the correct models are not affected by small sample size. In our simulation study we also varied the number of response options, but this manipulation did not affect the results of the C M and C P fit statistics much.
We compared the results of the PML fit statistics results with results of robust weighted least squares (WLS) with the adjusted chi-square statistic C W . The performance of C W was very similar to the performance of C P , and in small sample conditions C W outperformed C P in rejecting the one-factor model. Still, robust WLS estimation is very different from PML estimation. Robust WLS is a multiple-step method that relies on the estimated polychoric correlations. The model-implied correlations are then fitted to fixed polychoric correlations, so there is no direct relation between the model-implied correlations and the observed discrete responses. That is why we really expected PML to behave better than robust WLS. However, in the present simulation study of six variables measuring two common factors, robust WLS did at least as well as PML.
We still do not know how robust WLS and PML compare in larger datasets, with more variables, and more complex models. As WLS relies on a two-step procedure in which summary statistics are calculated first, we would expect the single-step PML procedure to outperform the WLS procedure. PML may also show advantages over WLS in case of incomplete data. Finally, we think that the PML method is a feasible alternative to FIML in case of larger data sets. Overall, the PML method seems a promising method that can be used to estimate all structural equation models, such as exploratory factor analysis models, multigroup models and longitudinal models (Moustaki, 2003;Vasdekis et al., 2012). We used Mx to apply the PML method, but the PML fit estimates can also be obtained with OpenMX (Boker et al., 2011) and lavaan (Rosseel, 2012). To facilitate their use, the C F , C M , and C P statistics have been implemented in lavaan (see Appendix in Supplementary materials; Rosseel, 2012).

AUTHOR CONTRIBUTIONS
All authors meet the criteria for authorship. All authors contributed substantially to the conception and design of the work, and drafting and finalizing the paper. MB, RL, and FO designed and programmed the simulation study.