A Comparison of Estimation Methods for a Multi-unidimensional Graded Response IRT Model

This study compared several parameter estimation methods for multi-unidimensional graded response models using their corresponding statistical software programs and packages. Specifically, we compared two marginal maximum likelihood (MML) approaches (Bock-Aitkin expectation-maximum algorithm, adaptive quadrature approach), four fully Bayesian algorithms (Gibbs sampling, Metropolis-Hastings, Hastings-within-Gibbs, blocked Metropolis), and the Metropolis-Hastings Robbins-Monro (MHRM) algorithm via the use of IRTPRO, BMIRT, and MATLAB. Simulation results suggested that, when the intertrait correlation was low, these estimation methods provided similar results. However, if the dimensions were moderately or highly correlated, Hastings-within-Gibbs had an overall better parameter recovery of item discrimination and intertrait correlation parameters. The performances of these estimation methods with different sample sizes and test lengths are also discussed.


INTRODUCTION
Polytomous item response theory (IRT; Lord, 1980) models are applicable for tests with items involving more than two response categories. Polytomous responses include nominal and ordinal responses. The former does not have any natural ordering between categories whereas the latter corresponds to a number of ordering categories. Ordinal polytomous responses, such as Likert scale items (Likert, 1932), are broadly used in many fields, including education, psychology, and marketing. Given this, many IRT models have been developed to analyze ordinal polytomous items, such as the graded response model (GRM; Samejima, 1969), the rating scale model (RSM; Andrich, 1978), and the partial credit model (PCM; Masters, 1982), to name a few. This study focuses on the GRM, the most widely used IRT model for polytomous response data (e.g., Rubio et al., 2007;Ferero and Maydeu-Olivares, 2009). The unidimensional GRM is defined as i = 1, . . . , N, j = 1, . . . , K, c = 1, 2, . . . , C j , where F and f denote the logistic or normal CDF and PDF for logistic or normal ogive GRM, θ i denotes the person trait parameter, α j denotes the item discrimination parameter, and δ j,c denotes the item threshold parameter for the cth category response of item j (Samejima, 1969), the latter of which satisfies −∞ = δ j,0 < δ j,1 < . . . < δ j,C j −1 < δ j,C j = ∞. (2) In many circumstances, it suffices to assume that all the test items measure one trait in common and hence use unidimensional IRT models. However, in some situations when multiple traits are being measured or the test dimensionality structure is not clear, multidimensional IRT (MIRT; Reckase, 1997Reckase, , 2009) models have to be considered. MIRT models are adopted when distinct multiple traits are involved in producing the manifest responses for an item. A special case of the MIRT model applies to the situation where the instrument consists of several subscales with each measuring one latent trait, such as the Minnesota Multiphasic Personality Inventory (MMPI; Buchanan, 1994). In the IRT literature, such a model is called the multi-unidimensional (Sheng and Wikle, 2007) or the simple structure MIRT (McDonald, 1999) model and is the major focus of the study. The multi-unidimensional GRM applies to situations where a K-item instrument consists of m subscales or dimensions, each containing k v polytomous response items that measure one latent dimension. The model is defined as: i = 1, . . . , N, j = 1, . . . , K, c = 1, 2, . . . , C j , where F and f denote the logistic or normal CDF and PDF for logistic or normal ogive GRM, α vj and θ vi denote the item discrimination and the person's latent trait in the vth dimension, and δ j,c denotes the item threshold parameter for the cth category response of item j (Samejima, 1969), the latter of which satisfies −∞ = δ j,0 < δ j,1 < . . . < δ j,C j −1 < δ j,C j = ∞.
For decades, IRT models can be estimated using (1) (Patz and Junker, 1999b) with Robins-Monro (Robbins and Monro, 1951) to facilitate the maximum likelihood estimation.
In the literature, studies have been conducted to compare the performances of these methods in estimating unidimensional dichotomous (e.g., Baker, 1998), unidimensional polytomous (e.g., Cowels, 1996) and multidimensional dichotomous (e.g., Han and Paek, 2014) models. However, little has been done to investigate them in estimating multidimensional polytomous models.
In view of the above, this study focuses on comparing different estimation methods of the multi-unidimensional GRM.
Several computer programs have been developed for estimating parameters in the multi-unidimensional GRM. For example, BAEM and MHRM are implemented in IRTPRO (Cai et al., 2011), and MH is implemented in BMIRT (Yao, 2003). This study then uses these software packages for the corresponding estimation method. The performance in parameter recovery of each method is evaluated and compared using Monte Carlo simulations. The results of this study can provide researchers/practitioners with a set of guidelines on the use of MML, fully Bayesian, and their hybrid, MHRM in estimating multi-unidimensional GRMs under different sample size, test length, and intertrait correlation conditions.

ESTIMATION METHODS
This study focuses on three general categories of estimation methods of multi-unidimensional GRMs: (1) MML, (2) fully Bayesian estimation, and (3) MHRM. A brief description of the major techniques in each category is provided below.

Marginal Maximum Likelihood (MML)
The two-step MML was developed by Bock and Aitkin (1981). After obtaining the joint probability (likelihood) of the item response vector given the person parameters, MML treats persons as random effects and derives a marginal probability of observing the item response vector by integrating the person effect out of the joint likelihood in order to separate item parameters from person parameters. Hence, in MML, item parameters are estimated using the expectation-maximum (EM) algorithm, and person parameters can be subsequently obtained using the estimated item parameters.
Bock and Aitkin's original algorithm uses a fixed Gauss-Hermite quadrature, which is limited for models with a lower dimension (Baker and Kim, 2004). However, as the number of dimension increases, the number of quadrature points increases exponentially and must be accommodated for by decreasing the number of quadrature in each dimension. To overcome this problem, Schilling and Bock (2005) suggested the use of an adaptive quadrature for better accuracy when a smaller number of quadratures per dimension is used. This method can be used for models with a moderate number of dimensions. These two MML methods for multi-unidimensional GRMs are directly implemented in IRTPRO (Cai et al., 2011).

Fully Bayesian Estimation
For the past two decades, fully Bayesian estimation via the use of the Markov chain Monte Carlo (MCMC) has gained an increased interest due to improved computational efficiency. In Bayesian analysis, prior information and all available data are integrated into posterior distributions (using Bayes rule) on which one can base on their inferences. Specifically, P(ξ |y) ∝ P(y|ξ )P(ξ ), where ξ and y are the collections of all parameters and observed data, respectively. Accurate approximations of the posterior distribution, P(ξ |y), via the use of MCMC simulation techniques have made fully Bayesian estimation available for complex models, such as MIRT models.
There are two types of fundamental mechanisms among the MCMC algorithm: Gibbs sampling (Geman and Geman, 1984) and Metropolis-Hastings (MH;Hastings, 1970;Metropolis and Ulam, 1949). Gibbs sampling is adopted in situations when the full conditional distribution of each parameter can be derived in closed form. If any of the full conditional distribution does not have an obtainable form, MH can be used via choosing a proposal or candidate distribution by the current value of the parameters. Then a proposal value is generated from the proposal distribution and accepted in the Markov chain with a certain amount of probability.
The two methods can be combined to form the blocked Metropolis (BM) procedure (Patz and Junker, 1999a,b), where each parameter is sampled from its full conditional distribution and an MH step is subsequently used to accept/reject it. Hastingswithin-Gibbs (HwG) is another form of the hybrid between Gibbs sampling and MH. As far as GRMs are concerned, Albert and Chib (1993) proposed a Gibbs sampler for the unidimensional model. Their approach can be easily extended to the multi-unidimensional model. Cowels (1996) proposed a HwG procedure by using a MH step within the Gibbs sampler developed by Albert and Chib (1993) for sampling the threshold parameters to improve mixing and accelerate convergence. Kuo and Sheng (2015) extended Cowels' approach to the more general multi-unidimensional model, where a constrained multivariate normal prior was assumed for θ , θ ∼ N m (0, P), with P being a correlation matrix to resolve the model location and scale indetermination. Moreover, the MH algorithm has been developed and implemented in BMIRT, and the BM procedure is implemented in IRTPRO for the multi-unidimensional model.

Metropolis-Hastings Robbins-Monro (MHRM)
More recently, Cai (2010a,b) developed a MHRM algorithm to combine fully Bayesian estimation with Robins-Monro (Robbins and Monro, 1951) technique to facilitate the maximum likelihood. Specifically, studies have shown that MML would cause the estimation process for MIRT models to be computationally intensive and often intractable when a large number of dimensions are involved (Cai, 2010a,b;Chalmers, 2012). The MHRM algorithm was consequently developed to overcome the problem of MML and provide useful estimates for data with a large number of items, many dimensions, or missing data. The MHRM estimation for the multi-unidimensional GRM can also be implemented in IRTPRO.

Simulated Data
To compare the aforementioned methods in estimating the multi-unidimensional GRM, a Monte Carlo simulation study was carried out where tests with two subscales were considered so that the first half measured one latent trait (θ 1 ) and the second half measured the other (θ 2 ). In the study, three factors were manipulated: sample size (N), test length (K), and intertrait correlation (ρ). The choice of N, K, and ρ was based on previous studies with similar models. For example, when investigating multidimensional GRMs, the simulation studies in Fu et al. (2010) adopted N = 500, 1000; K = 10, 20, 30, ρ = 0.1, 0.3, . . . , 0.9 for dichotomous items and N = 1000; K = 20, ρ = 0.2, 0.4, . . . 0.8, for polytomous items involving three categories. Working with dichotomous multi-unidimensional models, Sheng and Wikle (2008) adopted N = 1000, K = 18, ρ = 0.2, 0.5, 0.8 in the simulation studies, while Sheng and Headrick (2012) adopted N = 1000, K = 10, ρ = 0.2, 0.4, 0.6. Wollack et al. (2002) conducted simulation studies with nominal response models and they observed that parameter recovery was improved by increasing the test length from 10 to 30 items but that increasing the test length from 20 to 30 items did not produce a noticeable difference. Consequently, with our study, N polytomous responses (N = 500, 1000) to K items (K = 20, 40) were generated according to the multi-unidimensional GRM, where the population correlation between the two latent traits (ρ) was set to be 0.2, 0.5, or 0.8. Each item was set to be measured on three-scale Likert scales so that two threshold parameters were estimated for each item. The item discrimination parameters α v were generated randomly from uniform distributions so that α vj ∼ U(0, 2). The threshold parameters δ j1 and δ j2 were sorted values based on those randomly generated from a standard normal distribution, i.e., δ j1 = min(X 1 , X 2 ) and δ j2 = max(X 1 , X 2 ), where X 1 , X 2 ∼ N(0, 1).
For the use of software, (3) and (5) were programmed in MATLAB 1 , (1), (2), (6), and (7) were executed using IRTPRO 2 , and (4) was implemented in BMIRT 3 . However, in BMIRT, the population intertrait correlation, ρ, needs to be specified to implement the model, which may not be applicable in real situations. Yao and colleagues recommended Akaikes information criterion (AIC; Akaike, 1987) as a criterion in determining ρ when the latent structure is not clear (Yao and Schwarz, 2006). They further suggested that ρ can be approximated by finding the observed correlations between sum scores of subtests (Yao and Boughton, 2007). In this study, we considered all these scenarios when implementing MH via BMIRT, and they are: (1) directly using the true ρ (TRUE), (2) via the use of AIC (AIC), and (3) approximating ρ using the correlation between two subscores (COR). Hence, altogether this study compared nine estimation methods/approaches under two sample size, two test length, and three intertrait correlation conditions.
Ten replications were carried out for each scenario, where root-mean-squared differences (RMSDs) and bias were used to evaluate the recovery of each item parameter. The 10% trimmed means of these measures were calculated across items to provide summary statistics.

RESULTS
Tables 1-4 display the RMSD's and biases for estimating the model under the 12 test situations using the nine estimation methods. Inspection of these tables gives rise to the following results: The four methods implemented in IRTPRO (BAEM, ADQ, BM, and MHRM) yielded similar parameter estimates. Gibbs sampling had in general larger averaged RMSDs in recovering model parameters. This may be due to the slower convergence problem of Gibbs sampling as noted by Cowels (1996). It is noted that MH with each of the three criteria implemented in BMIRT (TRUE, AIC, COR) had a similar precision in estimating item discrimination (α) and threshold (δ) parameters. However, it only recovered ρ well under the TRUE criterion (where the population correlation was assumed to be available), which in practice is not realistic. On the other hand, MH with the COR criterion in BMIRT resulted in a slightly larger RMSD than the TRUE criterion. The AIC criterion had a worse performance on estimating ρ when the true intertrait correlation became larger.
These nine procedures performed differently in estimating α and δ under the 12 test situations. Specifically, when N = 500 and K = 20, the results shown in Table 1 indicated that MH with TRUE , AIC, or COR, as implemented in BMIRT had an overall better estimation of α and δ than the other procedures. HwG and the four procedures implemented in IRTPRO had similar results. Even though they did not estimate α and δ as accurately as MH, their RMSDs were not much larger.
When the test length was longer but the sample size remained the same (i.e., N = 500, K = 40). MH with each of the three criteria implemented in BMIRT had a better estimation of α and δ than the other methods with a low intertrait correlation (i.e., ρ = 0.2). However, when ρ = 0.5 and 0.8, MH with TRUE, AIC, or COR had larger RMSDs of δ. This was due to the inaccurate estimates of δ's with some items. For example, MH with TRUE had the estimate of a δ of 1.1176 when the true parameter was −0.7087. The less accurate estimates of δ obtained by BMIRT led to the larger averaged RMSDs. Compared to the three BMIRT methods, HwG and the four IRTPRO procedures had an overall fairly accurate estimation of α and δ regardless of the intertrait correlations.
When N = 1000, K = 20, and ρ = 0.2, the three procedures implemented in BMIRT had a better estimate of α and δ. With an increase of the true intertrait correlation, HwG had an improved estimation of α and δ that is comparable to the three BMIRT methods. The four IRTPRO methods did not have a comparatively well estimation of α, but they had a fairly accurate estimation of δ if the true intertrait correlation was 0.2 or 0.8. Further, when N = 1000 and K = 40, BMIRT had larger RMSDs in estimating δ, indicating that it involved more error in estimating δ when the test length was longer, as described previously. HwG had an overall better estimation of α under the three intertrait correlation conditions. It performed well in estimating δ when ρ = 0.5. However, when ρ = 0.2 and 0.8, the four IRTPRO procedures outperformed HwG in estimating δ.
In terms of estimating ρ, the three procedures implemented in BMIRT performed similarly when the true intertrait correlation (ρ) was 0.2. However, when ρ increased, the AIC criterion had a worse estimation of ρ. This is due to the reason that the AIC criterion usually selected the model with a lower intertrait correlation coefficient (i.e., ρ = 0 or 0.1) to be the best model (i.e., the one with the smallest AIC value). Therefore, it had a worse estimation of ρ when the true intertrait correlation was moderate to strong. The TRUE criterion had an overall better estimation in all test situations. However, it is noted that in practice the actual intertrait correlation is usually unknown. Among the procedures where ρ is estimated, HwG had a relatively better performance in recovering ρ, especially when the true intertrait correlation was 0.5 or 0.8. When the dimensions or latent traits were not much correlated, HwG, COR, and the four IRTPRO procedures performed equally well.
In addition to RMSDs, biases were obtained and shown in the tables to determine the direction of estimation in each procedure. Specifically, a positive (negative) bias indicates that the estimate tends to be larger (smaller) than the true parameter. The results suggested that Gibbs, HwG, and the four procedures implemented in IRTPRO had in general an overestimate of α, while HwG tended to underestimate δ. However, the four procedures in IRTPRO had a positive bias in estimating δ 1 (except for N = 1000, K = 20, ρ = 0.2) but a negative bias in estimating δ 2 . The three procedures in BMIRT tended to overestimate the model parameters when ρ was 0.2. When the intertrait correlation was larger, they could have negative biases for estimating δ. In terms of the bias of ρ, the four IRTPRO procedures and TRUE had in general positive biases (except for N = 500, K = 20, ρ = 0.5). AIC and COR tended to underestimate the intertrait correlation. HwG had positive biases when the true intertrait correlation was 0.2 or 0.8, but it had negative biases with the correlation being 0.5, regardless of sample sizes. Gibbs sampling had positive biases for all the parameters.

DISCUSSION
In general, this study compared the performances in estimating multi-unidimensional GRMs using nine estimation methods. Simulation results indicate that the four methods implemented Frontiers in Psychology | www.frontiersin.org in IRTPRO (BAEM, ADQ, MHRM, BM) performed similarly in recovering α, δ, and ρ well under the simulated conditions. This result is consistent with those of Han and Paek (2014), where IRTPRO was used to estimate multi-unidimensional two-parameter models, of Wollack et al. (2002), where unidimensional nominal response models were evaluated via BAEM and BM procedures, and of Cai (2010a), where BAEM and MHRM methods were implemented to estimate the two-dimensional graded response model. MH with the three criteria (TRUE, AIC, COR) executed using BMIRT also resulted in fairly accurate estimations of α and δ. However, these three criteria performed differently in estimating ρ based on the actual intertrait correlation. Specifically, these three criteria performed equally well when the latent traits had a weak correlation. However, in test situations where the intertrait correlation was moderate or high, only the TRUE criterion was able to obtain an accurate estimation of ρ. It is noted that the TRUE criterion is not applicable in many test situations due to the unknown true intertrait correlation, and hence is not practical with fitting such models. The COR criterion had a worse estimate than TRUE, but overall it had smaller RMSD's than the AIC criterion. Therefore, among the three criteria available from BMIRT, the COR criterion (i.e., using the correlation of subscales as the estimated intertrait If the intertrait correlation is not of interest, HwG, MH with the three BMIRT criteria, and the four IRTPRO procedures can provide a fairly accurate estimation of α and δ with tests that are not very long (e.g., K < 40). This result agrees with the findings in Li et al. (2014) when they compared the TRUE criterion in BMIRT and the four procedures implemented in IRTPRO in estimating the multi-undimensional two-parameter model. However, if more items are adopted, MH with the three BMIRT criteria tends to result in a less accurate estimate of δ with some, if not all, items, and hence not suggested in situations with tests that have 40 or more items. Alternatively, one may consider HwG or any of the four IRTPRO procedures under this condition. If the dimensions are moderately correlated, the HwG algorithm has a relatively better estimate of α and δ than any of the four IRTPRO procedures. However, the four IRTPRO procedures may perform relatively better in estimating δ when the intertrait correlation is low or high. In general, if item discrimination parameters α or the intertrait correlation ρ are more of interest, one may consider implementing the HwG procedure.
In summary, the magnitude of the intertrait correlations may affect the performances of parameter recovery of these nine estimation methods. However, in practice, the actual values of these correlations are rarely known in advance. Hence, one may consider adopting the HwG procedure to estimate the intertrait correlations since it has an overall better estimation of ρ according to the simulation studies. If the correlations are moderate, HwG can perform better in estimating α and δ than the other procedures. If the estimated ρ is either high or low, the four procedures implemented in IRTPRO may be adopted if one is more interested in estimating δ. It is noted that, if the dimensions are lowly correlated, BMIRT can also be considered when the test length is not very long. However, if ρ is moderate or high, or if more items are adopted in a test, BMIRT is not recommended due to its inaccurate estimations of δ with some (if not all) items.
The simulation results in this study are based on test situations where two dimensions are involved. Further research can consider comparing these nine procedures in test situations with more than two dimensions. In addition, this study focuses on items with three-scale Likert scales and therefore two threshold parameters need to be estimated for each item. It is noted that the results may be limited to this specification and thus may not generalize to situations where five or seven categories are adopted. Further study can evaluate the estimation of these procedures using items with more than three scales or with different numbers of scales. Furthermore, Kuo and Sheng (2015) suggested that parameter estimate of the HwG procedure is not sensitive to different prior distributions for α. However, the selection of prior distributions for α may affect the estimation of the other estimation procedures. Therefore, further study can consider comparing these estimation methods using informative priors. Lastly, the simulation study only adopted 10 replications due to the computational expense of the MCMC procedures (i.e., Gibbs and HwG). Harwell et al. (1996) suggested that a minimum of 25 replications for Markov Chain studies in IRT-based research is recommended. Further studies can add more replications to achieve a better accuracy.

AUTHOR CONTRIBUTIONS
TK conducted the simulation studies and summarized the conclusions. YS proofread and provided recommendations.

APPENDIX A
Sample code for fitting the multi-unidimensional GRM using IRTPRO using the BAEN estimation, where N = 1000, K = 20 and ρ = 0.