Expectation-Maximization-Maximization: A Feasible MLE Algorithm for the Three-Parameter Logistic Model Based on a Mixture Modeling Reformulation

Stable maximum likelihood estimation (MLE) of item parameters in 3PLM with a modest sample size remains a challenge. The current study presents a mixture-modeling approach to 3PLM based on which a feasible Expectation-Maximization-Maximization (EMM) MLE algorithm is proposed. The simulation study indicates that EMM is comparable to the Bayesian EM in terms of bias and RMSE. EMM also produces smaller standard errors (SEs) than MMLE/EM. In order to further demonstrate the feasibility, the method has also been applied to two real-world data sets. The point estimates in EMM are close to those from the commercial programs, BILOG-MG and flexMIRT, but the SEs are smaller.


INTRODUCTION
In the field of educational measurement, item response theory (IRT) models are a powerful tool aimed at providing an appropriate representation of students' test-taking behavior, and produce accurate estimates of students' ability. IRT models are expected to capture the underlying response processes such as students' ability and other strategies students might take. One of the most common strategies is guessing behavior when students cannot solve a problem correctly. The guessing strategy is prevalent particularly for multiple-choice questions in a low-stakes test (Lord, 1980;Baker and Kim, 2004;Cao and Stokes, 2008;Woods, 2008). To count for guessing, the threeparameter logistic model (3PLM; Birnbaum, 1968) has been commonly used in many applications of IRT in the measurement industry.
Despite of its popularity, however, several studies have pointed out technical and theoretical issues regarding the c-parameter (the guessing parameter) and its interpretation (Lord, 1974(Lord, , 1980Kolen, 1981;Thissen and Wainer, 1982;Holland, 1990;Hambleton et al., 1991;Yen et al., 1991;San Martín et al., 2006, 2013Woods, 2008;Maris and Bechger, 2009;McDonald, 2013). The current paper focuses on one of long-standing issues for the 3PLM, the item parameter estimation, which has proved to be challenging (Thissen and Wainer, 1982). Mislevy (1986) pointed out that the essential difficulty is the sparse data for the guessing parameter, yielding unstable maximum likelihood estimates (MLEs). The well-established marginal maximum likelihood estimation via expectation-maximization (MMLE/EM) algorithm (Bock and Aitkin, 1981) is not feasible in this case. According to Thissen and Wainer (1982), the sample size required for obtaining MLE for 3PLM with acceptable standard errors using MMLE/EM is about 10,000 and, for some items, as large as 1,000,000 which seems prohibitively expensive or impractical even now. A most recent empirical study testifying to this claim is Tay et al. (2016) in which the researchers recommended that a sample of 20,000 is desirable when fitting a 3PLM with covariate (Tay et al., 2013). In fact, Thissen and Wainer (1982) even concluded that "naked maximum likelihood estimation for the three-parameter model is not a technique that is likely to give happy results." San Martín et al. (2015) showed that the 3PLM is not even identifiable if a fixed-effects approach is adopted and the question of identifiability of the random-effects 3PLM is still open.
By now only some decent Bayesian methods have been developed, such as Bayesian EM (Mislevy, 1986), Bayesian joint estimation (Swaminathan and Gifford, 1986), and MCMC (Patz and Junker, 1999) MLE, however, enjoys favorable statistical properties (Casella and Berger, 2002). It may avoid some issues in Bayesian methods, such as assigning priors, convergence checking in MCMC, etc. and, thus, usually is preferred for parameter estimation. It would be a valuable addition to existing methods if that a practical feasible MLE algorithm could be developed for the 3PLM. The current study intends to make some contribution in this respect. The authors propose a feasible EM algorithm for the 3PLM, namely expectation-maximizationmaximization (EMM). EMM can be considered as a modified version of the MMLE/EM algorithm (Bock and Aitkin, 1981) and the extra maximization step is especially designed for the guessing parameter due to a different setup for the complete data based on a mixture-modeling reformulation of the 3PLM.
This mixture-modeling approach to the 3PLM is not entirely new in the IRT literature. Mixture modeling is a well-established tradition in IRT, especially for Rasch models (von Davier and Rost, 2006). As for the 3PLM in particular, Hutchinson (1991) first discussed the two underlying processes of guessing and answering based on ability which points to the idea of mixture modeling for the 3PLM. San Martín et al. (2006) proposed an ability-based guessing model to describe the interaction between guessing behavior and examinee's ability based on this perspective. von Davier (2009) further presented two different interpretations of the 3PLM from this standpoint. One may even easily notice that the reformulation of the 3PLM developed by the current study can be considered as a special case of the HTBRID model (Yamamoto, 1982) and the general diagnostic model (von Davier, 2008), both of which are of a strong mixture modeling flavor. Motivated by this tradition, especially the work by von Davier (2009) and Hutchinson (1991), the current study presents a new mixture-modeling reformulation of the 3PLM by introducing an extra latent indicator for the guessing behavior and develops a feasible MLE algorithm, although this concept of mixture-modeling approach for the 3PLM has recurred in the IRT literature.
More specifically, a conceptual summary of developing such a new algorithm goes as: (a) introducing a new latent variable to construct a space one dimension higher than the old one, which appears to be unwise because the reformulation makes the original 3PLM estimation problem more difficult by adding one more dimension. (b) invoking the independent assumption of guessing and problem-solving process to approximate the joint distribution of the response and the newly introduced latent variable. It is worth noting that this practice is very common in statistics to approximate a high-dimension space. (c) using the approximation as a surrogate of the original 3PLM likelihood function to obtain item parameter estimate. Since the goal is to calibrate items with the traditional 3PLM, the convergence criterion is still calculated through the likelihood function of the original 3PLM, though the E and M steps are involved with the approximation of the likelihood function of the reformulated model.
The remaining is organized as the follows: Section 1 will present the reformulation of 3PLM based on the mixture modeling (McLachlan and Peel, 2004). In next section, EMM is developed and derivation of the estimation standard errors (SEs) for EMM is provided. In section 3, a comprehensive simulation study is conducted and the EMM is also applied to two empirical examples from BILOG-MG (Zimowski et al., 2003) and flexMIRT (Houts and Cai, 2015). The last section gives a brief discussion and future directions.

A MIXTURE-MODELING APPROACH TO 3PLM
Mixture modeling is a powerful statistical tool for representing the presence of heterogeneous subpopulations within an overall population. The generic density form of mixture modeling for a random vector Y can be written as in which the quantities π 1 , . . . , π g are called the mixing proportions or weights for the g groups and the functions f 1 (y), . . . , f g (y) are called the component densities of the mixture. Obviously, mixture modeling overcomes the limitation of traditional modeling approach using one single density and attempts to approximate data by a linear combination of possibly various different densities. The idea of mixture modeling has been applied in psychometrics. Various Rasch mixture IRT models have been proposed to model different response styles and test taking strategies (Rost, 1997;von Davier and Rost, 2006). This paper takes advantage of this idea to reformulate 3PLM. If guessing was considered as a test-taking strategy, with introduction of a latent indicator variable for guessing, the 3PLM could be reformulated as a mixture model for two heterogeneous subpopulations: those who guess and those who do not, within each item. The detail derivation is presented below. The 3PLM is formulated as the follow: with in which u ij ∈ U is examinee j j = 1, 2, . . . , N 's response to item i (i = 1, 2, . . . , n), a i , b i , and c i are the item parameters; θ j ∈ θ is the ability parameter of the examinee j; D is the scaling constant 1.702. The 3PLM can be considered as a mixture of a degenerate distribution (in which all the probability mass is on a single point) with a probability of c i and a 2PLM with a probability of 1 − c i within each item. Mixture models with a degenerate model is termed as nonstandard mixture models and has been studied intensively in statistics (Cornell University Library, 1989). The observed data in a mixture problem is usually viewed as incomplete. A latent indicator variable z ij ∈ Z for guessing is introduced: Thus, the marginal of z ij follows a Bernoulli (1 − c i ). And the conditional distribution of u ij on z ij is as the follow (let ξ i = a i , b i , c i ∈ ξ represents the item parameter vector for the ith item): (4) To simplify the derivation of the joint distribution of u ij and z ij , z ij and θ j are assumed to be independent which leads to: More specifically, Please note that the case of u ij = 0 and z ij = 0, whose probability is zero, is redundant, so the other three cases consist of the three elements of the joint distribution. The assumption of independence means that, for each item, the examinee decides randomly whether guessing or ability-based responding is chosen first (San Martín et al., 2006;von Davier, 2009). Let u j and z j be the response vector and the latent indicator vector for examinee j, respectively, and then the joint distribution for the new augmented complete data (U, Z, θ ) is where and g θ j |τ is a density function for θ , where τ is the vector containing the parameters of the examinee population ability distribution. Following Bock and Lieberman (1970), the marginal distribution for a single examinee j is: So, the likelihood function of the EMM is: and the log-likelihood ln L = ln L(U, Z|ξ ) .

THE EXPECTATION-MAXIMIZATION-MAXIMIZATION (EMM) ALGORITHM
The reformulation of 3PLM points to the possibility that the guessing parameter can be estimated as a mixing parameter in mixture models, separated from the item discriminatory and difficulty parameters which will be estimated as the unknown parameters in the component density (the 2PLM). Put this in the language of the EM for IRT models and one has the Expectation (with respect to the latent variables , Z)-Maximization (with respect to the guessing parameter (c i )-Maximization (with respect to the item discrimination and difficulty parameter a i and b i ). The first expectation step follows the same idea as the EM for IRT (Bock and Aitkin, 1981) with slight modification due to the introduction of the extra latent variable Z; the two maximization steps consist of the EM algorithm for the mixture models. In order to facilitate understanding, the description and mathematical notations of EMM in the sequel will follow Baker and Kim (2004). It is worthwhile noting that there is a tremendous similarity in the derivation of EMM and MMLE/EM with the only exception of the joint distribution of the complete data.

Expectation Step and Artificial Data
Let ψ i denote any item parameter for item i in ξ i , and very similar to the mathematical derivation for MMEL/EM in Baker and Kim (2004), the first derivative of the log-likelihood function in EMM for each item parameter can be obtained as (see Appendix A in Supplementary Material for details): where P θ j |u j , z j , τ , ξ is the posterior probability of θ j given u j , z j , τ , and ξ , and P θ j |u j , z j , τ , ξ always equals P θ j |u j , τ , ξ because of θ j and z j are assumed to be independent. The conditional expectation of Z is also essential for the E step. From the joint distribution in Equation (6), one can calculate the expectation of z ij conditional on u ij and the marginal distribution of z ij . By using the Bayesian rule, one has the conditional distribution of z ij on u ij as: , Then the conditional expectation of z ij is The expectation step is to obtain the conditional expectation of the complete-data log likelihood with respect to the observation data, namely the response matrix U. In EMM it essentially amounts to plugging the conditional expectation of the indicator variable Z and integrating over the latent ability variable as in the original MMLE/EM. Let X k (k = 1, 2, . . . , q) be nodes on the ability scale with an associated weight A(X k ), and the E step is as the follow: with where P X k |u j , z j , τ , ξ is the posterior probability of θ j at X k given u j , z j , τ , and ξ , and P X k | u j , z j , τ , ξ always equals P X k | u j , τ , ξ because of the independence between X k | and z j . Furthermore, P X k |u j , z j , τ , ξ can be used to compute the "artificial data". For instance, Bock and Aitkin (1981) has provided two fundamental artificial data for the traditional EM algorithm as: in whichf ik is the expected number of examinees with ability X k . Thus, the sum off ik for every ability X k equals the total number of examinees N. The second index,r ik , is the expected number of examinees with ability X k answering item i correctly.
As can be seen from Table 1, the EMM algorithm introduced a new latent variable Z, so there are two new artificial data: in whichf (Z) ik is the expected number of examinees with ability X k without using guessing strategy;r (Z) ik is the expected number of examinees with ability X k answering item i correctly without using the guessing strategy. Thus,r ik −r (Z) ik is the expected number of examinees with ability X k who can answer item i correctly using the guessing strategy. The expected number of examinees with ability X k who can answer item i incorrectly by using the guessing strategy is zero which can be inferred from P(u ij = 0, z ij = 0|θ j , ξ i ) = 0. Putting these facts together, one ik across all is equivalent to the total number of examinees N, namely, After the E-step and calculation of the artificial data, the next steps are computing the first and second derivatives of Equation (15) with respect to each item parameter.

Maximization Step-1 for the c Parameter
Setting the first derivative equals to 0 and solving for c i : and surprisingly, a closed solution for c i can be available as: There is a nice interpretation for the estimate of the guessing parameter. From the description of the artificial data, it is easy to note that q k=1 (r ik −r (Z) ik ) is the expected number of examinees who can answer item i correctly using the guessing strategy and thus, the estimate is exactly the proportion of these examinees in the total sample. This interpretation presents a strong mixture modeling flavor, drastically different from the traditional interpretation, the lower bound for the probability with which a person solves the item correctly.

Maximization Step-2 for a and b Parameters
The second Maximization step is to execute the Newton-Raphson procedure to obtain estimates for a i and b i . The required first derivatives for a i and b i are One may also set the derivatives to zero, but there is no analytical solution to them and thus Newton-Raphson method have to be used. In order to implement Newton-Raphson, the second derivatives are derived as So, the estimates can be obtained using the Newton-Raphson iteratively:

Standard Errors (SEs) of Parameter Estimation in EMM
One of the important indices to assess the estimation quality is the standard errors. One of the major criticisms of EM is, however, that parameter estimate SEs are not a natural product of the algorithm and some other methods have to be devised (McLachlan and Krishnan, 2007). EMM, as a member of EM algorithms, does not provide parameter estimate SEs either. In general, the inverse of the negative expected value of the matrix of second derivatives of a log likelihood is the asymptotic variancecovariance matrix of the estimates (McLachlan and Krishnan, 1968). The square roots of the diagonal elements of the inverse are the asymptotic standard errors of the parameters. This part will present a theoretical argument on why EMM works better than MMLE/EM for 3PLM. In the original MMLE/EM, the expected second derivative matrix can generically be written as   l aa l ab l ac l ab l bb l bc l ac l bc l cc   , in which l cc is the expected second derivatives of the log likelihood with respect to the guessing parameter, l ac and l bc are the elements for two expected second partial derivatives. The convergence issue of the MMLE/EM for 3PLM is caused by this matrix being non-positive-definite or ill-conditioned (Baker and Kim, 2004). To solve this problem, several researchers suggested that the guessing parameter can usually set to be reciprocal of the number of the alternatives in the item (Hutchinson, 1991;Han, 2012;McDonald, 2013), which essentially mounted to set l ac and l bc to be zeros and then the 3PLM could be estimated. This option is available in several IRT programs, such as NOHARM (Fraser and Mcdonald, 1988), but an obvious limitation of this practice is that the guessing parameter is, in fact, not estimated, but a rather rough "guess" and does not seem to be a reasonable assumption (San Martín et al., 2006). In contrast, this matrix is different in EMM because EMM amounts to a divide-and-conquer strategy. The challenge of estimating item parameters is intelligently partitioned into two smaller estimation problems: estimation of the guessing parameter as the mixing proportion parameter in mixing modeling and that of the 2PL model. Interestingly, the two smaller problems happen to have been fully investigated in statistics and IRT. Conceptually, EMM can be perceived as a combination of these two mature techniques in statics and IRT. The prominent advantage of EMM comes from the separation of estimation of the guessing parameter from the other two and, thus, any instability in estimating the guessing parameter will not negatively affect that of the other two parameters, or otherwise. Such a setup statistically implies that the covariance between the guessing estimate and the other two are zeros. One may ensemble the expected second derivative matrix in generic form as EMM provides a scientific alternative to the practice of fixing the guessing parameter estimate as the reciprocal of the number of the alternatives in the item. It can not only eliminate undesirable fluctuation in estimating the guessing parameter, but also make estimation of the guessing parameter possible. For the sake of completeness, the second derivatives of the log likelihood with respect to the guessing parameter is given as As a consequence of the separation setup, the calculation does not involve any terms related to the difficulty and discrimination parameters. The detailed derivation of the SEs for EMM is provided in Appendix B (Supplementary Material).

SIMULATION STUDY
A simulation study was done to demonstrate the feasibility of the new method, compared to the Bayesian EM in BILOG-MG (Zimowski et al., 2003). The parameters for 10 or 20 items were generated from independent normal distribution as in Mislevy (1986): for ln a, the mean and variance were 0 and 0.5; for b, 0.5 and 1.0; for logit c, −1.39 and 0.16.
Three sample sizes of examinees (1,000, 1,500, and 2,000) were generated from standard normal distribution. 50 replications were run for each condition in the fully crossed 2(EMM vs. BILOG-MG) × 3(1, 000 vs. 1,500 vs. 2,000) × 2(10 vs. 20) design. The evaluation criteria are the bias and RMSE. In order to further demonstrate the performance of EMM, the EMM and Bayes solutions respectively, against generating values of the quantity b − 2/a, a heuristic index based on the observation that less information is obtained about c as items become easier or less reliable (Lord, 1975). Results. The detailed results for the simulation study are presented in tables and figures in Appendix C (Supplementary Material). Only the results for the condition of 1,000 examinees and 10 items are summarized and presented here (Figures 1, 2) since others are very similar. This is the most unfavorable condition for MLE since the sample size is moderately small and the test length is relatively short. But even under this condition, the EMM is comparable to or better than Bayesian EM in terms of bias and RMSE. In contrast, the traditional MMLE EM usually fails to converge with such a small sample size. As a MLE algorithm EMM is as flexible as the Bayesian EM implemented in BILOG-MG to deal with 3PL modeling in most practical testing situations.
The plots for the heuristic index comparing the EMM and Bayes solutions respectively, against generating values of the quantity b − 2/a, echo the results in terms of bias and RMSE. The figures for the 1,000-examinee-10-item condition are presented here. In general, both solutions for items with high values in the index are satisfactory, but the Bayes estimates for some items with low values were rather biased while the EMM estimates were very stable.

TWO EMPIRICAL EXAMPLES
In order to further demonstrate the performance of the new algorithm, the authors apply EMM to two real data sets from BILOG-MG (Zimowski et al., 2003) and flexMIRT (Houts and Cai, 2015), and compare the item estimates to those from the two commercial programs. Specifically, the two data sets are the Example 1 in the BILOG-MG and Example 2 in flexMIRT. The BILOG-MG data set consists of 1,000 examinees' responses to 15 items and the one for flexMIRT consists of 2,844 examinees' responses to 12 items. Please refer to the manuals for the details. Both programs adopt different priors for the guessing parameter: BILOG-MG uses the default setting, a beta distribution with parameters of 4 and 16, namely, c ∼ Beta(4, 16), and flexMIRT, c ∼ Beta(1, 4). Please note that due to different parameterization in BILOG-MG technical document, the specification for the prior parameters are 5 and 17 which correspond to 4 and 16 in the standard beta distribution parameterization. The two priors have identical means, but the variance for flexMIRT prior is smaller which indicates that it is less informative. An additional analysis of flexMIRT data using the BILOG-MG default setting for the guessing parameter in flexMIRT [the c ∼ Beta(1, 4) solution] is also run. In order to further facilitate the comparison of SEs, the authors employ supplemented EM (SEM) to obtain SEs for EMM (Meng and Rubin, 1991) in the real-data analysis below and SEM  has been applied for various IRT models (Cai, 2008;Cai and Lee, 2009;Tian et al., 2013).
The specific results are summarized in the tables and figures in Appendix D (Supplementary Material) and only figures (Figure 3 for BILOG-MG data and Figure 4 for flexMIRT data) are presented here. Both Figures 3, 4 indicates that, for both real data sets, the point estimates (the upper part of the figures) for item parameters from EMM are comparable to those from BILOG-MG and flexMIRT, but SEs (the lower part of the figures) are smaller. Smaller SEs means that one may ascertain point estimates with greater confidence.
The advantage of EMM can be further confirmed by the difference between the c ∼ Beta(1, 4) solution and the c ∼ Beta(4, 16) solution to the point estimate for the guessing parameters for flexMIRT data. The EMM point estimates of the guessing parameters coincide with the c ∼ Beta(4, 16) solution, with smaller SEs while those from the c ∼ Beta(4, 16) solution present nontrivial differences. The authors may reasonably conclude that the difference between the solutions from flexMIRT calibration since the only difference in the specification is the prior setting for the guessing parameter. This difference is not a surprise at all because assigning proper priors is a common burden shared by the Bayesian approach. In the case of 3PLM calibration, however, different priors exert an undesirable negative effect of the point estimation and SEs. To exclude the possible difference in software implementation in the two programs, the authors also replicated the same analysis in BILOG-MG and the results corroborate the claim. Furthermore, as expected, the SEs in the c ∼ Beta(1, 4) solution are bigger than those in the c ∼ Beta(4, 16) solution which is another piece of evidence that priors in Bayesian methods lead to different SE estimates.

DISCUSSION AND FUTURE DIRECTIONS
In summary, the EMM essentially is a member of EM family for 3PLM. The fundamental difference between the EMM and the original EM is that the old complete data setup (U, θ ) is expanded  Frontiers in Psychology | www.frontiersin.org into (U, θ , Z). This change enables the algorithm to explore the likelihood function curve more thoroughly. More specifically, the challenge of estimating item parameters is intelligently partitioned into two smaller estimation problems: estimation of the guessing parameter as the mixing proportion parameter in mixing modeling and that of the 2PL model, which happen to be fully investigated in statistics and IRT. The benefit of this strategy is that the estimation error in the two problems does not exert any negative effect on each other and thus both of them are more stable. Moreover, for the guessing parameter, there is an analytical solution for the score function and thus the NR and the second derivatives are not necessary.
The EMM is a feasible algorithm to obtain MLE for 3PLM with modest sample size. It has several theoretical and practical implications. First, the mixture modeling approach to 3PLM is an interesting novel perspective on 3PLM. Second, this new method provides a feasible alternative for practitioners, enabling them to obtain MLE for 3PLM with modest sample size. This paper is not to advocate to eliminate use of Bayesian estimation methods. This can be used to check with the Bayesian solution with other IRT programs in practice. Due to the high complexity in realworld 3PLM data, a combination of EMM and Bayesian methods might lead to a more sophisticated and nuanced understanding of data.
Another important feature of EMM is that the EMM stopping criterion serves as a self-checking/correcting mechanism for the independence assumption. The stopping criterion is calculated through the original 3PLM likelihood function, although the E and M steps are built with the approximation of the likelihood function of the reformulated model. That is to say, the EMM would not converge until the difference between the values of the original 3PLM likelihood function evaluated at the current and last-cycle item estimates becomes negligible. If the independence assumption is severely violated and thus the approximation is not accurate enough, the convergence of the algorithm will suffer. In this sense, the EMM algorithm has provided a data-driven validation method for this assumption.
Several research questions deserve further attention. Firstly, there are two different interpretation of 3PLM (Hutchinson, 1991;San Martín et al., 2006;von Davier, 2009) and this paper chose one of them. The derivation of an EMM for the other interpretation will be very similar to this paper, but it needs to provoke different assumptions, so it would be interesting to compare the performance of two versions in terms of estimate and interpretation for the guessing parameter, numerical stability, etc. Secondly, a Bayesian EMM can be investigated. EMM is a more powerful MMLE method comparable to Bayesian EM, but the simulation studies indicate that there might be some inflation in item estimates in EMM which points to the possibility of improvement. A natural alternative is to combining the Bayesian approach and EMM. Bayesian EMM will solve the issue of estimate inflation while taking advantage of the EMM exploring the likelihood function. Thirdly, the mixture modeling approach and EMM can be naturally extended to 4PLM ( Barton and Lord, 1981), which is a generalization of 3PLM and includes an upper asymptote for the probability of a correct response. There is a renewed interest in 4PLM (Rulison and Loken, 2009;Loken and Rulison, 2010;Liao et al., 2012;Ogasawara, 2012;Feuerstahler and Waller, 2014;Culpepper, 2016) for its usefulness in measuring psychological constructs. But, just as in the case of 3PLM, one consistent discussion point regarding 4PL pertains to the difficulty in estimating item parameters. The method proposed in this paper is a promising way for 4PLM.

AUTHOR CONTRIBUTIONS
CZ: Conceptualization, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Supervision, Writing -original draft, Writing -review, and editing. XM: Funding acquisition and Software. SG: Software, Writing -review, and editing. ZL: Data curation, Project Administration, and Validation.