A General Three-Parameter Logistic Model With Time Effect

Within the framework of item response theory, a new and flexible general three-parameter logistic model with response time (G3PLT) is proposed. The advantage of this model is that it can combine time effect, ability, and item difficulty to influence the correct-response probability. In contrast to the traditional response time models used in educational psychology, the new model incorporates the influence of the time effect on the correct-response probability directly, rather than linking them through a hierarchical method via latent and speed parameters as in van der Linden's model. In addition, the Metropolis–Hastings within Gibbs sampling algorithm is employed to estimate the model parameters. Based on Markov chain Monte Carlo output, two Bayesian model assessment methods are used to assess the goodness of fit between models. Finally, two simulation studies and a real data analysis are performed to further illustrate the advantages of the new model over the traditional three-parameter logistic model.


INTRODUCTION AND MOTIVATION
Computerized assessment has become a widely accepted method of testing owing to the fact that the results produced by examinees can be quickly and accurately evaluated by virtue of the computational power that is now available. In addition, with the help of computer technology, the response times of examinees are easier to collect than in the case of traditional paper-and-pencil tests. The collected response times provide a valuable source of information on examinees and test items. For example, response times can be used to improve the accuracy of ability estimates Klein Entink et al., 2009a;van der Linden and Glas, 2010;Wang et al., 2013Wang et al., , 2018aWang and Xu, 2015;Fox and Marianti, 2016;Bolsinova and Tijmstra, 2018;De Boeck and Jeon, 2019), to detect rapid guessing and cheating behavior (van der Linden and Guo, 2008;van der Linden, 2009;Wang and Xu, 2015;Pokropek, 2016;Qian et al., 2016;Skorupski and Wainer, 2017;Wang et al., 2018a,b;Lu et al., 2019;Sinharay and Johnson, 2019;Zopluoglu, 2019), to evaluate the speededness of tests (Schnipke and Scrams, 1997;van der Linden et al., 2007), and to design more efficient tests (Bridgeman and Cline, 2004;Chang, 2004;Choe et al., 2018).

Advantages of Our Model Over Traditional Response Time Models in Educational Psychology Research
Although response times in both educational and psychological research have been studied widely and in depth, there are still some deficiencies in the existing literature. Here, we compare existing response time models with our new model and analyze the advantages of our model from multiple aspects.  Thissen (1983) proposed a joint model of response time and accuracy to describe the speed-accuracy relationship. In his model, the speed-accuracy trade-off is reflected by letting response accuracy depend on the time devoted to an item: spending more time on an item increases the probability of a correct response. Thissen's joint model can be expressed as follows: where T ij is the response time of the ith examinee answering the jth item, u is a general intercept parameter, η i and ς j can be interpreted, respectively as the speed of examinee i and the amount of time required by item j, ρ is a regression parameter, a j and b j are, respectively the item discrimination and difficulty parameters, θ i is the ability parameter for the ith examinee, and ε ij ∼ N(0, σ 2 ). The speed-accuracy trade-off is represented by the term a j θ i − b j when ρ < 0. When ρ > 0, the speed-accuracy relation is reversed. However, the way in which this model incorporates personal-level and item-level parameters means that it is unable to fully reflect the direct impact of the response time on the correct-response probability. Our new model solves this problem. The response time and the ability and item difficulty parameters are combined in an item response model that reflects the way in which the interactions among the three factors influences the correct-response probability. To provide an intuitive explanation, we use a three-dimensional diagram (Figure 1) to illustrate the effect of the ability and response time on the correct-response probability. A similar modeling method was proposed by Verhelst et al. (1997). Roskam (1987Roskam ( , 1997 proposed a Rasch response time model integrating response time and correctness. According to this model, the probability of a correct response for the ith examinee answering the jth item can be written as where Y ij denotes the response of the ith examinee answering the jth item, θ i is the ability parameter for the ith examinee. δ j is the item difficulty parameter for the jth item, and ξ i , τ ij , and κ j are the logarithms of θ i , T ij , and δ j , respectively. We can see that when T ij goes to infinity, the correct-response probability p(Y ij = 1 | T ij , i, j) approaches 1, no matter how difficult the item is. In fact, this type of model can only be applied to speeded tests, because a basic characteristic of such tests is that test items are quite easy, so, with unlimited time available, the answers are almost always correct. However, our new model is designed for a power test. This means that even if the examinees are given enough time, they cannot be sure to answer an item correctly, but rather they answer the item correctly with the probability of a three-parameter logistic (3PL) model. Although there is some similarity between our model and the item response model proposed by Wang and Hanson (2005) with regard to the incorporation of response time into the traditional 3PL model, there are some major differences in concept and construction. Wang and Hanson give the probability of a correct response to item j by examinee i as where a j , b j , and c j are, respectively the item discrimination, difficulty, and guessing parameters for the jth item, as in the regular 3PL model. θ i and η i are, respectively the ability and slowness parameters for the ith examinee, and d j is the slowness parameter for the jth item. The item and personal slowness parameters determine the rate of increase in the probability of a correct answer as a function of response time. We will now analyze the differences between the two models.
From the perspective of model construction, the response time and the item and personal parameters are all incorporated into the same exponential function in Wang and Hanson's model,namely, , whereas in our model, the parameters and time effect appear in two different exponential functions (see the following section for a detailed description of the model): exp[−1.7a j (θ i − b j )] + exp(−t * ij ). Our model considers not only the influence of the personal and item factors on the correct-response probability, but also that of the time effect. In Wang and Hanson's model, two slowness parameters associated with persons and items are introduced on the basis of the traditional 3PL model, which increases the complexity of the model. The model can be identified only by imposing stronger constraints on the model parameters. The accuracy of parameter estimation may be reduced owing to the increase in the number of model parameters. However, in our model, no such additional parameters related to items and persons are introduced, and therefore the model is more concise and easy to understand. In terms of model identifiability, our model is similar to the traditional 3PL model in that no additional restrictions need to be imposed. More importantly, parameter estimation becomes more accurate because of the addition of time information. Besides the personal ability parameter, a personal slowness parameter is included Wang and Hanson's model. In fact, their model is a multidimensional item response theory model incorporating response time. In their model, it is assumed that these two personal parameters are independent, but this assumption may not necessarily be true in practice. For example, the lower a person's ability, the slower is their response. That is to say, there is a negative correlation between the ability parameter and the slowness parameter. More research is needed to verify this. Like other models based on the traditional 3PL model (see the next subsection), Wang and Hanson's model cannot distinguish between different abilities under different time intensities when examinees have the same response framework. However, our new model can deal with this problem very well.
In addition, our model introduces the concept of a time weight. Depending on the importance of a test (e.g., whether it is a high-stakes or a low-stakes test), the effect of the time constraint on the whole test is characterized by a time weight. This is something that cannot be dealt with by Wang and Hanson's model. van der Linden (2007) proposed a hierarchical framework in which responses and response times are modeled separately at the measurement model level, while at a higher level, the ability and speed parameters are included in a population model to account for the correlation between them. In his approach, the latent speed parameter directly affects the response time, while the speed parameters and ability parameters are linked by the hierarchical model. It is known that in item response theory models, ability has a direct impact on the correctresponse probability. Thus, we can see that the correct-response probability is related to the response time via the personal parameters (speed and ability). Van der Linden's hierarchical modeling method is unrealistic in that it includes the response time and the ability parameters in the item response model, whereas our model represents the relationships among response time, ability, and correct-response probability more simply and directly. Several other models have a similar structure to van der Linden's hierarchical model, including those of Fox et al. (2007), Klein Entink et al. (2009a, van der Linden and Glas (2010), Marianti et al. (2014), Wang and Xu (2015), Wang et al. (2018a), Fox and Marianti (2016), and Lu et al. (2019).

Advantages of Our Model Compared With the Traditional 3PL Model
Item response theory (IRT) models have been extensively used in educational testing and psychological measurement (Lord and Novick, 1968;van der Linden and Hambleton, 1997;Embretson and Reise, 2000;Baker and Kim, 2004). The most popular IRT model that includes guessing is the 3PL model (Birnbaum, 1968), which has been discussed in many papers and books (see e.g., Hambleton et al., 1991;van der Linden and Hambleton, 1997;Baker and Kim, 2004;von Davier, 2009;Han, 2012). However, several studies have revealed that the 3PL model has technical and theoretical limitations (Swaminathan and Gifford, 1979;Zhu et al., 2018). In this paper, we focus on another defect of the traditional 3PL model, namely, that it cannot distinguish between different abilities under different time intensities when the examinees have the same response framework. Here, we give a simulation example to illustrate the shortcomings of the traditional 3PL model and the advantages of our model (which is a general three-parameter logistic model with response time: G3PLT). We assume that 24 examinees answer three items and that the examinees can be divided into three groups of eight, with the examinees in each group having response frameworks (1, 0, 0), (0, 1, 0), and (1, 1, 0), respectively.
Here, 0 indicates that the item is answered correctly and 1 indicating that it is answered incorrectly. The item parameters of the three items are calibrated in advance and known. The discrimination, difficulty, and guessing parameters are set as in Table 1.
To consider the influence of different time effects on the ability of the examinees, eight time transformation values are considered: −0.2, 0.2, 0.5, 1, 2, 3, and 8. The specific settings for the time transformation values can be found in section 2. Table 2 shows the estimated ability values from the 3PL model and from our model under different response frameworks, with the maximum likelihood method being used to estimate the ability parameter.
The following conclusions can be drawn from   We now give another example to further explain the advantages of the G3PLT model. Under the condition that the correctresponse probability is the same, we consider the response times of examinees i and j when they answer the same item, and we find that these are 1 and 2 min, respectively. In general, we think that the examinee with shorter response times has a higher ability. Thus, here the ability of examinee i should be higher than that of examinee j. However, since the 3PL model does not consider response time, the difference in ability cannot be distinguished. This problem can be solved by using the G3PLT model. Because this model takes into account the information provided by response time, it can estimate the ability of examinees more objectively and accurately. As shown in Figure 2, for the same item, L1 represents the item characteristic curve corresponding to the case where examinees need a long response time (t * 1 = 4.41), and L2 represents the item characteristic curve corresponding to the case where examinees need a short response time (t * 1 = 1.94). When p = 0.86 is given as the correct-response probability, the estimated ability under L1 is 0, while the estimated ability under L2 is 0.88. Therefore, according to the evaluation results from the G3PLT model, the examinees with shorter times should have higher abilities, whereas the 3PL model is unable to distinguish between the two cases. In addition, it can be seen from the figure that when the ability is fixed at 0, the probabilities of a correct response under the two characteristic curves L1 and L2 are 0.86 and 0.52, respectively. This indicates that under the same ability condition, the correct-response probability of the examinees with short response times is lower than that of the examinees with long response times.
The remainder of this paper is organized as follows. Section 2 presents a detailed introduction to the proposed G3PLT model. Section 3 provides a computational strategy based on a Metropolis-Hastings within Gibbs sampling algorithm to meet computational challenges for the proposed model. Two Bayesian model comparison criteria are also discussed in section 3. In section 4, simulation studies are conducted to examine the performance of parameter recovery using the Bayesian algorithm and to assess model fit using the deviance information criterion (DIC) and the logarithm of the pseudomarginal likelihood (LPML). A real data analysis based on the Program for International Student Assessment (PISA) is presented in section 5. We conclude with a brief discussion and suggestions for further research in section 6.

THE MODEL AND ITS IDENTIFICATION
2.1. The General Three-Parameter Logistic Model With Response Time (G3PLT) the item effects, which are generally interpreted, respectively as discrimination power, difficulty, and success probability in the case of random guessing. If Y ij denotes the response of the ith examinee answering the jth item, then the corresponding correct-response probability can be expressed as where D is a constant equal to 1.7. The influence of the time effect on the probability is described by the term exp(−t * ij ).

Time Transformation Function
It is obvious that when the response time of each item is very short, the correct-response probability of an item is reduced.
In addition, we know that it is impossible for an examinee to answer an item 100% correctly even if they are given enough time to think about the item, and this can be attributed to limitations of the examinee's ability. When examinees are given enough time to answer each item, our model will reduce to the traditional 3PL model, and each item is answered correctly with the corresponding 3PL model correct-response probability. To make the model fully represent the requirement that the correctresponse probability varies with time and to eliminate the effects of different average response times for each item in different tests, we consider the following time transformation: where µ t is the logarithm of the average time spent by all examinees in answering all items, and σ t is the corresponding standard deviation. W denotes the time weight, which is equal to zero or a positive integer. From the simulation study and real data analysis, we find that the G3PLT model reduces to the traditional 3PL model when the time weight increases to 8, and therefore we restrict the weight to values in the range 0-8. An increase in the time weight indicates that the time factor of the test has a small influence on the correct-response probability of the examinee.
Proposition 1. Suppose that the correct-response probability is given by Equation (2.1). Then, we have the following results: 1. As the transformed time t * ij → +∞, the G3PLT model reduces to the 3PL model. That is, In other words, it is impossible for the examinee to answer the item 100% correctly even if they are given enough time to think about the item, which can be attributed to the limitations of the examinee's ability.
2. As the transformed time t * ij → −∞ (the original time t ij → 0), the correct-response probability of the G3PLT model tends to zero. That is, When there is not enough time to answer items (e.g., at the end of the examination), any item answered by the examinee must be one that requires only a very short time to finish. As the response time continues to shorten, the correct-response probability is reduced. 3. The G3PLT model can be reduced to a G2PLT model by constraining the lower asymptote parameter c j to be zero, and a G1PLT model can be obtained by further constraining a j to be the same across all items.

Asymptotic Properties of the Model
Let p j be the correct-response rate for the jth item. When the transformed time t * ij → +∞, the model in Equation (2.1) can be written as The ability can be obtained as Next, we will use a specific example to explain the meaning of Equations (2.5) and 2.6. Assuming that p j = 0.5, a j = 1.5, b j = 1, and c j = 0.1, we obtain θ i = 0.8 from Equation (2.6). This result indicates that even if examinee i has sufficient response time to finish item j, the examinee's ability should be at least 0.8 (the intersection of the vertical asymptote and the x-axis in Figure 3) if the correct response probability reaches 0.5; otherwise, no matter how long a response time is allowed, the examinee's correct-response probability cannot reach 0.5. This is like a primary school pupil attempting to solve a college math problem, because the pupil's ability is so low that no matter how much time he is given, he cannot get a correct answer to item j other than by guessing. Moreover, when the ability θ i → +∞, the model in Equation (2.1) can be written as The transformed time t * ij can be obtained as (2.8) We again assume that p j = 0.5, a j = 1.5, b j = 1, and c j = 0.1. From (2.8), the transformed time t * ij is about −0.2. This result indicates that even if the examinee i has a strong ability, the transformed time required to answer item j should not be less than −0.2 (the intersection of the horizontal asymptote and the y-axis in Figure 3) if the correct-response probability reaches 0.5; otherwise, no matter how strong the ability of the examinee, it is impossible to reach a correctresponse probability of 0.5. This is like a college student solving a primary school math problem. Although the college student's ability is very strong, she cannot finish the item in a very short time. In addition, the correct-response probability of the examinees is the same for two points on the equiprobability curve. For example, for the two examinees F1 and F2 with the same correct-response probability 0.7 in Figure 3, the examinee F1 with low ability (1) takes a long time (2.35), while the response time (1.67) of the examinee F2 with high ability (2) is short to obtain the same correct-response probability. Similarly, the equiprobability curve based on item difficulty and time is shown in Figure 4. The correct-response probability is the same for two points on the equiprobability curve. The item with high difficulty takes a long time, while the response time of the item with low difficulty is short, giving the same correctresponse probability.

Model Identification
To ensure identification of the G3PLT model, either the scale of latent traits or the scale of item parameters has to be restricted (Birnbaum, 1968;Lord, 1980;van der Linden and Hambleton, 1997). In this paper, we set the mean and variance of the latent traits to zero and one, respectively (Bock and Aitkin, 1981). The mean of the latent trait is fixed to remove the trade-off between θ i and b j in location, and the variance of the latent trait is fixed to remove the trade-off among θ i , b j , and a j in scale.

Prior and Posterior Distributions
In a Bayesian framework, the posterior distribution of the model parameters is obtained based on the observed data likelihood (sample information) and prior distributions (prior information). In general, these two kinds of information have an important influence on the posterior distribution. However, in large-scale educational assessment, the number of examinees is often very large. Therefore, the likelihood information plays a dominant role, and the selection of different priors (informative or non-informative) has no significant influence on the posterior inference Wang et al., 2018a). Based on previous results (Wang et al., 2018a), we adopt the informative prior distribution to analyze the following simulation studies and real data. The specific settings are as follows. For the latent ability, we assume a standardized normal prior, i.e., θ i ∼ N(0, 1) for i = 1, . . . , N. The prior distribution for the discrimination parameter a j is a lognormal distribution, i.e., a j ∼ log N(0, 1) for j = 1, . . . , J. The prior distribution for the difficulty parameter b j is a standardized normal distribution, i.e., b j ∼ N(0, 1) for j = 1, . . . , J. For the guessing parameter, we assume a Beta distribution, i.e., c j ∼ Beta(2, 10) for j = 1, . . . , J. Then, the joint posterior distribution of the parameters given the data is as follows: (3.1)

Bayesian Estimation
Bayesian methods have been widely applied to estimate parameters in complex IRT models (see e.g., Albert, 1992;Patz and Junker, 1999a,b;Béguin and Glas, 2001;Rupp et al., 2004). In this study, the Metropolis-Hastings within Gibbs algorithm (Metropolis et al., 1953;Hastings, 1970;Tierney, 1994;Chib and Greenberg, 1995;Chen et al., 2000) is used to draw samples from the full conditional posterior distributions because the parameters of interest do not have conjugate priors within the framework of the IRT model.

Detailed MCMC Sampling Process
Step 1: Sample the ability parameter θ i for the ith examinee. We independently draw θ * i from the normal proposal distribution, i.e., θ * i ∼ N(θ (r−1) i , v 2 θ ). The prior of θ i is assumed to follow a normal distribution with mean µ θ and variance σ 2 θ , i.e., θ i ∼ N(µ θ , σ 2 θ ). Therefore, the acceptance probability is given by Otherwise, the value of the preceding iteration is retained, i.e., θ i = θ (r−1) where p ij is given in Equation (2.1).
Step 2: Sample the difficulty parameter b j for the jth item. We independently draw b * j from the normal proposal distribution, i.e., b * j ∼ N(b (r−1) j , v 2 j ). The prior of b j is assumed to follow a normal distribution with mean µ b and variance σ 2 b , i.e., b j ∼ N(µ b , σ 2 b ). The acceptance probability is given by Otherwise, the value of the preceding iteration is retained, . . , Y Nj ), T j = (T 1j , T 2j , . . . , T Nj ), and θ = (θ 1 , θ 2 , . . . , θ N ). In Step 3: Sample the discrimination parameter a j for the jth item. We independently draw a * j from the log-normal proposal distribution, i.e., a * j ∼ log N(log a (r−1) j , v 2 a ). In addition, p prior (a j ) is a lognormal prior distribution, i.e., a j ∼ log N(µ a , σ 2 a ). The acceptance probability is given by Otherwise, the value of the preceding iteration is retained, i.e., a j = a (r) Step 4: Sample the guessing parameter c j for the jth item. We independently draw c * j from the uniform proposal distribution, i.e., c * j ∼ U(c (r−1) The prior of c j is assumed to follow a Beta distribution, i.e., c j ∼ Beta(υ 1 , υ 2 ). Therefore, the acceptance probability is given by Otherwise, the value of the preceding iteration is retained, Spiegelhalter et al. (2002) proposed the deviance information criterion (DIC) for model comparison when the number of parameters is not clearly defined. The DIC is an integrated measure of model fit and complexity. It is defined as the sum of a deviance measure and a penalty term for the effective number of parameters based on a measure of model complexity. We write

Bayesian Model Assessment
is the response probability of the G3PLT model. The logarithm of the joint likelihood function in Equation (3.6) evaluated at (r) is given by (3.7) Frontiers in Psychology | www.frontiersin.org The joint log-likelihoods for the responses, log f (y ij | θ (r) i , a (r) j , b (r) j , c (r) j , t ij ), i = 1, . . . , N and j = 1, . . . , J, are readily available from MCMC sampling outputs, and therefore log f (y ij | θ (r) i , a (r) j , b (r) j , c (r) j , t ij ) in Equation (3.7) is easy to compute. The effective number of parameters in the models is defined by (3.9) The DIC can now be formulated as follows: A model with a smaller DIC value fits the data better. Another method is to use the logarithm of the pseudomarginal likelihood (LPML) (Geisser and Eddy, 1979;Ibrahim et al., 2001) to compare different models. This is also based on the loglikelihood functions evaluated at the posterior samples of model parameters. The detailed calculation process is as follows.
We let U ij,max = max 1≤r≤R [− log f (y ij | θ (r) i , a (r) j , b (r) j , c (r) j , t ij )], and a Monte Carlo estimate of the conditional predictive ordinate (CPO) (Gelfand et al., 1992;Chen et al., 2000) is then given by Note that the maximum value adjustment used in log (CPO ij ) plays an important role in numerical stabilization in the Equation (3.11). A summary statistic of the CPO ij is the sum of their logarithms, which is called the LPML and is given by (3.12) A model with a larger LPML has a better fit to the data.

Accuracy Evaluation of Parameter Estimation
To implement the MCMC sampling algorithm, chains of length 10,000 with an initial burn-in period 5,000 are chosen. In the following simulation study, 200 replications are used. Five indices are used to assess the accuracy of the parameter estimates. Let ϑ be the parameter of interest. Assume that M = 200 data sets are generated. Also, let ϑ (m) and SD (m) (ϑ) denote the posterior mean and the posterior standard deviation of ϑ obtained from the mth simulated data set for m = 1, . . . , M.
The bias for the parameter ϑ is defined as and the mean squared error (MSE) for ϑ is defined as (3.14) The simulation SE is the square root of the sample variance of the posterior estimates over different simulated data sets. It is defined as (3.15) and the average of posterior standard deviation is defined as The coverage probability based on the 95% highest probability density (HPD) intervals is defined as

Simulation 1
We conduct a simulation study to evaluate the recovery performance of the combined MCMC sampling algorithm based on different simulation conditions.

Simulation Design
The following manipulated conditions are considered: (a) test length J = 20 or 60 and (b) number of examinees N = 500, 1, 000, or 2, 000. Fully crossing different levels of these two factors yields six conditions (two test lengths × three sample sizes). Next, the true values of the parameters are given. True item discrimination parameters a j are generated from a truncated normal distribution, i.e., a j ∼ N(1, 0.2)I(a j > 0), j = 1, 2, . . . , N, where the indicator function I(A) takes a value of 1 if A is true and a value of 0 if A is false. The item difficulty parameters b j are generated from a standardized normal distribution. The item guessing parameters c j are generated from a Beta distribution, i.e., c j ∼ Beta(2, 10). In addition, the ability parameters of the examinees, θ i , are also generated from a standardized normal distribution. In each simulation condition, 200 replications (replicas) are considered. Next, we generate the response time data for each examinee based on the following facts: 1. The difficulty of each item has a direct impact on the response time. That is to say, the time spent on simple items is shorter, and the time spent on difficult items is longer. 2. In addition, the ability of each examinee also has a direct impact on the response time. That is to say, examinees with higher ability spend less time on an item. 3. Depending on the importance of the test (high-stakes test or low-stakes test), the effect of the time constraint on the whole test should be characterized by the time weighting.
In Wang and Xu (2015, p. 459), the average logarithms of the response times for each item based on the solution behavior follow a normal distribution. That is, log t j ∼ N(0.5, 0.25), j = 1, 2, . . . , J, where the average time t j spent on item j is about 1.64872 (= e 0.5 ) min. We take the standardized transformation t * j = f (t j ) = (log t j − 0.5)/0.5, so that t * j ∼ N(0, 1), where −∞ < t * j < +∞. Next, we consider the premise that the easier an item, the shorter is the response time. The true values of the difficulty parameter and the transformed time t * j for each item are arranged in order from small to large, i.e., b The response time of each examinee is generated from a normal distribution, i.e., t * ij ∼ N(t * j , 0.5), where j = 1, . . . , J. Moreover, for a given item j, the premise that examinees with higher ability spend less time on the item needs to be satisfied. Therefore, we arrange θ 1j > θ 2j > · · · > θ N−1,j > θ N,j , and t * 1j < t * 2j < · · · < t * N−1,j < t * N,j . The corresponding abilitytime pairs can be obtained by arranging the true values of the ability parameter and the transformed time t * ij , i.e., (θ ij , t * ij ). The time weights range from 0 to 8. The higher the value of the time weight, the weaker is the influence of the time factor of the test on the correct-response probability of the examinee. In this simulation study, we assume that the time factor of the test has an important influence on the correct-response probability of the examinee. Therefore, we set the time weight to 1 in this simulation. Based on the true values of the parameters and the response time data, the response data can be simulated using the G3PLT model given by Equation (2.1).

Convergence Diagnostics
To evaluate the convergence of the parameter estimations, we only consider convergence in the case of minimum sample sizes. That is, the test length is fixed at 20, and the number of examinees is 500. Two methods are used to check the convergence of our algorithm. One is the "eyeball" method to monitor convergence by visually inspecting the history plots of the generated sequences Hung and Wang, 2012), and the other is the Gelman-Rubin method (Gelman and Rubin, 1992;Brooks and Gelman, 1998) for checking the convergence of the parameters.
The convergence of the Bayesian algorithm is checked by monitoring the trace plots of the parameters for consecutive sequences of 10,000 iterations. The trace plots show that all parameter estimates stabilize after 5,000 iterations and then converge quickly. Thus, we set the first 5,000 iterations as the burn-in period. As an illustration, four chains started at overdispersed starting values are run for each replication. The trace plots of three randomly selected items are shown in Figure 5. In addition, we find that the potential scale reduction factor (PSRF) (Brooks and Gelman, 1998) values for all parameters are less than 1.2, which ensures that all chains converge as expected.

Recovery of Item Parameters
The average bias, MSE, SD, SE, and CP for discrimination, difficulty, and guessing parameters based on six different simulation conditions are shown in Table 3. The following conclusions can be drawn.

Given the total test length, when the number of individuals
increases from 500 to 2,000, the average MSE, SD, and SE for the discrimination, difficulty, and guessing parameters show a decreasing trend. For example, for a total test length of 20 items, when the number of examinees increases from 500 to 2,000, the average MSE of all discrimination parameters decreases from 0.0088 to 0.0072, the average SE of all discrimination parameters decreases from 0.0022 to 0.0014, and the average SD of all discrimination parameters decreases from 0.0085 to 0.0066. The average MSE of all difficulty parameters decreases from 0.0436 to 0.0213, the average SE of all difficulty parameters decreases from 0.0272 to 0.0122, and the average SD of all difficulty parameters decreases from 0.0362 to 0.0143. The average MSE of all guessing parameters decreases from 0.0019 to 0.0013, the average SE of all guessing parameters decreases from 0.0007 to 0.0006, and the average SD of all guessing parameters decreases from 0.0013 to 0.0008. 2. The average SDs of the item parameters are larger than their average SEs. This indicates that the fluctuations of the posterior means of item parameters between different replications are small compared with their fluctuations within each replication. 3. Under the six simulated conditions, the average CPs of the discrimination, difficulty, and guessing parameters are about 0.950. 4. When the number of examinees is held fixed but the number of items increases from 20 to 40, the average MSE, SD, and SE show that the recovery results for the discrimination, difficulty and guessing parameters do not change much, which indicates that the Bayesian algorithm is stable and there is no reduction in accuracy due to an increase in the number of items.
In summary, the Bayesian algorithm provides accurate estimates of the item parameters for various numbers of examinees and items. Therefore, it can be used as a guide to practice.

Recovery of Ability Parameters
Next, we evaluate the recovery of latent ability from the plots of the true values and the estimates in Figure 6. For a fixed number of examinees (500 or 1,000), when the number of items increases from 20 to 60, the ability estimates become more accurate, with the true values and the estimates basically lying on the diagonal line. Note that the estimated abilities are the average of 200 replication estimates. Because of the increase in the number of items, the probability of the situation in which all items are answered correctly by the high-ability examinees and incorrectly by the low-ability examinees, leading to a large deviation of the ability estimators, is reduced. Therefore, the estimated values and the true values of the ability at the end of the curve are closer to the diagonal line when the number of items is 60. In summary, the Bayesian sampling algorithm also provides accurate estimates of the ability parameters in term of the plots of the true values and the estimates.

Simulation 2
In this simulation study, we use the DIC and LPML model assessment criteria to evaluate model fitting. Two issues need further study. The first is whether the two criteria can accurately identify the true model that generates data from numerous fitting

No. of items=20
No. of examinees 500 No. of examinees 1,000 No. of examinees 2,000 Item parameter models. The second concerns the influence of different time weights in the G3PLT model on model fitting.

Simulation Design
In this simulation, the number of examinees is N = 1, 000 and the test length is fixed at 20. Six item response models will be considered: the traditional 3PL model and the G3PLT model with time weights W = 0, 2, 4, 6, and 8. Thus, we evaluate the model fitting in the following five cases: •  Table 4. Note that the following results for DIC and LPML are based on the average of 200 replications. From Table 4, we find that when the G3PLT model with time weight 0 (G3PLT0) is the true model, the G3PLT0 model is chosen as the better-fitting model according to the results for DIC and LPML, which is what we expect to see. The medians of DIC and LPML are respectively 25 324.43 and −13231.77. The differences between the G3PLT0 model and 3PL model in the medians of respectively. Similarly, when the G3PLT model with time weight 2 (G3PLT2) is the true model, the G3PLT2 model is also chosen as the better-fitting model according to the results for DIC and LPML. The medians of DIC and LPML are respectively 22 777.38 and −12221.93. The differences between the G3PLT2 model and 3PL model in the medians of DIC and LPML are −74.07 and 21.75, respectively. However, when the time weight increases from 4 to 8, the medians of DIC for the 3PL model and G3PLT model are basically the same. This shows that the 3PL model is basically the same as the G3PLT model with time weights 4, 6, and 8, which is attributed to the fact that the G3PLT model reduces to the traditional 3PL model when the time weight increases from 4 to 8. Based on the results for LPML, we find that the power of LPML to distinguish between the true G3PLT4 (6, 8) model and the 3PL model is stronger than that of DIC, because the LPMLs of the two models differ greatly. For example, the difference between the G3PLT8 model and 3PL model in the median of LPML is 46.45.
In summary, the two Bayesian model assessment criteria can accurately identify the true model that generates data. In addition, the process of transformation of the G3PLT model into the traditional 3PL model is also reflected by the differences in DIC and LPML. Therefore, the two Bayesian model assessment criteria are effective and robust and can guide practice.

Bayesian Model Assessment
To evaluate the impact of different time weights on the PISA data and to analyze the differences between the G3PLT model and the traditional 3PL model in fitting the data, both models are used to fit the data. G3PLT models with different time weights W = 0, 1, 2, 3, 4, 5, 6, 7, and 8 are considered. In the estimation procedure, the setting of the prior distributions is the same as in Simulation 1. In all of the Bayesian computations, we use 10,000 MCMC samples after a burn-in of 5,000 iterations for each model to compute all posterior estimates. Table 5 shows the results for DIC and LPML under the 3PL model and the G3PLT model with different time weights. According to DIC and LPML, we find that the G3PLT model with time weight 6 is the best-fitting model, with DIC and LPML values of 8389.316 and −4196.672, respectively. The G3PLT model with time weight 0 is the worst-fitting model, with DIC and LPML values of 9708.940 and −4792.301, respectively. That the G3PLT model with time weight 0 is the worst fitting model can be attributed to the fact that the influence of the time effect on the correct-response probability is relatively weak for the PISA data. This is consistent with the the evaluation purpose of the PISA test, which is a nonselective and low-stakes test. Examinees lack motivation to answer each item carefully, and therefore the time effect cannot be reflected. However, when the time weight of the G3PLT model increases from 5 to 8, the DIC and LPML values are basically the same as those in the case of the 3PL model. The model fitting results once again verify that our G3PLT model reduces to the traditional 3PL model when the time weight increases to a certain value. Next, we will analyze the PISA data based on the G3PLT model with time weight 6.

Analysis of Item Parameters
The estimated results for the item parameters are shown in Table 6. We can see that the expected a posteriori (EAP) estimates of the nine item discrimination parameters are greater than one. This indicates that these items can distinguish well between different abilities. In addition, the EAP estimates of the 11 difficulty parameters are less than zero, which indicates that 10 items are slightly easier than the other six. The three most difficult items are items 8 (DR442Q06C), 7 (DR442Q05C), and 9 (CR442Q07S). The EAP estimates of the difficulty parameters for these three items are, respectively 1.085, 0.900, and 0.839. The corresponding correct rates for the three items in Figure 7 are 0.231, 0.257, and 0.285. The most difficult three items have the lowest correct rates, which is consistent with our intuition. The six EAP estimates of the guessing parameters are larger than 0.1. The three items that the examinees are most likely to answer correctly by guessing are items 11 (CR245Q02S), 12 (CR101Q01S), and 10 (CR245Q01S). The EAP estimates of the guessing parameters for these three items are respectively 0.132, 0.128, and 0.117. Among the 16 items, item 7 is the best design item owing to the fact that it has high discrimination and difficulty estimates, and the guessing parameter has the lowest estimate in all of the items. Next, we use the posterior standard deviation (SD) to evaluate the degree of deviation from the EAP estimate. The average SD of all discrimination parameters is about 0.005, the average SD of all difficulty parameters is about 0.010, and the average SD of all guessing parameters is about 0.001. We can see that the average SD values of the three parameters are very small, indicating that the estimated values fluctuate near the posterior mean.

Analysis of Personal Parameters
Next, we analyze the differences between the estimated abilities of examinees in the 3PL model and in the G3PLT model under the same response framework, together with the reasons for these differences. We consider four examinees with same response framework for the 16 items, (1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1). They are examinee 60, examinee 313, examinee 498, and examinee 210, and the corresponding response times for these examinees to answer the 16 items are 25. 80, 29.36, 35.48, and 41.44 min. Under the framework of the 3PL model, the estimated abilities of the four examinees are the same, 1.45. However, taking into account the time factors for the four examinees, the estimated abilities are different according to the G3PLT model with time weight 6. The estimated abilities are 1.46, 1.42, 1.41, and 1.38, respectively. We find that under the same response framework, as the response times of the examinees increase from 25.80 to 41.44 min, the estimated abilities of the examinees show a decreasing trend. This indicates that examinees with short response times are more proficient in answering these items than examinees with long response times. Therefore, the ability of examinees with short response times to answer 15 items correctly should be higher than that of examinees with long times. This once again shows that our G3PLT model is reasonable. By incorporating the time effect into the IRT model, the interpretation of the latent construct essentially shifts: before we were measuring whether students could answer items correctly, now we are measuring whether students can answer items correctly and quickly.

CONCLUDING REMARKS
In this paper, we propose a new and flexible general threeparameter logistic model with response time (G3PLT), which is different from previous response time models, such as the hierarchical model framework proposed by , in which the response and the response time are considered in different measurement models, while a highlevel model represents the correlation between latent ability and speed through a population distribution. However, our model integrates latent ability, time, and item difficulty into a item response model to comprehensively consider the impact on the probability of correct response. This approach to modeling is simpler and more intuitive. In addition, time weights are introduced in our model to investigate the influence of time intensity limited by different tests on the correct-response probability. When the time weight reaches 8, our model reduces to the traditional 3PL model, which indicates that the time factor has little influence on the correct-response probability. The examinees then answer each item correctly with the response probability given by the 3PL model. However, the computational burden of the Bayesian algorithm becomes excessive when large numbers of examinees or items are considered or a large MCMC sample size is used. Therefore, it is desirable to develop a standalone R package associated with C++ or Fortran software for more extensive large-scale assessment programs.
Other issues should be investigated in the future. First of these is whether the G3PLT model can be combined with a multilevel structure model to analyze the influence of covariates on the latent ability at different levels, for example, to explore the influence of the time effect, gender, and socioeconomic status on latent ability. Second, although we have found that for different examinees with the same response framework, the ability estimates from the 3PL model is the same, those from the G3PLT model differ greatly. Examinees who take less time should be more proficient in answering items, and their ability should be higher than that of examinees who take longer. "Proficiency" is a latent skill that is not the same as latent ability. Whether we can connect proficiency and latent ability through a multidimensional 3PLT model to analyze their relationship is also an important topic for our future research. Third, our new model can also be used to detect various abnormal response behaviors, such as rapid guessing and cheating, with the aim of eliminating deviations in ability estimates caused by such behaviors.

DATA AVAILABILITY STATEMENT
Publicly available datasets were analyzed in this study. This data can be found here: http://www.oecd.org/pisa/data/.