Joint Testlet Cognitive Diagnosis Modeling for Paired Local Item Dependence in Response Times and Response Accuracy

In joint models for item response times (RTs) and response accuracy (RA), local item dependence is composed of local RA dependence and local RT dependence. The two components are usually caused by the same common stimulus and emerge as pairs. Thus, the violation of local item independence in the joint models is called paired local item dependence. To address the issue of paired local item dependence while applying the joint cognitive diagnosis models (CDMs), this study proposed a joint testlet cognitive diagnosis modeling approach. The proposed approach is an extension of Zhan et al. (2017) and it incorporates two types of random testlet effect parameters (one for RA and the other for RTs) to account for paired local item dependence. The model parameters were estimated using the full Bayesian Markov chain Monte Carlo (MCMC) method. The 2015 PISA computer-based mathematics data were analyzed to demonstrate the application of the proposed model. Further, a brief simulation study was conducted to demonstrate the acceptable parameter recovery and the consequence of ignoring paired local item dependence.


INTRODUCTION
Nowadays, it becomes a common practice to collect response time (RT) data as the computerbased tests are applied to large-scale assessments. RT represents the amount of time a respondent spends on an item. It serves as an additional source of information about the working speed of a respondent as well as the time intensity of an item. In the past few decades, a number of studies have been done to model the RTs. Before the year of 2007, the RT modeling studies such as Thissen (1983), Verhelst et al. (1997), and Wang and Hanson (2005) were motivated by the speed-accuracy trade-off (Luce, 1986). However, this trade-off only reflected a within-person relationship between speed and accuracy (van der Linden, 2009) where, given a fixed set of items, a respondent's speed is dependent on his or her accuracy. Therefore, the relationship between speed and accuracy should be modeled at a higher level. To this end, van der Linden (2007) proposed a hierarchical modeling framework to explain the higher-level relationship between speed and accuracy. In this framework, RTs and RA were separately modeled at the first level whereas two correlational structures were modeled at the second level. The correlational structures accounted for either the dependence between person latent speed and latent ability parameters and that between item accuracy-related and item time-related parameters. A comparison study suggested that the hierarchical modeling framework yielded more reasonable outcomes in both real and simulated data than other RT modeling approaches (Suh, 2010). The hierarchical modeling framework was generalized to integrate different measurement models due to its flexible nature (e.g., Klein Entink et al., 2009a,b;Wang et al., 2013;Meng et al., 2015;Molenaar et al., 2015;Wang and Xu, 2015;Fox and Marianti, 2016). However, almost all the previous studies in RT modeling were based on unidimensional item response theory (IRT) models but none used multidimensional measurement models.
Multidimensional tests and cognitive diagnostic assessments become more and more prevalent given the increasing demand for diagnostic test feedback containing refined information. In general, cognitive diagnostic assessments aim at evaluating respondent's mastery status (e.g., mastery or non-mastery) of latent skills or attributes. This information can be provided to teachers or clinicians so that they can determine the remedial instructions or targeted interventions accordingly. Although numerous cognitive diagnosis models (CDMs) have been developed (for review, see Rupp et al., 2010) based on various cognitive and psychological assumptions, almost all of them only utilized information on RA. Recently, Zhan et al. (2017) proposed a joint cognitive diagnosis modeling approach to simultaneously model RTs and RA. In the study of Zhan et al. (2017), the deterministic-inputs, noisy "and" (DINA) model (Macready and Dayton, 1977;Haertel, 1989;Junker and Sijtsma, 2001) and the lognormal RT model (van der Linden, 2006) were used as the measurement models for RA and RTs, respectively. A higher-order latent structure (de la Torre and Douglas, 2004) was introduced to account for the relationship between latent attributes and a continuous higher-order latent ability. Furthermore, a bivariate normal distribution was used to model the relationship between the higher-order latent ability and the latent speed. A similar approach was proposed by Minchen (2017). Unlike Minchen's approach, Zhan et al. (2017)'s approach explicitly modeled the correlation between different item parameters (i.e., within-item characteristic dependency; Fox, 2010; manuscript submitted for publication) by assuming that they followed a multivariate normal distribution.
A key assumption in the joint models of RA and RTs is local item dependence. Specifically, the observed RA responses are conditionally independent of each other given an individual score in latent ability or a specific latent attribute mastery status, which is denoted as local RA independence; in the meanwhile, all the RTs are conditionally independent of each other given the an individual score in latent speed, which is denoted as local RT independence. In other words, in the joint models, local item independence is composed of local RA independence and local RT independence, which is known as paired local item independence. However, the assumption of local item independence is often violated in educational tests, resulting in local item dependence. One of the most common scenarios that lead to local item dependence is the presence of testlet, where several items are based on a common context (Wainer and Kiely, 1987).
A testlet is defined as a cluster of items that share a common stimulus. The local item dependence resulted from a testlet is called testlet effect. Testlet has been widely adopted in educational tests. For example, in a reading comprehension test, a testlet is formed when a bundle of items are based on the same reading passage. The testlet design makes the assessment process more efficient (DeMars, 2012). While responding to the items within the same testlet, the students only need to process the scenario once and the context information can be applied to all the items in the testlet. However, the testlet design makes it more difficult to measure student's reading ability as the student's performance may be affected by their knowledge or interest in the reading passage content besides their reading ability (Yen, 1993). Thus, item responses within the same testlet may be locally dependent on each other.
Testlet response theory modeling (Wang and Wilson, 2005;Wainer et al., 2007) is one of the most popular approaches to handle testlet effect or local item dependency. As a bi-factor multidimensional IRT model (DeMars, 2006;Li et al., 2006), the testlet response theory model assumes that all the item responses are accounted for by a common factor of latent ability, while the responses within a testlet are further explained by a random testlet effect factor. It has been demonstrated that the presence of testlet effect affects model parameter estimates, equating process, and test reliability estimates (e.g., Sireci et al., 1991;Bradlow et al., 1999;Wang and Wilson, 2005;Wainer et al., 2007;Jiao et al., 2012Jiao et al., , 2013Zhan et al., 2014;Jiao and Zhang, 2015;Tao and Cao, 2016). However, all the studies above only addressed the local RA dependence but none accounted for the local RT dependence.
As aforementioned, the paired local item independence is composed of local RA independence and local RT independence. Given that the item clusters which cause local RA dependence would also result in local RT dependence, and local RA dependence and local RT dependence should emerge in pairs. Thus, the violation of paired local item independence is called paired local item dependence. In other words, local RA dependence and its corresponding local RT dependence are caused by the same stimulus but are reflected in different forms (i.e., RA and RTs). To address the paired local item dependence in the IRT framework, Im (2017) proposed a hierarchical testlet model, in which local RA dependence was handled by a testlet response theory model whereas local RT dependence was handled by a lognormal RT testlet model.
In cognitive diagnosis, however, only a few studies focused on accounting for local RA dependence (e.g., Hansen, 2013;Zhan et al., 2015;Hansen et al., 2016), and, to our knowledge, none examined local RT dependence. As aforementioned, the joint CDMs assume paired local item independence. Thus, the purpose of this study is to extend the joint cognitive diagnosis modeling approach  in order to address the potential paired local item dependence in RTs and RA. The rest of the paper starts with a review of the testlet-DINA model (Zhan et al., 2015) and the lognormal RT testlet model (Im, 2017). Then the proposed joint testlet-DINA model is introduced. It is followed by a real data analysis using the Program for International Student Assessment (PISA) 2015 computer-based mathematics data, which serves to demonstrate the application of the proposed model. Finally, a brief simulation study is presented used to demonstrate the model parameter recovery and the consequence of ignoring paired local item dependence.

JOINT TESTLET COGNITIVE DIAGNOSIS MODELING
The Testlet-DINA Model To account for the local RA dependence in cognitive diagnosis, Hansen (2013) and Hansen et al. (2016) proposed a higherorder, hierarchical CDM which can be viewed as a combination of the two-tier item factor model (Cai, 2010) and the log-linear CDM (Henson et al., 2009). Like the two-tier item factor model, Hansen's model could only account for local RA dependence which was resulted from a single source. Zhan et al. (2015) proposed two within-item multidimensional testlet effect CDMs which was able to account for local RA dependence that was resulted from multiple sources simultaneously (Rijmen, 2011;Zhan et al., 2014). The two models included a compensatory model which allowed attributes to compensate each other and a non-compensatory model which assumed that respondents need to master all the required attributes in order to have a high correct response probability. For simplicity, the testlet-DINA model in this study only refers to the non-compensatory model, which is written as where Y ni denotes the dichotomous response of person n to item i; α n = (α n1 , . . . , α nK ) ′ denotes person n's attribute pattern, K is the number of required attributes; β i and δ i are the intercept and interaction parameters for item i, respectively; The Q-matrix (Tatsuoka, 1983) is an I-by-K confirmatory matrix with element q ik indicating whether the attribute k is required to correctly answer the item i (i.e., q ik = 1 if the attribute is required, and 0 otherwise); γ nm ∼ N(0, σ 2 γ m )is the RA testlet effect of the mth testlet, which represents the interaction effect between person n and items within testlet m on RA. Usually, the value of σ 2 γ m indicates the magnitude of testlet effect (Wang and Wilson, 2005;Wainer et al., 2007). A large variance is associated with a large testlet effect. All the γ nm s are assumed to be independent with each others; Let M be the total number of testlets in the test, the U-matrix (Zhan et al., 2014) is an I-by-M confirmatory matrix with element u im indicating whether item i belongs to testlet m (i.e., u im = 1 if item i belongs to testlet m, and 0 otherwise).
Obviously, when all elements in the U-matrix equal to 0 (means no tesltet in the test) or all σ 2 γ m = 0 (means no testlet effect), the testlet-DINA model reduces to the reparameterized DINA model (DeCarlo, 2011;von Davier, 2014).

The Lognormal RT Testlet Model
To account for the local RT dependence, Im (2017) proposed the lognormal RT testlet model. The lognormal RT testlet model is an extension of the regular lognormal RT model (van der Linden, 2006) by introducing a random testlet effect parameter, but it can also be taken as a special case of the multidimensional lognormal RT model (Zhan et al., manuscript submitted for publication).
Let T ni be the observed RT of person n to item i, the lognormal RT testlet model can be expressed as where logt ni be the logarithm of RT, which is used to transform the positively skewed distribution of RT to a more symmetric shape; τ n be the latent speed of person n; ξ i be the time-intensity of item i; ω i be the discriminating power of item i, which can be treated as a time-kurtosis parameter; λ nm ∼ N(0, σ 2 λ m )be the mth RT testlet effect parameter to address local RT dependence, which represents the interaction between person n and items within testlet m in RT. The larger the variance, the larger the testlet effect is. All λ nm s are assumed to be independent of each other.
Equation (2) can be extended to account for potential withinitem multidimensional testlet effect where all the parameters have been defined above. Equation (3) is regarded as the within-item multidimensional testlet effect lognormal RT model, which can be seen as a special case of the multidimensional lognormal RT model (Zhan et al., manuscript submitted for publication). For simplicity, Equation (3) can be equivalently expressed as When there is only one source of local RT dependence, the within-item multidimensional testlet effect lognormal RT model reduces to the lognormal RT testlet model (Im, 2017). Further, when all the elements in the U-matrix equal to 0 or σ 2 λ m = 0for all testlets, the within-item multidimensional testlet effect lognormal RT model reduces to the regular lognormal RT model (van der Linden, 2006).

The Joint Testlet-DINA Model
The joint testlet-DINA model is specified as follows: Y ni and logT ni are separately modeled at the first level following the convention of joint cognitive diagnosis modeling approach and the hierarchical testlet model; a higher-order latent structural model is used to account for the relationship between binary latent attributes and a continuous higher-order latent ability; further, at the higher level, three variance-covariance structures are imposed to model the dependencies among person parameters, item parameters, and testlet effect parameters. A graphical representation of the joint testlet-DINA model is given in Figure 1.
First, the testlet-DINA model (Equation 1) and the withinitem multidimensional testlet effect lognormal RT model Then, the higher-order latent structural model is used to link the correlated attributes, which is given by where P(α nk = 1) is the probability of mastery of attribute k by person n; θ n is a higher-order (general) ability of person n, which is assumed to follow a standard normal distribution for identification purpose; and ν k and κ k are the slope and difficulty parameters for attribute k.
Further, item parameters are assumed to follow a trivariate normal distribution Additionally, since the residual error variance, ω −2 i , is assumed to be independently distributed , it is not included in i .
Likewise, person parameters are assumed to follow a bivariate normal distribution In addition, testlet effect parameters in testlet m are assumed to follow a bivariate normal distribution If there are M testlets, there will be M bivariate normal distributions. In addition, it should be noted that, in the proposed model, the u im in RT model (Equation 3) has the same value as the u im in RA model (Equation 1) because of the paired local item dependence. In summary, Equations (1, 4-8), together, constitute the joint testlet-DINA model. Constraints are set for identification purpose (i.e., µ θ = 0, σ 2 θ = 1; µ τ = 0). The first two constraints are consistent with those set in the higher-order latent trait model while the third removes the tradeoff between ξ i and τ n from a lognormal model. After addressing the paired local item dependence, four conditional independence assumptions are made: the α nk are conditionally independent given θ n ; the Y ni are conditionally independent given α n and γ nm ; the logT ni are conditionally independent given τ n and λ nm ; and Y ni and logT ni for a particular item i are conditionally independent given person parameters and testlet effect.

Bayesian Parameter Estimation
Parameters in the joint testlet-DINA model can be estimated using the full Bayesian approach with the Markov chain Monte Carlo (MCMC) method. In this study, free software JAGS (Version 4.3.0; Plummer, 2015) was used to estimate the parameters. JAGS uses a default option of the Gibbs sampler (Gelfand and Smith, 1990). Sample code were presented in Appendix. A tutorial of using JAGS for Bayesian CDM estimation can be found in Zhan (2017).
To begin with, under the assumption of local independence, Y ni , logT ni and α nk are independently distributed, which is written as The priors of item parameters are assumed to be a trivariate normal distribution, written as Frontiers in Psychology | www.frontiersin.org Further, the hyper priors are specified as where R item is a tridimensional identity matrix. The priors of person parameters are set as As suggested by Zhan et al. (2017), the Cholesky decomposition of the person is used is a low triangular matrix with positive entries on the diagonal and unrestricted entries below the diagonal; ' person is the conjugate transpose of person . The priors of the elements in person are specified as ϕ ∼ N(0, 1),ψ ∼ Gamma(1, 1). Then, the priors of the higher-order structural parameters are specified as In addition, the priors of testlet effect parameters in testlet m are specified as with the hyper priors of testlet,m ∼ InvWishart(R testlet,m , 2), where R testlet,m is a two-dimensional identity matrix for testlet m. Finally, the posterior mean and the posterior mode are used as the estimates for the continuous parameters (e.g., β i , δ i , θ n , and τ n ) and categorical parameters (e.g., α nk ), respectively.

REAL DATA ANALYSIS Data
In this study, the PISA 2015 computer-based mathematics data were used. 17 computer-scored dichotomous items from M1 and M2 testing clusters were selected and used in the analysis. The complete-case method was implemented to handle the missing data. That is, only the respondents without missing values in any of the 17 items were used. As a result, the dataset used for analysis contained the dichotomous response data and continuous RT data for 8,606 respondents from 58 countries/economies. The natural logarithm of RTs (i.e., log RTs) were used for modeling. According to the PISA 2015 mathematics assessment framework (OECD, 2016), 11 attributes were assessed, including change and relationships (α1), space and shape (α2), quantity (α3), uncertainty and data (α4), personal (α5), occupational (α6), societal (α7), scientific (α8), formulating situations mathematically (α9), employing mathematical concepts, facts, procedures and reasoning (α10), and interpreting, and applying and evaluating mathematical outcomes (α11). The first four attributes are associated with the mathematical content knowledge that is targeted for use in the items. The next four attributes are associated with the mathematical context that is needed to place additional demands on the problem-solver (Watson and Callingham, 2003;OECD, 2016). The last three attributes are associated with the mathematical processes that connect the context of the mathematics problem with problem-solving (OECD, 2016). In addition, the 17 items contained four testlets, namely, population pyramids (m1), diving (m2), cash withdrawal (m3), and chair lift (m4). Only one source of local item dependence was considered in this study (i.e., an item only belongs to one testlet). The Q-matrix and the U-matrix are presented in Table 1.

Analysis
In addition to the joint testlet-DINA model, the joint responses and times DINA (denoted as the JRT-DINA) model  was also used to fit the data for comparison purpose. The JRT-DINA model can be seen as a special case of the joint testlet-DINA model where all random testlet effect parameters are set to be zero. For both models, two Markov chains with random starting points were used and 10,000 iterations were run for each chain. The first 5,000 iterations in each chain were discarded as burn-in. In order to save space in memory 1 , the thinning interval was set to be five. As a result, 2,000 iterations were retained for model parameter inferences. The potential scale reduction factor (PSRF; Brooks and Gelman, 1998) was computed to assess the convergence of each parameter. PSRF values lower than 1.1 or 1.2 were used as convergence criteria in previous studies (Brooks and Gelman, 1998;de la Torre and Douglas, 2004). In this study, the PSRFs were generally lower than 1.05, indicating good convergence in the specific setting.
The AIC (Akaike, 1974), BIC (Schwarz, 1978), and DIC (Spiegelhalter et al., 2002) were computed for model comparison. Posterior predictive model checking (PPMC; Gelman et al., 2014) was used to evaluate model-data fit. Posterior predictive probability (PPP) values near 0.5 indicate that there are no systematic differences between the observed and predicted values, suggesting an adequate model-data fit. As the research in the absolute model-fit statistics for joint models was limited, this study followed Zhan et al. (2017) to evaluate the model fit of the RA and RT models separately. The sum of the squared Pearson residuals for person n and item i (Yan et al., 2003) was used as a discrepancy measure to evaluate the overall fit of the RA model,  which is written as where P(Y ni = 1) has the same definition as that in Equation (1). On the other hand, the sum of the standardized error function of logT ni for person n and item i (Marianti et al., 2014;Fox and Marianti, 2017) was used as a discrepancy measure to evaluate the overall fit of the RT model, which is given by

Results
The joint testlet-DINA model was favored based on the AIC, BIC, and DIC, as is shown in Table 2. In addition, the likelihood deviances (i.e., −2 log likelihood or −2LL) of these two models were 387,466 and 414,438, respectively ( −2LL = 26,972, df = 12, p < 0.001). Therefore, the   3 presents the estimated item mean vector and the estimated item variance-covariance matrix. ρ βδ was estimated to be −0.645, which means that higher item intercept parameters were associated with lower item interaction parameters. ρ βξ and ρ δξ were estimated to be −0.700 and 0.450, respectively, indicating that items with higher intercept parameters tended to have lower time-intensity parameters; by contrast, items with higher interaction parameters tended to be have higher timeintensity parameters. Further, Figure 2 presents the estimated item parameters. All the β i estimates were negative except the 1st and the 13th items, which means that the guessing probabilities (i.e., exp(β i ) 1+exp(β i ) ) of these two items were higher than 0.5. Table 4 presents the estimated person variance and covariance matrix. ρ θτ was estimated to be −0.196, which means that a low negative correlation was observed between the higherorder ability and the latent speed parameters. The negative correlation was consistent with the results in Zhan et al. (2017). One reasonable explanation is that low-ability respondents lack motivation in taking the low-stakes test (Wise and Kong, 2005). Thus, the low-ability respondents may have shorter RTs and a greater number of incorrect responses than the high-ability respondents. In addition, the variance of latent speed was quite small (i.e., 0.073), which means the variability in latent speed among all respondents was small. Table 5 presents the four estimated testlet effect variance-covariance matrices. As aforementioned, a larger variance of testlet effect parameters indicates a larger testlet effect. The variances of the four RA testlet effect parameters were estimated to be 0.438, 0.260, 2.800, and 0.414, respectively. Compared to the variance of the latent trait (i.e., 1.00), the RA testlet effects ranged from small to large 2 . By contrast, the variances of the four RT testlet effect parameters were estimated to be 0.110, 0.083, 0.226, and 0.212, respectively. Although the RT testlet effects were small in terms of the absolute values, their ratios to the variance of latent speed (i.e., 0.073) were

Covariance in lower triangular matrix and correlation coefficient in upper triangular matrix, respectively; standard error (standard deviation of the posterior distribution) is in parentheses.
FIGURE 2 | Item parameter estimates for PISA 2015 computer-based mathematics items. β, item intercept; δ, item interaction; ξ, item time-intensity; ω, item time-kurtosis.

Covariance in lower triangular matrix and correlation coefficient in upper triangular matrix, respectively; standard error (standard deviation of the posterior distribution) is in parentheses.
Frontiers in Psychology | www.frontiersin.org around 1.507, 1.137, 3.096, and 2.904, respectively, indicating that the RT testlet effects were large in this dataset. In addition, low correlation was observed between each pair of RA testlet effect and RT testlet effect, indicating that these two types of testlet effects were separable. This is an unexpected result. A moderate or a high correlation was expected since, theoretically speaking, local RA dependence and local RT dependence should be caused by the same stimulus. More practical evidence needs to be accumulated from future studies to explain the results. Figure 3 presents the posterior mixing proportions of the 20 most frequent attribute patterns out of the 2,048 possible attribute patterns. Only 73 patterns were observed in the estimated attribute profiles. Attribute pattern (11111111111) was the most prevalent with a percentage of 40.19%; the second most prevalent pattern was (10100100000) with a percentage of 23.41%.

Design and Data Generation
A brief simulation study was conducted to examine the parameter recovery of the proposed model and the consequence of ignoring the potential paired local item dependence in analysis. The simulated dataset contained 1,000 respondents and 30 items measuring five attributes. The Q-matrix is presented in Figure 4. The last 20 items were evenly divided into 4 testlets. Specifically, testlet 1 consisted of items 11 ∼ 15, testlet 2 consisted of items 16 ∼ 20, testlet 3 consisted of items 21 ∼ 25, and testlet 4 consisted of items 26 ∼ 30. For simplicity, the four pairs of RA and RT testlet effects were generated from a same bivariate normal distribution, where ρ γλ = −0.5. Typically, setting the testlet effect as 0.5 indicates a moderate testlet effect (Wang and Wilson, 2005;Wainer et al., 2007). In addition, each item was assumed to belong to only one testlet. Item parameters were generated from a trivariate normal distribution, where ρ βδ = −0.8, ρ βξ = −0.5, and ρ δξ = 0.3, which were set according to the estimates from the real data analysis ; ω i were generated from N(2, 0.25). Person parameters FIGURE 4 | K-by-I Q' matrix for simulation study. blank means "0," gray means "1".
Frontiers in Psychology | www.frontiersin.org  were generated from a bivariate normal distribution, where ρ θτ = −0.5. For higher-order structural parameters, ν k = 1.5 for all the attributes and κ k = (−1.0, −0.5, 0.0, 0.5, 1.0), indicating moderate correlations among attributes. The mastery status of each person on each attribute was generated from a Bernoulli distribution with the parameter, P(α nk = 1) which was computed based on Equation (5).

Analysis
Thirty replications were implemented. Both the joint testlet-DINA model and the JRT-DINA model were fit to the simulated data. In each replication, the number of chains, burn-in iterations, and post-burn-in iterations were consistent with those in the real data analysis. Convergence was well achieved (see Figure S3 in Appendix). The bias and root mean square error (RMSE) were used to evaluate parameter recovery, which were calculated as bias(υ) = , whereυ and υ are the estimated and true value of model parameters, respectively; R is the number of replications. In addition, the correlation between the true and estimated value of model parameters was computed. In terms of the classification accuracy, the attribute correct classification rate (ACCR) and pattern correct classification rate (PCCR) were computed as ACCR =

Results
In all the 30 replications, the joint tesltet-DINA model was favored by AIC, BIC and DIC, which indicates that the three fit indices can select the best-fitting model correctly. Figures 5, 6 display the recovery of the item parameters for the two models. According to the results of the last 20 items with testlet structure, the performance of the JRT-DINA model was significantly affected by the paired local item dependence. Specifically, ignoring paired local item dependence in analysis would result in overestimation of item intercept parameters, underestimation of item interaction parameters, and underestimation of item time-kurtosis parameters. However, it had little effect on the recovery of item time-intensity parameters. In addition, most of the 10 items without testlet structure had smaller absolute bias in parameter estimates from the joint testlet-DINA model than from the JRT-DINA model; the RMSE of the parameter estimates from the joint testlet-DINA model was equal to or smaller than those from the JRT-DINA model. Table 6 further summarizes the item parameter recovery by presenting the mean absolute bias, the mean RMSE, and the correlation between estimated and true values of all the items. Again, it can be seen that ignoring the paired local item dependence mainly affected the recovery of item time-kurtosis parameters. In addition, the item RT parameters were recovered better than the item RA parameters in joint models. Figures 7, 8 display the recovery of the person parameters for the two models. The two models performed similarly on recovering the higher-order ability parameter. In terms of the latent speed parameters, the bias was similar for the two models, but the RMSE from the JRT-DINA model was significantly larger than that from the joint testlet-DINA model. The results indicate that ignoring the paired local item dependence in analysis would result in large variability in latent speed parameters but had little effect on the recovery of higher-order ability parameters. Table 7 further summarizes the recovery of person parameters. The two models mainly differed in the mean RMSE of latent speed across person. In addition, the recovery of latent speed parameters was better than that of the higher-order ability parameters. Table 8 presents the recovery of individual attributes and attribute patterns. The joint testlet-DINA model was higher than the JRT-DINA model in both ACCR and PCCR, which indicates that ignoring the paired local item dependence would slightly reduce attribute and pattern correct classification rates (PCCRs). Table 9 presents the recovery of item, person and testlet variance-covariance matrices. First, in terms of the item variancecovariance matrix, the bias was similar for the two models, but    the RMSE from the joint testlet-DINA model was larger than that from the JRT-DINA model. Second, the latent speed variance was recovered better in the joint testlet-DINA model than in the JRT-DINA model. Third, all the four testlet variance-covariance matrices were well recovered. The recovery of the RT testlet effect variance parameters was better than that of the RA testlet effect variance parameters. Table 10 presents the recovery of item mean vector components and higher-order structural parameters. The item mean vector component estimates from the joint testlet-DINA model had smaller absolute bias and RMSE than those from the JRT-DINA model. The two models performed similarly on recovering the higher-order structural parameters. The results indicate that ignoring the paired local item dependence in analysis would result in less precise item mean vector component estimates, but had little effect on the higher-order structural parameter recovery.
Overall, the model parameters of the joint testlet-DINA model were well recovered by using the proposed MCMC estimation algorithm. Additionally, ignoring the paired local item dependence in analysis would result in biased model parameter estimates and lower correct classification rates. Specifically, it would result in overestimation of item intercept parameters, underestimation of item interaction parameters, and underestimation of item time-kurtosis parameters. It would lead to less precise estimates of latent speed parameters and item mean vector components. It would also reduce attribute and PCCRs. However, it had little effect on the recovery of item timeintensity parameters, the higher-order ability parameters, or the higher-order structural parameters.

CONCLUSION AND DISCUSSION
To address the paired local item dependence in RT and RA when applying the joint CDMs, this study proposed a joint testlet cognitive diagnosis modeling approach. As an extension of the joint cognitive diagnosis modeling approach , the proposed approach modeled the relationship between each pair of RA testlet effect and RT testlet effect using correlational structure. Specifically, the testlet-DINA model and the withinitem multidimensional testlet effects lognormal RT model were adopted as the RA model and RT model, respectively. The model parameters were estimated using the full Bayesian MCMC method. The 2015 PISA computer-based mathematics data were  analyzed to demonstrate the application of the proposed model. The real data analysis results are summarized as follows: (a) a negative correlation was observed between the higher-order ability and latent speed; (b) a negative correlation was observed between the item intercept parameters and the item timeintensity parameters; (c) a positive correlation was observed between the item interaction parameters and the item timeintensity parameters; (d) the magnitude of RA testlet effects varied from small to large whereas the magnitude of RT testlet effects was large; and (e) low correlation coefficients between the RA and RT testlet effects were found. Overall, most results in this real data analysis were consistent with those in Zhan et al. (2017) that used PISA 2012 computer-based mathematics data. Further, a simulation study was conducted to examine model parameter recovery of the proposed model and the consequence of ignoring testlet effects. The results indicated that the model parameters of the proposed model can be well recovered. Additionally, ignoring the paired local item dependence in analysis would result in biased model parameter estimates and low individual correct classification rates. Despite the promising results, further research is needed. First, only a DINA-based testlet model and a lognormal RTbased testlet model were used for illustration in this study. In the future study, other CDMs (e.g., von Davier, 2008;Henson et al., 2009;de la Torre, 2011) andRT models (e.g., Klein Entink et al., 2009b;Wang et al., 2013) can be used as the measurement models of RA and RTs. Second, in this study, the proposed model was evaluated using a brief simulation where only a limited number of factors were manipulated. More factors (e.g., test length, number of attributes, magnitude of testlet effects, etc.) and replications are recommended in future studies. Third, the model-data fit of RA and RT models was evaluated separately because of the lack of model-data fit indices for the joint models. In the future studies, absolutely model-fit indices designed for joint models can be explored and further be applied to evaluate the current modeling approach. Fourth, in educational and psychological measurements, latent speed can be defined as the ratio of the amount of labor spent on the items with respect to time (van der Linden, 2011). Due to the multidimensional nature of labors, latent speed may also be a multidimensional concept, each dimension of which corresponds to a specific type of labor. The latent speed was treated as a unidimensional latent trait in this study although the RT testlet effect can be regarded as a specific factor that is relevant to the working speed. Recently, Zhan et al., Manuscript submitted for publication proposed a multidimensional lognormal RT model to account for the potential multidimensionality of latent speed. One possible extension of the current joint modeling approach is to account for the multidimensional latent speed. Fifth, as noted by one of the anonymous reviewers, if there are many testlets, there will be many bivariate covariance matrices to be estimated, leading to large computational burden. Further exploration is needed to deal with this challenging issue. Sixth, in this study, respondents were assumed to be from the same population group, but, in reality, they may be from different groups (e.g., male and female). Multiple group joint modeling (e.g., Jiao et al., 2017) and mixture modeling (e.g., von Davier, 2008) can be incorporated into the current modeling approach in the future. Seventh, in practice, students are nested within classrooms, and classrooms are further nested within schools. Thus, multilevel modeling (e.g., Fox and Glas, 2001;Jiao et al., 2012;Jiao and Zhang, 2015) extension can also be a future direction. Finally, the generalizability of the results from this study is limited given that only data from a lowstakes test were analyzed. More empirical studies based on data from other tests, especially high-stakes tests, are needed.

AUTHOR CONTRIBUTIONS
PZ: Design of the study, data analysis, paper writing, and revision; ML: Data cleaning, interpretation of data for the work, and paper revision; YB: Preliminary idea construction, paper revision, and proofreading.

FUNDING
The first author was sponsored by the Graduate Student Self-manage Project from Collaborative Innovation Center of Assessment toward Basic Education Quality, Beijing Normal University (No. BJSP-2016A1-15001).