Cognitive Diagnostic Models for Random Guessing Behaviors

Many test-takers do not carefully answer every test question; instead they sometimes quickly answer without thoughtful consideration (rapid guessing, RG). Researchers have not modeled RG when assessing student learning with cognitive diagnostic models (CDMs) to personalize feedback on a set of fine-grained skills (or attributes). Therefore, this study proposes to enhance cognitive diagnosis by modeling RG via an advanced CDM with item response and response time. This study tests the parameter recovery of this new CDM with a series of simulations via Markov chain Monte Carlo methods in JAGS. Also, this study tests the degree to which the standard and proposed CDMs fit the student response data for the Programme for International Student Assessment (PISA) 2015 computer-based mathematics test. This new CDM outperformed the simpler CDM that ignored RG; the new CDM showed less bias and greater precision for both item and person estimates, and greater classification accuracy of test results. Meanwhile, the empirical study showed different levels of student RG across test items and confirmed the findings in the simulations.


INTRODUCTION
Cognitive diagnostic models (CDMs) assess whether test-takers have the skills needed to answer test questions (attributes), so that their test results can give them diagnostic feedback on their strengths and weaknesses in these attributes (Rupp et al., 2010). Specifically, a CDM analysis determines whether a person shows mastery (vs. non-mastery) of a set of fine-grained attributes (latent class). Teachers, clinicians and other users of test scores can use such specific information on each student or client to adapt and improve their instructions/interventions more effectively, compared to a simple, summative score.
However, some test-taking behaviors can distort current CDM results and thereby jeopardize the validity of their assessments. Recently, researchers have proposed different approaches to account for test-taking behaviors when assessing test-taker performance and item characteristics. In this study, we focus on two frequently-observed test-taking behaviors during actual tests: solution attempt and rapid guessing (RG; Wise and Kong, 2005). In a solution attempt, test-takers carefully try to find answers to test questions. By contrast, RG refers to test-takers quickly answering test questions without thoughtful consideration (e.g., Wise and DeMars, 2006). For instance, Meyer (2010) integrated a two-class mixture Rasch model (Rost, 1990) to classify a test-taker as either making a solution attempt or RG, but not allowing different behaviors by the same person during a test. To address this limitation, Wang and Xu (2015) proposed a model with a latent indicator to allow each testtaker to engage in either a solution attempt or RG on each item. Furthermore, the indicator can depend on either a test-taker's RG propensity or on an item-level feature (Wang et al., 2018). As RGs are typically much shorter than solution attempts, CDMs can use a test-taker's reaction time (RT) to each test question to properly model RGs and distinguish them from solution attempts (but not necessarily pre-knowledge answers, e.g., Wang et al., 2018). As no published study has proposed and tested a CDM that models RG, we do so in this study.
This study proposes a new framework of CDMs to recognize different test-taking behaviors by using RT and item responses simultaneously. This new class of CDMs: (1) models two testtaking behaviors (RG vs. solution attempt) for each item-person concurrence, (2) allows multiple switch points between RG and solution attempts among the items for each test-taker, (3) thereby yields person and item estimates with greater accuracy, and (4) generalizes to available CDMs, RT functions and other kinds of dissimilar behaviors. The generalized DINA model (G-DINA, de la Torre, 2011) conceptualizes and shows the utility of this framework. Specifically, the two special cases of the G-DINA model, the deterministic input, noisy "and" gate (DINA) model (Junker and Sijtsma, 2001) and its counterpart, the deterministic input, noisy "or" gate (DINO) model (Templin and Henson, 2006) are simple to compute, estimate, and interpret, so they serve as illustrations. Nevertheless, researchers can extend this approach to other CDMs, especially G-DINA-liked formulation CDMs, such as the general diagnostic model (GDM;von Davier, 2005) and the linear logistic model (Maris, 1999).
After we present the functions for describing RT and item response, we specify the new model. Next, our simulation study illustrates the new model's performance, followed by its application to real data. Lastly, we discuss the implications of this study for identifying test-taking behaviors and improving the estimation accuracy of both person and item parameters.

A NEW CDM FRAMEWORK
The new model requires distinct functions to separately specify two fundamentals for an item, RT and item response, while two main facets, person and item, affect the observed RT and item response. This section describes the adopted RT and item response functions, before specifying the new model.

The Lognormal RT Model
As cognitive test data typically resemble a lognormal distribution more closely than a normal distribution, we use a lognormal function to characterize RT (van der Linden, 2006, 2007. Let RT ij be the observed RT of person i (i = 1, 2, . . ., I) to item j (j = 1, 2, . . ., J). In the lognormal function, the two parameters of person speed and time intensity, respectively, represent the two facts, person and item, as follows, where τ i indicates the average speed of test-taker i on a test (person speed); β j indicates the mean time that the population needs to resolve item j (time intensity); and κ 2 j indicates the dispersion of the logarithmized RT distribution (time discrimination parameter) of item j.

The G-DINA Model
The G-DINA model loosens some restrictions of the DINA model and its saturated form is equivalent to other general CDMs via link functions (de la Torre, 2011). Hence, the G-DINA model can (a) present different CDMs with similar formulations via various constraints and (b) substantially reduce the number of latent classes for an item -especially for models with more than five attributes. The original G-DINA model with identity link can be expressed as For test-taker i, the reduced attribute vector α * ij has the required attributes for item j. The intercept for item j, δ j0 represents the probability of a correct response without the required attributes (baseline probability). The main effect δ jk reflects the extent to which mastery of a single attribute α k changes the probability of a correct response. The interaction effect δ jkk indicates the extent to which mastery of both attributes α k and α k changes the probability of a correct response. The interaction effect δ j12···K * j reflects the extent to which mastery of all the required attributes α 1 , α 2 , · · · , and α K * j changes the probability of a correct response. Like most CDMs, the G-DINA model requires a J × K Q-matrix (Tatsuoka, 1983), in which K knowledge attributes are required to correctly answer J items. K * j = K k=1 q jk is the number of required attributes for item j, where q jk = 1 if the correct response to item j requires attribute k; and 0 otherwise. As the number of required attributes for item j is smaller than that of the all attribute vectors (K * j < K), the G-DINA model can reduce the number of required latent classes (2 K * j < 2 K ) for an item. To illustrate the G-DINA-like formulations, we use two common cases: the DINA and DINO.

The DINA Model
In the non-compensatory DINA, individuals are classified into one of two latent classes for an item: (a) the attribute vectors have all of an item's required attributes (mastery) or (b) the attribute vectors are missing at least one of the item's required attributes (non-mastery). The two latent classes' corresponding probabilities for a correct response entail that (a) mastery individuals do not slip, or (b) non-mastery individuals guess correctly (Junker and Sijtsma, 2001). Thus, the DINA model can be re-formed by setting to zero, all G-DINA model parameters except δ j0 and δ j12···K * In Eq. 3, δ j0 = g j is the probability of a correct response to item j for a non-mastery test-taker i, where g j is the guessing parameter for item j; δ j0 + δ j12···K * j = 1 − s j is the probability of a correct response to item j for a mastery test-taker i, where s j is the slipping parameter for item j. In the DINA, a mastery test-taker with all the required attributes (K * j ) for item j generally answers it correctly and other test-takers generally answer it incorrectly. Like the DINA, Eq. 3 shows that except for the attribute vector α * j = 1 K * j (in which 1 K * j is a vector of ones with length K * j ), other latent classes (2 K * j − 1) have the same probability of correctly answering item j. As shown in Eq. 3, this probability increases only after mastering all the required attributes. Under the DINA model assumption (Junker and Sijtsma, 2001), the G-DINA has two parameters per item (see Eq. 3).

The DINO Model
Unlike the non-compensatory DINA, the compensatory DINO only entails at least one of the required attributes to answer an item, so the parameters in G-DINA are set to where k = 1, · · · , K * j , k = 1, 2, · · · K * j − 1, and k > k , · · · , K * j . The orders of the interactions vary the alternating sign, and the quantities of the main effects and interactions have the same value.
For a test-taker i with at least one of the required attributes, the probability of answering item j without slipping (s j ) is δ j0 + δ jk = 1 − s j . Likewise, for a test-taker i with none of the required attributes, the probability of correctly answering item j is the guessing parameter, δ j0 = g j . Unlike the DINA, all latent classes except for the attribute vector α * j = 0 K * j (a vector of zeros and of length K * j ) have the same probability of correctly answering item j. Like the DINA, the DINO only needs two parameters for an item (Eq. 5, Templin and Henson, 2006).
To use both information of RT and item response, two functions must be specified. Hence, RT-GDINA, RT-DINA and RT-DINO jointly model RT and item response with the lognormal distribution (Eq. 1) and G-DINA, DINA and DINO, respectively.

New Class of CDMs
We introduce a new class of G-DINA to account for varying testtaking behaviors. RT (RT ij ) and item response (Y ij ) are modeled individually. As test-takers can switch between RG and solution behaviors, like Wang and Xu (2015), a latent indicator (ξ) is employed, where if test-taker i tries to solve item j, ξ ij = 1 (0 otherwise; RG is specified in this study). Incorporating this latent indicator into the lognormal RT model extends Eqs 1-6 indicates that the logarithmized RT is normally distributed as Eq. 1 if test-taker i solves item j from solution attempt (ξ ij = 1), and it is normally distributed with mean time intensity β 0 and time discrimination κ 2 0 if test-taker i responds to item j with a RG (ξ ij = 0). For a RG on item j by test-taker i, RT is constant.
Likewise, adding ξ ij to the G-DINA yields The G-DINA model is the underlying model for a solution attempt on item j by test-taker i (ξ ij = 1). We assume that a RG on item j by test-taker i (ξ ij = 0) yields δ * j . For simplicity, like Wise and DeMars (2006), we assume that test-taker i has the same probability of correctly answering item j both by RG and by guessing with none of the required attributes (δ * j = δ j0 ), that is, guessing randomly for all options. Hence, Eq. 7 can be re-written as The latent indicator ξ ij in Eqs 6-8 is a binary result of test-taker i on item j's behavior (solution attempt vs. RG). It can be modeled by a Bernoulli distribution with π j , the marginal probability of the solution attempt. Using the DINA and DINO to identify RGs and solution attempts, Eqs 3 and 5 are re-written, respectively, as Eqs 8 and 9.
To illustrate this approach, we combine the lognormal RT distribution and the easy-to-understand DINA and DINO models, with the latent indicator ξ (RT-DINA-RG, RT-DINO-RG) and without it (RT-DINA, RT-DINO). We estimate their parameters via the Bayesian method with the Markov chain Monte Carlo (MCMC) algorithm in the freeware JAGS (Plummer, 2017). For the JAGS code and the priors for the estimated parameters of the RT-DINA-RG and RT-DINO-RG models (see Appendix).

SIMULATION STUDY 1: PARAMETER RECOVERY OF RT-DINA-RG Design
In simulation study 1, we evaluated the parameter recovery of the RT-DINA-RG for a test of 30 dichotomous items measuring five non-compensatory attributes. See the artificial Q-matrix in Table 1. The guessing (g j ) and slipping (s j ) parameters were randomly generated, respectively, from the uniform distributions of U(0.05, 0.3) and U(0.05, 0.2), which reflect a high quality test. This data-generating procedure for the 30 simulated items  yielded item discrimination indices (IDI) that ranged from 0.51 to 0.88, indicating a test with high measurement quality (Lee et al., 2012). We manipulated two conditions. In the RG condition, the marginal probability of RG (1 − π j ) was set for items at two levels: 0.1 and 0.2 (Wang et al., 2018). To describe the dynamic latent indicator of person i on item j in the RG condition, the ξ-parameter was generated from a Bernoulli distribution with probability either of 0.8 or 0.9. In the RT-DINA-RG, mean item time intensity (β 0 ) and item discrimination (κ 0 ) were (a) set, respectively, at 2 and 1.6 for rapid guessers (ξ ij = 0) and (b) generated, respectively, from U(2, 4) and U(0.15, 2) for normal test-takers (ξ ij = 1). In non-RG condition, RG never occurs, and the RT-DINA served as the data-generating model, yielding parameters similar to the RT-DINA-RG. Mean item time intensity and item discrimination can be generated to accommodate various test situations (e.g., Man et al., 2018), but they do not affect the use of the proposed model. Therefore, we leave this interesting topic for further study.
Both the RT-DINA and RT-DINA-RG were fit to these data to test three hypotheses: (1) with some RG, the RT-DINA-RG efficiently recovers item and person estimates; (2) ignoring RG via the RT-DINA yields biased item parameter estimates, less accurate classification of attribute mastery, and less reliable person speed estimates; and (3) with no RG, the RT-DINA-RG performs as well as the RT-DINA. To evaluate the recovery of item parameters, the bias and root mean squared error (RMSE) were computed as dependent variables: where ν andν r indicate respectively, true and estimated values in the r-th replication of an item parameter. We examined test-takers' true and estimated latent classes to evaluate the classification accuracy of each attribute. The reliability of the person speed parameter was computed as: Frontiers in Psychology | www.frontiersin.org In the RG condition, the RT-DINA-RG generally yielded unbiased parameter estimates, whereas the RT-DINA overestimated the slipping parameters and underestimated the item intensity, guessing, and time discrimination parameters (see Figure 1). Greater RG increased the severities of slipping overestimation and time intensity underestimation. For testtakers without the required attributes, RG did not influence the success rate, so ignoring RG did not substantially influence estimation of the guessing parameters. Across five attributes and 100 replications, mean classification accuracy was higher for the RT-DINA-RG than the RT-DINA (0.936 > 0.924), suggesting that ignoring RG reduces the accuracy of attribute classification. Also, the RT-DINA-RG outperformed the RT-DINA on reliability of the person speed parameter (M: 0.66 > 0.57).
In the non-RG condition, both RT-DINA and RT-DINA-RG recovered the parameters well (see Figure 2). The bias and RMSE for the π-parameter in the RT-DINA-RG were nearly zero. Also, both models yielded practically identical classification accuracy (M = 96.6%) and reliability of person speed parameter (M = 0.76) across 100 replications. Hence, overfitting the RT-DINA-RG to data without RG showed no significant harm. In brief, the simulation results supported our three hypotheses.

SIMULATION STUDY 2: PARAMETER RECOVERY OF RT-DINO-RG Design
Study 2 simulated compensatory attributes and analyzed parameter recovery by the RT-DINO and RT-DINO-RG. The item responses and RTs were generated for (a) the RG condition with the RT-DINO-RG and (b) the non-RG condition with the RT-DINO. The parameters, data generation and evaluation criteria were the same as those in simulation study 1. Paralleling study 1, we test three hypotheses: (1) with some RG, the RT-DINO-RG efficiently recovers item and person estimates; (2) ignoring RG via the RT-DINO yields biased item parameter estimates and less accurate classification of attribute mastery; and (3) with no RG, the RT-DINO-RG performs as well as the RT-DINO.

Results
The study 2 results resemble the study 1 results (see Figure 3). In the RG condition, the RT-DINO-RG recovered the parameters well, whereas the RT-DINO overestimated the slipping parameters and underestimated the item intensity, guessing, and time discrimination parameters. Greater RG In the non-RG condition, both RT-DINO and RT-DINO-RG recovered the item parameters well (see Figure 4). The bias and RMSE for π-parameter in the RT-DINO-RG model were very small. Also, both models had practically identical classification accuracy (M = 98.4%) and reliability of person speed parameter (M = 0.71) across replications. In sum, these simulation results supported our three hypotheses.
The results indicate both compensatory attributes and RG. DICs showed that the compensatory models outperformed the non-compensatory ones (RT-DINO < RT-DINA: 1,351,697 < 1,433,173; and RT-DINO-RG < RT-DINA-RG: 1,327,068 < 1,360,978) suggesting that the eight attributes' relationships were more compensatory than non-compensatory. Also, the RG models outperformed the simpler models (RT-DINO-RG < RT-DINO: 1,327,068 < 1,351,697; and RT-DINA-RG < RT-DINA: 1,360,978 < 1,433,173), showing substantial RG. As the data indicated both compensatory attributes and RG, the RT-DINO-RG showed the best fit. Hence, we examine the RT-DINO and RT-DINO-RG results in greater detail.
Like study 2, the RT-DINO estimated higher slipping parameters and lower guessing parameters, compared to the RT-

Item
Label q 1 q 2 q 3 q 4 q 5 q 6 q 7 q 8 overestimated slipping parameters when ignoring RGs. Also, the πs of items 1-11 were generally lower than those of items 12-22. If these items appeared on the test in this sequence (item position information was not publicly available), these π results suggest that test-taker accuracy depended on their completion speed (speededness). The RT-DINO-RG also uses response time to recognizes RGs and solution attempts, showing estimated mean time intensity (β 0 ) of 3.21 and time discrimination (κ 0 ) of 0.70. The various probability density functions of response time for RGs and solution attempts in the RT-DINO-RG (see Figure 6) suggest that students used varied answering strategies to spend more time on some items and less time on others (including RGs). The RT-DINO and RT-DINO-RG did not consistently classify mastery of the eight attributes [Cohen's κ ranged from 0.48 (quantity) to 0.98 (occupational), see Table 3]. Notably, few students had knowledge of the third attribute (space and shape). The simulation studies suggest that the RT-DINO-RG classifications are more reliable than the RT-DINO ones.

DISCUSSION
CDMs assess whether test-takers have the needed skills (attributes) to answer each test question and give suitable diagnostic feedback, but they have not adequately modeled RG vs. solution attempts with reaction times. Hence, this study   Wang and Xu (2015) person-level manipulation of RG, this study manipulated RG at the item level (Wang et al., 2018). The simulation results and real data analysis showed that the RT-DINA-RG and RT-DINO-RG recovered parameters well and assessed test-takers' diagnostic results more accurately. In contrast, ignoring RGs by fitting simpler models yielded biased parameters, less reliable person speed parameter, and less classification accuracy of test results.
Hence, this study extends research showing how analyses of RT improves cognitive assessments of test-takers (e.g., van der Linden, 2008; Lee and Chen, 2011;Wang and Xu, 2015). When test-takers rapidly guess, the RT-GDINA-RG yields greater accuracies in person parameters, item parameters, and cognitive results. Therefore, researchers or users should use the RT-GDINA-RG to depict a data if RGs might occur. The choice of RT-GDINA-RG model (i.e., RT-DINA-RG or RT-DINO-RG) depends on the nature of the test items. If a test item's needed underlying constructs can compensate for one another, then RT-DINO-RG is suitable. If the underlying constructs cannot compensate for one another, then RT-DINA-RG is suitable.
Moreover, the person and item parameters of the RT-GDINA-RG were assumed to be, respectively, independent in this study. As attributes and item parameters of a CDM are often related in practice, we capture the relations between them with correlational structures (e.g., van der Linden, 2007). Note that the commonly-used multivariate normal distribution to specify the relations among person parameters is not feasible for the discrete feature of attributes in CDMs. Following Zhan et al. (2018), one can address this problem by using a higher-order latent trait to link the correlated attributes (de la Torre and Douglas, 2004), and then assuming that the person parameters (i.e., the higher-order latent trait and person speed) follow a bivariate normal distribution.
In addition, this study assumes the same probability of correctly answering an item by a RG as by guessing with none of the required attributes for the sake of simplicity. Such a naïve assumption can be further explored as in Wang and Xu (2015). Further, the RT-GDINA-RG distinguishes between solution attempt and RG for cognitive diagnosis via a latent indicator. In addition to RG, RT-GDINA-RG can be easily extended to adapt diverse test-taking behaviors and various tests' requirements. For example, we can extend CDMs to include other test-taking behaviors such as prior knowledge/preknowledge (Wang et al., 2018;Man et al., 2019) or nonresponses (Ulitzsch et al., 2019) if and only if the probabilities of a correct response from different latent indicators (or classes) can be clearly defined. In a high-stakes test, individuals often use preknowledge to correctly answer items with extremely short RT (unlike solution attempts with relatively long RT and unlike RGs with often wrong answers and short RT). Furthermore, we can adapt the functions for depicting RT and item response to the testing contexts, such as linear transformation (Wang et al., 2013), a gamma distribution to depict RT for mental rotation items (Maris, 1993), etc. (De Boeck andJeon, 2019). Also, the item response function can be replaced by other CDMs, such as the GDM (von Davier, 2005) or the linear logistic model (Maris, 1999). Future studies can investigate these approaches.
In addition, ignoring RGs can harm the development and application of cognitive assessments (for both high-and lowstakes tests), distort test results, or invalidate inferences. For example, greater precision of test parameters via the RT-GDINA-RG ensures the quality of item bank construction and assembly of tests, especially for large-scale assessments. Their greater precision also reduces the number of necessary test items to accurately assess a test-taker's domain knowledge, thereby enabling more subdomains to be assessed. The RT-GDINA-RG results regarding time can also inform designers of timed tests regarding the time needed for different solution approaches to a test question. For example, for a timed test, items have frequent RG might because test-takers perceive that they lack sufficient time to attempt a solution. Thus, such information can provide the users of test scores to set a suitable time (e.g., increasing the response time) for completing the test. In addition, greater accuracy in the estimation of test scores increases users' confidence in the results and their subsequent inferences.
When using RT-GDINA-RG to estimate more precise person and item parameters during RG, Q-matrix is an essential component in CDM contexts. An identifiable Q-matrix ensures the consistency of a CDM estimation. In this study, the simulation studies used an identifiable Q-matrix (Xu and Zhang, 2016;Xu, 2017), and the real data analysis adopted a partially identifiable Q-matrix (Gu and Xu, 2020). To enable consistent CDM estimation, checking the identifiability of the Q-matrix in advance is crucial. Besides, for ease of use, a tutorial to introduce the RT-GDINA-RG in JAGS can be developed in future work (cf. Curtis, 2010;Zhan et al., 2019).

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

AUTHOR CONTRIBUTIONS
All authors contributed to the article and approved the submitted version.