Applying the M2 Statistic to Evaluate the Fit of Diagnostic Classification Models in the Presence of Attribute Hierarchies

The performance of the limited-information statistic M2 for diagnostic classification models (DCMs) is under-investigated in the current literature. Specifically, the investigations of M2 for specific DCMs rather than general modeling frameworks are needed. This article aims to demonstrate the usefulness of M2 in hierarchical diagnostic classification models (HDCMs). The performance of M2 in evaluating the fit of HDCMs was investigated in the presence of four types of attribute hierarchies. Two simulation studies were conducted to examine Type I error rates and statistical power of M2 under different simulation conditions, respectively. The findings suggest acceptable Type I error rates control of M2 as well as high statistical power under the conditions of a Q-matrix misspecification and the DINA model misspecification. The data of Examination for the Certificate of Proficiency in English (ECPE) were used to empirically illustrate the suitability of M2 in practice.


INTRODUCTION
Diagnostic classification models (DCMs) (Rupp et al., 2010) have demonstrated great potential for evaluating respondents with fine-grained information to support targeted interventions. Previous studies have applied DCMs to address some practical issues in education (e.g., Jang, 2009) and psychology (e.g., Templin and Henson, 2006). However, the area of research in DCMs is still relatively new. More research on model data fit statistics are needed for evaluating DCMs. Although relative fit statistics (e.g., de la Torre and Douglas, 2008) are available for determining the most suitable of several alternative models, they cannot be used for evaluating test-or item-level goodness-of-fit. As such, some authors have proposed and developed methods to evaluate the absolute fit of DCMs (e.g., Jurich, 2014;Wang et al., 2015). Specifically, the limited-information test statistics, e.g., the M 2 statistic, was recommended by researchers (e.g., Liu et al., 2016) because of its ability to address sparseness in the contingency table. In the study by Liu et al. (2016), the performance of M 2 was evaluated under the log-linear cognitive diagnosis model (LCDM; Henson et al., 2009). They found that M 2 has reasonable Type I error rates control when models were correctly specified and good statistical power when models or Q-matrix were misspecified, under the conditions of different sample sizes, test lengths, and attribute correlations. More importantly, their study has identified the cutoff values of the root mean square error of approximation (RMSEA) fit index for M 2 (RMSEA 2 ). Similarly, the study by Henson et al. (2009) validated the usefulness of M 2 in DCMs by a series simulation studies. Specifically, their study showed that M 2 is of good performance across different diagnostic model structures and is sensitive to the model misspecifications, the Q-matrix misspecifications, the misspecifications in the distribution of higher-order latent dimensions, and violations of local item independence. Generally, their findings were based on the general frameworks (e.g., LCDM) or the most common DCMs (e.g., DINA; de la Torre, 2009), which assume extremely complicated relationships among items but simple relationships among attributes. However, the more specific models are more suitable for practical use (Rojas et al., 2012;Ma et al., 2016), the results will be more convincing if M 2 can be applied in more specific and practical conditions.
In education, students are often required to master certain requisite skills before they move on to learn new knowledge and skills. This indicates that hierarchical relationships often exist among the cognitive skills. To address the presence of hierarchical relationship among attributes, Templin and Bradshaw (2014) developed the hierarchical diagnostic classification models (HDCMs) to model the attribute hierarchies. Hence, applying M 2 to evaluate the fit of the DCMs in the presence of attribute hierarchies can help further testify to the utility of limited-information tests for DCMs. In this study, we use M 2 and examine its Type I error rates and its power to evaluate the overall fit of HDCMs under different simulation conditions. Specifically, five types of DCMs are considered in our research: LCDM and HDCMs with linear, convergent, divergent and unstructured attribute hierarchies. In addition, the performance of M 2 is examined with real data.

HIERARCHICAL DIAGNOSTIC CLASSIFICATION MODELS
Over the past several decades, numerous DCMs have been developed and presented in the psychometric literature. Some of these models are general modeling frameworks, under which other specific DCMs can be subsumed through statistical constraints on model parameters. Hence, the general DCMs can flexibly model the probabilities of examinee's differently structured responses at the sacrifice of model simplicity. In this study, we use as the fitting model the HDCM, which is developed based upon the LCDM framework (Henson et al., 2009). The LCDM defines the probability of a correct response of the ith examinee for item j as where α i = (α i1 , . . . , α iK ) ′ represents the attribute mastery pattern of examinee i, λ j,0 represents the intercept parameter for item j, and λ j represents the main and the interaction effect parameters for item j. q j = q j1 , . . . , q jk , . . . , q jK ′ is the Q-matrix entries for the item j, where q jk denotes whether attribute k is required by item j. h is a mapping function and is used to indicate the linear combination of α i and q j : In the above equation, for item j, all main and interaction effects are included. Specifically, λ j,1,(k) refers to the main effect of attribute k for item j, and λ j,2,(k,k ′ ) refers to the twoway interaction effect of attributes k and k ′ for item j. Hence, the subscript following the first comma in λ j,1,(k) or λ j,2,(k,k ′ ) indicates the effect level, and the subscript(s) in parentheses refers to the involved attribute(s) for the effect. For example, if q j = (0, 1, 1) ′ and α i = (1, 1, 0) ′ , namely the second and third attributes, are required by item j and the examinee has mastered the first two attributes, then λ ′ j h( α i , q j ) = λ j,1,(2) . For α i = (0, 1, 1) ′ , namely the examinee has mastered all attributes required by item j, then λ T j h( α i , q j ) = λ j,1,(2) + λ j,1,(3) + λ j,2,(2,3) , indicating that two main effects and one two-way interaction effect of attributes 2 and 3 are included in item j. Further, a statistical constraint is defined to ensure the monotonicity of the LCDM (see Henson et al., 2009).
Attribute hierarchies representing student knowledge structures often exist in education. However, how to model the attribute hierarchies had not been resolved until the HDCMs were developed by Templin and Bradshaw (2014). As previously mentioned, the LCDM is a general modeling framework under which other specific models can be obtained through statistical constraints. The parameterization of the HDCM, which is based on the LCDM, is no exception. Moreover, because the different types of attribute hierarches correspond to different parameterizations, the linear hierarchies are used for illustration.
With respect to the LCDM, K attributes correspond to 2 K attribute mastery patterns. However, for a linear hierarchy with K attributes, the number of attribute mastery patterns is sharply reduced from 2 K toK + 1. This change is reflected byα * i , which refers to the possible attribute mastery patterns under the constraints of the linear hierarchy. Allowing q j = a, b, c, . . . ′ to denote the attributes required by item j, the linear hierarchy defines attribute a to be the most fundamental attribute such that any other one attribute is nested within its former attribute. Thus, the matrix product in Equation (2) (a) α ia q ja + λ j,2,(b(a)) α ia α ib q ja q jb + λ j,3,(c(b,a)) α ia α ib α ic q ja q jb q jc + · · · .
Under the HDCM, the item parameters include only one intercept, one main effect for attribute 1 and one interaction effect for attribute 2 nested within attribute 1. Obviously, regardless of the number of attributes, there will be only one main effect in the item response function owing to the constraints of the linear hierarchies; accordingly, the number of item parameters for items measuring several attributes is also greatly reduced. In this study, four fundamental hierarchical structures (see Figure 1) by Leighton et al. (2004) were modeled by HDCMs. All attribute hierarchies involve five attributes. Specifically, linear hierarchy defines linear relationships between attributes: the mastery of an attribute is dependent on the mastery of its former attribute. As such, attribute 1 is the most fundamental attribute given that it is required for the mastery of any other attributes in the hierarchical structure. In other words, for example, it is unlikely that an examinee have mastery of attributes 3 and 4 without the mastery of attributes 1 and 2. The linear hierarchy would largely reduce the item parameters of LCDM. As an example, assume item j measures attributes 1, 2, and 4, the linear HDCM defines the matrix product of item j as λ j,1,(1) α i1 q j1 + λ j,2,(2(1)) α i1 α i2 q j1 q j2 + λ j,3,(1,2,4) α i1 α i2 α i4 q j1 q j2 q j4 .
It can be found that only the main effect of attribute 1, the two-way interaction effect of attributes 1 and 2, and the threeway interaction effect of attributes 1, 2, and 4 are modeled in HDCM under the constraints of the linear hierarchy. In the convergent hierarchy, the mastery of attributes 3 or 4 are dependent on the mastery of attributes 1 and 2, and both attributes 3 and 4 are prerequisite attributes of attribute 5. As such, the same item measuring attributes 1, 2, and 4 is modeled by the same way as equation (5) under the convergent HDCM.
In the unstructured hierarchy, attribute 1 is the prerequisite attribute of all other attributes, which are independent from each other. Under the unstructured HDCM, the same item would be modeled by the same way of divergent HDCM. It should be noted that for items measuring four or five attributes, different attribute hierarchies would model the same item in more distinct ways, since more attributes lead to more main and interaction effects.

THE M 2 TEST STATISTIC
Regarding the issue of fit tests for DCMs, several recent studies have focused on item-level fit statistics for DCMs (e.g., Kunina-Habenicht et al., 2012;Wang et al., 2015). Despite the feasibility of the proposed item-level fit statistics, a way to assess the absolute model-data fit in the test level remains undeveloped because the traditional full-information statistics, such as χ 2 and G 2 , cannot be practically feasible in DCMs. Specifically, the computations of χ 2 and G 2 should be based on all possible response patterns, namely, the full contingency table. However, it is required that the expected frequency in each cell should be large, i.e., usually exceeding 5, for χ 2 and G 2 to be effective. This indicates that only a few items and a large number of examinees are required for a test using DCMs because even a small number of items can contribute to a large number of response patterns, a situation that easily leads to the sparseness of the contingency table. As such, some approaches have been proposed to address the issue of sparseness. For instance, although the Monte Carlo resampling technique can be used to produce empirical p-values (Tollenaar and Mooijaart, 2003), it is too time-consuming in practice. Another approach, the posterior predictive model checking method (Sinharay and Almond, 2007), is conservative and requires intensive computations due to the Markov chain Monte Carlo (MCMC) algorithm.
The limited-information tests are promising for model-data fit testing in DCMs. Unlike the full-information statistics, such as χ 2 and G 2 , which use the entire contingency table, limitedinformation statistics use only some subset of lower-order marginal tables. M 2 is a commonly used limited-information statistic that demonstrates good performance for model-data M 2 is the most popular statistic of one family of limitedinformation test statistics, M r , where r denotes the marginal order (Maydeu-Olivares and Joe, 2006). Even though M 2 uses only the univariate and bivariate marginal information, it is adequately powerful and can be computed efficiently (Maydeu-Olivares and Joe, 2005). Similar to traditional fullinformation statistics, the limited-information statistics are constructed using the residuals between observed and expected marginal probabilities. Hence, the residuals of the univariate and bivariate marginal probabilities should calculate first prior to the computation of M 2 . Letπ 1 = (π 1 , . . . ,π j , . . . ,π J ) ′ denotes the first-order marginal probabilities, specifically, the marginal probabilities of correctly responding to each single test item, whereπ j denotes the marginal probability of correctly responding to the jth item. Accordingly,π 2 denotes the secondorder marginal probabilities of correctly responding to each item pair. Then, let π 2 = (π 1 ′ ,π 2 ′ ) ′ denote the up-to-order 2 joint marginal probabilities and π denote the true response pattern probabilities, which is a 2 J × 1 vector. Thus, the univariate and bivariate marginal probabilities can be calculated as follows: π 2 = L 2 × π where L 2 is a d × 2 J operator matrix of 1 and 0 s. The row-dimension, d = J + J(J − 1)/2, is the number of first-and second-order residuals. Using a test of 3 items as an example, the up-to-order 2 marginal probabilities should be where π ( # , # , # ) refers to the probability of the corresponding item response pattern.
Then, let p 2 = L 2 × p andπ 2 = L 2 × π γ indicate the observed and model-predicted marginal probabilities, respectively, where p refers to the observed response pattern probabilities and π γ refers to the model-predicted response pattern probabilities evaluated using the maximum likelihood estimateγ . Thus, the univariate and bivariate residuals are computed asr 2 = p 2 −π 2 . Thereafter, the M 2 statistic is derived using r 2 and a weight matrix, W 2 , as follows: The vector of the first-and second-order residuals r 2 is asymptotically normally distributed with means of zero and a covariance matrix 2 − 2 I −1 ′ 2 (Reiser, 1996;Maydeu-Olivares and Joe, 2005): where 2 = L 2 L ′ 2 and = diag π γ − π γ π γ ′ .
Another component in W 2 , 2 , is the first-order partial derivative of the expected marginal probabilities with respect to the maximum likelihood estimate of item parameters: The statistic is asymptotically distributed chi-squared with d − k degrees of freedom (Maydeu-Olivares and Joe, 2005), where k refers to the number of free parameters in the model. For a more elaborate description of M 2 , please refer to Hansen et al. (2016). Similar to the traditional fit test statistics, the RMSEA 2 can be calculated for M 2 (Maydeu-Olivares and Joe, 2014). RMSEA 2 is recommended to assess the approximate goodness-of-fit when M 2 indicates the model does not fit exactly in the population. RMSEA 2 can be obtained based on the observedM 2 and the df :  In this study, the information matrix, I, is the expected (Fisher) information matrix, which is described in a recent study in detail (Liu et al., 2018).

SIMULATION STUDY 1: THE TYPE I ERROR RATES OF M 2
Simulation 1 was conducted to examine the empirical Type I error rates of M 2 when models were correctly specified under the condition of different attribute hierarchies and sample sizes. For each simulation, three sample sizes, N = 1,000, N = 2,000 and N = 4,000, and a fixed test length, J = 20, were considered with 500 replications. All simulations were performed by R and Mplus.
In this simulation, data were generated from five models: linear, divergent, convergent and unstructured HDCMs, and LCDM (no attribute hierarchy). The Q-matrix (see Table 1) used in this study involved 5 attributes and 20 items with each item measuring one, two or three attributes. In the Q-matrix, the number of corresponding items for each attribute was specified to be equal. In addition, the attributes were specified to follow a multidimensional normal distribution, with the mean vectors randomly selected from the uniform distribution µ (−.5, .5). We used 0 as the critical value to dichotomize the attribute vectors. With respect to the attribute correlations, as suggested by Kunina-Habenicht et al. (2012), 0.5 and 0.8 are typical low and high attribute correlation coefficients, respectively; therefore, the correlation coefficients between the attributes were randomly selected from µ (.5, .8)for each replication. To avoid the effects of the magnitudes of item parameters on the simulation results, the values of all main effects for each item were fixed at a value of 2, the values of all interaction effects were fixed at a value of 1, and the intercepts were fixed at a value of −0.5 times the sum of all main and interaction effects for each item (Templin and Bradshaw, 2014). Table 2 presents the results of the Type I error rates of M 2 under the conditions of different attribute hierarchy types at five significance levels. The Type I error rates of M 2 matched their expected rates well under different combinations of simulation conditions. Specifically, there was no substantial discrepancy in the performance of the Type I error rate control of M 2 for the five different attribute hierarchy types. The average values of the empirical Type I error rates at the five significance levels were 0.014, 0.057, 0.111, 0.220, and 0.271, respectively. Hence, it is evident that the M 2 statistic exhibited good Type I error rate control for different attribute hierarchy types. However, further examination of the empirical Type I error rates found that M 2 under the LCDM, which indicates the independent relationships for attributes, had slightly better Type I error rate control compared to HDCMs.

SIMULATION STUDY 2: THE STATISTICAL POWER OF M 2
Simulation 2 was conducted to examine the power of M 2 under the conditions of model and Q-matrix misspecifications. The model and Q-matrix misspecifications were regarded as the

K j
Alterations Note Both K j is the number of required attributes for the j th item; q jk is the entry in the jth row and kth column of the Q-matrix; q jk ′ is the entry in the jth row and k'th column of the Q-matrix, k = k ′ [this table is excerpted from and with the permission of Liu et al. (2016)]. Especially, the Q-matrix misspecification can be a very influential source of model-data misfit (Kunina-Habenicht et al., 2012). The generating models and the Q-matrix in this simulation were identical to those in Simulation 1 except that the LCDM was no longer considered. Identical to Simulation 1, three sample sizes, N = 1,000, N = 2,000, and N = 4,000, and a fixed test length, J = 20, were considered, and each simulation was performed with 500 replications. In addition, according to previous findings (e.g., Kunina-Habenicht et al., 2012;Liu et al., 2016), attribute correlations may affect the power of model-data fit statistics in DCMs, we therefore considered three attribute correlation levels, 0.3, 0.5, and 0.8, to examine the effects of different attribute correlations on the power of M 2 .
Regarding the model misspecification, we used the DINA model as the misspecified fitting model. The DINA model classifies the examinees into two groups: the examinees who have mastered all measured attributes and those who have not mastered at least one of the measured attributes. With respect to the Q-matrix misspecification, we used the random balance design (Chen et al., 2013) to misspecify 20% of the Q-matrix elements. This means that some attributes that were originally measured by the items are no longer required in the misspecified Frontiers in Psychology | www.frontiersin.org Q-matrix, and vice versa (see Table 3). The other technical settings are presented in the previous simulation study. Table 4 presents the results of the empirical rejection rates of M 2 when the DINA model is used as the misspecified model. According to the table, for each attribute hierarchy, the statistical power increased with the increase of the sample size. This trend existed across different attribute correlations and significance levels. Specifically, when the sample size was 4,000, M 2 had good performance in detecting the misspecified model. However, when the sample sizes were 2,000 and 1,000, the statistical power of M 2 was unsatisfactory. In addition, the statistical power of M 2 for the unstructured HDCM was rather poor in this simulation. Table 5 presents the results under the Q-matrix misspecification. According to Table 5, it is noted that the statistical power of M 2 was extremely high for each type of attribute hierarchy. Specifically, the power for the divergent and unstructured HDCMs reached 100%, and the power for the linear and convergent HDCMs was slightly <100% only when the sample size was 1,000. Generally, the attribute correlations had no effects on the statistical power of M 2 for each type of misfit.

EMPIRICAL ILLUSTRATION
We used the Examination for the Certificate of Proficiency in English (ECPE) data (Templin and Bradshaw, 2014) to investigate the usefulness of M 2 in real settings. The ECPE data is publicly available in the CDM package in R (Robitzsch et al., 2014). The ECPE data embrace three attributes (knowledge of morphosyntactic rules, cohesive rules and lexical rules), 28 multiple-choice items and 2,922 examinees (Buck and Tatsuoka, Frontiers in Psychology | www.frontiersin.org 1998). The Q-matrix of the ECPE data is presented in Table 6.
There exists a linear hierarchy underlying the three attributes: "lexical rules" is the prerequisite attribute of "cohesive rules, " which in turn is the prerequisite attribute of "morphosyntactic rules." For illustration purposes, we used the linear HDCM, LCDM, DINA, and C-RUM to fit the data and applied M 2 and RMSEA 2 to evaluate the test-level model-data fit. As mentioned previously, The LCDM is the saturated model which involves the largest number of item parameters. The DINA model is a parsimonious model which includes two parameters for each item (the guessing and slip parameters). The C-RUM (Hartz, 2002) can be obtained from LCDM by retaining all main effects and removing all interaction effects of attributes. Modeling of LCDM, DINA, and C-RUM can be fulfilled by the gdina function in the CDM package in R. For the linear HDCM, because there is no available function in the CDM package, Mplus was used for the modeling according to the work by Templin and Hoffman (2013).We provided an abbreviated Mplus Syntax for the estimation of ECPE data and the R function for calculating M 2 in the online Supplementary Material. The full Mplus Syntax can be accessed at the personal website of the developer of HDCM (https://jonathantemplin.com/hierarchicalattribute-structures/). Table 7 presents the item parameter estimates of the ECPE data using the LCDM, DINA, and C-RUM models. In this study, we used the same estimation procedure programmed with Mplus as Templin and Bradshaw (2014) for the HDCM estimation. Thus, the item parameter estimates of the HDCM are available in Templin and Bradshaw (2014). The values of M 2 and RMSEA 2 for these models are presented in Table 8. Unfortunately, the M 2 statistic rejected all models for the ECPE data. However, the values of RMSEA 2 were small. In addition, the relative modeldata fit statistics, AIC and BIC, showed that HDCM was the best fitting model for the ECPE data.

DISCUSSION
This article aims to investigate the performance of a widely used limited-information fit test statistic, M 2 , in the hierarchical  DCMs. We used the HDCMs to model the four fundamental attribute hierarchies and conducted two simulation studies and one empirical study to testify to the usefulness of M 2 . According to Simulation 1, the observed Type I error rates of M 2 are reasonably close to the nominal levels for each attribute hierarchy. This indicates that the M 2 statistic can be safely used for different types of attribute hierarchies in DCMs. The attribute hierarchies are of great importance for practitioners using DCMs because hierarchical structures often exist among different knowledge, skills and psychological concepts. However, researchers and practitioners often improperly assume that the attributes involved in the cognitive diagnostic tests are independent. Hence, by examining four fundamental types of attribute hierarchies, in an initial step, we demonstrated the usefulness of M 2 for addressing the complex attribute relationships in DCMs. Our findings echo previous findings on Type I error rates of M 2 in DCMs (e.g., Hansen et al., 2016;Liu et al., 2016). In the study by Hansen et al. (2016), the Type I error rates of M 2 for higher-order DINA, DINA and their variations were close to what would be expected. It should be noted that the Type I error rates of M 2 were examined for a fixed test length, 20 items, which is close to that of the study by Hansen et al. (2016), 24 items. The test length was decided to cause the sparseness of the contingency table, based on which the limitedinformation statistics can be used. However, readers who are interested in how test length affects the Type I error rates of M 2 in DCMs, should refer to the study by Liu et al. (2016). In their study, given both non-sparse (J = 6) and sparse contingency tables (J = 30/50), M 2 demonstrated good control of Type I error rates.
Thereafter, we further examined the sensitivity of M 2 to the specification of an incorrect DINA model and the misspecification of the Q-matrix. According to Simulation 2, the M 2 statistic is extremely sensitive to the misspecification of the Q-matrix regardless of sample size and attribute correlation. This finding is consistent with previous studies (Hansen et al., 2016;Liu et al., 2016) that emphasize the importance of the correct specification of the Q-matrix in cognitive diagnostic tests. It should be noted that our study adopted the same approach for the Q-matrix misspecification generation as used by previous studies (e.g., Chen et al., 2013;Liu et al., 2016). The percentage of misspecified Q-matrix elements were set to be 20% considering that the true Q-matrix may not be easily identified by domain experts in reality. Thus 20% of misspecified Q-matrix elements was designed to reflect a substantial Qmatrix misspecification. In addition, due to the fact that the parameter estimation of a single HDCM by Mplus is extremely time-consuming (20-60 min), we did not examine the power of M 2 for other levels of misspecified Q-matrix elements considering the infeasible simulation time. However, the study by Hansen et al. (2016) provided the evidence that when only two elements of Q-matrix were misspecified, the empirical rejection rates of M 2 reached 100% in most simulation conditions, whereas omitting an existing attribute or adding an extraneous attribute would lead to lower sensitivity of M 2 to Q-matrix specification. This finding implies that M 2 may be largely sensitive to the Q-matrix misspecification given any For the LCDM and C-RUM, 0 refers to the intercept, "1(#)" refers to the main effect, and "2(##)" refers to the two-way interaction effect; for the DINA model, "g" refers to the "guessing" parameter and "s" refers to the "slipping" parameter. With respect to model misspecification, when the DINA model was used as the fitting model, M 2 was sensitive to the misspecification for large sample sizes. This expected finding can be explained by different assumptions regarding the relationships between the items and attributes underlying the DINA and HDCMs. Specifically, the DINA model is a non-compensatory model, which indicates that an examinee must possess mastery of all required attributes to correctly respond to some items and that the lack of any one of the required attributes will contribute to an incorrect answer. In contrast, the relationships between items and attributes of the HDCMs are compensatory, indicating that examinees have a higher probability of correctly responding to an item when they have mastered any one of the additional required attributes of the item. Accordingly, M 2 was generally sensitive to the specification of the incorrect DINA model due to the huge discrepancy regarding the natures of the DINA and HDCMs. In addition, it is evident that larger sample sizes generally lead to higher statistical power, a finding that is consistent with previous studies (Kunina-Habenicht et al., 2012;Liu et al., 2016). Furthermore, we found that the attribute correlations have no noticeable influence on the sensitivity of M 2 to the model and Q-matrix misspecifications. It is possible that the attribute hierarchies already assume strong relationships among the attributes and therefore the specified attribute correlation levels do not affect the performance of M 2 . This finding echoes previous findings that attribute correlations would not significantly influence the classification accuracy of DCMs (Kunina-Habenicht et al., 2012) and the statistical power of M 2 in DCMs (Liu et al., 2016). It should be noted that in the study by Liu et al. (2016), when the sample size was small, a lower attribute correlation would lead to slightly higher power of M 2 . However, the opposite result was observed for a small sample size in our study. This is possibly due to that small sample sizes would lead to unstable parameter estimation which in in turn affected the classification accuracy and the performance of M 2 .
Regarding the empirical illustration, the M 2 statistic rejected all models. It was expected that M 2 would reject models except HDCM because the ECPE data involves hierarchical attribute relationships. One possible explanation is that M 2 is, in practice, a strongly sensitive fit test statistic. Undoubtedly, the true generating Q-matrix for the ECPE data cannot be known. Hence, it is unavoidable that there exist some mistakes in the Q-matrix, which is specified by the domain and measurement experts. Moreover, the simulation study shows that M 2 is extremely sensitive to the Q-matrix misspecification. So it is not surprising that M 2 rejected all models for the ECPE data. The evidence of this finding is also supported by the empirical illustrations of numerous studies (e.g., Cai et al., 2006;Maydeu-Olivares et al., 2011;Jurich, 2014). Considering the sensitivity of M 2 , which provides only information about whether the models fit the data, many researchers recommended the use of RMSEA 2 to assess the goodness of approximation of DCMs and to characterize the degree of model error (e.g., Maydeu-Olivares and Joe, 2014). The RMSEA 2 is an effect-size-like index that can be used for the direct comparisons among the different models. According to Liu et al. (2016)'s criteria, the values of .030 and .045 are the thresholds for excellence and good fit, respectively. The values of RMSEA 2 for the ECPE data in this study are significantly <0.030, indicating good model-data fit for all models. However, it was expected that values of RMSEA 2 would vary across the four models because they assume different relationships between attributes and items. For example, LCDM is the general modeling framework whereas DINA is one of the most parsimonious models which defines only two item parameters. Despite the fact that RMSEA penalizes the model complexity and measures the degree of model-data misfit, it was found to be influenced by sample sizes and the degree of freedom (df ) in other modeling frameworks (e.g., structural equation modeling). For small sample sizes and small df, the RMSEA is often positively biased (Kenny et al., 2015). However, it was also evident that compared with small sample sizes and small df, the model rejection rates of RMSEA were much lower for large sample sizes and large df given the same cutoff value (Chen et al., 2008). Therefore, in our study, it is possible that the large sample size and the large df s led to indistinguishable values and CIs of RMSEA 2 . More empirical investigations are needed for revealing the performance of RMSEA 2 in examining the model-data fit for DCMs.
For the real-life application of DCMs, practitioners should carefully examine the relationships between attributes at the test design stage or the initial stage of data analysis. HDCMs are recommended as the modeling framework if attribute hierarchies are identified. However, the model selection decision should be based on both substantive considerations and technical solutions. Despite the fact that the model-data fit tests in DCMs are under-developed, according to our findings, the M 2 statistic is of great value for examining the absolute model-data fit of HDCMs. However, given the strong sensitivity of M 2 to model or Q-matrix errors, RMSEA 2 is recommended to evaluate the model-data misfit.
Some limitations exist in the present study. First, we used a fixed test length of 20 items, which, in reality, is considered as a reasonable test length. However, future studies are encouraged to investigate the effects of test length on the performance of M 2 in DCMs. Second, we considered only the dichotomous data in this study. Future studies regarding the application of M 2 should include polytomous models. Finally, although four fundamental afttribute hierarchies were considered in our research, it is recommended that the performances of M 2 and RMSEA 2 be examined in DCMs with more complicated attribute relationships.

AUTHOR CONTRIBUTIONS
FC and TX contributed to the conceptualization and design of the work. FC and YL contributed to the analysis and interpretation of data. FC, YL, YC, and TX were involved in drafting and revising the manuscript. All authors approved the final manuscript submitted.