A Semi-supervised Learning Method for Q-Matrix Specification Under the DINA and DINO Model With Independent Structure

Cognitive diagnosis assessment (CDA) can be regarded as a kind of formative assessments because it is intended to promote assessment for learning and modify instruction and learning in classrooms by providing the formative diagnostic information about students' cognitive strengths and weaknesses. CDA has two phases, like a statistical pattern recognition. The first phase is feature generation, followed by classification stage. A Q-matrix, which describes the relationship between items and latent skills, corresponds to the feature generation phase in statistical pattern recognition. Feature generation is of paramount importance in any pattern recognition task. In practice, the Q-matrix is difficult to specify correctly in cognitive diagnosis and misspecification of the Q-matrix can seriously affect the accuracy of the classification of examinees. Based on the fact that any columns of a reduced Q-matrix can be expressed by the columns of a reachability R matrix under the logical OR operation, a semi-supervised learning approach and an optimal design for examinee sampling were proposed for Q-matrix specification under the conjunctive and disjunctive model with independent structure. This method only required subject matter experts specifying a R matrix corresponding to a small part of test items for the independent structure in which the R matrix is an identity matrix. Simulation and real data analysis showed that the new method with the optimal design is promising in terms of correct recovery rates of q-entries.


INTRODUCTION
In educational assessment, cognitive diagnostic assessment (CDA) that combines psychometrics and cognitive science has received increased attention recently (Leighton and Gierl, 2007;Tatsuoka, 2009;Rupp et al., 2010). This approach potentially provides useful diagnostic information regarding students' strengths and weaknesses, and can facilitate individualized learning (Chang, 2015). Cognitive diagnostic models (CDMs) often utilize a Q-matrix (Embretson, 1984;Tatsuoka, 1990Tatsuoka, , 1995Tatsuoka, , 2009. Tatsuoka (2009) pointed out that " Tatsuoka (1990) organized the underlying cognitive processing skills and knowledge that are required in answering test items correctly in a Q-matrix, in which the rows represent attributes and the columns represent items." The entries of a Q-matrix are 1 or 0, denoted by q kj . If attribute k is involved in correctly answering item j, then q kj = 1, and q kj = 0 otherwise. The definition of Q-matrix in Tatsuoka (1990) is used in our study. Recently, one common representation of a Q-matrix is that in which the rows represent items and the columns represent attributes (Ma and de la Torre, 2020;Zhan et al., 2020). It should be noted that the representation of the Q-matrix that they used in the study differs from the traditional one.
Cognitive diagnostic assessment has two phases, like statistical pattern recognition and classification methodology. The first phase is feature generation, and then classification stage follows. The specification of Q-matrix corresponds to the feature extractor phase in statistical pattern recognition and classification problems. Feature generation is of paramount importance in any pattern recognition task. So, the Q-matrix plays a very important role in establishing the relation between latent attribute patterns and ideal/latent response patterns.
In practice, the Q-matrix is difficult to specify correctly in cognitive diagnostic assessment (Jang, 2009;DeCarlo, 2011) and misspecification of the Q-matrix can seriously affect the accuracy of both item parameter estimates and the classification of examinees (de la Torre, 2008;Rupp and Templin, 2008). Researchers have proposed several quantitative methods for deriving or refining Q-matrix. These methods can be classified into two categories (Xu and Desmarais, 2018): (a) the unsupervised method, including but not limited to the qmatrix method (Barnes, 2003(Barnes, , 2011, the non-negative matrix factorization technique (Desmarais, 2011;Desmarais et al., 2012;Desmarais and Naceur, 2013) or alternate least-square factorization method (Desmarais et al., 2014;Xu and Desmarais, 2016), the data-driven approach (Liu et al., 2012(Liu et al., , 2013, and the exploratory factor analysis method (Barnes, 2003;Close, 2012;Wang et al., 2018bWang et al., , 2020, and (b) the supervised method, including the sequential EM-based δ method (de la Torre, 2008) and its extension ς 2 method (de la Torre and Chiu, 2016), the Bayesian approach (DeCarlo, 2012), the non-parametric Qmatrix refinement method (Chiu, 2013), the stepwise reduction algorithm (Hartz, 2002), the EM-based methods (Wang et al., 2018a), the residual-based or item fit statistic approach (Chen, 2017;Kang et al., 2018) and so on.
The unsupervised method is deriving a Q-matrix only from test data or item responses. The unsupervised method is very useful because there are many existing tests without specifying the Q-matrix but with test response data. However, it would be difficult to identify the number of latent skills and be slightly more difficult to understand results from real data. A study of Beheshti et al. (2012) found that the number of latent skills estimated from real data is not well-aligned with the assessment of experts.
The supervised method can incorporate the information of experts' Q-matrix and test response data to refine or validate the provisional Q-matrix. If the provisional Q-matrix is unknown for an existing test, the supervised methods cannot be used. Furthermore, this method often needs a high-quality provisional Q-matrix for a whole test. If the provisional Qmatrix is specified by subject matter experts but contains a large amount of misspecification, it will be difficult for the recovery of a high-quality Q-matrix through the supervised method, because the performance of the supervised method relies on the precision of classification of attribute patterns resulting from the provisional Q-matrix (de la Torre, 2008;Rupp and Templin, 2008).
Specifying a Q-matrix for a whole test by experts can be a time-consuming and fatigue process. The purpose of this study is to propose a semi-supervised method for Q-matrix specification in order to check whether only some of items needs to be identified by experts. The semi-supervised method falls between unsupervised and supervised methods.

Model
Let K be the number of attributes. Let X ij be a binary random variable to denote the response of examinee i to item j, i = 1, 2, . . . , N, j = 1, 2, . . . , J. Let α i be a column vector to denote an attribute mastery pattern or a knowledge state from the universal set of knowledge states. Moreover, Q-matrix that specifies the item-attribute relationship is a K × J matrix, in which entry q kj = 1 if attribute k is required for answering item j correctly; otherwise, q kj = 0.
The item response function for the deterministic inputs, noisy "and" gate (DINA) model (Haertel, 1989;Junker and Sijtsma, 2001;Chiu and Douglas, 2013) is as follows: where a deterministic latent response η ij = K k=1 α q kj ki indicates whether or not examinee i possesses all of the attributes required by item j. A value of η ij = 1 means that examinee i has mastered all of the attributes required by item j, and η ij = 0 otherwise. The slip parameter s j refers to the probability of an incorrect response to the item j when η ij = 1, and the guessing parameter g j refers to the probability of a correct response to item j when η ij = 0. Let B = (η ij ) be a deterministic latent response matrix for the DINA model.
The item response function for the deterministic inputs, noisy "or" gate (DINO) model (Templin and Henson, 2006;Chiu and Douglas, 2013) is as follows: where w ij = 1 − K k=1 (1 − α ki ) q kj is a deterministic latent response. As in the DINA model, s j and g j are the slip and guessing parameters of item j. The DINA and DINO model are conjunctive and disjunctive models (Maris, 1999), respectively. Let W = (w ij ) be a deterministic latent response matrix for the DINO model.

A Semi-supervised Learning Approach for the Conjunctive Model
In the rule space method (Tatsuoka, 2009) or the attribute hierarchy method (Leighton et al., 2004), the adjacency matrix denoted by A represents the direct relationship among attributes. We denote the entry in row k 1 and column k 2 of A by a k 1 k 2 . If a direct prerequisite relation exists from attribute k 1 to attribute k 2 , then a k 1 k 2 = 1, and a k 1 k 2 = 0 otherwise. Let R denote a reachability matrix of order (K, K) to specify the direct and indirect relationships among attributes. The R matrix is given by R = (A + I) K with respect to Boolean operations, where I is an identity matrix. The reduced Q matrix denoted by Q r is obtained by removing the items (columns) that do not satisfy the specified relationships from the incidence Q matrix. The columns of Q r and the zero vector forms the student matrix denoted by Q s in which the columns forms the universal set of attribute patterns. If K attributes are independent, A is a zero matrix, R with K columns is an identity matrix, Q r with 2 K − 1 columns does not include the zero vector, and Q s with 2 K columns contains all possible combinations of attribute patterns.
We assume that the cognitive requirement for the multiple skills within an item is conjunctive (Maris, 1999), that is, answering an item correctly requires mastery of all the skills required by that item. For the conjunctive model, Example 1 will show the relationship of latent responses on items with q-vectors corresponding to R and Q r .
Given Q s and a test Q-matrix of Q r , a latent response matrix B = can be calculated, in which the entry in row i and column j is the deterministic latent response of η ij . If 0 corresponds to F (false) and 1 corresponds to T (true), the logical conjunction and disjunction operators, ∨ and ∧, can be applied to two binary vectors of equal length, by taking the bitwise AND or OR of each pair of bits at corresponding positions. It can be observed that η 3 = η 1 ∧η 2 , where η 3 = η 1 ∧η 2 is the conjunction of η 1 and η 2 . This is because the relationship q 3 = q 1 ∨ q 2 is true, where q 1 ∨ q 2 is the disjunction of q 1 and q 2 . Example 1 illustrates the following fact. For the conjunctive model, consider two latent response matrices denoted by B 1 and B 2 from two tests corresponding two Q-matrices Q r and R, where denoted as a reachability matrix. It means that B 1 and B 2 can be generated, respectively from the reduced Qmatrix and the reachability matrix based on the universal set of attribute patterns. From the example above, then any columns of the B 1 can be expressed by the columns of the B 2 under the logical AND operation. This is because the augmented algorithm proposed by Ding et al. (2008Ding et al. ( , 2009 in the generalized Qmatrix theory (Ding et al., 2015) provided the useful fact that any columns of the reduced Q-matrix can be expressed by the columns of the reachability matrix under the logical OR operation. The argument in Example 1 can be adapted to prove the following theorem.
Theorem 1. For the conjunctive model, if K attributes are independent, then q j = ∨ l∈S j r l if and only if η ij = ∧ l∈S j η il , where α i is any column of Q s and S j is a subset of {1, 2, . . . , K}.
Proof : If q j = ∨ l∈S j r l , we need to consider two cases, when η ij = 1 and η ij = 0. If η ij = 1 for α i as a column of Q s , we know that α ki = 1 for all attributes k with q kj = 1 by the definition of the deterministic latent response. That is, examinee i has mastered all the skills required by item j. Since q j = ∨ l∈S j r l , then by the definition of conjunction, we can conclude that α ki = 1 for all attributes k with r kl = 1 for all l ∈ S j . We now use the definition of the deterministic latent response to conclude that η il = 1 for all l ∈ S j , that is, ∧ l∈S j η il = 1. This shows that η ij = ∧ l∈S j η il when η ij = 1. If η ij = 0 for α i as a column of Q s , we know that α ki = 0 for at least one of attributes with q kj = 1 by the definition of the deterministic latent response. That is, examinee i has not mastered all the skills required by item j. Since q kj = 1 and q j = ∨ l∈S j r l , there is an item l in S j such that r kl = 1. This means that item l measured attribute k. Since α ki = 0, then by the definition of the deterministic latent response, it follows that η il = 0 for at least one of items in S j , that is, ∧ l∈S j η il = 0. This show that η ij = ∧ l∈S j η il when η ij = 0. Next, we try to prove the converse. First suppose that there exists an attribute k ∈ {1, 2, . . . , K} such that ∨ l∈S j r kl = 1 and q kj = 0. Since ∨ l∈S j r kl = 1, we know that there exists an item l ∈ S j with r kl = 1. Due to the arbitrariness of α i , let T and e k is the vector with a 1 in the kth entry and 0's elsewhere. This is a contradiction, because we know that η ij = 1, while ∧ l∈S j η il = 0. Similarly, we assume that there exists an attribute k ∈ {1, 2, . . . , K} such that ∨ l∈S j r kl = 0 and q kj = 1. One can still take α i = 1−e k . This is also a contradiction, because we know that η ij = 0, while ∧ l∈S j η il = 1. The proof is complete.
The important fact about Theorem 1 is that if a latent response matrix is calculated from a Q-matrix, the relationship between the columns in the Q-matrix can be constructed from the relationship between the corresponding columns in the latent response matrix. It should be noted that an observed item response is a function of an underlying latent response and slip and guessing parameters. In other words, the noise introduced in the process is due to slip and guessing parameters.
Next, we will introduce a semi-supervised learning method for Q-matrix specification for the conjunctive model by using the result of Theorem 1 and considering the noise in item responses. Without loss of generality, we begin by arbitrarily assigning q-vector q j to item j. Given a test Q-matrix, written as Q t = [R K×K q j ] = [r 1 r 2 . . . r K q j ], where R is a reachability matrix specified by subject matter experts and the remaining q j is unknown. Let U = [X N×K Y N×1 ] be an item response matrix on Q t , where N is the sample size. The estimate of q j can be written asq where logical OR is applied to the corresponding entries of the columns in the following set ofŜ ĵ S j = arg min where P({r 1 , r 2 , . . . , r K }) is the power set of the set {r 1 , r 2 , . . . , r K }. The exhaustive method with time complexity O(2 K ) provided a simple way to find a global solution ofŜ j .

A Semi-supervised Learning Approach for the Disjunctive Model
For the disjunctive model, the deterministic latent response on an item is correct if and only if an examinee has mastered at least one of the skills required by the item. This is illustrated in Example 2. Similar to what we did in Example 1, Example 2 will show the relationship of latent responses on items with q-vectors corresponding to R and Q r . Example 2 for an independent structure. Let K = 2, R = r 1 r 2 = 1 0 0 1 , Q s = α 1 α 2 α 3 α 4 = 0 1 0 1 0 0 1 1 , and Q r = q 1 q 2 q 3 = 1 0 1 0 1 1 . From Q s and Q r , a latent response which the entry in row i and column j is the deterministic latent response of w ij . It can be observed that w 3 = w 1 ∨ w 2 . This is because the relationship q 3 = q 1 ∨ q 2 is true. Consider a latent response matrix, denoted by W 2 = w 1 w 2 , corresponding to the R matrix. The fact illustrated in Example 2 is that any columns of the W 1 can be expressed by the columns of the W 2 under the logical OR operation for the disjunctive model. This is also because the augmented algorithm proposed by Ding et al. (2008Ding et al. ( , 2009 in the generalized Q-matrix theory (Ding et al., 2015) provided the useful fact that any columns of the reduced Qmatrix can be expressed by the columns of the reachability matrix under the logical OR operation. The following theorem gives the precise statement.
Theorem 2. For the disjunctive model, if K attributes are independent, then q j = ∨ l∈S j r l if and only if w ij = ∨ l∈S j w il , where α i is any column of Q s and S j is a subset of {1, 2, . . . , K}.
Proof: If q j = ∨ l∈S j r l , we need to consider two cases, when w ij = 1 and w ij = 0. If w ij = 1 for α i as a column of Q s , we know that α ki = 1 for at least one of attributes k with q kj = 1 by the definition of the deterministic latent response. That is, examinee i has mastered at least one of the attributes required by item j. Without loss of generality, we assume α ki = 1 and q kj = 1. Since q j = ∨ l∈S j r l , then by the definition of disjunction, we can conclude that r kl = 1 is true for at least one of l ∈ S j . From the definition of the deterministic latent response, it follows that there is at least one item l ∈ S j such that w il = 1, that is, ∨ l∈S j w il = 1. This show that w ij = ∨ l∈S j w il when w ij = 1. If w ij = 0 for α i as a column of Q s , we know that w ki = 0 for all of attributes with q kj = 1 by the definition of the deterministic latent response. That is, examinee i has not mastered any skills required by item j. Since q j = ∨ l∈S j r l , examinee i has not mastered any skills required by any item l ∈ S j . If we suppose that examinee i has mastered at least one of attributes required by an item l ∈ S j , then w ij = 1, which is a contradiction. It means that item l measured attribute k. It follows that w il = 0 for all of items in S j , that is, ∨ l∈S j w il = 0, directly from the definition of the deterministic latent response. This show that w ij = ∨ l∈S j w il when w ij = 0. Next, we use a proof by contradiction to prove the converse. First assume that there exists an attribute k ∈ {1, 2, . . . , K} such that ∨ l∈S j r kl = 1 and q kj = 0. Since ∨ l∈S j r kl = 1, we know that there exists an item l ∈ S j with r kl = 1. Due to the arbitrariness of α i , let α i = e k , where e k is the vector with a 1 in the kth entry and 0's elsewhere. Then, we havew il = 1 and w ij = 0. Sincew ij = ∨ l∈S j w il , we know that w ij = 1 and arrive at a contradiction. Similarly, we assume that there exists an attribute k ∈ {1, 2, . . . , K} such that ∨ l∈S j r kl = 0 and q kj = 1. One can still take α i = e k . This is also a contradiction, because we know that w ij = 1, while ∧ l∈S j w il = 0. The proof is complete.
The important fact about Theorem 2 is that one can derive the relationship between the columns of a Q-matrix from the relationship between the columns of corresponding latent response matrix. For considering the noise introduced in item responses due to slipping and guessing, we will introduce a semisupervised learning method for Q-matrix specification for the disjunctive model by using the result of Theorem 2. Without loss of generality, we begin by arbitrarily assigning a q-vector to q j . Given a test Q-matrix, written as Q t = [R K×K q j ] = [r 1 r 2 . . . r K q j ], where R is a reachability matrix specified by subject matter experts and the remaining q j is unknown. Let U = [X N×K Y N×1 ] be an item response matrix on Q t . The estimate of q j can be written aŝ where logical OR is applied to the corresponding entries of the columns in the following set ofŜ ĵ S j = arg min where P({r 1 , r 2 , . . . , r K }) is the power set of the set {r 1 , r 2 , . . . , r K }.
The exhaustive method with time complexity O(2 K ) provided a simple way to find a global solution ofŜ j .

Study Design
A simulation study was conducted to investigate the performance of the new method under five factors, such as sample size, item parameters for items corresponding to a reachability matrix, item parameters for new or raw items with unknown q-vectors, two cognitive diagnostic models (the DINA and DINO model), and two designs. Five attributes were considered in the simulation study. Matlab 2015a and R-3.6.1 were used for estimating unknown Q-matrix and analyzing real data below. In the simulation study, a test Q-matrix Q t = [R Q r ] consists of an identity or a reachability matrix and a reduced Q-matrix, where the reduced Q-matrix with 31 items includes all non-zero possible q-vectors. The number of examinees has 10 levels, such as N =30, 60, . . . , and 300. Item parameters for R and Q r have 10 levels, such as 0, 0.05, . . . , and 0.45. In general, for the DINA or DINO model, a high quality or "good" item will have small slip and guessing parameters (Rupp et al., 2010), which means that the noise are small.
Random and optimal designs were considered in the simulation study. For the random design, attribute patterns for examinees were generated by taking each of the 2 5 possible patterns with equal probability for each sample size. From the proof of Theorem 1 above, we know that the following set of attribute patterns for examinees plays a very important role in discriminating latent response vectors of different q-vectors under the DINA model where e k is the vector with a 1 in the kth entry and 0's otherwise. From the proof of Theorem 2 above, another set of attribute patterns for examinees plays a very important role in discriminating latent response vectors of different q-vectors under the DINO model as follows where e k is the vector with a 1 in the kth entry and 0's otherwise. For the optimal design, attribute patterns for examinees under the DINA or DINA model were randomly drawn with replacement from the set of S DINA or S DINO , respectively. Optimal designs for two models are possible to meet the needs of learners at different stages of skills and knowledge acquisition. For example, the attribute patterns in S DINO containing only one skill. This condition is really improbable for summary assessments in real situations, but is expected to be common for novice learners with respect to the new content to be learned in formative assessments or classroom assessments.

Data Simulation
Simulated data were generated using five attributes. Based on the simulated Q-matrix, item parameters, and attribute patterns, item responses are generated in the following way where u is a random value from a Uniform (0, 1) distribution and P j (α i ) is the item response function of the DINA or DINO model. A total of 4,000 conditions were simulated (10 sample sizes × 10 item parameters × 10 item parameters × 2 models × 2 designs). Thirty replication data sets were simulated for each condition.

Evaluation Criterion
The performance of the new method is evaluated in terms of the correct recovery rate (CRR) of q-entries. The correct recovery rate equals the ratio of the number of correct q-entries in the estimated Q-matrix to the total number of q-entries (Chiu, 2013) where M = 31 is the number of columns of the unknown Qmatrix Q r , q kj is an (k, j)th entry of the simulated Q r , andq kj is an (k, j) entry of theQ r estimated from the new method. The mean and standard deviation of the CRR values of the 30 replications were reported for each condition. Table 1 lists descriptive statistics of correct recovery rate of qentries for two models and two designs across other conditions. It is clear that the mean of correct recovery rates of q-entries tends to increase as sample size increases, but sample size has slightly affected the standard deviations of correct recovery rates. It should be noted that the mean of correct recovery rates of the optimal design is larger than that of the random design. The semisupervised learning method for q-matrix specification performed similarly under two cognitive diagnostic models. In addition, since there are 32 possible attribute patterns, no all attribute patterns can be observed in the first sample size condition (N = 30). This might lead to lower rate of correct recovery observed for this condition. Table 2 shows the correct recovery rates of q-entries from the new method with sample size of 300 for the DINA model under the random design. From correct recovery rates of qentries, when item parameters for items with known (i.e., the reachability matrix) and unknown q-vectors are ≤0.2, most of the average of correct recovery rates of q-entries for the semisupervised method are larger than or equal to 0.9. From trends of marginal means of last rows and columns in Table 2, item parameters of the reachability matrix have a relatively larger impact on the performance of the semi-supervised method than item parameters with unknown q-vectors. Table 3 presents the correct recovery rates of q-entries from the new method with sample size of 300 for the DINA model under the optimal design. From correct recovery rates of qentries, when item parameters for items with known and unknown q-vectors are ≤0.25, the average of correct recovery The bold values are larger than 0.9. rates of q-entries for the semi-supervised method are larger than or equal to 0.9. However, item parameters for known q-vectors have slightly larger impact on the performance of the semisupervised method than for unknown q-vectors, because the row means decreased more quickly than the column means. We need to compare the Tables 2, 3 to see which designs are promising. The number of correct recovery rates above 0.9 in Table 3 were found to be larger than that of Table 2. Tables 4, 5 show the correct recovery rates of q-entries from the new method with sample size of 300 for the DINO model under the random and optimal design. It can be observed that results for the DINO model are the same as those for the DINA model described above.

REAL DATA ANALYSIS
The purpose of the real data analysis is to examinee whether the proposed method is promising for a non-independent structure under the conjunctive model based on an intuitive fact from the following example. , it can be observed that η 4 = η 2 ∧ η 3 or η 4 = η 1 ∧ η 2 ∧ η 3 . This is because the relationship q 4 = q 2 ∨ q 3 or q 4 = q 1 ∨ q 2 ∨ q 3 is true.
A common data set pertaining to fraction-subtraction data contains 20 items and 536 examines (de la Torre and Douglas, 2004). In our real data analysis, we focused on the analysis of a subset of test items where the expert Q-matrix comes The bold values are larger than 0.9.  (2012). The labels given to the five skills are (A1) performing basic fraction-subtraction operation, (A2) simplifying/reducing, (A3) separating whole numbers from fractions, (A4) borrowing one from whole number to fraction, and (A5) converting whole numbers to fractions. We assumed the corresponding Q-matrix of items 3, 8, 9, 12, and 10 known since these item parameters are relatively small and the q-vectors of other items are combinations of qvectors for these five items. Then, the semi-supervised method was applied to estimate q-vectors for the other 10 items. Results in Table 6 show that the agreement rate of q-entries between the estimate and expert Q-matrix on the 10 items is 84%. The estimated q-entries suggest that items 4, 7, 13, 14, and 15 do not require attribute 2 (simplifying/reducing). Item 4 (similar to item 14) do not required attribute A2, which is consistent with results from DeCarlo (2012). Items 7, 13, and 15 can be answered correctly by using attributes required by item 12.
The estimated q-vector of item 1 has largest discrepancy with the expert q-vector. The reason might be that solving item 1 correctly needs to find a common denominator and then performs basic fraction-subtraction operation. The guessing and slip parameter of item 1 are 0.0001 and 0.2769 under the expert q-vector, respectively. The guessing and slip parameter of item 1 are 0.3408 and 0.0716 under the estimated q-vector, respectively. Since item 1 requires an extra attribute (i.e., find a common denominator), the slip parameter for the expert qvector is relatively large, while the estimated q-vector contains some unnecessary attributes, the guessing parameter is relatively large. In the estimated Q-matrix, attribute A4 has been added to item 11.The guessing probability of item 11 increased sensibly (from 0.10 to 0.48). It indicated that attribute A4 is not necessary for item 11 because this item is different from items 7, 12, and so on. The generalized DINA model (GDINA; de la Torre, 2011), the DINA model, the linear logistic model (LLM; Fischer, 1995), and the reduced reparametrized unified model (R-RUM; Hartz, 2002) were applied to fit the fraction-subtraction data with the expert or estimated Q-matrix. Under the DINA model, the means of the estimates of the guessing and slip parameter for the expert Qmatrix are 0.1080 and 0.1381, respectively, while for the revised Q-matrix, they are 0.1440 and 0.1295, respectively. It means that the estimates of the slip parameter become lower, but the guessing parameters tend to be larger. Table 7 presents fit results for the fraction subtraction data using the expert and estimated qmatrix. The LLM with the estimated Q-matrix is the best-fitting CDM and the R-RUM with the estimated Q-matrix is slightly worse, whereas the estimated Q-matrix performed worse than the expert Q-matrix only in the DINA model.

CONCLUSION AND DISCUSSION
The supervised methods rely on a provisional Q-matrix for a whole test, the estimates of examinees' attribute patterns and their accuracy. It is not suitable for the case of a provisional Q-matrix with a large amount of misspecification. The purpose of this study is to propose the semi-supervised method under independent structure based on item responses and a reachability R matrix corresponding to a small part of test item specified by subject matter experts. The new method doesn't need to estimate examinees' attribute patterns. The main conclusion of this study is that the new method will play a very important role in assist subject matter experts for Q-matrix specification because it is hard to correctly specify a Q-matrix with a large number of test items by subject matter experts. It may be useful for cognitive diagnostic assessment to facilitate teaching and learning. The generalized Q-matrix theory has been shown that each column in the reduced Q-matrix can be expressed as a logical disjunction of some of columns of the reachability matrix. With the aid of this theory, this study takes a look inside a latent response matrix and reveals an interesting and useful relationship hidden in its columns. If a latent response matrix is calculated from a Q-matrix under the conjunctive model, a column in the latent response matrix is the conjunction of some other columns in this matrix if and only if the corresponding column of the Q-matrix can be written as the disjunction of their corresponding columns. While for the disjunctive model, the columns of the latent response matrix have exactly the same disjunction relationships as the columns of the Q-matrix. Because any conjunction or disjunction relationship among the columns of a latent response matrix would imply a disjunction relationship among the columns of a Q-matrix, then we are expected that the relationship between the columns in the Q-matrix can be constructed from the relationship between the corresponding columns in an observed response matrix, resulting from the latent response matrix by adding the noise or random errors. Another reason for this expectation is that each entry in the observed response matrix is modeled as a noisy observation of the corresponding entry in the latent  (Chen et al., 2013).
response matrix through slip and guessing parameters (Junker and Sijtsma, 2001) and the discrepancies between the latent and observed response matrices are considered as random errors (Tatsuoka, 1987). From the key theoretical results above, the semi-supervised method and an optimal design were then proposed for Q-matrix specification based on test response data and a reachability matrix specified by subject matter experts, and the simulation study was conducted to investigate the performance of the new method and the optimal design for examinee sampling in terms of the CRR of q-entries. From the CRR of q-entries, it is clear found that: (a) for the random design, when item parameters for items with known and unknown q-vectors are ≤0.20, the average of CRRs of q-entries for the semi-supervised method is larger than or equal to 0.9, (b) for the optimal design, when item parameters for items with known and unknown q-vectors are ≤0.25, the average of CRRs of q-entries for the semi-supervised method is larger than or equal to 0.9, and (c) item parameters of the reachability matrix have a larger impact on the performance of the semi-supervised method than item parameters with unknown q-vectors.
Finally, based on the results obtained in this study, some problems worthy of study in the future are put forward. First, how to effectively use the most of data or information on some other items for which experts have also specified q-vectors, because as the increase of the number of item specified qvectors, the time complexity (more specifically, exponential time) of the exhaustive method grows much faster? If the number of items is increased to double or triple the number of attributes corresponding to the reachability matrix, one should investigate whether choosing a small part of items with high quality will reduce the noise of the responses and improve the estimation of q entries of unknown items. Second, in the simulation study, we know exactly how many attributes all items include. However, in the real situation, some items with unknown Q-matrix may mix additional attributes not specified in the reachability matrix because we haven't reviewed all items. Thus, we should explore a novel or revised method for identifying the possibility of extra attribute(s). Third, if the Q-matrix obtained from the semi-supervised method is taken as an initial matrix or a provisional Q-matrix of the existing supervised methods, is it possible to further improve the recovery of Q-matrix? From the results of the study, it can be seen that item parameters or random errors of item responses have an impact on the recovery of Q-matrix. If there is a method to reduce noise in item responses, the recovery of Q-matrix may be further improved. We only considered the small set of items with known q-vectors and fixed item parameters. Additional work is needed to further examine the impact of not only error patterns for known q-vectors but different item parameters for test items. Fourth, the current study focused on the DINA and DINO model only. In the future, the proposed method should be applied to general families of cognitive diagnostic models such as the generalized DINA model (de la Torre, 2011), the loglinear cognitive diagnostic model (Henson et al., 2009), the general diagnostic model (von Davier, 2008), testlet cognitive diagnosis model (Zhan et al., 2018), or polytomous cognitive diagnosis models (Chen and de la Torre, 2018;Ma, 2019). Lastly, since only the independent attribute structure in the simulation study and hierarchy structures for the conjunctive model in real data analysis were considered, the proposed method for other attribute hierarchies with different cognitive assumptions is worth studying.

DATA AVAILABILITY STATEMENT
Publicly available datasets were analyzed in this study. This data can be found here: https://www.rdocumentation.org/packages/ CDM/versions/7.4-19/topics/fraction.subtraction.data.

AUTHOR'S NOTE
Based on the fact that any columns of a reduced Q-matrix can be expressed by the columns of a reachability R matrix under the logical OR operation, a semi-supervised learning approach and an optimal design for examinee sampling were proposed for Q-matrix specification under the conjunctive and disjunctive model. This method only required subject matter experts specifying a R matrix corresponding to a small part of test items. Simulation and real data analysis showed that the new method with the optimal design is promising in terms of correct recovery rates of q-entries.