The Development of an Instrument for Longitudinal Learning Diagnosis of Rational Number Operations Based on Parallel Tests

The precondition of the measurement of longitudinal learning is a high-quality instrument for longitudinal learning diagnosis. This study developed an instrument for longitudinal learning diagnosis of rational number operations. In order to provide a reference for practitioners to develop the instrument for longitudinal learning diagnosis, the development process was presented step by step. The development process contains three main phases, the Q-matrix construction and item development, the preliminary/pilot test for item quality monitoring, and the formal test for test quality control. The results of this study indicate that (a) both the overall quality of the tests and the quality of each item are good enough and that (b) the three tests meet the requirements of parallel tests, which can be used as an instrument for longitudinal learning diagnosis to track students’ learning.


INTRODUCTION
In recent decades, with the development of psychometrics, learning diagnosis (Zhan, 2020b) or cognitive diagnosis (Leighton and Gierl, 2007), which objectively quantifies students' current learning status, has drawn increasing interest. Learning diagnosis aims to promote students' learning according to diagnostic results which typically including diagnostic feedback and interventions. However, most existing cross-sectional learning diagnoses are not concerned about measuring growth in learning. By contrast, longitudinal learning diagnosis evaluates students' knowledge and skills (collectively known as latent attributes) and identifies their strengths and weaknesses over a period (Zhan, 2020b).
A complete longitudinal learning diagnosis should include at least two parts: an instrument for longitudinal learning diagnosis of specific content and a longitudinal learning diagnosis model (LDM). The precondition of the measurement of longitudinal learning is a high-quality instrument for longitudinal learning diagnosis. The data collected from the instrument for longitudinal learning diagnosis can provide researchers with opportunities to develop longitudinal LDMs that can be used to track individual growth over time. Additionally, in recent years, several longitudinal LDMs have been proposed, for review, see Zhan (2020a). Although the usefulness of these longitudinal LDMs in analyzing longitudinal learning diagnosis data has been evaluated through some simulation studies and a few applications, the development process of an instrument for longitudinal learning diagnosis is rarely mentioned (cf. Wang et al., 2020). The lack of an operable development process of instrument hinders the application and promotion of longitudinal learning diagnosis in practice and prevents practitioners from specific fields to apply this approach to track individual growth in specific domains.
Currently, there are many applications use cross-sectional LDMs to diagnose individuals' learning status in the field of mathematics because the structure of mathematical attributes is relative clear to be identified, such as fraction calculations (Tatsuoka, 1983;Wu, 2019), linear algebraic equations (Birenbaum et al., 1993), and spatial rotations (Chen et al., 2018;Wang et al., 2018). Some studies also apply cross-sectional LDMs to analyze data from large-scale mathematical assessments (e.g., George and Robitzsch, 2018;Park et al., 2018;Zhan et al., 2018;Wu et al., 2020). However, most of these application studies use cross-sectional design and cannot track the individual growth of mathematical ability.
In the field of mathematics, understanding rational numbers is crucial for students' mathematics achievement (Booth et al., 2014). Rational numbers and their operations are one of the most basic concepts of numbers and mathematical operations, respectively. The fact that many effects are put into rational number teaching makes many students and teachers struggle to understand rational numbers (Cramer et al., 2002;Mazzocco and Devlin, 2008). The content of rational number operation is the first challenge that students encounter in the field of mathematics at the beginning of junior high school. Learning rational number operation is not only the premise of the subsequent learning of mathematics in junior high school but is also an important opportunity to cultivate students' interest and confidence in mathematics learning.
The main purpose of this study is to develop an instrument for longitudinal learning diagnosis, especially for the content of rational number operations. We present the development process step by step to provide a reference for practitioners to develop the instrument for longitudinal learning diagnosis.

DEVELOPMENT OF THE INSTRUMENT FOR LONGITUDINAL LEARNING DIAGNOSIS
As the repeated measures design is not always feasible in longitudinal educational measurement, in this study, the developed instrument is a longitudinal assessment consisting of parallel tests. The whole development process is shown in Figure 1. In the rest of the paper, we describe the development process step by step.

Recognition of Attributes and Attribute Hierarchy
The first step in designing and developing a diagnostic assessment is recognizing the core attributes involved in the field of study (Bradshaw et al., 2014). In the analysis of previous studies, the confirmation of attributes mainly adopted the method of literature review (Henson and Douglas, 2005) and expert judgment (Buck et al., 1997;Roduta Roberts et al., 2014;Wu, 2019). This study used the combination of these two methods.
First, relevant content knowledge was extracted according to the analysis of mathematics curriculum standards, mathematics exam outlines, teaching materials and supporting books, existing provincial tests, and chapter exercises. By reviewing the literature, we find that the existing researches mainly focus on one or several parts of rational number operation. For example, fraction addition and subtraction is the most involved in existing researches (e.g., Tatsuoka, 1983;Wu, 2019). In contrast, it is not common to focus on the whole part of rational number operation in diagnostic tests. Ning et al. (2012) pointed out that rational number operation contains 15 attributes; however, such a larger number of attributes does not apply in practice.  Second, according to the attribute framework based on the diagnosis of mathematics learning among students in 20 countries in the Third International Math and Science Study-Revised (Tatsuoka et al., 2004), the initial attribute framework and the corresponding attribute hierarchy (Leighton et al., 2004) of this study were determined after a discussion among six experts, including two frontline mathematics teachers who have more than 10 years' experience in mathematics education, two graduate students majoring in mathematics, and two graduate students majoring in psychometrics (see Table 1 and Figure 2).
Third, a reassessment by another group of eight experts (frontline mathematics teachers) and the think-aloud protocol analysis (Roduta Roberts et al., 2014) were used to verify the rationality of the initial attribute framework and that of the corresponding attribute hierarchy. All experts agreed that the attributes and their hierarchical relationships were reasonable. In the think-aloud protocol analysis, six items were initially prepared according to the initial attribute framework and attribute hierarchy (see Table 2). Then, six seventh graders were selected according to above-average Please say out aloud your thoughts when you solve the problem.
(1) Which one of the following statement about rational numbers is correct? ().
performance, gender balance, willingness to participate, and ability to express their thinking process (Gierl et al., 2008). The experimenter individually tested these students and recorded their responses; in the response process, the students were required to say aloud their problem-solving train of thought.
Taking the responses of two students to item 6 as an example, Figure 3 and Table 3 present their problem-solving process and thinking process, respectively. Although different students used different problem-solving processes, they all used addition, subtraction, multiplication, and division to solve the items of the mixed operation of rational numbers. Therefore, mastering A4 and A5 are prerequisites to mastering A6, and they validate the rationality of the attribute hierarchy proposed by experts. Finally, as presented in Table 1, the attributes of rational number operation fell into the following six categories: (A1) rational number, (A2) related concepts of rational numbers, (A3) axis, (A4) addition and subtraction of rational numbers, (A5) multiplication and division of rational numbers, and (A6) mixed operation of rational numbers. The six attributes followed a hierarchical structure (Figure 2), which indicates that A1-A3 are structurally independent and that A4 and A5 are both needed to master A6.

Q-Matrix Construction and Item Development
According to the attribute hierarchy, A4 and A5 are both needed to master A6. Therefore, the attribute patterns that contain A6 but lack either A4 or A5 are unattainable. Theoretically, there are 40 rather than 2 6 = 64 attainable attribute patterns. Correspondingly, the initial Q-matrix (i.e., test blueprint) (Tatsuoka, 1983) was constructed based on these 40 permissible attribute patterns and with the following factors FIGURE 3 | (A,B) Problem-solving process of two students in the think-aloud protocol analysis. Note that in item 6, TABLE 3 | The thinking process of two students in think-aloud protocol analysis.

Student 1:
Step 1: Read the item, and judge that the content knowledge investigated in this item is the mixed operation of rational numbers; Step 2: Recall the rule for mixed operation of rational numbers: First power, then multiplication and division, final addition and subtraction; If there are parentheses, count them in parentheses first; Step 3: Make sure multiply and divide first: (− 2 5 ) × (− 3 7 ) = 6 35 , and change division by (− 7 3 ) to multiply by(− 3 7 ); Step 4: Use multiplication: (− 3 5 ) × (− 3 7 ) = 9 35 ; Step 5: Use addition, and get the answer:

Student 4:
Step 1: Read the item, and judge that the content knowledge investigated in this item is the mixed operation of rational numbers; Step 2: Recall the rule for mixed operation of rational numbers: First power, then multiplication and division, final addition and subtraction; If there are parentheses, count them in parentheses first; Step 3: Observe dividing by (− 7 3 ) can be changed to multiplying by (− 3 7 ), the multiplication distribution law can be used; Step 4: Use the multiplication distribution law, put (− 3 7 )outside of the parentheses, then (− 2 5 ) + (− 3 5 ) = (−1) in the parentheses; Step 5: Use subtraction, and get the answer.
in mind: (a) the Q-matrix contains at least one reachability matrix for completeness (Ding et al., 2010); (b) each attribute is examined at least twice, and (c) the test time is limited to a teaching period of 40 min to ensure that students have a high degree of involvement. Finally, the test length was determined as 18, including 12 multiple-choice items and 6 calculation items (see Figure 4). Notice that all items are dichotomous scored in current study. To ensure that the initial item bank contains enough items, we prepared 4-5 items for each of the 18 attribute patterns contained in the initial Q-matrix. Finally, an initial item bank containing 80 items was formed.

Participants
In the preliminary test, 296 students (145 males and 151 females) were conveniently sampled from six classes in grade seven of junior high school A 1 .

Procedure
To avoid the fatigue effect, 80 items were divided into two tests (preliminary test I and preliminary test II, with 40 items in each test). All participants took part in the two tests. Each test lasted for 90 min, and the two tests were completed within 48 h.

Analysis
Item difficulty and discrimination were computed based on the classical test theory. The differential item functioning (DIF) was checked using the difR package (version 5.0) (Magis et al., 2018) in R software.

Results
A total of 296 students took the preliminary test. After data cleaning, 270 and 269 valid tests were collected in preliminary test I and preliminary test II, respectively. The effective rates of preliminary test I and preliminary test II were 91.22 and 91.19%, respectively. Table 4 presents the basic sample information and descriptive statistics of the raw scores. The distribution of the raw scores for the two tests was the same. Table 5 presents the average difficulty and the average discrimination of the preliminary test (the difficulty and discrimination of each item are presented in Table 6). In classical test theory, item difficulty (i.e., the pass rate) is equal to the ratio of the number of people who have a correct response to the total number of people, and item discrimination is equal to the difference between the pass rate of the upper 27% of the group FIGURE 4 | Q-matrix, where blank means "0" and gray means "1." Note that A1 = rational number; A2 = related concepts of rational numbers; A3 = axis; A4 = addition and subtraction of rational numbers; A5 = multiplication and division of rational numbers; and A6 = mixed operation of rational numbers.  and that of the lower 27% of the group. In general, a high-quality test should have the following characteristics: (a) the average difficulty of the test is 0.5, (b) the difficulty of each item is between 0.2 and 0.8, and (c) the discrimination of each item is greater than 0.3. Based on the above three criteria, we deleted eight items in preliminary test I and seven items in preliminary test II. Table 7 presents the results of the DIF testing of the preliminary tests. DIF is an important index to evaluate the quality of an item. If an item has a DIF, it will lead to a significant difference in the scores of two observed groups (male and female) in the case of a similar overall ability. In the preliminary tests, the Mantel-Haenszel method (Holland and Thayer, 1986) was used to conduct DIF testing. Male is treated as the reference group, and female is treated as the focal group. The results indicated that items 28 and 36 in preliminary test I had DIF, and no item in preliminary test II had DIF. According to item difficulty and discrimination in the above analysis, these two items were classified as items to be deleted.
By analyzing item difficulty, item discrimination, and DIF, 65 items finally remained (including 32 items in preliminary test I and 33 items in preliminary test II) to form the final item bank. Among them, there are 3-5 candidate items corresponding to each of the 18 attribute patterns in the initial Q-matrix. Furthermore, based on the initial Q-matrix, three learning diagnostic tests with the same Q-matrix were randomly extracted from the final item bank to form the instrument of the formal tests: formal test A, formal test B, and formal test C.

Formal Test: Q-Matrix Validation, Reliability and Validity, and Parallel Test Checking
It was possible that the initial Q-matrix was not adequately representative despite the level of care exercised. Thus, empirical validation of the initial Q-matrix was still needed to improve the accuracy of subsequent analysis (de la Torre, 2008). Although item quality was controlled in the preliminary test, it was necessary to ensure that these three tests, as instruments for longitudinal learning diagnosis, met the requirements of parallel tests. Only in this way could the performance of students at different time points be compared.

Participants
In the formal tests, 301 students (146 males and 155 females) were conveniently sampled from six classes in grade seven of junior high school B.

Procedure
All participants were tested simultaneously. The three tests (i.e., formal tests A, B, and C) were tested in turn. Each test lasted 40 min, and the three tests were completed within 48 h.

Analysis
Except for some descriptive statistics, the data in the formal test were mainly analyzed based on the LDMs using the CDM package (version 7.4-19) (Robitzsch et al., 2019) in R software. Including the model-data fitting, the empirical validation of the initial Q-matrix, the model parameter estimation, and the testing of reliability and validity were conducted. In the parallel test checking, the consistency of the three tests among the raw scores, the estimated item parameters, and the diagnostic classifications were calculated. The deterministic-input, noisy "and" (DINA) model (Junker and Sijtsma, 2001), the deterministic-input, noisy "or" (DINO) model (Templin and Henson, 2006), and the general DINA (GDINA) model (de la Torre, 2011) were used to fit the data. In the model-data fitting, as suggested by Chen et al. (2013), the AIC and BIC were used for the relative fit evaluation, and the RMSEA, SRMSR, MADcor, and MADQ3 were used for the absolute fit evaluation. In the model parameter estimation, only the estimates of the best-fitting model were presented. In the empirical validation of the initial Q-matrix, the procedure suggested by de la Torre (2008) was used. In the model-based DIF checking, the Wald test (Hou et al., 2014) was used. In the Items to be deleted including items 5,9,20,24,28,29,33,and 36 in preliminary test I,and items 2,15,20,24,26,33,and 39 in preliminary test II. testing of reliability and validity, the classification accuracy (P a ) and consistency (P c ) indices (Wang et al., 2015) were computed.

Descriptive statistics of raw scores
A total of 301 students took the formal test. After data cleaning, the same 277 valid tests (including those from 135 males  Note that the standard deviation is indicated in parentheses. The tests have a full mark of 18.   Split-half reliability is calculated according to the items of odd and even numbers; r AB = the parallel-forms reliability of tests A and B; r AC = the parallel-forms reliability of tests A and C; r BC = the parallel-forms reliability of tests B and C; * * p < 0.01. and 142 females) were collected from each of the three tests; the effective rate of the formal tests was 93.57%. Table 8 presents the descriptive statistics of raw scores in the formal tests. The average, standard deviation, mode, median, minimum, and maximum of raw scores of the three tests were the same.

Model-data fitting
The parameters in an LDM can be interpreted only when the selected model fits the data. The fit indices presented in Table 9 provide information about the data fit of three LDMs, namely DINA, DINO, and GDINA, to determine the best-fitting model. Absolute fit indices hold that values near zero indicate an absolute fit (Oliveri and von Davier, 2011;Ravand, 2016). The result indices indicated that all three models fitted the data well. For relative fit indices, smaller values indicate a better fit. The DINA model was preferred based on the BIC, and the GDINA model was preferred based on the AIC. According to the parsimony principle (Beck, 1943), a simpler model is preferred if its performance is not significantly worse than that of a more complex model. Both AIC and BIC introduced a penalty for model complexity. However, as the sample size was included in the penalty in BIC, the penalty in BIC was larger than that in AIC.
The DINA model was chosen as the best-fitting model given the small sample size of this study, which might not meet the needs of an accurate parameter estimation of the GDINA model, and the item parameters in the DINA model having more straightforward interpretations. Therefore, the DINA model was used for the follow-up model-based analyses.

Q-matrix validation
A misspecified Q-matrix can seriously affect the parameter estimation and the results of diagnostic accuracy (de la Torre, 2008; Ma and de la Torre, 2019). Notice that the Q-matrix validation can also be skipped when the model fits the data well. Table 10 presents the revision suggestion based on the empirical validation of the initial Q-matrix. In all three tests, the revision suggestion was only for item 9. However, after the subjective and empirical judgment of the experts (Ravand, 2016), this revision suggestion was not recommended to be adopted. Let us take item 9 ("Which number minus 7 is equal to −10?") in formal test A as an example. Clearly, this item does not address the suggested changes in A3, A5, and A6. As the expertdefined Q-matrix was consistent with the data-driven Q-matrix, the initial Q-matrix was used as the confirmed Q-matrix in the follow-up analyses.

Reliability and validity
Classification accuracy (P a ) and consistency (P c ) are two important indicators for evaluating the reliability and validity of classification results. According to Ravand and Robitzsch (2018), values of at least 0.8 for the P a index and 0.7 for the P c index can be considered acceptable classification rates. As shown in Table 11, both pattern-and attribute-level classification accuracy and consistency were within the acceptable range. Additionally, Cronbach's α, split-half reliability, and parallel form reliability were also computed based on the raw scores (see Table 12).
The attribute framework of this study was reassessed by several experts, and the Q-matrix was confirmed, indicating that the content validity and the structural validity of this study were good. To further verify the external validity, the correlation between the raw score of each formal test and the raw score of a monthly exam (denoted as S; the content of this test is the chapter on "rational numbers") was computed (r AS = 0.95, p < 0.01; r BS = 0.95, p < 0.01; r CS = 0.94, p < 0.01). The results indicated that the reliability and validity of all three tests were good.

Parallel test checking
To determine whether there were significant differences in the performance of the same group of students in the three tests, the raw scores, estimated item parameters (Table 13), and diagnostic classifications ( As the three tests examined the same content knowledge, contained the same Q-matrix, had high parallel-forms reliability, and had no significant differences in the raw scores, estimated item parameters, and diagnostic classifications, they could be considered to meet the requirements of parallel tests.

CONCLUSION AND DISCUSSION
This study developed an instrument for longitudinal learning diagnosis of rational number operations. In order to provide a reference for practitioners to develop the instrument for longitudinal learning diagnosis, the development process was presented step by step. The development process contains three main phases, the Q-matrix construction and item development, the preliminary test for item quality monitoring, and the formal test for test quality control. The results of this study indicate that (a) both the overall quality of the tests and the quality of each item are good enough and that (b) the three tests meet the requirements of parallel tests, which can be used as an instrument for longitudinal learning diagnosis to track students' learning. 2 The Bayes factor (BF 10 ) was calculated using the JASP software (Goss-Sampson, 2020) based on the Bayesian estimation. BF 10 = 0.038 means that the current data are 0.038 times more likely to occur under the alternative hypothesis (H1) being true than under the null hypothesis (H0) being true. As suggested by Dienes (2014), BF 10 less than 1, 1/3, and 1/10 represents weak, moderate, and strong evidence for the H0, respectively. By contrast, BF 10 greater than 1, 3, and 10 represents weak, moderate, and strong evidence for the H1, respectively. However, there are still some limitations of this study. First, to increase operability, only the binary attributes were adopted. As the binary attribute can only divide students into two categories (i.e., mastery and non-mastery), it may not meet the need for a multiple levels division of practical teaching objectives (Bloom et al., 1956). Polytomous attributes and the corresponding LDMs (Karelitz, 2008;Zhan et al., 2020) can be adopted in future studies. Second, the adopted instrument for longitudinal learning diagnosis was based on parallel tests. However, in practice, perfect parallel tests do not exist. In further studies, the anchoritem design (e.g., Zhan et al., 2019) can be adopted to develop an instrument for longitudinal learning diagnosis. Third, an appropriate Q-matrix is one of the key factors in learning diagnosis (de la Torre, 2008). However, the Q-matrix used in the instrument may not strictly meet the requirements of identification (Gu and Xu, 2019), which may affect the diagnostic classification accuracy.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

ETHICS STATEMENT
Ethical review and approval was not required for the study on human participants in accordance with the local legislation and institutional requirements. Written informed consent from the participants' legal guardian/next of kin was not required to participate in this study in accordance with the national legislation and the institutional requirements.

AUTHOR CONTRIBUTIONS
FT conducted data acquisition and analysis. PZ provided the idea, wrote the first draft, and revised the manuscript. Both authors contributed to the article and approved the submitted version.