^{1}

^{2}

^{*}

^{2}

^{1}

^{2}

^{1}

^{*}

^{1}

^{2}

Edited by: Peida Zhan, Zhejiang Normal University, China

Reviewed by: Yi Zheng, Arizona State University, United States; Yinhong He, Nanjing University of Information Science and Technology, China

This article was submitted to Quantitative Psychology and Measurement, a section of the journal Frontiers in Psychology

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Computerized adaptive testing (CAT) is an efficient testing mode, which allows each examinee to answer appropriate items according his or her latent trait level. The implementation of CAT requires a large-scale item pool, and item pool needs to be frequently replenished with new items to ensure test validity and security. Online calibration is a technique to calibrate the parameters of new items in CAT, which seeds new items in the process of answering operational items, and estimates the parameters of new items through the response data of examinees on new items. The most popular estimation methods include one EM cycle method (OEM) and multiple EM cycle method (MEM) under dichotomous item response theory models. This paper extends OEM and MEM to the graded response model (GRM), a popular model for polytomous data with ordered categories. Two simulation studies were carried out to explore online calibration under a variety of conditions, including calibration design, initial item parameter calculation methods, calibration methods, calibration sample size and the number of categories. Results show that the calibration accuracy of new items were acceptable, and which were affected by the interaction of some factors, therefore some conclusions were given.

Computerized adaptive testing (CAT), which is considered to be one of the most important applications of item response theory (IRT; Lord,

The implementation of CAT requires a large-scale item pool, and the maintenance and management of item pool is critical to ensure the validity and security of CAT. After a period of time, some operational items may be no longer suitable for use due to overexposure, obsoleteness, or flaw, thus it is necessary to replace unsuitable items by new ones (Wainer and Mislevy,

Wainer and Mislevy (

Online calibration design and online calibration method are two crucial aspects of online calibration (Chen and Xin,

There are many studies on online calibration based on dichotomously scored models (e.g., You et al.,

The structure of this article is as follows. First, the GRM, an IRT model used in this research is introduced. Second, online calibration method (OEM and MEM method) based on GRM is introduced. Two methods for calculating initial item parameters are given in detail. Third, two simulation studies are designed, and the research results are presented. Fourth, a batch of real data are used to verify the validity of the method. The last part involves conclusions, a supplementary study, discussions, and directions for future research.

The GRM is an IRT model suitable for polytomous data with ordered categories. It is an extension of two parameters logistic model (2PLM). In GRM, an examinee's likelihood of responding in a particular response category is obtained by two steps. First, category boundary response functions (CBRFs) are calculated to determine boundary decision probabilities of

In Equation (1), _{i} will respond positively at the boundary of category _{j}, θ_{i} represents the _{j} represents the item discrimination parameter or slope for item _{jt} represents the item difficulty parameter or category location. Importantly, the values of _{jt} should satisfy monotonically increasing, that is _{j1} < _{j2} < ⋯_{jt} < ⋯_{j,fj}.

In the second step of GRM, the probability of responding in a particular category is determined by CBRF, which are derived by subtracting

Further, make the following constraints,

Under the dichotomous model, OEM (Wainer and Mislevy,

OEM has only one EM cycle. For each examinee _{j} who takes item _{i} denotes his/her responses to the operational items, η_{op} is a vector of the known item parameters of the operational items. The E-step of the OEM method marginalizes the log-likelihood of new item _{i} and η_{op}. Based on the common assumption that examinees are independent from each other, the log-likelihood of item _{j} examinees are summed up as the final marginalized log-likelihood of item

These two steps are adapted from described in Muraki (

Where _{j} denote the _{j} examinees who received new item _{k} is the quadrature point; _{k}) is the corresponding weight, which is approximately the standard normal probability density at the point _{k}, assuming there are _{ijt} is an indicator variable expressed in a binary format; _{ijt} = 1 represents examinee _{ijt} = 0._{i}(_{k}) is the likelihood of examinee _{k}; _{h} is the number of categories of _{ht}(_{k}) is the probability of correct response to the _{k}, _{iht} is an indicator variable too, which denotes the examinee

With the one EM cycle in the OEM method, the revised

The MEM method allows multiple EM cycles. The first cycle is the same as OEM. Beginning with the second cycle, response data from both the operational items and the new items are used to update the posterior ability distribution in the E-step. Specifically, the only change in computation from OEM is that beginning with the second cycle of MEM, _{i}(_{k}) is replaced by:

Where _{ijt} denotes examinee

The E-step and the M-step iterate until a certain convergence criterion is met, for example the maximum absolute change in the item parameters between two consecutive EM cycles are less than a small threshold.

OEM and MEM are both iterative algorithms, the initial item parameters have a great influence on the calibration accuracy. However, there are few reports on the calculation of initial iteration values. In the dichotomous model, a squeezing average method is given to compute the initial value of difficulty parameter and a biserial correlation method is used to compute the initial value of discrimination parameter (You et al.,

Under the dichotomous model, according to the characteristics of the item response curve, the correctness of the examinee's response to a certain item is related with the ratio of his/her ability to the difficulty parameter of the item. When the ratio is more than 1, the correct response probability is high; otherwise, the correct response probability is low. For the one-parameter logistic model (1PLM), when the examinee's ability value is equal to the difficulty of one item, his/her correct response probability on the item is 0.5. Therefore, as long as the number of responses is sufficiently large for one item, there must be some examinees whose abilities approach to the difficulty parameter of the item (You et al.,

The steps of the squeezing average method (You et al.,

Under the GRM model, GRM has multiple difficulty parameters, so multiple squeezing processes are required. For example, for the initial difficulty parameter of the _{j} sets for item

Where

In actual life, the evaluation of a contestant is generally based on a set of scores given by the experts. The highest and lowest score are removed, and then the average is taken, deleting extremum and squeezing average method takes this idea. The practice of choosing 5% as the extreme value in Equation (8) is derived from the way to obtain the initial value of the guess parameter under the three-parameter logistic model (3PLM). Pilot study also showed that the value had better results. It's easy to implement and guarantee the accuracy of parameter estimation.

The polyserial correlation coefficient method is a common statistical method (Olsson et al.,

_{jt} is the number of examinees whose scores on the new item

_{jt}; then calculate the corresponding normal density function value _{jt}). The specific calculation formula is as follows:

_{j}) of the score on the new item _{j}) between the score of the new item

Two methods of calculating the initial parameters of new items are given. The first method is called polyserial-initial method, abbreviated as Poly-Ini method, with this method, both

Two simulation studies were conducted using programs written in Python 3.7. The program simulated the entire calibration workflow including the implementation of CAT and the calibration of the new items, and replicated 100 times in each circumstance. The main purpose of Study 1 is to explore the calibration results under a set of conditions fully crossed by two online calibration design methods (random design, adaptive design), two initial item parameter calculation methods (Poly-Ini method, Poly-Sq-Ini method), two calibration methods (OEM, MEM). There are 8 combinations, each combination takes 3-categories as an example.

The main purpose of Study 2 is to explore the calibration results under different calibration sample size and different number of categories. Two factors were manipulated: calibration sample size (300, 400, 500, 600, and 700) and the number of categories of new items (2, 3, 4, and 5). There are 20 combinations. Random design, Poly-Sq-Ini method and MEM are adopted in each combinations.

Suppose there are 1000 operational items with various categories (2–5 categories) in the CAT item pool, item parameters were randomly generated under GRM from the following distributions:

_{j},_{j} is the number of categories. In addition, the generated _{j1} < _{j2} < ⋯_{jt} < ⋯_{j,fj} in this paper.

A total number of 20 new items were generated in the same manner with the operational items.

3,000 examinees' ability values (θ) were randomly drawn from the standard normal distribution θ ~

The CAT test length is fixed 25 items, including 20 operational items and 5 new items. During the CAT test, the maximum Fisher information method (MFI; Lord,

During operational item selection, provisional θ estimates were used to replace the θ's in the formulae. After each operational item is administered, the examinee ability parameter

The number of examinees who answer each new item must be sufficiently large to provide accurate item parameter estimates without placing an undue burden on examinees (Wainer and Mislevy,

In study 1, random design and adaptive design are considered. There are some researches adopted random design to assign the new items to the examinees during CAT due to its convenient implementation and acceptable calibration precision (e.g., Wainer and Mislevy,

The calibration accuracy of the new items was evaluated by root mean square error (RMSE) and bias. They quantify the recovery between the estimated and true parameter values, and the calculation formulas based on vector are as follows (He and Chen,

Where _{fj}-parameters,

In order to evaluate the overall recovery of

Smaller RMSE indicates higher calibration precision. If bias is close to 0, the calibration could be regarded as unbiased.

The results of Study 1 are shown in

RMSE under different combinations.

Random | Poly-Sq-Ini | OEM | 0.2047 | 0.2696 | 0.1567 | 0.2377 |

MEM | 0.2022 | 0.1705 | 0.1522 | 0.2009 | ||

Poly-Ini | OEM | 0.2892 | 0.1789 | 0.1705 | 0.2306 | |

MEM | 0.2632 | 0.2142 | 0.1847 | 0.2595 | ||

Adaptive | Poly-Sq-Ini | OEM | 0.2266 | 0.2651 | 0.2108 | 0.2501 |

MEM | 0.2259 | 0.2700 | 0.2101 | 0.2433 | ||

Poly-Ini | OEM | 0.2324 | 0.3106 | 0.2005 | 0.3179 | |

MEM | 0.2324 | 0.3116 | 0.2070 | 0.3231 |

Bias under different combinations.

Random | Poly-Sq-Ini | OEM | 0.1258 | −0.1310 | −0.0549 | 0.0261 |

MEM | 0.0483 | −0.0423 | −0.0422 | −0.0398 | ||

Poly-Ini | OEM | 0.2380 | 0.0727 | −0.0367 | −0.1391 | |

MEM | 0.2163 | 0.1065 | −0.0292 | −0.1589 | ||

Adaptive | Poly-Sq-Ini | OEM | 0.0286 | 0.0777 | 0.0099 | −0.0482 |

MEM | 0.0296 | 0.0783 | 0.0126 | −0.0472 | ||

Poly-Ini | OEM | 0.1744 | 0.2182 | 0.0120 | −0.1887 | |

MEM | 0.1751 | 0.2159 | 0.0056 | −0.1875 |

RMSE and bias of

The results of Study 2 are shown in

RMSE of different calibration sample size under different categories.

0.2730 | 0.2716 | 0.2683 | 0.2656 | 0.2722 | ||

_{1} |
0.2495 | 0.2259 | 0.2216 | 0.2078 | 0.2060 | |

_{2} |
0.2876 | 0.2660 | 0.2602 | 0.2554 | 0.2470 | |

Mean( |
0.2706 | 0.2481 | 0.2427 | 0.2338 | 0.2286 | |

0.2189 | 0.2141 | 0.2119 | 0.2074 | 0.2033 | ||

_{1} |
0.2413 | 0.2237 | 0.1954 | 0.1919 | 0.1865 | |

_{2} |
0.2127 | 0.1827 | 0.1723 | 0.1673 | 0.1568 | |

_{3} |
0.2674 | 0.2395 | 0.2270 | 0.2249 | 0.2156 | |

Mean( |
0.2439 | 0.2187 | 0.2014 | 0.1993 | 0.1899 | |

0.2166 | 0.2150 | 0.2138 | 0.2149 | 0.2081 | ||

_{1} |
0.2989 | 0.2866 | 0.2599 | 0.2458 | 0.2262 | |

_{2} |
0.2232 | 0.1968 | 0.1760 | 0.1634 | 0.1577 | |

_{3} |
0.2357 | 0.2016 | 0.1908 | 0.1610 | 0.1659 | |

_{4} |
0.2996 | 0.2611 | 0.2564 | 0.2337 | 0.2294 | |

Mean( |
0.2722 | 0.2432 | 0.2345 | 0.2098 | 0.2007 | |

0.2340 | 0.2407 | 0.2353 | 0.2301 | 0.2208 | ||

_{1} |
0.2837 | 0.2616 | 0.2604 | 0.2503 | 0.2491 | |

_{2} |
0.1929 | 0.1706 | 0.1662 | 0.1583 | 0.1511 | |

_{3} |
0.1693 | 0.1451 | 0.1419 | 0.1346 | 0.1210 | |

_{4} |
0.1950 | 0.1743 | 0.1633 | 0.1600 | 0.1462 | |

_{5} |
0.2672 | 0.2565 | 0.2368 | 0.2356 | 0.2257 | |

Mean( |
0.2284 | 0.2095 | 0.2044 | 0.1976 | 0.1873 |

Bias of different calibration sample size under different categories.

0.1517 | 0.1561 | 0.1488 | 0.1564 | 0.1611 | ||

_{1} |
−0.0231 | −0.0193 | −0.0289 | −0.0253 | −0.0336 | |

_{2} |
−0.0976 | −0.0912 | −0.0979 | −0.0945 | −0.1047 | |

Mean( |
−0.0603 | −0.0553 | −0.0634 | −0.0599 | −0.0692 | |

0.0479 | 0.0415 | 0.0546 | 0.0500 | 0.0398 | ||

_{1} |
−0.0451 | −0.0479 | −0.0424 | −0.0457 | −0.0589 | |

_{2} |
−0.046 | −0.0365 | −0.0395 | −0.0403 | −0.0502 | |

_{3} |
−0.0398 | −0.0385 | −0.0502 | −0.0435 | −0.0478 | |

Mean( |
−0.0365 | −0.0409 | −0.0440 | −0.0432 | −0.0523 | |

−0.0491 | −0.0445 | −0.0602 | −0.0477 | −0.0449 | ||

_{1} |
−0.086 | −0.0829 | −0.089 | −0.1059 | −0.0957 | |

_{2} |
−0.0491 | −0.0354 | −0.0444 | −0.0544 | −0.0451 | |

_{3} |
−0.0298 | −0.0186 | −0.0227 | −0.0238 | −0.0115 | |

_{4} |
0.0032 | 0.0132 | 0.0239 | 0.0204 | 0.0347 | |

Mean( |
−0.0404 | −0.0309 | −0.0330 | −0.0409 | −0.0294 | |

−0.1217 | −0.1305 | −0.1289 | −0.1232 | −0.1199 | ||

_{1} |
−0.1567 | −0.1473 | −0.1699 | −0.1519 | −0.1568 | |

_{2} |
−0.0752 | −0.0662 | −0.0789 | −0.0665 | −0.0737 | |

_{3} |
−0.0238 | −0.0155 | −0.0239 | −0.0126 | −0.0211 | |

_{4} |
0.0192 | 0.0273 | 0.0294 | 0.0390 | 0.0315 | |

_{5} |
0.0817 | 0.0990 | 0.1018 | 0.1168 | 0.1031 | |

Mean( |
−0.0309 | −0.0205 | −0.0283 | −0.0150 | −0.0233 |

RMSE of

As can be seen from

RMSE of different calibration sample size under different categories. 2-C, 2-categories; 3-C, 3-categories; 4-C, 4-categories; 5-C, 5-categories.

It can be seen from

Bias of

Bias of different calibration sample size under different categories.

In this paper, an online calibration method based on GRM is proposed, which has a good performance in simulation study. What is the performance on real data? Because the construction of the real CAT item pool is expensive, it is difficult to organize and arrange large-scale CAT tests also. This study used the response data of 500 examinees on 10 polytomous items (3-categories) in HSK4 (Chinese proficiency test) to conduct an empirical study. Detailed steps are as follows.

Because of the limited real data, this study only analyzed the calibrated sample of 200. The results of the analysis were as follows:_{a} = 0.4067, _{b1} = 0.4778, _{b2} = 0.3218, _{b3} = 0.3029.

This research extended OEM and MEM to GRM for online calibration, detailed description of algorithms were given in the article. While online calibration is a complex process, there are many factors affecting the calibration accuracy. In order to make online calibration efficient and practicable under GRM, various factors should be explored clearly. Two simulation studies were conducted to investigate the calibration results under various conditions. The results showed: (1) both OEM and MEM were able to generate reasonably new item parameters with 700 examinees per item, and each has its own merits. (2) The Poly-Sq-Ini method had better performance than Poly-Ini method under most experimental conditions. (3) Compared to the random calibration design, the adaptive calibration design do not improve the calibration accuracy in most conditions. (4) The calibration sample size had an effect on the calibration accuracy. In most conditions, the calibration accuracy increases with the increase of sample size. (5) The number of categories of new items also affected the calibration results, the calibration accuracy of 3-categories items was higher than that of 2-categories, and so on.

In addition, a supplementary study was conducted to investigate the calibration accuracy of GRM online calibration under different CAT scenarios. Eight CAT scenarios, which were fully crossed by sample sizes (2,000 and 3,000) and test lengths (variable-length, fixed-length with 10, 20, and 30 respectively), were investigated. The ability estimation results of CAT and the calibration results of new items under various CAT scenarios were listed in

Several future directions for research can be identified. First, in this paper, the

Second, in this paper, only the match

Third, the number of categories discussed in this paper was up to 5, which means that the new items can be 2, 3, 4, and 5 categories. If there are more than 5-categories items, whether the new online calibration method is still valid is worthy of further study.

Fourth, there is an interesting phenomenon in the bias of the 5-categories condition. The lower

The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation, to any qualified researcher.

SD, ZL, and JX designed experiments. JX and FL carried out experiments. JX analyzed experimental results and wrote the manuscript.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Estimation accuracy of ability under different CAT scenarios.

2,000 | Variable-length | 0.1904 | 0.0007 |

10 | 0.1924 | −0.0004 | |

20 | 0.1340 | −0.0008 | |

30 | 0.1105 | −0.0012 | |

3,000 | Variable-length | 0.1882 | 0.0033 |

10 | 0.2012 | −0.0001 | |

20 | 0.1286 | −0.0024 | |

30 | 0.1057 | 0.0050 |

RMSE of new item parameters under different CAT scenarios.

2,000 | Variable-length | 0.2483 | 0.2109 | 0.1802 | 0.2294 |

10 | 0.2345 | 0.2224 | 0.1557 | 0.2182 | |

20 | 0.2169 | 0.1954 | 0.1545 | 0.2242 | |

30 | 0.2232 | 0.2060 | 0.1685 | 0.2357 | |

3,000 | Variable-length | 0.2337 | 0.1921 | 0.1620 | 0.2203 |

10 | 0.2302 | 0.2571 | 0.1668 | 0.2143 | |

20 | 0.2121 | 0.2102 | 0.1640 | 0.2078 | |

30 | 0.2069 | 0.2012 | 0.1664 | 0.2235 |

Bias of new item parameters under different CAT scenarios.

2,000 | Variable-length | 0.0998 | −0.0005 | −0.0330 | −0.0685 |

10 | −0.0719 | −0.0889 | −0.0088 | 0.0615 | |

20 | −0.0001 | −0.0464 | −0.0011 | 0.0272 | |

30 | 0.0589 | −0.0168 | −0.0211 | −0.0269 | |

3,000 | Variable-length | 0.0939 | −0.0117 | −0.0239 | −0.0458 |

10 | −0.0650 | −0.1364 | −0.0486 | 0.0287 | |

20 | −0.0228 | −0.0187 | −0.0106 | 0.0011 | |

30 | 0.0613 | −0.0152 | −0.0232 | −0.0486 |