Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Psychol., 12 January 2026

Sec. Quantitative Psychology and Measurement

Volume 16 - 2025 | https://doi.org/10.3389/fpsyg.2025.1652341

The stability of IRT parameters under several test equating conditions

  • Department of Individual Differences & Psychodiagnostics, Saarland University, Saarbrücken, Germany

Introduction: It is crucial for researchers and test developers to compare results from different test sets (e. g., re-testing, parallel test forms). To ensure comparability, test sets are often linked using anchor items as a common denominator alongside distinct items. To date, most studies on test equating have been limited in scope, typically comparing only absolute numbers of anchor items or focusing on a single IRT model or equating method. Furthermore, previous research has primarily evaluated the absolute deviation of estimated parameters from true parameters. However, in diagnostic contexts, the correlation between these values is often more relevant for ensuring validity and test fairness. Therefore, the aim of this simulation study was to examine the impact of a broad range of key factors on test equating.

Methods: We evaluated correlations and recovery indices between predefined true values and values estimated through test equating for three IRT parameters (discrimination, difficulty, and ability). To this end, we varied the equating method (MS, MM, MGM, IRF, TRF), the IRT model (2PL vs. 3PL), guessing probability (0.000–0.250), anchor item proportion (5–25%), test set size (20–80 items), and the discrimination parameters of the anchor items. In addition, we used samples of 25–100 individuals to assess equating quality under challenging conditions as well as samples of 500 and 1,000 individuals to reflect adequate modeling conditions.

Results: Low guessing probabilities and high anchor item discrimination parameters strongly improved test equating quality for all three IRT parameters. Recovery of discrimination and ability parameters increased logarithmically with larger test set sizes and higher anchor item proportions, with each of these two factors partially compensating for reductions in the other. While sample sizes below 100 individuals produced inadequate parameter recovery, samples of 100 or 500 individuals were justifiable under certain conditions. However, samples of only 100 individuals carried a slight risk of non-convergence. The choice of the equating method had rather minor effects and the impact of the IRT model was ambivalent.

Discussion: These findings highlight the importance of using distractor-free response formats without any guessing probability, anchor items with high discrimination parameters, and large samples to ensure valid test equating. For individual research and test application purposes, we provide a comprehensive data set covering multiple factor levels and a step-by-step simulation guide.

1 Introduction

Test equating comprises a class of methods to ensure the comparability of different test sets when administering them to identical samples is not feasible. Test comparability is a prerequisite for validity and test fairness and is therefore crucial for panel studies or group assessments with different test sets (Battauz, 2013; Cook and Eignor, 1991; Kolen and Brennan, 2014). A typical approach to test equating is the use of anchor items, which place the item parameters of different test sets on a common scale. The goal of this simulation study was to investigate, using an IRT parameter recovery approach, which factors determine how well anchor item-based test equating performs.

1.1 Necessity of multiple test sets

Psychological tests must adhere to high psychometric standards to be applied reasonably in diagnostics and research. Test development is therefore an important process, which often proves to be time-consuming. In particular, the development of parallel test sets can require large amounts of effort. However, parallel tests are essential for test repetition and integrity. Especially in performance testing, practice effects in within-subject designs can occur when the same test is applied repeatedly (Hausknecht et al., 2007; Kulik et al., 1984). This can lead to biased results since in the re-test participants may need fewer cognitive resources to solve items which they had not solved at the first measurement (Ackerman, 1987). Given differentially varying increases in the variable of interest, mean test score differences become unrepresentative of real changes in the underlying variable and thus decrease the validity of the test (e.g., Hausknecht et al., 2002). In between-subject designs, when larger samples are studied over a longer timeframe, test material might become public. Under such circumstances, some participants can prepare specifically for the test with the leaked test materials. As a result, it is not clear whether the test assesses true ability or whether the result is biased by factors such as motivation for preparation or economic capability to obtain or purchase materials. These effects make it necessary to create parallel test sets or at least to constantly renew and standardize the items of a test to ensure validity and test fairness.

1.2 Challenges in the application of multiple test sets

Creating parallel test sets is a difficult endeavor. Not only must test developers create twice as many items, they also must ensure that the parallel test sets have equal psychometric properties. Failure to provide this equality can result in test sets of different difficulty which produce mean differences between test takers that are unrelated to true differences in the underlying construct. Since this artificially alters the actual ranking order of the test-takers in the construct of interest, test fairness and validity might be impaired. It is therefore mandatory to validate the item pool from which various test sets are generated in a broad sample. Only if the difficulties of all items are measured on a common scale it is possible to determine the ability of a tested individual independently from the respective test set. For this purpose, item response theory (IRT; Birnbaum, 1968; Rasch, 1960) can be applied: in IRT models, ability parameters (e.g., intelligence) and item parameters (e.g., item difficulty) are determined by maximum likelihood estimation based on the results of several participants on several items. There are three common IRT models that contain a different number of parameters. The simplest and most restrictive model is the Rasch model (1PL model) which only takes the item difficulty and the participant's ability into account. The probability pxi that a participant x with the ability θx solves an item i with the difficulty bi is

pxi(θx, bi)=exp(θx-bi)1+exp(θx-bi) 

While the 1PL model assumes for all items the same capacity to discriminate between different able participants, the 2PL model is less restrictive and takes the more usual case of different item discrimination parameters ai into account:

pxi(θx, ai, bi)=exp[ai(θx-bi)]1+exp[ai(θx-bi)]

Finally, the 3PL model further considers guessing probabilities ci (e.g., for the case that multiple choice items are applied):

pxi(θx, ai, bi, ci)=ci+(1-ci)exp[ai(θx-bi)]1+exp[ai(θx-bi)]

However, the resulting parameters depend on the sample that has taken the test: in a highly able sample, items are solved by more participants resulting in lower estimates of the item difficulties than in a less able sample. In practice, despite randomization, it is hardly possible to collect samples of exactly equal ability and even if samples could be matched on a broad range of criteria, people might differ in several confounding factors (i.e., how fast they tire). As a result, the estimated parameters are unequally scaled, and the test sets are not comparable (Battauz, 2015). To obtain equally scaled item difficulties, an item pool needs to be completed by a single sample. However, this is only possible in theory. In fact, the longer a test takes, the fewer participants will work on all test items. Dropout and careless responding (e.g., marking response options at random) might occur more frequently which would impair parameter estimates. Furthermore, ethical aspects such as those provided by the ITC guidelines (International Test Commission, 2001), require that a test must not impose unreasonable strain on the test takers.

1.3 Anchor items as an approach to test equating

To address this issue, various test equating strategies (Cook and Eignor, 1991; Kolen and Brennan, 2014) have been developed in an effort to establish the comparability of different test sets without the need to assess only one sample that completes all items from an item pool. The most common strategy is the application of anchor items. Anchor items are shared throughout different test sets, which allows the transformation of item parameters from one test set into the scales of other test sets. There are two types of anchor item-based test equating (Battauz, 2015): direct and indirect equating. In direct equating, all test sets contain the same set of anchor items. Based on the anchor items, equating coefficients are determined, that serve to transform the parameters of one test set into the parameters of another test set by linear transformation following the function

b^2i=f(b1i)=A1,2b1i+B1,2

where b^2i is the estimated difficulty of item i in test set 2 with an empirical difficulty b1i from test set 1. There are five common methods to compute the equating coefficient A1, 2 and the equating constant B1, 2: mean-sigma (MS) method, mean-mean (MM) method, mean-geometric mean (MGM) method, item response function (IRF) method, and test response function (TRF) method. The equating coefficient A1, 2 is computed by

A1,2=σb1σb2

for the MS method (cf., Kolen and Brennan, 2014; Marco, 1977), by

A1,2=i=1l1,2a1ii=1l1,2a2i

for the MM method (cf. Battauz, 2015), and by

A1,2=(i=1l1,2a1ia2i)1l1,2

for the MGM method (cf. Battauz, 2015), where l is the number of anchor items, a1i and a2i are the empirical discrimination parameters of the anchor items within test set 1 and test set 2, respectively, and σb1 and σb2 are the standard deviations of the item difficulties of test set 1 and test set 2, respectively. The equating constant B1, 2 is computed by

B1,2=i=1l1,2b2i-A1,2i=1l1,2b1il1,2

where b1i and b2i are the empirical difficulties within test set 1 and test set 2, respectively.

The IRF method is based on the Haebara method (Haebara, 1980) and minimizes the following loss function to obtain the equating coefficient A1, 2 and the equating constant B1, 2:

f(A1,2, B1,2)=12+i=1l1,2[p2i(θ,a2i, b2i,c2i)p1i(θ,a1i', b1i',c1i)]2h(θ)dθ

where h(θ) is the density of a standardized variable,

a1i=a1iA1,2

and

b1i=A1,2b1i+B1,2

Finally, the TRF method minimizes based on the Stocking-Lord method (Stocking and Lord, 1983) the following loss function:

f(A1,2, B1,2)=12+{i=1l1,2[p2i(θ;a2i,b2i,c2i)p1i(θ;a1i',b1i',c1i)]}2h(θ)dθ

While direct equating is easy to administer, it is only suitable in situations when all test sets are administered to the samples simultaneously. In case of a delay between test sessions, item leakage becomes possible, and the utility of the anchor items might deteriorate. This may result in invalid parameter estimates for the rest of the item set. In contrast, indirect equating proves to be less susceptible to such biases since not every test set contains the same anchor items, but the sets are linked via chains of different anchor items. To this end, based on the direct equating coefficients, a linear function of the form

b^3i=f(b1i)=Acb1i+Bc 

can be set up to transform the empirical difficulty parameters of test set 1 into the scale of test set 3 with the equating coefficient

Ac=A1,2,,m=j=2m=3Aj-1, j=A1,2A2,3

and the equating constant

Bc=B1,2,m=j=2m=3Bj-1,jA2,,m=B1,2A2,,m+B2,3A2,,m

where

A2,,m=h=j+1m=3Ah-1,h=A2,3

(cf. Battauz, 2015). Based on simulated data, Battauz (2017) found that the choice of the equating method (MM, MGM, IRF, TRF) does not affect the parameter estimates. However, there were some limitations: The IRT modeling was based on the 2PL model only and did not consider the possible effect of the guessing probability. Further, the absolute difference between the estimated and true values was taken as the outcome variable. However, in some cases, the correlation between the estimated and true values (i.e., the stability of the IRT parameters) might be more important in terms of validity and test fairness (e.g., student selection tests where the individual result in comparison to the results of the other participants is crucial). Finally, the anchor item proportion did not vary since the test set size was fixed to 40 items and the number of anchor items to five. It is plausible that the anchor item proportion might affect the accuracy of test equating since more anchor items contain more information.

1.4 Number of anchor items

Obviously, if only one anchor item was used, the resulting parameter estimates are prone to bias and lack stability. It is commonly assumed that the more anchor items are applied, the better test equating works. However, apart from some rules of thumb, only few empirical studies have addressed the question of how many anchor items are required for test equating so far. Coffman (1971) recommends adopting 20 items of the original test set when creating a new test set. Alternatively, test sets should share at least 20% of the total item set if this results in more than 20 anchor items. Hills et al. (1988) compared the results of test equating based on six anchor item sets of different sizes, while the number of unshared items was fixed at 45. Test equating worked well based on sets of 10, 15, 20, 25, and 30 anchor items, whereas a set of five anchor items failed to produce satisfactory results. Yang and Houang (1996) compared results based on anchor item sets of 12, 20, and 30 items, while the total test set size was 60 items in each case. They concluded that all three anchor item sets led to similar results. Finally, Sinharay and Holland (2007) compared in an extensive simulation study the equating results of different test set sizes with different anchor item numbers. They showed that equating performs better for larger test sets with more anchor items. However, they fixed the combinations of test set size and anchor item number and did not recombine them, resulting in only three factor levels with nearly the same relative anchor item proportion (41.67–44.87%). Further, the smallest set consisted of 20 anchor items. For tests with large numbers of items and short item processing times, these recommendations may be feasible. However, if tests are short, containing, for instance, 30 items that require a longer processing time (e.g., figural matrix tests where an item puts high cognitive demands on the participants), then different test sets might consist of more shared than distinct items. Under such circumstances, the goal of creating different test sets to prevent test manipulation is prone to fail.

1.5 Impact of the anchor item discrimination parameter

In addition to the number of anchor items used, there has been research on the impact of the discrimination parameters of the anchor items: in the context of differential item functioning (DIF), Lopez Rivas et al. (2009) varied the discrimination parameters of the anchor items. In a simulation study, they generated sets of one, three and five anchor items with either low (a = 0.60) or high (a = 1.20) discrimination parameters. They found an advantage of the higher discriminating anchor items in the power of detecting predefined items with DIF. Further, they mentioned in terms of saturation effect diminishing gains in power when adding items. Similarly, Meade and Wright (2012) recommend employing up to five invariant items with large discrimination parameters. Finally, MacCallum et al. (1999) found that higher communalities (which can be interpreted as a factor analytical analog to IRT discrimination parameters) facilitate accurate recovery of population parameters.

1.6 Impact of the sample size

Previous studies comparing equating methods have largely been conducted under idealized simulation conditions, such as large sample sizes and samples with similar latent means and variances. It is generally recommended that stable IRT parameter estimation requires sample sizes of n = 500 for 2PL models and n = 1,000 for 3PL models (e.g., De Ayala, 2013). However, for samples of 500 (1,000) individuals, the expected sampling error in the sample mean is 0.04 (0.03), and in the sample variance it is 0.06 (0.04). This degree of variation is comparatively small and makes it easy for the equating methods to recover the latent trait. To meaningfully challenge equating methods and determine their performance under realistic instability, reduced sample sizes of fewer than 500 (1,000) individuals with differing latent means and variances would be necessary. Such instability would be expected to increase the importance of the number of anchor items, as they would help to better capture the variation in trait means and variances.

1.7 The present study

To our best knowledge, there is no systematic state-of-the-art analysis that determines simultaneously the effects of varying characteristics of the previously discussed key influences on test equating, including challenging conditions such as small sample sizes. In order to answer this question as validly as possible and to provide recommendations on required test equating characteristics, very large samples are necessary. To address this issue, simulation studies are an appropriate methodological approach. There are several advantages using simulation studies (Hallgren, 2013; Morris et al., 2019): they prove to be more economical than empirical studies because they do not take time to acquire participants and collect data. In addition, simulation studies are particularly convenient for comparing statistical methods: if, for example, the true value of a parameter is already known, it is possible to find out which statistical method can be applied to estimate it with the highest accuracy. Since the goal of this study was to ensure the comparability of different test sets, we used simulated data to examine the stability of IRT parameters (i.e., discrimination parameter, item difficulty and ability parameter) under various test equating conditions. In this study, we defined stability as the correlation and deviation between the true (i.e., predefined) parameters and the parameters estimated through test equating in terms of parameter recovery. We investigated the influence of (1) the equating method, (2) the IRT model and the height of guessing probability, (3) the anchor item proportion and test set size, and (4) the sample size. To address these research questions, we had nine hypotheses (H1 to H9):

(H1) The choice of the equating method does not affect the stability of the IRT parameters (i.e., discrimination parameter, item difficulty and ability parameter). This would be in line with Battauz (2017) who showed similar outcomes of several test equating methods for the discrimination parameter and item difficulty.

(H2) The IRT model (2PL vs. 3PL) has an influence on the stability of the parameters. Since the 3PL model takes the guessing probability into account we expect it to result in better parameter recovery than the 2PL model.

(H3) The height of the guessing probability c has a negative influence on the stability since responding to items by random does not reflect participants' true ability. We would expect a linear decrease of the guessing probability to result in a logarithmic increase in stabilities in terms of a saturation effect. However, in practice the guessing probability does not decrease linearly by adding further distractors to an item but follows the function c = f(d) = d−1 with d = number of distractors (i.e., cd = 2 = 0.500, cd = 3 = 0.333, cd = 4 = 0.250 etc.). Since this function and the logarithmic saturation function multiplicate approximately to a linear function we expect a linear decrease of the guessing probability to result in a linear increase of the stabilities.

(H4) Linear increases of the proportion of anchor items within a test set result in logarithmic increases of the stabilities. The anchor items serve as a common denominator for different test sets. Based on their parameters, the parameter estimates of the remaining items are computed. As parameter estimates are more robust the more reference values are used for estimation, we expected the stabilities of the IRT parameters to increase along with the anchor item proportion. We assumed that the estimation gains in stability especially when only few anchor items are included in the test sets. If the test sets already contain many anchor items, additional anchor items should result in less additional stability, following the law of diminishing marginal utility (Marshall and Guillebaud, 1961). Lopez Rivas et al. (2009) found initial evidence for this diminishing gain when adding anchor items. Therefore, we expected a logarithmic trend for the increase of stability with increasing anchor item proportion.

(H5) Linear increases of the test set size result in logarithmic increases of the stabilities. If only a few anchor items are used, it is possible that some might coincidentally be extremely easy or extremely difficult and thus not representative of the total set of items. If the parameters of the remaining items are estimated based on such anchor items, it can be assumed that they are less stable compared to representative items. If, on the other hand, more anchor items are used, random extreme parameters average out, resulting in higher stability of the parameters. Fixing the anchor item proportion to a constant value (e.g., 10%), the absolute number of anchor items varies (e.g., 2 vs. 10) depending on the test set size (e.g., 20 vs. 100). Sinharay and Holland (2007) found an advantage of larger test sets with a similar anchor item proportion to smaller test sets. Thus, we hypothesized that the stability of the IRT parameters increases with the test set size. Analogously to H3, we expected in terms of the law of diminishing utility a decreasing gain in stability with increasing test set size, and thus increasing amount of anchor items. Consequently, we expected a logarithmic trend of the test set size as well.

(H6) There is an interaction effect between the factors test set size and anchor item proportion. Based on H5 and H6, we assumed that the differences between the anchor item proportions decrease with increasing test set size due to a saturation effect based on the high absolute number of anchor items in large test sets. Therefore, we expected the absolute anchor item number to be a substantial predictor of the stability.

(H7) Higher discrimination parameters of the anchor items result in higher correlations and smaller deviations between the true IRT parameters and the parameters estimated through test equating. The discrimination parameter is the capability of an item to differentiate between individuals with different ability parameters. Lopez Rivas et al. (2009) and Meade and Wright (2012) found evidence for the advantage of anchor items with high discrimination parameters in detecting items with DIF. Transferred to test equating, we expected the estimation of the IRT parameters to be more accurate based on anchor items with high discrimination parameters.

(H8) Larger sample sizes result in a logarithmic increase of the stability of IRT parameters. Typically, sample sizes of n = 500 (n = 1,000) individuals are recommended for 2PL (3PL) models. When sample sizes fall below these recommendations, models should fail to converge more frequently and negatively affect parameter recovery, as increased sampling error and single outliers may bias the estimated latent traits. Consistent with the effects of anchor item proportion and the test set size, we expected a saturation effect when increasing sample size.

(H9) There is an interaction effect between sample size and anchor item proportion. In small samples, the characteristics of the anchor items are estimated with lower precision. Hence, a larger number of anchor items could help average out this reduced precision. In contrast, in large samples, the characteristics of single anchor items are estimated more precisely, so increasing the anchor item proportion would provide comparatively less additional precision.

2 Methods

2.1 Sample

We simulated samples of five different sizes: to examine test equating under challenging conditions, we generated samples of n = 25, 50 and 100 individuals. To assess performance under adequate IRT modeling conditions, we additionally generated samples of n = 500 and 1,000 individuals. In each iteration, the latent means of three samples were drawn from normal distribution with the means Mθ = −0.50, 0.00, and +0.50, and with the standard deviations SDθ = 0.80, 1.00, and 1.20 in order to manipulate heterogeneity in latent means and variances.

2.2 Independent variables

To address H1, we varied the equating method. Test equating was conducted using the MS, MM, MGM, IRF and TRF methods. To address H2, we applied two IRT models: 2PL model (varying discrimination parameters) and the 3PL model (additionally accounting for the guessing probability). To address H3, we manipulated the guessing probability: we included the most common cases of no guessing probability (e.g., in a test with distractor-free response formats) and of four (cd = 4 = 0.250), six (cd = 6 = 0.167), and eight (cd = 8 = 0.125) distractors. To address H4 to H6, we manipulated both the proportion of anchor items within the test sets and the test set size: proportions of 5, 10, 15, 20 and 25% were implemented within test sets of 20, 40, 60, and 80 items. Indirect equating was used to achieve comparability across three test sets. To ensure equal conditions for all factor combinations, within one iteration the same samples were taken for each of them.

2.3 Simulation procedure and test equating

We simulated data for 400 different equating scenarios, corresponding to all combinations of the factors test set size × anchor item proportion × sample size × guessing probability. Within each scenario, the “true” discrimination parameters ai were drawn from a log-normal distribution with Ma = 1.00 and SDa = 0.20 to ensure positive values, while item difficulties bi were drawn from a normal distribution (Mb = 0.00, SDb = 1.00). For each scenario, we conducted 1,000 iterations, meaning that the true IRT parameters were newly generated 1,000 times. Participants' responses were simulated based on these item difficulties and ability parameters. We used the R (R Core Team, 2021) package catIrt (Nydick, 2014), which produces binary outcomes indicating whether each item was solved (1) or not (0). Item parameters were then estimated on a test set-specific scale first by means of 2PL and second by means of 3PL modeling using the mirt package (Chalmers, 2012). Based on the anchor items, the five test equating methods were performed with the package equateMultiple (Battauz, 2021). This resulted in a common item parameter scale across the three linked test sets. Figure 1 provides an overview of the independent variables and their dependencies.

Figure 1
Diagram showing scenario 1 and 400. Scenario 1 is characterized by a TSS of 20, AIP of 5 percent, and c value of 0 percent, with n of 25. Based on this scenario, IRT parameters are estimated with 2PL and 3PL model. Each model includes Mean-Mean, Mean-Grand Mean, Mean-Sigma, Item Response Function, and Test Response Function. Scenario 400 is characterized by a TSS of 80, AIP of 25 percent, c value of 25 percent, and n of 1,000.

Figure 1. Independent variables and their dependencies within the simulation. The figure shows the six independent variables: TSS, test set size; AIP, anchor item proportion; c, guessing probability; n, sample size; IRT model, and equating method. 400 scenarios as a combination of test set size × anchor item proportion × guessing probability × sample size were simulated. Within a scenario, discrimination and difficulty parameters of the items and ability parameters of the simulees were drewn. Based on these parameters, the simulees' responses to the items were generated. From these responses, in each scenario IRT parameters were estimated with a 2PL as well as with a 3PL model. With the resulting test-specific parameters, five equating methods were applied to achieve parameters of a common scale.

2.4 Dependent variables

To quantify the stability of the IRT parameters, we computed common indices of parameter recovery: First, we calculated the correlations between the true and the estimated values for each IRT parameter as an association-based recovery index. Although focusing on the correlation is informative in diagnostic settings in which an individual's test result is interpreted relative to other test takers (e.g., in student selection tests), this criterion can mask distortions arising from inherent identification issues in IRT models (as discussed in Noventa et al., 2024): Because IRT models allow multiple equivalent parameterizations, systematic underestimation (overestimation) of discrimination parameters can lead to an inflation (deflation) of the latent trait variance. Likewise, downward (upward) bias in item difficulties can shift the ability estimates downward (upward). As a result, correlations between true and estimated values remain high even when the ability parameter can no longer be interpreted as a standardized score. Therefore, in addition to the correlation as an association-biased recovery index, we computed bias, Mean Absolute Error (MAE), and Standard Error of Estimate (SEE) as error-based recovery indices. Finally, we calculated Root Mean Square Error (RMSE), which captures bias, absolute error, and the linearity between true and estimated values (Roberts and Laughlin, 1996).

2.5 Statistical analysis

Due to the large amount of simulated data, extreme values that are not plausible under real conditions were expected. Therefore, for each iteration, the values in the upper and lower 2.5% quantiles were removed, and the inner 95% of the 1,000 values were retained for analysis. In line with our hypotheses, we estimated a linear mixed model (LMM) for each parameter recovery index with the following fixed effects: equating method (linear), IRT model (linear), guessing probability (linear), anchor item proportion (logarithmic), test set size (logarithmic), interaction effect of the anchor item proportion (logarithmic) and the test set size (logarithmic), discrimination parameter of the anchor items (linear), samples size (logarithmic), and interaction effect of the sample size (logarithmic) and the anchor item proportion (logarithmic). To account for repeated measurements and for an imbalance of the number of converged iterations between models, we additionally specified random intercepts for the scenario (test set size × anchor item proportion × sample size × guessing probability) as well as for the iteration within a scenario (1 to 1,000). This resulted in the following model equation (illustrated for RMSE of the item difficulty):

RMSEbisr ~ b0+b1T·methodi+b2T·modeli+b3·ci+b4·ln(AIPi)+b5·ln(TSSi)+b6·ln(AIPi)·ln(TSSi)+b7·MAAi+b8·ln(ni)+b9·ln(ni)·ln(AIPi)+us+vsr+εisr

with c = guessing probability, ln = natural logarithm, AIP = anchor item proportion, TSS = test set size, MAA = mean anchor item discrimination parameter, n = sample size, us = random intercept of the scenario (combination of test set size × anchor item proportion × sample size × guessing probability), vsr = random intercept of an iteration within a scenario, and εisr = residual error term. Data analyses were conducted using the R packages lme4 (Bates et al., 2015) and lmerTest (Kuznetsova et al., 2017). Given the large number of significance tests, we corrected for multiple testing: we applied Benjamini-and-Hochberg correction (Benjamini and Hochberg, 1995) covering all model tests to control the False Discovery Rate (FDR) at q = 0.050, ensuring that the proportion of false positives among significant results remained below 5%. Our simulations, the resulting data set, and the R script for analyzing the data are provided in our OSF repository: https://osf.io/np7k9/?view_only=dcb2b2b1bb02426e95b882f48378aa81.

2.6 Post-hoc analyses

Even when average parameter recovery is satisfactory, results are only meaningful if model convergence occurs with sufficient probability. Hence, as a post-hoc analysis, we examined the conditions under which model convergence is likely. We therefore computed the mean convergence rate (CR = [0, 1]) and the mean number of converged test sets [0, 3] per scenario (test set size × anchor item proportion × sample size × guessing probability) and tested the effects of the IRT model, the guessing probability, the anchor item proportion, and the sample size for significance using a LMM. Since each scenario (i.e., the same sample) contributed two convergence statistics (one regarding 2PL, one regarding 3PL model), we specified the scenario as a random intercept. This resulted in the following model equation:

convergenceis ~ b0+b1T·modeli+b2·ci+b3·AIPi+b4·TSSi+b5·ni+us+εis

where c = guessing probability, AIP = anchor item proportion, TSS = test set size, n = sample size, us = random intercept of the scenario, and εisr = residual error term. As in the main analysis, we applied Benjamini-and-Hochberg correction to control the FDR at q = 0.050.

3 Results

3.1 Descriptive statistics

Across all equating scenarios, mean correlation between true and estimated discrimination parameters was r = 0.55. Mean bias was 0.46, mean RMSE was 1.25, mean MAE was 0.80, and mean SEE was 1.11. For item difficulties, mean correlation was r = 0.81, mean bias was 0.24, mean RMSE was 8.11, mean MAE was 4.68, and mean SEE was 7.39. For ability parameters, mean correlation was r = 0.83, mean bias was −0.16, mean RMSE was 8.01, mean MAE was 4.62, and mean SEE was 6.90. Tables 13 present the mean parameter recovery indices across predictor levels.

Table 1
www.frontiersin.org

Table 1. Descriptive statistics of the parameter recovery criterions for the discrimination parameter.

Table 2
www.frontiersin.org

Table 2. Descriptive statistics of the parameter recovery criterions for the item difficulty.

Table 3
www.frontiersin.org

Table 3. Descriptive statistics of the parameter recovery criterions for the ability parameter.

3.2 Effects on discrimination parameters

For better readability, in the following we primarily report the effects on correlation and RMSE, in particular how they increase or decrease when one of the predictors changes. Tables 46 show the results of the LMM analysis for each parameter recovery criterion of each IRT parameter. The equating method significantly predicted the correlation between the true and estimated discrimination parameter, although the effect was comparatively small: switching from the MS to the MM method increased the correlation by Δr = 0.01. Error-based recovery indices were significantly lower when using the MGM method compared with the other methods. Switching from the 2PL to the 3PL model increased the correlation of the discrimination parameters substantially (Δr = 0.07), although error-based recovery indices deteriorated (e.g., ΔRMSE = 1.17). Increasing the guessing probability strongly impaired parameter recovery (Δr = −0.08 and ΔRMSE = 0.73 for an increase of Δc = 0.10). Recovery of discrimination parameters improved with larger anchor item proportions, test set sizes, and sample size (Δr = 0.02 | 0.05 | 0.11 and ΔRMSE = −0.39 | −0.84 | −0.85 when doubling anchor item proportion | test set size | sample size). However, there was a significant interaction between anchor item proportion and test set size: the effect of increasing the anchor item proportion diminished as test set size grew, br = −0.02, t(386.71) = −9.93, p < 0.001 and bRMSE = 0.47, t(394.40) = 3.29, p = 0.002. A similar diminishing effect occurred for the interaction between anchor item proportion and sample size with respect to the error-based recovery indices, bRMSE = 0.45, t(399.43) = 3.31, p = 0.002. Finally, higher discrimination parameters of the anchor items improved association-based recovery (Δr = 0.11 for an increase of ΔMa = 0.10) but deteriorated the error-based recovery indices (e.g., ΔRMSE = 0.32).

Table 4
www.frontiersin.org

Table 4. Results of the linear mixed models on the discrimination parameter.

Table 5
www.frontiersin.org

Table 5. Results of the linear mixed models on the item difficulties.

Table 6
www.frontiersin.org

Table 6. Results of the linear mixed models on the ability parameters.

3.3 Effects on item difficulties

The choice of equating method significantly predicted also recovery indices of the item difficulties: The MM method led to slightly higher correlations between true and estimated item difficulties (Δr = 0.01), and especially the MGM (ΔRMSE = −10.43) and IRF (ΔRMSE = −10.47) methods produced better error-based recovery indices than the MS method. Also, the 3PL model resulted in marginally higher correlations (Δr = 0.01). Again, higher guessing probability strongly reduced recovery quality (Δr = −0.06 and ΔRMSE = 3.38 for an increase of Δc = 0.10). Contrary to our hypotheses, increasing anchor item proportion reduced correlation of the item difficulty (Δr = −0.02 when doubling the anchor item proportion), although it improved error-based recovery indices (ΔRMSE = −4.24). Furthermore, test set size had no significant effect. Larger sample sizes enhanced recovery quality of item difficulties (Δr = 0.02 and ΔRMSE = −4.75 when doubling sample size). There was an ambivalent interaction between sample size and anchor item proportion: with larger samples, increasing the anchor item proportion more strongly improved correlations, br = 0.03, t(392.92) = 6.44, p < 0.001, but less strongly improved the error-based indices, bRMSE = 4.20, t(286.45) = 3.29, p = 0.002. Increasing the discrimination parameter of the anchor items enhanced correlations strongly (Δr = 0.10 for an increase of ΔMa = 0.10), but not the error-based recovery indices.

3.4 Effects on ability parameters

Switching from MS method to the IRF or TRF method improved the correlation between true and estimated ability parameters substantially (Δr = 0.06 | 0.05), whereas switching to the MM or MGM method reduced error-based recovery indices (ΔRMSE = −17.38 | −15.22). Ability parameters were recovered nearly equally well under the 2PL and 3PL models, while lower guessing probabilities again enhanced recovery quality (Δr = −0.04 for Δc = 0.10). Increasing anchor item proportion, test set size, and sample size improved recovery (Δr = 0.01 | 0.07 | 0.01 and ΔRMSE = −10.44 | – | −5.20 when doubling the anchor item proportion | test set size | sample size). There was an interaction between the sample size and anchor item proportion: with larger samples, the negative effect of an increasing anchor item proportion on error-based recovery indices diminished, bRMSE = 11.07, t(207.16) = 4.23, p < 0.001. Higher discrimination ability of anchor items improved both association-based and error-based recovery indices (Δr = 0.05 and ΔRMSE = −8.87 for ΔMa = 0.10). Table 7 summarizes the detailed predictor effects for each parameter recovery index.

Table 7
www.frontiersin.org

Table 7. Effects of the predictors on parameter recovery.

3.5 Model convergence

Across all scenarios, 88.48% of the total models and on average 2.87 out of 3 test sets per estimation converged. Convergence Rates (CR) decreased by enlarging test set sizes, b = −0.04, t(45.96) = −6.95, p < 0.001: adding 20 items per test reduced CR by 3.56 percentage points. In contrast, CR increased with increasing sample sizes, b = 0.09, t(43.81) = 16.28, p < 0.001: adding 100 individuals per sample enhanced CR by 2.36 percentage points. The choice of IRT model, guessing probability, and anchor item proportion did not affect CR. Figure 2 displays a heatmap of CR across sample size × test set size combination: for sample sizes of n = 100, a minimum CR of 0.98 (i.e., on average 98 of 100 iterations converged) was observed. Also, for the number of test sets that converged, test set size (b = −0.06, t(400.00) = −5.38, p < 0.001) and sample size (b = 0.10, t(400.00) = 10.02, p < 0.001) were significant predictors. Additionally, marginally more test sets converged under the 2PL model than under the 3PL model (b = 0.00, t(400.00) = −4.14, p < 0.001). Table 8 shows the detailed results of the linear mixed models predicting CR and the number of converged test sets.

Figure 2
Heatmap titled “Convergence Rate” illustrating values of TSS against n. Convergence rates are reported for TSS values of 20, 40, 60, and 80 items in combination with samples of 25, 50, 100, 500, and 1,000 participants. Colors range from violet to green, indicating convergence rates from 0.36 to 1.00, as shown in the color scale.

Figure 2. Model convergence rate depending on sample and test set size. The heatmap shows the model convergence rate depending on its two significant predictors: n, sample size; TSS, test set size. Convergence Rates below 0.80 (i.e., less than 80 out of 100 iterations converged) are colored in violet, above 0.80 in green.

Table 8
www.frontiersin.org

Table 8. Results of the linear mixed model on model convergence.

4 Discussion

4.1 Interpretation and implications of the results

The goal of this study was to provide researchers with reliable information and methods to fully utilize the potential of test equating. For this purpose, we used computer-simulated data and a parameter recovery approach to examine the extent to which key factors of test equating affect the stability of IRT parameters. Therefore, we considered the correlations between true and estimated discrimination parameters, item difficulties, and ability parameters as association-based recovery indices, and bias, RMSE, MAE, and SEE as error-based recovery indices. The overall correlation of the discrimination parameters was rather low (r = 0.55). In contrast, the correlation of both the item difficulties (r = 0.81) and the ability parameters (r = 0.83) were considerably higher. We employed five common equating methods (MS, MM, MGM IRF and TRF method) discussed in the literature. Indeed, the choice of the equating method had an impact on how well the IRT parameters were recovered: for discrimination parameters, the MM and MGM method led to better recovery quality, for the item difficulties the MM, MGM, and IRF method were convenient, and for ability parameters, the MS method was outperformed by all the other methods. However, the effect of the equating method was relatively small, which is in line with previous research from Battauz (2017).

In line with H2, the choice of the IRT model affected parameter stability: the 3PL model produced notably higher correlations between true and estimated ability parameters which is plausible since in contrast to the 2PL model the 3PL model takes the guessing probability into account. However, switching from 2PL to 3PL model deteriorated error-based recovery indices. This can be explained by the fact that in case of the 3PL model, the slope of the item characteristic curve is systematically underestimated, and this underestimation constitutes a linear transformation of the discrimination parameter which does not affect the correlation (e.g., Baker and Kim, 2004; Lord, 1980). To investigate the role of guessing probability, we varied the number of response options to an item. Guessing probability strongly affected the stability of all three IRT parameters: in line with H3, lower guessing probabilities resulted in higher stability, with the largest improvement when the guessing probability was eliminated entirely. This might lead to the recommendation to employ items with distractor-free response formats whenever possible to enhance parameter stability.

Regarding H4 and H5, enlarging anchor item proportion and test set size substantially improved the stability of discrimination and ability parameters. These findings are in line with Sinharay and Holland (2007) and extend previous research by examining smaller and more flexible anchor item proportions as well as larger test set sizes, enhancing the generalizability of these effects. Furthermore, we could show that the effects of enlarging anchor item proportion and test set size on IRT parameter stability are not linear but approximatively logarithmic: there was a diminishing marginal utility in terms of a saturation effect which means that small item sets benefit more from additional items than larger item tests. Additionally, in line with H6, our results complement existing research by indicating that the benefit of increasing anchor item proportion on the stability of the discrimination parameters decreased as test set size increased. This interaction and the diminishing marginal utility effect may guide important practical implications: in settings where resources are limited, test set size and, as a consequence, the duration of the assessment can be reduced, while this reduction can be compensated to some degree by augmenting the anchor item proportion in order to get reasonable estimates of discrimination and ability parameters. In contrast to discrimination and ability parameters, test set size had no significant effect on the stability of the item difficulties, and the influence of the anchor item proportion was ambiguous: increasing anchor item proportion reduced correlations but improved error-based indices. One explanation for this ambiguous effect might be variance shrinkage of anchor item difficulty: in small anchor item sets, the variance of (true) item difficulties is statistically higher than in larger anchor item sets which can reduce the correlation slightly. However, error-based recovery indices remain unaffected by this shrinkage since they do not depend on rank-orders but on absolute deviations. Prior work suggests that homogenizing anchor item difficulties toward the center of the difficulty scale enhances parameter estimation (e.g., Dorans et al., 2007; Sinharay et al., 2012).

In line with H7, higher discrimination parameters of the anchor items strongly increased the stability of all three IRT parameters. These findings correspond with research on DIF detection (e.g., Lopez Rivas et al., 2009; Meade and Wright, 2012) and extend the benefit of high-discrimination items to test equating contexts.

According to H8, enlarging the sample size had a strong effect on the recovery of all three IRT parameters, particularly on the recovery of discrimination parameters and item difficulties. For discrimination parameters, association-based recovery profited especially from enlarging the sample from n = 100 (r = 0.47) to N = 500 (r = 0.73) while error-based recovery indices were already relatively low at n = 100 individuals. For item difficulties and ability parameters, the pattern is inversed: samples of n = 100 provided high association-based recovery indices, but for acceptable error-based recovery indices larger sample sizes with a minimum of n = 500 are necessary. These sample size thresholds might be relevant for item banking where it is aimed to create large banks of items with stable item characteristics but where it is necessary to consider personnel and temporal resources. Consistent with H9, increasing sample size reduced the number of anchor items required to achieve high stability of discrimination and ability parameters. This effect might hold practical implications for periodically recurring assessments such as PISA (OECD, 2019) or annual student selection tests, where a small number of anchor items is advantageous since it reduces the probability of anchor item leakage which would bias ability estimation.

Model CR strongly depended on sample size and test set size: Small samples of 25 or 50 individuals were particularly prone to converge failures, especially when employing tests with many items. This is in line with best-practice recommendations on IRT modeling which suggest samples of at least 500 (2PL) or 1,000 (3PL) individuals. However, results indicate that under certain conditions these thresholds may be relaxed to sample sizes of 100 individuals: in low-stakes settings, where individual test results are less consequential and the primary interest lies in detecting general trends, a 1–2% risk of non-convergence may be justifiable. In such cases, more constrained models (e.g., fixed guessing probability or 1PL model) or the exclusion of items with poor statistical properties could help maintain adequate stability. In contrast, in high-stakes settings (e.g., student selection tests) where test results have crucial consequences for participants, ensuring model convergence is essential. Consequently, when different test forms are administered in such contexts, it should be considered that each form is completed by a sufficient number of participants (i.e., n ≥ 500).

In order to facilitate researchers to gauge which stability can be expected from a specific factor level combination we provide our data and the R code on the OSF. For individual test development and experimental design purposes, we further provide an instruction for examining the parameter stability under arbitrary conditions that are not included in this article. The R simulation code is prepared in a way that allows (1) the ability distributions of the samples, (2) the sample sizes, (3) the number of test sets, (4) the number of items, (5) the anchor item proportion, (6) the number of iterations, (7) the IRT model, (8) the guessing probability, and (9) the equating method to be varied easily. The R simulation code and instructions on how to use the code are available on our OSF: https://osf.io/np7k9/?view_only=dcb2b2b1bb02426e95b882f48378aa81.

4.2 Conclusion

These considerations lead to the following three recommendations: first, to enhance the stability of all three IRT parameters, guessing probability should be minimized, anchor items should have high discrimination parameters, and larger sample sizes should be used: while sample sizes of 25 of 50 produce inacceptable estimates, in some cases samples of already 100 or 500 individuals may be justifiable. However, although acceptable convergence rates were observed with samples of 100 individuals, model convergence is not guaranteed when sample sizes fall below 500. Second, to improve the stability of discrimination and ability parameters in particular, large test sets or high anchor item proportions should be applied. Importantly, augmenting one of these two factors can compensate to some extent for reductions in the other one. Third, although certain equating methods perform slightly better for specific IRT parameters, the overall advantages are comparatively small.

Data availability statement

The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found at: https://osf.io/np7k9/?view_only=dcb2b2b1bb02426e95b882f48378aa81.

Author contributions

DW: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Visualization, Writing – original draft, Writing – review & editing. NB: Conceptualization, Data curation, Methodology, Project administration, Supervision, Validation, Writing – review & editing. FS: Conceptualization, Data curation, Project administration, Resources, Supervision, Validation, Writing – review & editing. MK: Conceptualization, Data curation, Methodology, Project administration, Supervision, Validation, Writing – review & editing, Formal analysis, Investigation, Software.

Funding

The author(s) declared that financial support was not received for this work and/or its publication.

Conflict of interest

The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Ackerman, P. L. (1987). Individual differences in skill learning: An integration of psychometric and information processing perspectives. Psychol. Bull. 102, 3–27. doi: 10.1037/0033-2909.102.1.3

Crossref Full Text | Google Scholar

Baker, F. B., and Kim, S. H. (2004). Item Response Theory: Parameter Estimation Techniques. Boca Raton, FL: CRC Press.

Google Scholar

Bates, D., Mächler, M., Bolker, B., and Walker, S. (2015). Fitting linear mixed-effects models using lme4. J. Stat. Software 67, 1–48. doi: 10.18637/jss.v067.i01

Crossref Full Text | Google Scholar

Battauz, M. (2013). IRT test equating in complex linkage plans. Psychometrika 78, 464–480. doi: 10.1007/s11336-012-9316-y

PubMed Abstract | Crossref Full Text | Google Scholar

Battauz, M. (2015). EquateIRT: an R package for IRT test equating. J. Stat. Software 68, 1–22. doi: 10.18637/jss.v068.i07

Crossref Full Text | Google Scholar

Battauz, M. (2017). Multiple equating of separate IRT calibrations. Psychometrika 82, 610–636. doi: 10.1007/s11336-016-9517-x

PubMed Abstract | Crossref Full Text | Google Scholar

Battauz, M. (2021). equateMultiple: equating of Multiple Forms (R package version 0.1.0). Available online at: https://CRAN.R-project.org/package=equateMultiple (accessed November 14, 2025).

Google Scholar

Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Royal Stat. Soc. Ser. B. 57, 289–300. doi: 10.1111/j.2517-6161.1995.tb02031.x

Crossref Full Text | Google Scholar

Birnbaum, A. (1968). “Some latent trait models and their use in inferring an examinee's ability,” in Statistical Theories of Mental Test Score, eds. F. M. Lord, M. R. Novick (Reading, MA: Addison-Wesley).

Google Scholar

Chalmers, R. P. (2012). mirt: a multidimensional item response theory package for the R environment. J. Stat. Software 48, 1–29. doi: 10.18637/jss.v048.i06

Crossref Full Text | Google Scholar

Coffman, W. E. (1971). “The achievement tests,” in The College Board Admissions Testing Program: A Technical Report on Research and Development Activities Relating to the Scholastic Aptitude Test and Achievement Tests, eds. W. H. Angoff (Princeton, NJ: Educational Testing Service), 49-77.

Google Scholar

Cook, L. L., and Eignor, D. R. (1991). IRT equating methods. Educ. Meas. Issues Practice 10, 37–45. doi: 10.1111/j.1745-3992.1991.tb00207.x

Crossref Full Text | Google Scholar

De Ayala, R. J. (2013). The Theory and Practice of Item Response Theory. Guilford Publications.

Google Scholar

Dorans, N. J., Pommerich, M., and Holland, P. W. (2007). Linking and Aligning Scores and Scales. New York, NY: Springer, 135–159. doi: 10.1007/978-0-387-49771-6

Crossref Full Text | Google Scholar

Haebara, T. (1980). Equating logistic ability scales by a weighted least squares method. Jpn. Psychol. Res. 22, 144–149. doi: 10.4992/psycholres22.144

Crossref Full Text | Google Scholar

Hallgren, K. A. (2013). Conducting simulation studies in the R programming environment. Tutorials Quant. Methods Psychol. 9, 43–60. doi: 10.20982/tqmp.09.2.p043

PubMed Abstract | Crossref Full Text | Google Scholar

Hausknecht, J. P., Halpert, J. A., Paolo, D. i., and Moriarty Gerrard, N. T. M. O. (2007). Retesting in selection: A meta-analysis of coaching and practice effects for tests of cognitive ability. J. Appl. Psychol. 92, 373–385. doi: 10.1037/0021-9010.92.2.373

PubMed Abstract | Crossref Full Text | Google Scholar

Hausknecht, J. P., Trevor, C. O., and Farr, J. L. (2002). Retaking ability tests in a selection setting: Implications for practice effects, training performance, and turnover. J. Appl. Psychol. 87, 243–254. doi: 10.1037/0021-9010.87.2.243

PubMed Abstract | Crossref Full Text | Google Scholar

Hills, J. R., Subhiyah, R. G., and Hirsch, T. M. (1988). Equating minimum-competency tests: comparisons of methods. J. Educ. Meas. 25, 221–231. doi: 10.1111/j.1745-3984.1988.tb00304.x

Crossref Full Text | Google Scholar

International Test Commission (2001). International guidelines for test use. Int. J. Testing 1, 93–114. doi: 10.1207/S15327574IJT0102_1

Crossref Full Text | Google Scholar

Kolen, M. J., and Brennan, R. L. (2014). Test Equating, Scaling, and Linking. New York, NY: Springer-Verlag. doi: 10.1007/978-1-4939-0317-7

Crossref Full Text | Google Scholar

Kulik, J. A., Kulik, C-. L. C., and Bangert, R. L. (1984). Effects of practice on aptitude and achievement test scores. Am. Educ. Res. J. 21, 435–447. doi: 10.3102/00028312021002435

Crossref Full Text | Google Scholar

Kuznetsova, A., Brockhoff, P. B., and Christensen, R. H. (2017). lmerTest package: tests in linear mixed effects models. J. Stat. Software 82, 1–26. doi: 10.18637/jss.v082.i13

Crossref Full Text | Google Scholar

Lopez Rivas, G. E., Stark, S., and Chernyshenko, O. S. (2009). The effects of referent item parameters on differential item functioning detection using the free baseline likelihood ratio test. Appl. Psychol. Meas. 33, 251–265. doi: 10.1177/0146621608321760

Crossref Full Text | Google Scholar

Lord, F. M. (1980). Applications of Item Response Theory to Practical Testing Problems. Mahwah, NJ: Lawrence Erlbaum Associates.

Google Scholar

MacCallum, R. C., Widaman, K. F., Zhang, S., and Hong, S. (1999). Sample size in factor analysis. Psychological Methods, 4, 84–99. doi: 10.1037/1082-989X.4.1.84

Crossref Full Text | Google Scholar

Marco, G. L. (1977). Item characteristic curve solutions to three intractable testing problems. J. Educ. Meas. 14, 139–160. doi: 10.1111/j.1745-3984.1977.tb00033.x

Crossref Full Text | Google Scholar

Marshall, A., and Guillebaud, C. W. (1961). Principles of Economics: An Introductory Volume. London: Macmillan.

Google Scholar

Meade, A. W., and Wright, N. A. (2012). Solving the measurement invariance anchor item problem in item response theory. J. Appl. Psychol. 97, 1016–1031. doi: 10.1037/a0027934

PubMed Abstract | Crossref Full Text | Google Scholar

Morris, T. P., White, I. R., and Crowther, M. J. (2019). Using simulation studies to evaluate statistical methods. Stat. Med. 38, 2074–2102. doi: 10.1002/sim.8086

PubMed Abstract | Crossref Full Text | Google Scholar

Noventa, S., Ye, S., Kelava, A., and Spoto, A. (2024). On the identifiability of 3- and 4 parameter item response theory models from the perspective of knowledge space theory. Psychometrika 89, 486–516. doi: 10.1007/s11336-024-09950-z

Crossref Full Text | Google Scholar

Nydick, S. W. (2014). catIrt: An R Package for Simulating IRT-Based Computerized Adaptive Tests. Available online at: https://CRAN.R-project.org/package=catIrt (accessed November 14, 2025).

Google Scholar

OECD (2019). PISA 2018 Assessment and Analytical Framework. Paris: OECD. doi: 10.1787/b25efab8-en

Crossref Full Text | Google Scholar

R Core Team (2021). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing.

Google Scholar

Rasch, G. (1960). Studies in Mathematical Psychology: I. Probabilistic Models for Some Intelligence and Attainment Tests. Oxford: Nielsen and Lydiche.

Google Scholar

Roberts, J. S., and Laughlin, J. E. (1996). A unidimensional item response model for unfolding responses from a graded disagree-agree response scale. Appl. Psychol. Meas. 20, 231–255. doi: 10.1177/014662169602000305

Crossref Full Text | Google Scholar

Sinharay, S., Haberman, S., Holland, P., and Lewis, C. (2012). A note on the choice of an anchor test in equating. ETS Res. Rep. Ser. 2012, I−9. doi: 10.1002/j.2333-8504.2012.tb02296.x

Crossref Full Text | Google Scholar

Sinharay, S., and Holland, P. W. (2007). Is it necessary to make anchor tests mini-versions of the tests being equated or can some restrictions be relaxed? J. Educ. Meas. 44, 249–275. doi: 10.1111/j.1745-3984.2007.00037.x

Crossref Full Text | Google Scholar

Stocking, M., and Lord, M. L. (1983). Developing a common metric in item response theory. Appl. Psychol. Meas. 7, 201–210. doi: 10.1177/014662168300700208

Crossref Full Text | Google Scholar

Yang, W. L., and Houang, R. T. (1996). The Effect of Anchor Length and Equating Method on the Accuracy of Test Equating: Comparisons of Linear and IRT-Based Equating Using an Anchor-Item Design (ED401308). New York, NY: ERIC.

Google Scholar

Keywords: test equating, item linking, test validity, anchor item, item response theory, simulation study

Citation: Weber D, Becker N, Spinath FM and Koch M (2026) The stability of IRT parameters under several test equating conditions. Front. Psychol. 16:1652341. doi: 10.3389/fpsyg.2025.1652341

Received: 23 June 2025; Revised: 21 November 2025; Accepted: 11 December 2025;
Published: 12 January 2026.

Edited by:

Holmes Finch, Ball State University, United States

Reviewed by:

Rodrigo Schames Kreitchmann, National University of Distance Education (UNED), Spain
Hye-Jeong Choi, Human Resources Research Organization, United States

Copyright © 2026 Weber, Becker, Spinath and Koch. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Dominik Weber, ZG9taW5pay53ZWJlckB1bmktc2FhcmxhbmQuZGU=

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.