^{1}

^{2}

^{3}

^{4}

^{1}

^{*}

^{1}

^{2}

^{3}

^{4}

Edited by: Hong Jiao, University of Maryland, College Park, United States

Reviewed by: Okan Bulut, University of Alberta, Canada; Yoav Cohen, National Institute for Testing and Evaluation (NITE), Israel

This article was submitted to Quantitative Psychology and Measurement, a section of the journal Frontiers in Psychology

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Item leakage has been a serious issue in continuous, computer-based testing, especially computerized adaptive testing (CAT), as compromised items jeopardize the fairness and validity of the test. Strategies to detect and address the problem of compromised items have been proposed and investigated, but many solutions are computationally intensive and thus difficult to apply in real-time monitoring. Recently, researchers have proposed several sequential methods aimed at fast detection of compromised items, but applications of these methods have not considered various scenarios of item leakage. In this paper, we introduce a model with a leakage parameter to better characterize the item leaking process and develop a more generalized detection method on its basis. The new model achieves a high level of detection accuracy while maintaining the type-I error at the nominal level, for both fast and slow leakage scenarios. The proposed model also estimates the time point at which an item becomes compromised, thus providing additional useful information for testing practitioners.

Due to advances in information technology, continuous testing has been offered for many large-scale testing programs, and test takers can take such exams nearly any time during the year. Although continuous testing provides test takers with considerable flexibility and convenience, it also raises serious security concerns. Individuals who take the test earlier in a testing window could share the items orally or online (e.g., via social media platforms), which would benefit subsequent test takers, jeopardizing the validity and fairness of the test. Studies have shown the severe and negative impact of compromised items (Chang and Zhang,

Alternatively, many methods have been developed to proactively detect item preknowledge (McLeod et al.,

The above-mentioned proactive methods focus on individual-level test statistics, but in recent years, several item-level sequential methods have been proposed to detect compromised items in computerized adaptive testing (CAT) (Zhang, ^{th} response to the (^{th}, where _{1} can be detected as an compromised item at day _{2}. There is a _{2} − _{1} lag in between before a significant conclusion could be drawn. In this case, _{2} is the detection day, which is known. And _{1} is the compromised day which is not known.

Therefore, there is need for a new, flexible method to account for various item-leaking processes in real life, where compromised items can spread at different rates and item leakage can result from many causes. The new method should be able to detect leakage under different scenarios, and provide an estimate of when an item is leaked.

First, compromised items may spread at different speeds, and the expected probability of correctly response to an item may not jump abruptly to a fixed, high value. For example, a posting on a popular social media website could quickly spread preknowledge of an item, whereas sharing within a small group of acquaintances might result in slower spreading. Therefore, to make the sequential detection approach more robust, it is important to develop a flexible method that takes these underlying dynamics into consideration.

Second, there are many probable causes of item leakage. A common scenario as detailed above could involve a test taker who posts the items received on a website, where future test takers could gain preknowledge on those items. A more severe case is organized item theft, which has been discussed in Yi et al. (

We therefore propose a new method for proactive detection of compromised items that largely addresses the stated limitations of existing approaches. Our method uses generalized linear modeling with complementary log-log transformation (cloglog) as the link function, and it takes the potential leaking mechanism into consideration. Compared with existing methods, it has the following advantages: (1) It can handle more complicated item leakage mechanisms, both fast and slow; (2) Unlike existing sequential approaches, it does not need a moving window to boost the detection sensitivity, and thus saves the trouble of determining the best window size; instead, it improves the detection accuracy by utilizing complete testing information. (3) It enables the estimation of the “compromise time,” i.e., the time point at which the item was compromised. (4) It is computationally more efficient compared with those item preknowledge detection methods since it does not depend on the selection of suspicious items.

The model is validated by both simulation data and real data in practice. For simulation, the test is performed with simulations under different scenarios and parameters. The simulated datasets are generated as diverse as possible. First, the model we use for simulating data is purposefully designed to differ from our model for leakage detection, in order to test the robustness of our leakage detection method when the underlying leaking mechanism is unknown. Second, our simulation covers two distinct leakage scenarios, organized item theft and random item leakage. Third, for each simulation scenario, we investigated the values of the leakage rate in a wide range, in order to mimic different spread speeds in practice. In addition of all simulation studies above, we also showed an application of our proposed method to a real large-scale testing dataset. Both studies, i.e., simulation and real data, perform well. In our study, an application based on the estimation of the compromised day, _{1}, is also proposed, which successfully links the compromised item detection with the person-level preknowledge detection. Simulation results show that _{1} can provide important information for the preknowledge detection in CAT and significantly improve the accuracy of the person's ability estimation.

We detect compromised items by monitoring the responses of test takers. When an item is compromised, the expected probability for test takers to answer it correctly will increase. Instead of assuming all responses to always be correct (Yi et al.,

In computerized adaptive testing, the probability for a test taker to give a correct response to an uncompromised item can be modeled by a three-parameter logistic (3PL) item response theory (IRT) model (Lord,

where θ is the latent ability of a test taker,

when the maximum item information method is used to select the next item. The expected probability for test takers to answer the item correctly is (1+

In this study, the proposed detection algorithm concerns only the time series of responses of a single item, and all items are treated independently. Unless stated otherwise, we will use a representative item to hereafter illustrate the detection model.

Suppose the expected probability for a test taker to answer this item correctly is 1−π_{t} on day _{t}). Therefore, the number of incorrect answers _{t} should approximately follow a binomial distribution, _{t}~ Bin (_{t}, π_{t}), where _{t} is the total number of examinees taking this item on day

where

In order to design an effective model to detect the leakage pattern in real data, we worked with the researchers in the large-scale testing company in this study.

Representative Curves for Different Scenarios.

To detect the gradual change of the expected probability, two possible methods could be used to model the probability π_{t} as a function of time: logit

or cloglog

where π_{0} is the expected probability before leakage and β is a coefficient that controls the speed of the leakage. Here _{0} is the point at which the item is compromised. _{t} under different combinations of π_{0} and β for both logit and cloglog functions. In general, π_{t} decreases in a sigmoid manner when β is negative, and a larger absolute value of β corresponds to a faster decrease, suggesting a faster leakage of the compromised item. In the beginning, π_{t} presumably changes relatively faster than later in the test cycle. This is when some test takers who are eager to obtain preknowledge of the compromised item would like to take the test, since the compromised item likely is still available. In such case, it will induce a faster drop of probability of incorrect response when leakage starts. For this reason, the asymmetry of the cloglog function is favored in this study and will be selected to model π_{t}. When _{0},

That is,

For a compromised item, a negative β is expected. Therefore, the problem of detecting a compromised item is converted to performing the following hypothesis test:

Note that our test is one-sided, since a positive beta corresponds to an increasing π_{t}, which is not a desired pattern we want to flag out.

Comparison of link functions with logit and cloglog transformations.

In order to perform the hypothesis test, we need

Let _{1} = π_{0}, ψ_{2} = β and ψ_{3} = α. In this way, we can use one general symbol

Initialize model parameters with random starting values. We use ^{(0)} = 0 and α^{(0)} = 0 for all our simulation studies.

Update

First, keep

Then, keep

Third, keep

Each of the above updates is given by

(See the

Repeat Step 2 until convergence. The convergence is checked by calculating the change of the log-likelihood after each iteration. If the change is less than a threshold, e.g., 0.001, the model has converged. Then the element of the Fisher information matrix (see

where _{1}, _{2} = 1, 2, 3. According to the co-factor method of getting the inverse matrix of

Given

When the null hypothesis is rejected (one-sided test), the item will be flagged as compromised. The time from when the item starts to leak, i.e., the “compromised time,” to when a leaked item is flagged, is defined as “detection lag.” This definition of detection lag is the same as that in Zhang (_{c}, is unknown in real applications. We propose an estimate of it, _{t} drops to a certain percentage, say ϵ, of π_{0}. Based on our model in Equation 7, it is easy to show that

Especially, we use ϵ = 90%. The bias of this estimate is defined as the “estimation lag”.

Further, the variance of

where

The elements of

Our primary goal for introducing a different leakage model is to test the effectiveness of the proposed detection method with unknown underlying leakage rates. The leakage simulation model should have these two features: (1) After the item is compromised, the expected probability to get a correct response will increase; (2) The spread rates of the compromised item may differ across items. In this study, the leaking process is simulated using an exponential function as follows,

where λ is the leakage parameter that regulates how fast the item will be exposed to the public, _{0} is the time point at which the item is compromised, and _{0} is the time interval since the item was first compromised. The probability for any test taker to have item preknowledge is a function of

If the test taker already knows the answer to the item due to item preknowledge, the response process is described by the first component of Equation 16, which is _{0}, responses to the compromised item will contain increasingly more 1 s (i.e., correct responses), as time

Note that, in this study, simulation model is only used to test the detection model, not to detect the leakage. Compared with the detection model, simulation model is more complex with extra parameters including a person's ability θ. Although we can also use the leakage model to fit the curve and run the hypothesis test thereafter, a simultaneous estimation of person's ability will make the fitting less efficient than the detection model. Since we only care about the detection of probability curve's leakage pattern, the proposed detection model is more straightforward and easier to converge.

Simulation studies are conducted to investigate the performance of the proposed detection method. The parameters in our simulation were chosen according to previous publications (du Toit,

The discrimination parameters are generated by lognormal distribution. An exposure control procedure is implemented to prevent items from being over-exposed and to protect test security. The exposure rate for an item is defined as

In this study, the exposure control parameter is set to be 0.2, meaning only items with exposure rate lower than 0.2 are eligible for administration. Items in the bank belong to three content areas with percentages 40, 30, and 30%, respectively. Test length is set at 40. A content control procedure is implemented in the simulation to ensure that 40, 30, and 30% of items are selected from each content area for every test taker (i.e., 16, 12, and 12, respectively). The item with the lowest exposure rate in the desired content area will be selected as the first item for the incoming test taker. A sample of 500 test takers (θs) are generated each day to take the exam, whose abilities follow standard normal distribution. The simulation is replicated 10 times and all the distribution figures presented in the remainder of this paper are generated based on results aggregated over replications.

A test item could become compromised for a variety of reasons. The interest of this study is to investigate the effectiveness of the detection algorithm in general. In order to achieve this goal, we studied two common scenarios, which form the core of this paper:

Organized item theft. Organized item theft is one of the most severe item leakage scenarios in computer-based testing (Yi et al.,

Random item leakage. Some test takers simply share the items that they have memorized with the public. In this instance, the leakage could occur any time. For the purpose of this study, 20 such item sharers are randomly selected during one testing window. A testing window is so defined that no item pool maintenance such as rotation or replenishment occurs within that window. In other words, the item bank remains the same throughout the window. In this study, the testing window is set to be 30 days (one month). In practice, this number highly depends on the operation of testing company. It might not be a fixed value even for the same test. For simulation, we use monthly rotation to demonstrate the methodology. On average, we assume each item sharer could remember at random 10 out of the 40 items and share these with the public. Usually the motivation to share items is weak near the end of a testing window. For this reason, this simulation study assumes that such random sharing behavior happens only in the first 25 days.

For each test taker, the first item is selected from the item bank that has the lowest exposure rate at that time from the desired content areas. The probability of the test taker to give a correct answer to the target item is calculated based on the mixture leakage model (Equation 16). Then a uniform distributed random number will be generated within (0, 1). If its value is less than the mixture probability, the response will be 1 (i.e., a correct answer). Otherwise, the response will be 0. The expected a posterior (EAP) method (Bock and Mislevy,

In some extreme cases, the probability of getting an item correctly after it is compromised is 1 for all test takers. But, in practice, item leakage could be a gradual change process. In this study, leakage parameter λ (see Equation 15) are set to be 0.05, 0.1, 0.3, 0.5, 0.7, 1, and 1.5 to regulate the differential speed of item leakage. When λ is large, e.g., λ = 1.5, the simulation represents a severe leakage scenario, in which nearly all responses will be correct once an item has been compromised.

As illustrated in the Method section, the proposed leakage detection model intentionally uses Equation 7, which differs from the true underlying model (Equation 16) that is used to generate the item responses. Parameter λ controls the speed of leakage. The days to reach the probability's half-drop can be approximately estimated by

As mentioned earlier, this study assumes that all item thieves have taken the test in the first 4 days within a testing window.

Detection accuracy and Type-I error for organized item theft (standard error is given in parenthesis).

0.05 | 93.70 (0.54) | 4.49 (0.37) |

0.10 | 99.86 (0.09) | 6.56 (0.56) |

0.30 | 99.93 (0.06) | 7.67 (0.70) |

0.50 | 99.43 (0.27) | 4.09 (0.76) |

0.70 | 99.61 (0.13) | 4.89 (0.74) |

1.00 | 99.04 (0.16) | 4.99 (0.34) |

1.50 | 98.85 (0.26) | 4.32 (0.83) |

For a desired 95% confidence interval, the detection accuracy is about 99% for those λs larger than 0.05. When λ = 0.05, the detection accuracy drops to 93.70%. This is because λ = 0.05 represents a very slow leakage process, which is hard to detect within the 30-day window. On the other hand, the type-I errors for all λs are well controlled at ~5%, consistent with the desired 95% confidence interval.

Distribution of the detection day for organized item theft.

Detection lag and estimation lag for organized item theft (standard error is given in parenthesis).

0.05 | 17.47 (0.16) | 0.61 (0.19) | 0.58 |

0.10 | 10.61 (0.11) | –0.46 (0.11) | 0.65 |

0.30 | 4.69 (0.06) | –0.95 (0.04) | 0.76 |

0.50 | 3.47 (0.05) | –1.07 (0.03) | 0.82 |

0.70 | 3.03 (0.05) | –1.13 (0.02) | 0.88 |

1.00 | 3.11 (0.08) | –1.07 (0.02) | 0.96 |

1.50 | 4.96 (0.14) | –1.00 (0.01) | 1.00 |

Item distribution of Type-I error items for organized item theft.

Further, given the estimation of an item compromise point, test practitioners could re-evaluate a test-taker's ability by removing the responses to the suspicious items from ability estimation. Suspect items are defined as those compromised items administered to test takers who take the test after the item compromise point. For example, if an item is flagged as being compromised on day 3 and it was assigned to a test taker on day 4, this item will be classified as a suspicious item for that test taker.

Ability Estimation with/without Suspicious Items for Organized Item Theft.

Results from studying the random item leakage conditions show common patterns with those of the organized item theft conditions. However, unlike the scenario of organized item theft, random item leakage does not always start at the beginning of the item bank rotation. The leakage can occur any time before the rotation of the item pool. Therefore, more data are available before the leakage. This part of the study examines how the model performs under such a scenario. _{t} will not change much from day 25 to 30 when λ is small. When λ is large, however, a significant change of π_{t} could still be observed within 5 days.

Detection accuracy and Type-I error for random item leakage (standard error is given in parenthesis).

0.05 | 67.63 (2.61) | 2.02 (0.44) |

0.10 | 87.00 (1.98) | 4.54 (0.43) |

0.30 | 96.64 (1.03) | 5.14 (0.92) |

0.50 | 98.80 (0.33) | 4.67 (1.27) |

0.70 | 99.13 (0.50) | 4.85 (0.50) |

1.00 | 99.79 (0.10) | 5.22 (0.90) |

1.50 | 99.74 (0.14) | 4.01 (0.83) |

Distribution of the detection day for random item leakage.

Detection lag and estimation lag for random item leakage (standard error is given in parenthesis).

0.05 | 12.42 (0.15) | 0.66 (0.16) | 0.46 |

0.10 | 7.76 (0.08) | 0.18 (0.08) | 0.54 |

0.30 | 3.88 (0.05) | –0.90 (0.05) | 0.69 |

0.50 | 3.00 (0.04) | –1.00 (0.04) | 0.78 |

0.70 | 2.66 (0.04) | –1.16 (0.04) | 0.84 |

1.00 | 2.38 (0.04) | –1.14 (0.03) | 0.91 |

1.50 | 2.64 (0.07) | –1.17 (0.03) | 0.98 |

Item distribution of Type-I error items for random item leakage.

Ability Estimation with/without Suspicious Items for Random Item Leakage.

In this study, we demonstrate the use of the proposed methods with real data from a large-scale operational CAT program that offers continuous testing. Item response data for about 10 days from two operational item pools are used for the analysis. There are 2905 items in total and only 32 items are flagged as being compromised, with nominal alpha level at 0.05. This result indicates that this operational testing program is rather secure, with only slightly over 1% (32 out of 2905) of potential leakage detected. Although the nominal Type-I error is 0.05, the empirical alpha level may be different due to many factors, e.g., the short testing interval (10 days). For all four typical curves illustrated in

We compare our proposed detection method with the existing method (Zhang,

First, we apply Zhang's detection model to our simulation data with a leakage process taken into consideration.

Application of Zhang's sequential method to random leakage scenario.

0.05 | 99.68 | 89.87 | 94.50 | 48.36 | 73.95 | 8.90 | 37.44 | 1.58 |

0.1 | 99.87 | 84.29 | 98.08 | 42.37 | 91.78 | 9.75 | 71.31 | 2.95 |

0.3 | 99.40 | 77.97 | 99.31 | 39.00 | 97.18 | 9.30 | 90.51 | 2.98 |

0.5 | 99.93 | 77.89 | 99.74 | 37.83 | 98.01 | 8.56 | 92.05 | 2.75 |

0.7 | 99.94 | 74.96 | 99.68 | 32.80 | 97.78 | 6.79 | 93.06 | 2.46 |

1.0 | 100.00 | 76.40 | 99.41 | 35.60 | 96.81 | 8.11 | 91.46 | 2.46 |

1.5 | 99.93 | 74.40 | 99.01 | 32.38 | 95.74 | 7.47 | 91.36 | 1.98 |

Application of Zhang's sequential method to organized theft scenario.

0.05 | 100.00 | 81.81 | 99.81 | 36.89 | 91.75 | 9.04 | 44.80 | 2.63 |

0.1 | 99.93 | 72.07 | 98.86 | 36.41 | 95.32 | 9.66 | 56.41 | 3.89 |

0.3 | 99.93 | 75.37 | 99.93 | 40.17 | 92.22 | 11.56 | 64.05 | 4.51 |

0.5 | 99.93 | 76.72 | 99.63 | 41.27 | 90.39 | 11.92 | 61.03 | 3.10 |

0.7 | 99.68 | 79.17 | 99.65 | 42.74 | 83.88 | 10.96 | 60.33 | 2.80 |

1.0 | 99.54 | 76.60 | 96.63 | 40.86 | 80.73 | 9.96 | 61.54 | 2.56 |

1.5 | 99.19 | 77.48 | 93.23 | 43.83 | 74.96 | 10.34 | 56.59 | 1.88 |

We also apply Zhang's sequencial method to the real dataset we used in the section of “Real Data Application”.

Number of items that are flagged as compromised with different α for two models.

In this study, we have proposed a general detection model that considers the practical dynamics of the item leaking process. The method shows, through all our simulation studies, a strong detection power for various leakage rates with well-controlled type-I error. The model also provides a way to estimate the time point at which an item is compromised, which may be helpful for testing practitioners to better secure the testing process.

The goal of our method is to detect the item leakage for various leakage rates with unknown underlying leaking processes. Therefore, the simulation model of the leakage is purposefully designed to differ from the compromised item detection model. The results show that the proposed model for detection performs very well under such scenarios, which is a strong indicator of the generality and powerfulness of our detection method. Estimates of both detection accuracy and detection lag are close to the expected value when the leakage rate is not too small. When the leakage is very slow, we have observed a longer detection lag time. The impact of this lag, however, can be quite mild in real data applications: When λ is small, the change in probability of getting a correct answer is not large even with a relatively large detection lag. Further, this lag is inevitable: Determining whether an item is compromised when the leakage is slow is intrinsically difficult, no matter what method is used.

The assumption of the detection model is that, given an infinitely long testing window, all test takers eventually will be aware of the compromised item and hence be able to respond correctly, which is implicit in Equation 7. In practice, it may be the case that some portion of test takers will not gain any preknowledge of the items, no matter the length of long the testing window. Therefore, the probability of correctly answering a compromised item ultimately may never reach 100%. In that case, we can generalize our method to cover such a scenario as follows:

where one more parameter π_{e} is introduced to represent the expected upper asymptote after the item has been compromised. The π_{e} could be any value between [0, π_{0}]. When π_{e}=0, the model reduces to the simplified model in Equation 7. It will be our future work to implement this more general model.

The validation of the model was performed both by simulation and real data. Through the simulation study we were able to generate different leakage dynamics and test the effectiveness of our proposed method in these scenarios. Note that, although both models control the leakage speed, the parameter β in the detection model is not mathematically related to the λ in the simulation model. Actually, our proposed method essentially focuses on detecting the leakage pattern. As long as the overall pattern of the expected probability curve is similar to what we proposed, the method should work. We also applied the method to real data to demonstrate its utility in practice.

The simulation study shows that our proposed method is powerful and reliable when applied to CAT using the maximum item information method for item selection. But the method is not limited to a particular item selection method. Letting

The expected probability, therefore, does not depend on the distribution of θ. Different item selection algorithms provide different

Although the time unit in this study is set at the day level, its selection is very flexible and can be set at finer levels if necessary. The best way to select a time unit depends on the property of the test of interest and expert judgment of experienced testing practitioners. For example, given a large number of scheduled test takers per day, the time unit could be further divided by hourly increments. This would allow for more time points to be used for model fitting, subsequently leading to higher detection sensitivity. On the other hand, instead of aggregating the data by time, one could also choose to aggregate the data by a fixed number of item responses, e.g., every 20 responses. In addition, the type-I error for the hypothesis test is set to be 0.05 in this study, following convention. In practice, the cutoff could be chosen per test practitioners' preference as well.

Our study shows that the ability estimation

Comparing with the existing sequential method (Zhang,

CL and JL developed the method. CL did the simulation study and analysis on real data. KH provided the real data and did the real data analysis with CL and JL. CL, JL, and KH wrote the manuscript.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The authors have relied upon data supplied by the Graduate Management Admission Council (GMAC) to conduct the independent research that forms the basis for the findings and conclusions stated by the authors in this article. These findings and conclusions are the opinion of authors only, and do not necessarily reflect the opinion of GMAC. The authors also would like to thank the comments and suggestions from editor and three reviewers.

According to the chain rule of differentiation, the deduction could be divided into two parts: First derive the log-likelihood toward π_{t}; then calculate the derivative of π_{t} toward model parameters π_{0}, β and α. Therefore, for convenience, let

where _{t}, _{t}, _{t}) is function of π_{t}, _{t}, _{t}. Then we have,

where

For convenience, let

from both equations A2 and A3, we finally get, for π_{0}

and for β,

and for α,

and the derivatives for cross terms are,

Therefore, the Fisher information matrix is,

Take equation A2 into equation A9, and we have,

Since

the Fisher information matrix could be simplified to