Evaluating the Quality of Higher Education Instructor-Constructed Multiple-Choice Tests: Impact on Student Grades

Multiple-choice questions (MCQs) are commonly used in higher education assessment tasks because they can be easily and accurately scored, while giving good coverage of instructional content in a short time. However, studies that have evaluated the quality of MCQs used in higher education assessments have found many flawed items, resulting in misleading insights about student performance and contaminating important decisions. Thus, MCQs need to be evaluated statistically to ensure high quality items are used as the basis of inferences. This study evaluated the quality of 100 instructor-written MCQs used in an undergraduate midterm test (50 items) and final exam (50 items), making up 50% of the course grade, using the responses of 380 students enrolled in one 1st-year undergraduate general education course. Item difficulty, discrimination, and chance properties were determined using Classical Test Theory and Item Response Theory statistical item analysis models. The two-parameter logistic model consistently had the best fit to the data. The impact on overall course grades between the original raw score model and the IRT 2PL model showed 70% of students would receive the same grade (i.e., D to A), but only one-third would get the same mark using the standard augmented grade scale (i.e., A+ to D-). The analyses show that higher education institutions need to ensure MCQs are evaluated before student grading decisions are made.

Multiple-choice questions (MCQs) are one of the most commonly used assessment methods in higher education (DiBattista and Kurzawa, 2011;Bailey et al., 2012). However, its use varies by discipline (e.g., high use is seen in medical education) and jurisdiction (e.g., perhaps less so in the United Kingdom). MCQs are used because, in a short period of time, a broad range of course material can be efficiently assessed and accurately scored (DiBattista and Kurzawa, 2011;Nedeau-Cayo et al., 2013). However, the quality of the instructor-written MCQs used in higher education assessments is questionable and potentially results in misleading evidence of student achievement (Masters et al., 2001;Brady, 2005;Downing, 2005;Stagnaro-Green and Downing, 2006;Tarrant et al., 2006). This is an understandable situation since few academics have had formal education in assessment theory or the principles of MCQ item writing.
Evaluation of MCQ quality can be conducted through professional judgment processes relative to "best practice" conventions and advice (Haladyna, 2004). Four major foci have been identified: (1) content guidelines, (2) style and format, (3) writing the stem, and (4) writing options (Haladyna and Rodriguez, 2013). Implementation of these guidelines can and should be conducted automatically "in-house" by academics within each department or discipline, prior to deployment of the test or examination. The second approach for determining quality is the application of statistical item analysis procedures to determine the characteristics of items and use those statistics to decide if an item can be properly included in the determination of test-taker performance (Downing, 2006;Malau-Aduli and Zimitat, 2012).
Items which are (a) inappropriately difficult or easy, (b) too easy to guess at, or (c) do not discriminate positively between high and low performing learners will lead to inappropriate decisions about student ability and consequent decisions (e.g., passfail, graduation, access to scholarship, etc.). Furthermore, such items will also give inappropriate feedback to students and instructors. In both cases, the problem lies in poorly constructed items, rather than necessarily poorly delivered teaching or poor learning habits and strategies.
Statistical tools exist to evaluate item quality and are used extensively in high-stakes testing programs in international K-12 test systems, in national K-12 testing programs, and in high-stakes university admission testing. Unfortunately, the same cannot be said for higher education course assessments, especially those relying on MCQs. This is problematic since grades are awarded, in part, on the basis of performance on MCQ testing and, if little or no quality assurance is carried out, then invalid conclusions about student performance will be drawn. Hence, the lack of quality assurance processes, such as statistical item analysis or item evaluation, raises doubt as to the validity and legitimacy of scores, grades, and ultimately certificates and degrees. Thus, the goal of this study was to examine two operational MCQ-based tests within one course, using multiple statistical models, to determine (1) the quality of items and (2) possible implications for grading decisions. It is worth noting that, while this study involves MCQ items, the same issues and challenges exist for any dichotomously scored test question formats such as True-False, Mix-and-Match, and so on. The statistical problems are very similar, although the item quality indicators would be different.

assessMenT in higher eDUcaTiOn
Assessments in higher education serve a wide range of functions, including formative (e.g., how and what to improve on) and summative (e.g., pass-fail decisions, entry to restricted programs, scholarships, graduation, etc.) (Yorke, 2009;Schaughency et al., 2012). Tested performance using MCQs is normally transformed into grades (e.g., A to D or E) which are meant represent the quality of students' performance and level of achievement (Yorke, 2009). Grades are signals of achievement and show students their areas of strength and weakness and can inform instructors about the success of their teaching (Joughin, 2009;Yorke, 2009;Brown, 2010;Walvoord and Anderson, 2011). Obviously, the quality of assessment matters so that inferences and decisions by students and instructors, as well as external stakeholders (e.g., employers), can be made on a robust basis (Grainger et al., 2008).

Quality of McQs in higher education
Despite the existence of guidelines for writing MCQs (Haladyna, 2004;Brady, 2005;Burton, 2005;Downing and Yudkowsky, 2009), studies have found many bad items and violations of recommended guidelines. Tarrant et al. (2006) evaluated 2,770 MCQs used over a five-year period from 2001 to 2005 and concluded that nearly half (46%) of the items were bad because they violated item-writing guidelines. Similar outcomes in higher education assessments are reported across different disciplines (Ellsworth et al., 1990;Hansen and Dexter, 1997;Masters et al., 2001;Downing, 2005). Poorly written MCQs can negatively impact students' performance and achievement (Downing, 2005;Tarrant et al., 2006;Clifton and Schriner, 2010).
There is also concern that MCQs do not assess higher order thinking and focus too much on recall of knowledge (Downing, 2005;Tarrant et al., 2006;Walsh and Seldomridge, 2006;Popham, 2011;Malau-Aduli and Zimitat, 2012). Additionally, creating good MCQs is time-consuming and it is particularly difficult to create good distractors, especially for higher order thinking objectives (Fellenz, 2004;Clifton and Schriner, 2010). It has been proposed that poor item writing, rather than an inherent characteristic of MCQs, accounts for their tendency to assess lower order cognitive skills (Downing, 2005;Downing and Yudkowsky, 2009;Malau-Aduli and Zimitat, 2012). Fortunately, training in MCQ item writing has produced significantly higher quality MCQs (Jozefowicz et al., 2002).
Since MCQs contribute to course grades, a high score on an easy test may artificially inflate student grades. Likewise, the reverse occurs if the test was overly difficult resulting in artificially depressed grades. The problem of ensuring that item and test difficulty aligns with appropriate standards is complicated, especially if item difficulty is caused by poor writing. Without quality assurance and standard setting processes that take into account the difficulty of the test relative to the grade criteria, raw scores on a test have little meaning. Hence, MCQ tests and examinations need to be evaluated for the quality of the item writing and the statistical properties of the contributing items. Then, standards need to be derived for each test, using one of many methods available (Cizek, 2001), which map the test scores onto grade descriptors. While item quality and standard setting are complex human processes, the analysis of item properties is a more technically demanding statistical process.

sTaTisTical aPPrOaches TO McQ iTeM QUaliTY
Two major classes of statistical methods can be applied to MCQs. These are known as classical test theory (CTT) and item response theory (IRT). The former examines tests as entities, while the latter evaluates items in and of themselves.

classical Test Theory
The CTT approach determines item characteristics from the available observed data (Reynolds et al., 2009). CTT assumes that the total number of items answered correctly indicates the examinee's level of ability or knowledge (de Ayala, 2009;Schaughency et al., 2012). In other words, students who get a higher proportion of items correct know more than those with a lower percentage correct. Most commonly, letter grades are associated with ranges of percentage correct (e.g., B = 70 to 79%) or a pass-score can be set at a proportion correct (e.g., 60%).
Classical test theory specifies that the score achieved by an individual examinee is equal to the sum of their theoretical true ability and the unobserved error component in the test. The proportion of candidates getting an item right (p) determines the difficulty of each item; items that are too easy (p > 0.80) or too hard (p < 0.20) are frequently rejected from a test as not providing useful information about candidate ability. Ideally, all items in a test discriminate positively between those who know most and those who know least. This is determined by examining the point-biserial correlation (rpb), which is the correlation of the item to the total after the item has been removed from the total. In many testing situations, items which do not have a significantly positive value (rpb > 0.20) are rejected, though any positive value indicates a tendency for higher scoring candidates to get individual items correct more than the lower scoring candidates (Ebel and Frisbie, 1991).
Another quality indicator of MCQ is the efficiency of the wrong answer distractors. Distractors that get selected infrequently (e.g., <5% of test-takers choose it) are so implausible that they seem to attract only candidates randomly guessing (Haladyna and Downing, 1993). Options with low selection rates have been found in up to nearly half of all items (Haladyna and Downing, 1993), between 30 and 40% of all items (Tarrant et al., 2009), and as many as 75% (Hingorjo and Jaleel, 2012). Hence, identification of such options and their subsequent replacement or deletion could improve item quality.
Test quality is accepted, generally, if the estimate of reliability (e.g., Cronbach's alpha) is sufficient for the decisions being made. For example, for research purposes α > 0.70 is considered sufficient because the shared covariance of the items accounts for about 50% of the test score. However, in a high-stakes certification examination (e.g., Advanced Placement tests at the end of high school in the USA), very high reliability estimates (α > 0.90) are expected. Given the mean, standard deviation, and reliability estimate of a test, it is possible to calculate a standard error of measurement, which is the range of scores that each candidate would most likely get the next time they sat the exact same test (Harvill, 1991). The SEM indicates the number of marks a score could vary by chance, without any substantive change in the student's ability and should be used in making decisions about quality or change. Thus, the CTT approach provides sufficient statistics to evaluate items for difficulty and discrimination.
However, in CTT examinee ability is "sample dependent" meaning that if the test is hard, the students will seem to be low achievers and vice versa . Similarly, items have difficulty values totally dependent on the ability of the sampled test takers, and so a change in their ability will change the item difficulty. This means that items will have very different characteristics depending on who attempted them and what other questions were present.

item response Theory
Because a test is a sample of a domain of interest, the real focus of interest in assessment is the learner's ability in the domain, independent of the set of items presented in a test. Hence, a modern class of statistics (i.e., IRT) has arisen which permits items to be given different difficulties, discrimination, and guessing characteristics independent of the test in which they are presented Embretson and Reise, 2000;Borsboom, 2005). IRT predicts the likelihood that an examinee with a specific ability level will correctly answer a specific item by defining the examinee's ability in relation to the item characteristics (Embretson and Reise, 2000). This means that a person's total score or ability can be estimated using a probabilistic formula based on the dual properties of the item and the test-taker's performance (Hambleton and Jones, 1993).
All IRT models propose that the probability of answering an item has an S or ogive shape in which the probability of answering correctly increases probabilistically as the ability of the test-taker increases. The formula uses the natural log of the odds that an item is answered correctly over answered incorrectly. The S-shape of the item plot (i.e., probability of answering correctly on the vertical axis versus item difficulty and person ability on the horizontal axis) creates two asymptotes so that the probability approaches, but never reaches, certainty (p = 1.00 versus p = 0.00) even as ability reaches positive or negative infinity (Giblin, 1972). This shows that there is always the possibility for very low-ability students to answer an item correctly by chance, and vice versa. Since MCQs have multiple wrong answers, it is possible for a high-ability student to be misled and similarly, because the right answer is available, it is possible for a low-ability student to randomly select it.
The difficulty of an item is the point when the probability of answering the item correctly equals 50%. The ability of a person is defined as the difficulty of items for which the person has a probability of answering correctly at the 50% correct rate (Embretson and Reise, 2000). Unlike CTT, answering more questions correctly does not increase the overall ability estimate unless the items are hard. In other words, in IRT the person's score goes up, not by answering a higher proportion of questions correctly, but by answering much harder questions. If the difficulty of items does not align well with the test takers' ability (e.g., too many easy or very hard questions relative to performance) then the accuracy of the estimated score decreases.
Within IRT there are three major models with increasing complexity Embretson and Reise, 2000;Osterlind and Wang, 2012).

One-Parameter Logistic Model (1PL) or Rasch Model
This approach assumes that all items have statistically equivalent discrimination and only differ in terms of their difficulty (Bond and Fox, 2007). The probability of guessing is assumed to be equal and very close to zero when ability is very low. Items are deemed to fit the Rasch model if the Chi-square (χ 2 ) index is statistically not significant (Bond and Fox, 2007). Only items that are statistically equivalent to each other in terms of their slope and lower asymptotes are retained by the model. This can inadvertently mean that items with very strong positive discrimination could be rejected, simply because they differ too much from the model (Houts et al., 2016). It is important to note here that the Rasch model approach prioritizes the model which requires all items to conform to the assumptions and be statistically equivalent to each other. This stands in contrast to the data-centric approach of IRT, which may use the same 1PL model as Rasch, but allows parameters to be freely estimated without constraining them to fit the a priori model. While some have argued that the imposition of the Rasch model assumptions is necessary to achieve "measurement" (Bond and Fox, 2007), the data analysis in this study is neutral as to the philosophic assumptions associated with Rasch modeling.
The Rasch or 1PL model has been used to analyze the quality of multiple-choice items with mixed results. Some studies showed that the Rasch model fitted most items (Athanasou and Lamprianou, 2004) while others have found that the Rasch model did not fit most multiple-choice items (Divgi, 1986;Leeson and Fletcher, 2003), most likely because of the overly restrictive requirement that all items have zero guessing and equal discrimination (Drasgow and Parsons, 1983;van de Vijver, 1986). Hence, there are doubts as to the sufficiency of the Rasch model for MCQ items where the possibility of guessing exists.

Two-Parameter Logistic Model (2PL)
The 2PL model includes item discrimination and item difficulty as factors that determine the item and test-taker characteristics Embretson and Reise, 2000;Thissen and Orlando, 2001). Items that have greater discriminatory power have steeper slopes at the 50% probability point. This means that only a small change in ability will produce a large change in probability of answering correctly. Highly discriminating items are useful, as in CTT, in differentiating between examinees of different ability, especially when distinctions relative to a cutoff score or grade boundary are required (Thissen and Orlando, 2001;Osterlind and Wang, 2012). The advantage of 2PL over Rasch is that items do not have to have equal discrimination rates. Nonetheless, negative discrimination values, as in CTT, provide misleading information about the domain of interest, necessitating the removal of such items. Like the 1PL model, the 2PL model does not account for the possibility of correctly answering the item by chance or guessing. The logic here is that if items are well-written, the probability of guessing should be less than the raw chance of randomly picking the right answer from a set of options. For example, a 4-option item could be answered correctly one in four times. If the chance value for such an item is actually 17%, then the effect of chance can be ignored. The 2PL model has been found to fit the data from well-designed MCQ reading comprehension items better than the Rasch model (Leeson and Fletcher, 2003).

Three-Parameter Logistic Model (3PL)
The 3PL model extends IRT by including a parameter that represents the possibility of low ability examinees answering an item correctly due to chance (Crocker and Algina, 1986;Embretson and Reise, 2000;Osterlind and Wang, 2012). MCQ items rarely have a lower asymptote at zero because even very weak students can answer items correctly by random processes. Assuming all items are completely independent of each other, the chance of getting any 4-option MCQ correct should be 25%. Unsurprisingly, if items are poorly written or if test-takers exercise very little effort (Wise and Smith, 2016), the probability of guessing correctly could be much higher. Thus, items which have chance values greater than the random rate of guessing are normally removed before score determination. Studies of MCQs with the 3PL model have found it to fit well most of items in a test (Leeson and Fletcher, 2003;Bergan, 2010;Adedoyin and Mokobi, 2013).

Sample Size
Given the complexity of the IRT models, it is not surprising that large sample sizes (i.e., N ≥ 500) are considered necessary (Hambleton and Jones, 1993). This is in contrast to the smaller sizes (i.e., N ≥ 200) permitted by the simpler approach of CTT. Claims that Rasch or 1PL models will estimate accurately with N < 100 (Boone et al., 2014) have been found to be unreliable (Houts et al., 2016). Research into real-data, as opposed to simulated data, with smaller sample sizes has suggested that N < 200 is infeasible (Sireci, 1992), but that N ≥ 300 can provide reasonably accurate parameter estimation provided a test has ≥ 30 items (Akour and AL-Omari, 2013).
Thus, the challenge in operational testing of classes with teacher-made tests, where N < 500 is commonplace, remains whether IRT techniques can be legitimately used to evaluate test items. In addition, few operational testing programs in higher education recycle items into future final examinations because institutional regulations require that exams be exposed to future students. This transparency practice means that, once used, there is little opportunity to collect new data with the self-same items. Hence, in this naturalistic study, the accuracy of estimation, given the available sample sizes, has to be taken somewhat cautiously. Ideally, the results would be tested on a second independent sample for corroboration purposes, but this was beyond scope of the study.

Design
This study is a secondary data analysis of the course assessment data used to evaluate student learning in an introductory education course. The course was assessed with a mixture of two 50-item MCQ tests and essay examinations scored by course instructors and/or tutors. The course followed the university scoring system to convert percentage scores to grades against criterial descriptions for grades. Grades were A = excellent (80-100%); B = good (65-79%), C = satisfactory (50-64%), and D = unsatisfactory (0-49%). The minimum pass (C−) requires a score of at least 50%.

Participants
The participants in this study were students enrolled in one general education undergraduate course at a large research-intensive university in New Zealand. The course was an introductory educational psychology course on learning theory and offered as either a general education course (i.e., a course provided for students studying other disciplines outside the Faculty of Arts) or a normal elective course (i.e., for students from the Faculty of Arts). General education courses in this study are first-year courses taken by students from outside the faculty hosting the course to broaden their education.
Of the 380 students for whom data were available, 276 were enrolled in the general education course, while 104 were enrolled as Arts normal elective students. Only 375 of 395 students did the midterm test, and 377 took the final examination, resulting in 372 students who received a final grade. This size of sample is sufficient for CTT analysis and is close to the recommended threshold for IRT. Unfortunately, as is the case in real-world testing programs, it is not possible to administer the items again to increase the sample size. Hence, if the IRT results are poor quality, this may be partly attributable to the relatively low sample size.
No specific demographic characteristics of the students (i.e., their gender; age, or ethnic group) were available. These operational data were released for secondary analysis without identifying information for the purpose of evaluating the test item writing.

Midterm Test
The midterm test was 2-h long and was administered in week 6 of the course, covering material related to the course content presented in the first five 2-h lectures. The test had 50 fouroption MCQs, drafted by the main course lecturer, and vetted for quality by the course coordinator and the faculty examination manager against accepted best-practice conventions for writing multiple-choice items. The midterm-test items covered content related to seven topics: that is (1) cognitive processing (6 items), (2) forgetting (8 items), (3) general learning theory (5 items), (4) memory (8 items), (5) meta-cognition (10 items), (6) retrieval (7 items), and (7) schema (6 items). The test was administered on paper under invigilation.

Final Exam
The final exam had 50 four-option MCQs constructed by three of the course lecturers and were checked against best-practice recommendations by the course co-ordinator and faculty examination administrator. The final exam excluded topics covered in the midterm-test and covered five topics, each with 10 items, taught after the midterm test. These were (1) motivation, (2) approaches to learning, (3) problem solving, (4) social structure, and (5) behaviorism or observational learning. The exam was administered on paper under invigilation. The MCQs were worth half of the total exam score (i.e., 25% of the total course score) with three essay questions making up the balance of marks.
analyses Analysis of a test presumes that items are unidimensional, which may not be the case when a test covers multiple topic areas. Dimensionality was checked with confirmatory factor analysis of a single factor with 50 items using the weighted least square estimator with robust standard errors and mean-and varianceadjusted χ 2 test statistic (Finney and DiStefano, 2006) in Mplus version 7.4 (Muthén andMuthén, 1998-2015) to account for the dichotomous nature of the items. Current standards suggest that models do not need to be rejected if the root mean square error of approximation (RMSEA) is <0.08, the weighted root-meansquare residual (WRMR) is close to 1.00, and the comparative fit index (CFI) is >0.90 (Yu, 2002;Fan and Sivo, 2007). Cronbach's alpha estimates of internal reliability were conducted before and after removal of misfitting items to establish further evidence of total test score coherence. Psychometric analysis of the items was conducted using each statistical model. Once misfitting items were identified, student scores were recalculated, and the impact on students' grades and pass/fail rates was explored.

IRT Analyses
All IRT analyses were conducted with the "ltm" package in R which also reports model fit (Rizopoulos, 2006). This package produces a variety of comparable fit indices including log likelihood, BIC, and AIC values. Differences of AIC > 10 indicate that the model with the smaller AIC has superior fit to the data; likewise, all models included by 95% confidence interval set (indicated by the sum of Akaike weights Σwi ≥ 0.95) are plausible equally well-fitting models (Burnham and Anderson, 2004).

Classical Test Theory
Items with p-values equal to either 1 or 0 (i.e., 1 = 100% of the students answered the item correctly, 0 = 0% of the students answered the item correctly) were discarded. Items with point biserial correlation (rpb) values below 0.19 were discarded. All values were determined through SPSS version 21.

IRT 1PL Model
Each item's fit statistic values were found using the chi-square (χ 2 ) index of probability that the item data fit the Rasch model (Raykov and Marcoulides, 2006). Only items with statistically non-significant chi-square values (i.e., p ≥ 0.05) were retained.

IRT 2PL Model
Items with discrimination values lower than 0.19 were rejected.

IRT 3PL Model
Because all MCQs had four options, items with chance >0.25 were rejected.

Grade Effect
After removing misfitting items, each student's score was generated using the revised set of items. The scores were transformed from raw percentage (CTT) or logit value (IRT) to match the original raw score mean and standard deviation for the midterm and examination separately. This was done because a standard setting exercise in which course lecturers set grade cut scores based on the revised set of items was not feasible. After transformation, the scores were added to the essay-based course and exam scores to generate a course total score. Based on this value, the number of students being awarded each grade (A to D) and pass-fail (A to C versus D) was determined.

resUlTs
Given that the tests covered multiple topics, each was examined for unidimensionality. Model fit for a single factor was mixed item analysis Table 1 shows the psychometric properties of the midterm test and final exam items according to the statistical model used.
Highlighted values indicate items which failed to meet analytic standards for each approach.
In the midterm test, the CTT model rejected 28 items, the IRT 1PL model rejected 26 items rejected, the IRT 2PL rejected 24 items, and the IRT 3PL rejected just 20 items. It was noteworthy that in terms of distractor efficiency, 66% of items had all distractors with >5% selection, 15 items (30%) had only one distractor with a low selection rate, and just two (4%) had two bad distractors. In total, just one item (M45) was identified as misfitting by all four methods, indicating that the different approaches lead to quite different decisions about item quality. Model fit statistics indicated that the 2PL model had best fit (AIC = 17653.32, Σwi = 1.00) compared to the 3PL (AIC = 19052.00) and Rasch     (AIC = 1962.39) models. The item characteristic curves for the IRT 2PL show that despite its superior fit to the data many items clearly have inverse discrimination slopes or very flat trajectories with very high intercepts at logit −4.00 (Figure 1).
In the final exam, many fewer items were identified as misfitting. The CTT model rejected 14 items for low discrimination and distractor efficiency indicated 70% of items had no low selection options, 12 items (24%) had one low selection distractors, and three items (6%) had two low selection options. The IRT 1PL model rejected 19 items, IRT 2PL just three items, and 19 by the IRT 3PL. Two items (E36 and E44) were rejected by all four methods. This suggests that the quality of items written for the final exam was probably better than that for the midterm test or else that the alignment of the items to the student ability was greater. The IRT 3PL retained the most items in the midterm test, while the 2PL kept the most in the final exam. However, as per the midterm test, the IRT 2PL had the best fit (AIC = 18546.93, Σwi = 1.00) compared to the 3PL (AIC = 18578.89) and Rasch (AIC = 18889.88) models. The item characteristic curves for the IRT 2PL show that, in accordance with its superior fit to the data, few items have inverse discrimination slopes or very high intercepts at logit −4.00 (Figure 2).
Therefore, it seems that using the 2PL model as the basis of analyzing these two MCQ tests is the most robust approach and the Rasch method is the least effective IRT method. Nevertheless, in the context of classroom assessment, there may be a legitimate goal in including very easy or very difficult items (e.g., motivating sense of learning or establishing learning needs). These very easy or difficult items might have poor item discrimination statistics but may still be useful to ensure an adequate sample of the constructs of interest.

Test analysis
After removing the misfitting items, the test statistics for each analysis were obtained ( Table 2). Except for the Rasch analysis, the reliability of both tests, after removing misfitting items, reached reasonably acceptable levels of internal consistency (i.e., α > 0.70) for all models. In both tests, the CTT and IRT 2PL methods produced the highest internal estimates of consistency among items. It is worth noting that the total score for a test that assesses many different topics (e.g., the midterm test), on which the standard error of measurement depends, is likely to have a lower correlation between the total and any single contentfocused item. Nonetheless, a heterogeneous item pool in terms of content may still be necessary to sample the intended domain of the test. Hence, it may be unwise to place too many eggs in the basket of high internal estimates of reliability when evaluating a test aiming to cover multiple topic areas. However, this seems not to be a problem in this instance, if either the CTT or IRT 2PL models are used to remove poor fitting items, and less so if the IRT 2PL approach is used because it retained a greater number of items than the CTT approach.

grade impact
After removal of misfitting items, the grade distribution for the midterm and final examinations and the cumulative effect of the model changes on total course grade were determined for the IRT 2PL model only (Table 3). Interrater agreement using Fleiss' (Fleiss, 1971) generalized kappa (κ) was conducted to evaluate the chance-corrected measure of agreement between the two rating systems, each of which independently classified the participants into one of a set of 11 (A to D−) mutually exclusive and exhaustive grade categories. King (2004) software reported an overall κ = 0.25 (95% CI = 0.23-0.28), much below the minimum standard of κ > 0.40 to indicate that the observed agreement is greater than might occur by chance (Stemler, 2004). Interestingly, the kappa value per grade category was similarly low, except for the "D−" grade. However, when aggregated into the four main grade categories (A to D), the proportion agreement was 70%, giving κ = 0.55, a somewhat more convincing indication that grade similarity was beyond chance. Grade results changed for nearly two-thirds of the 372 participants, equally split between increase (n = 123) and decrease (n = 123). The total number of "A" grades increased trivially from 24 to 26; "B" grades fell from 146 to 133; the number of "C" grades increased from 142 to 149; and the number of fail grades increased from 60 to 64. Hence, it could be argued that using the 2PL IRT approach would not make the course look any worse in terms of grade distribution because there were two more "A"s and only four more fails.

DiscUssiOn
This study showed that the instructor-constructed MCQs used in this higher education course were problematic, much more so in the midterm test than the final examination. The inclusion of poor quality items had a small but critical impact on students' overall course grades, especially in terms of pass/fail decisions. Using a statistical model approach to removing items with unacceptable characteristics made a difference to course performance in a way that benefited a small number of students and overall made the course appear equally successful as the official raw score approach.
Given that MCQs have the possibility of guessing, it seems logically appropriate to analyze items with a statistical model capable of detecting the effect of chance performance. This is a feature only of the IRT 3PL statistical model. However, the current study has shown that the IRT 2PL model had superior fit to the overall data, indicating that this analysis can be sufficient to detect items with low or reverse discrimination, leading to appropriate calculation of person ability, and ultimately an appropriate grade score. It may be that the pseudo-chance guessing parameter cannot be accurately estimated with N < 1,000, and so the IRT 2PL may fit better simply because of the relatively low N in this operational test. Further evaluation of model fit in large enrollment classes (e.g., N > 1,000) could be conducted with operational MCQ tests in many large universities. Nonetheless, given the superior fit of the IRT 2PL model to the data, it may be that our logical preference for the IRT 3PL is misplaced empirically and greater emphasis should be put on using the simpler statistical model.
In contrast, the IRT 1PL model did the worst job, especially in the midterm test, when perhaps item writing quality was weaker than the final examination. The strict assumptions of the Rasch model seem to be unrealistic for use with MCQs (and quite possibly all dichotomously scored knowledge questions) and so this analysis reiterates previous findings. It may be that when items are written better or when students make greater effort, both of which are possible explanations for the better properties of the items in the final examination, the IRT 2PL model may be sufficient (Crocker and Algina, 1986).
Given the much simpler statistical manipulations involved in calculating a CTT score relative to any of the IRT models, it may be tempting to conclude that it is sufficiently robust. However, to illustrate the additional benefit of using an IRT approach over the CTT method, consider two students (i.e., AUID153 and AUID332) who both answered correctly 15 of 34 items on the CTT revised final exam for a percentage score of 44%. This is considered an unsatisfactory grade showing a lack of knowledge and understanding of the topic. However, after adjusting student scores based on the relative difficulty of items using the IRT 2PL method, student AUID153 would get four points more moving their total grade to "C" (i.e., satisfactory), while student AUID332 would get 0.6 points less resulting in grade "D−" (i.e., extremely poor). This suggests that insofar as the MCQ items were concerned, student AUID332 answered fewer hard items correctly relative to AUID153. Thus, treating items of different difficulty as if they had the same information about the quality of performance would in these two cases generate different conclusions. Nonetheless, since half the course grade depended on performance on essay-type questions, this small change in test score was not sufficient to change total grade. At the same time, using a test with many fewer items may also be misleading, since the abbreviated test is likely to cover a much smaller part of the intended curriculum. Thus, while the statistical analysis might lead to a more credible score, it may do so at the cost of valid inferences about competence across the full range of examination objectives. Given the power of the IRT 2PL model to adjust scores based on the relative difficulty of items, a case could be made for retaining poor quality items to ensure content coverage. However, it is our position that having fewer items would lead to more defensible decision making than retaining poor quality items that generate misleading information. Fewer items with trusted information can result in more robust decisions than poorly constructed items. In terms of a purely formative assessment that does not contribute to summative grading, maximizing content coverage may be useful, but when coursework and class quizzes or tests contribute to final total grades, we are of the opinion that making that overall judgment based on high-quality information is more likely to lead to public credibility.
Not using an IRT approach to score calculation potentially has a negative impact on instructors. For example, initial raw scores showed that 217 of 375 students had failed the midterm test, whereas, if IRT 2PL had been used, only 144 students would have failed, a nearly 20% decline in the fail rate. Since a high proportion of failing grades can be interpreted as poor teaching quality (Brown, 2010), the raw test information in this situation may have led instructors to invest time and resources to changing teaching strategies, which may not have been needed or could have been spent in a more productive way. Thus, not conducting item analysis and removing poorly performing items could lead to misleading feedback to both students and lecturers.
Alternatively, the poor characteristics of the midterm test items may suggest a different explanation. If a test is too difficult, it is understandable many students would guess. Difficulty for the students may arise from poorly written items, but also from poor instruction. However, this was not the first time the course had been run and the course content and sequences followed those set by previous administrations of the course. The only difference to previous administrations of the course was a different teacher for the first five lectures and, thus, a different item writer for the midterm test. Hence, it seems unlikely that the present study has identified a need to revise the course. Rather it seems more likely that there was a greater need for item analysis of the MCQ midterm test before scores were finalized.
The overwhelming conclusion is that item statistical analysis is a necessary adjunct to judgment-based evaluation of item quality in MCQ testing in higher education. The quality of decisions can be defended when the statistical analysis eliminates misleading items. However, this requires that, before any scores are released to students or record systems, some sort of psychometric analysis of item characteristics is conducted. Since most higher education teachers would have little training in these procedures, it seems that the development of automated analysis systems would be a useful support for academics. An automated system would indicate that certain items do not meet statistical conventions, with an opportunity for the academic to approve deletion. Once poorly fitting items are deleted, the system would recalculate scores for students. By displaying the items from easiest to hardest, the system could then ask the instructor, using the logic of bookmark standard setting (Mitzel et al., 2001), to indicate where boundaries for each grade level should be established. Having done this, the system would then transform IRT logit scores into appropriate institutional values reflecting the grade boundary decisions made by the academic. For example, a score of −0.15 might be judged to be the minimum passing mark of 50%. Then actual grades could be stored in student management systems and disseminated to students. This approach takes advantage of computer technology to calculate scores while placing responsibility for grade boundaries in the hands of the content experts teaching a course (Pitoniak and Morgan, 2012).

Future research
This last possibility identifies a clear weakness in this study. No expert based judgment of the revised test items was conducted to determine appropriate cut scores for grades. Instead, the distributions were transformed to match the raw score distribution, which had not been subjected to a standard setting process. Another challenge to conducting IRT analysis is the availability of open-source software suitable for these analyses. While SPSS has developed routines for 1PL and 2PL analysis, the base product is not free. Likewise, Mplus, which only provides 2PL analyses, is not free. The ICL (Hanson, 2002) and PARAM (Rudner, 2012) applications are free and can run IRT 3PL analyses, but are not widely used. The free "ltm" package (Rizopoulos, 2006) in R overcomes these challenges. While very similar conclusions about which items to keep or reject would be reached across the various applications, there were more than trivial differences in the difficulty and discrimination parameters, for example, between ICL and PARAM. Hence, analysts would be greatly aided by studies which can establish equivalences between open source applications and gold-standard applications. Perhaps, with greater acceptance and use, appropriate packages (e.g., "ltm" or "mirt") in the open source R software will be able to resolve these equivalence and access problems.
Another limitation in this study is the relatively small sample size (i.e., 372 students) and the effect it may have on estimating the pseudochance parameter in the IRT 3PL model. However, small sample size is a normal case in higher education and it may be relatively uncommon to have courses with at least 1,000 students. Nonetheless, future studies to further establish the robustness of IRT applications for realistically small sample sizes are needed.
The generalizability of this study is limited because only one course and only two MCQ tests in one year have been analyzed. Nonetheless, the current study is consistent with other studies that have evaluated the quality of multiple-choice items in higher education assessments. Thus, further studies into the quality of MCQ testing, especially evaluating training programs designed to improve instructor item writing skill, are needed. Evidence from publishers is needed about the qualities of items in textbook related item banks which can be used in formal assessments. cOnclUsiOn This research is necessary within each and every institution that uses MCQs because, while the threat cumulatively may not be large, it seems highly likely that specific exams or tests will not meet normal requirements. The credibility of assessment is necessary, especially if there is a tendency to be litigious about grading or testing (Brookhart, 2010). Any lack of quality assurance process at the course or department level poses a significant reputational risk to the institution.
This study has shown that use of IRT item analysis has a potential beneficial impact on overall course grades and number of students passing. It also suggests that more informative feedback to students and instructors might be generated by giving grades derived from item difficulty. This study provides a warning for the different stakeholders concerned with the quality of higher education assessment practices and suggests that more commitment and effort is needed in quality assurance in order to meet professional obligations.

eThics sTaTeMenT
This research involves secondary analysis of anonymized datasets. No ethics risks were involved. aUThOr cOnTribUTiOns GB obtained data, supervised analysis, drafted, and revised the manuscript; took responsibility for reanalysis of data using ltm package and Mplus. HA drafted literature review, conducted statistical analyses, and drafted the discussion. This work began as MA thesis awarded to HA and supervised by GB.