<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article article-type="research-article" dtd-version="2.3" xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Educ.</journal-id>
<journal-title>Frontiers in Education</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Educ.</abbrev-journal-title>
<issn pub-type="epub">2504-284X</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">742560</article-id>
<article-id pub-id-type="doi">10.3389/feduc.2021.742560</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Education</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>The Rasch Model Cannot Reveal Systematic Differential Item Functioning in Single Tests: Subset DIF Analysis as an Alternative Methodology</article-title>
<alt-title alt-title-type="left-running-head">Humphry and Montuoro</alt-title>
<alt-title alt-title-type="right-running-head">Subset DIF Analysis</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Humphry</surname>
<given-names>Stephen</given-names>
</name>
<xref ref-type="corresp" rid="c001">&#x2a;</xref>
<uri xlink:href="https://loop.frontiersin.org/people/42873/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Montuoro</surname>
<given-names>Paul</given-names>
</name>
<uri xlink:href="https://loop.frontiersin.org/people/1277044/overview"/>
</contrib>
</contrib-group>
<aff>Graduate School of Education (M428), The University of Western Australia, <addr-line>Crawley</addr-line>, <addr-line>WA</addr-line>, <country>Australia</country>
</aff>
<author-notes>
<fn fn-type="edited-by">
<p>
<bold>Edited by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/345250/overview">Zi Yan</ext-link>, The Education University of Hong Kong, Hong Kong SAR, China</p>
</fn>
<fn fn-type="edited-by">
<p>
<bold>Reviewed by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/490782/overview">Jason Fan</ext-link>, The University of Melbourne, Australia</p>
<p>
<ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1010706/overview">Lokman Akbay</ext-link>, Istanbul University-Cerrahpasa, Turkey</p>
</fn>
<corresp id="c001">&#x2a;Correspondence: Stephen Humphry, <email>stephen.humphry@uwa.edu.au</email>
</corresp>
<fn fn-type="other">
<p>This article was submitted to Assessment, Testing and Applied Measurement, a section of the journal Frontiers in Education</p>
</fn>
</author-notes>
<pub-date pub-type="epub">
<day>23</day>
<month>11</month>
<year>2021</year>
</pub-date>
<pub-date pub-type="collection">
<year>2021</year>
</pub-date>
<volume>6</volume>
<elocation-id>742560</elocation-id>
<history>
<date date-type="received">
<day>16</day>
<month>07</month>
<year>2021</year>
</date>
<date date-type="accepted">
<day>01</day>
<month>11</month>
<year>2021</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#xa9; 2021 Humphry and Montuoro.</copyright-statement>
<copyright-year>2021</copyright-year>
<copyright-holder>Humphry and Montuoro</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these&#x20;terms.</p>
</license>
</permissions>
<abstract>
<p>This article demonstrates that the Rasch model cannot reveal systematic differential item functioning (DIF) in single tests. The person total score is the sufficient statistic for the person parameter estimate, eliminating the possibility for residuals at the test level. An alternative approach is to use <italic>subset DIF analysis</italic> to search for DIF in item subsets that form the components of the broader latent trait<italic>.</italic> In this methodology, person parameter estimates are initially calculated using all test items. Then, in separate analyses, these person estimates are compared to the observed means in each subset, and the residuals assessed. As such, this methodology tests the assumption that the person locations in each factor group are invariant across subsets. The first objective is to demonstrate that in single tests differences in factor groups will appear as differences in the mean person estimates and the distributions of these estimates. The second objective is to demonstrate how subset DIF analysis reveals differences between person estimates and the observed means in subsets. Implications for practitioners are discussed.</p>
</abstract>
<kwd-group>
<kwd>psychometrics</kwd>
<kwd>Rasch model</kwd>
<kwd>invariance</kwd>
<kwd>differential item functioning (DIF)</kwd>
<kwd>systematic DIF</kwd>
<kwd>subset DIF</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1">
<title>1 Introduction</title>
<p>In the Rasch model the total score is the sufficient statistic for calculating person ability and item difficulty parameter estimates (<xref ref-type="bibr" rid="B13">Rasch 1960</xref>). These estimates must function invariantly for valid quantitative measurement to occur. <xref ref-type="bibr" rid="B12">Rasch (1961)</xref> described the invariance requirement as follows:</p>
<p>The comparison between two stimuli should be independent of which particular individuals were instrumental for the comparison; it should also be independent of which other stimuli within the considered class were or might also have been compared. Symmetrically, a comparison between two individuals should be independent of which particular stimuli within the class considered were instrumental for comparison; and it should also be independent of which other individuals were also compared on the same or some other occasion (<xref ref-type="bibr" rid="B12">Rasch 1961</xref>, p.&#x20;322).</p>
<p>The meaningful comparison of persons therefore requires the stimuli in a measurement instrument to function invariantly. This is not only the case along the variable of assessment, but between the factor groups being compared, where differential item functioning (DIF) causes items to function differently between groups who otherwise share the same ability estimate on the latent trait (<xref ref-type="bibr" rid="B8">Hagquist &#x26; Andrich, 2017</xref>). In this article, we demonstrate that the Rasch model cannot reveal systematic DIF in single tests. Therefore, we propose an alternative approach named <italic>subset DIF analysis.</italic> Person parameter estimates are initially calculated in a Rasch model analysis that includes all test items. Then, in separate analyses, these person estimates are compared to the observed means in the subsets that form the components of the broader latent trait, and the residuals assessed. As such, this methodology tests the assumption that the person locations in each factor group are invariant across subsets, instead of the systematic DIF analysis assumption that the person estimates in each factor group are invariant across tests. Subset DIF analysis can also be performed by testing persons on additional construct-relevant items. Here the addition items function as a frame of reference in the calculation of the person estimates, against which the observed means in the original subset are compared.</p>
<p>The common expression of the Rasch model for dichotomous responses is,<disp-formula id="e1">
<mml:math id="m1">
<mml:mrow>
<mml:mi>Pr</mml:mi>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>X</mml:mi>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo>}</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>exp</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>&#x3b2;</mml:mi>
<mml:mi>n</mml:mi>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:mo>&#xa0;</mml:mo>
<mml:msub>
<mml:mi>&#x3b4;</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>&#x2b;</mml:mo>
<mml:mo>&#xa0;</mml:mo>
<mml:mi>exp</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>&#x3b2;</mml:mi>
<mml:mi>n</mml:mi>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:mo>&#xa0;</mml:mo>
<mml:msub>
<mml:mi>&#x3b4;</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
<label>(1)</label>
</disp-formula>where <italic>X</italic>
<sub>
<italic>ni</italic>
</sub> &#x3d; <italic>x</italic>
<sub>
<italic>ni,</italic>
</sub> <italic>x</italic>
<sub>
<italic>ni</italic>
</sub> <inline-formula id="inf1">
<mml:math id="m2">
<mml:mi mathvariant="normal">&#x3f5;</mml:mi>
</mml:math>
</inline-formula> {1, 0}, is a Bernoulli random variable, and <italic>&#xdf;</italic>
<sub>
<italic>n</italic>
</sub> and &#x3b4;<sub>
<italic>i</italic>
</sub> denote the person <italic>n</italic> and item <italic>i</italic> locations on a latent continuum. The estimated person ability and item difficulty parameter estimates are placed on a common logit scale, where the location of persons and items can be compared. This enables the analysis of the functioning characteristics of items along the continuum of the latent trait using expected score curves. These curves use person ability and item difficulty parameters to predict scores on the latent trait. Ideally, the observed means of persons in adjacent class intervals conform to the expected values of the expected score curve. In the case of dichotomous items, the expected value reduces to the probability of a correct response. Misfit between the observed means and the expected score curve represents a general lack of invariance across the variable and can appear as low or high discrimination of observed means compared to the expected score curve (<xref ref-type="bibr" rid="B8">Hagquist &#x26; Andrich, 2017</xref>).</p>
<p>As stated earlier, DIF is a form of misfit that occurs when items do not function in the same way for different factor groups who otherwise share the same ability estimate on the latent trait (<xref ref-type="bibr" rid="B8">Hagquist &#x26; Andrich, 2017</xref>). Therefore, DIF occurs when the probability of answering an item correctly is not the same for persons who share the same ability level but belong to different factor groups. Here items are said to have different relative difficulties for different groups, thus violating invariance and distorting person comparisons (<xref ref-type="bibr" rid="B2">Andrich &#x26; Hagquist, 2004</xref>; <xref ref-type="bibr" rid="B1">Andrich &#x26; Hagquist, 2012</xref>; <xref ref-type="bibr" rid="B8">Hagquist &#x26; Andrich, 2017</xref>; <xref ref-type="bibr" rid="B3">Andrich &#x26; Marais, 2019</xref>).</p>
<p>In practice, DIF is used to examine whether there is bias in an item. Uniform DIF favoring one factor group, such as girls, means that for any given ability, girls obtain higher scores on average than boys. Non-uniform DIF occurs when one factor group obtains higher scores on average, but not across all ability levels. For example, girls may obtain higher scores on average at the lower end of the ability range, whereas boys obtain higher scores on average, at the higher end of the range. One question that arises in practice is whether a test <italic>as a whole</italic> is biased in favor of one factor group. This paper focuses on how to approach this, and related, questions in applied contexts.</p>
<p>Systematic DIF in a set of questions can produce misleading data regarding the performance of factor groups. This is particularly an issue if the DIF affects validity. For example, if a significant number of items in a mathematics test demand a high level of vocabulary, that demand may introduce construct-irrelevant variance. As such, systematic DIF can have a bearing on the validity of an assessment and inferences drawn in comparing factor group results.</p>
<p>The methodology for identifying DIF used in this study was described by <xref ref-type="bibr" rid="B2">Andrich and Hagquist (2004)</xref>. In this approach, a single set of item parameters is estimated and the residuals of each factor group are analysed. DIF can be checked graphically by inspecting the residuals around the expected score curve. The observed means are displayed separately for each factor group. However, in this approach DIF can also be checked statistically via an analysis of the residuals. The standardized residual of each person, <italic>n</italic>, to each item, <italic>i</italic>, is given:<disp-formula id="e2">
<mml:math id="m3">
<mml:mrow>
<mml:msub>
<mml:mi>z</mml:mi>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>E</mml:mi>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:msqrt>
<mml:mrow>
<mml:mi>v</mml:mi>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msqrt>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
<label>(2)</label>
</disp-formula>where <italic>E</italic> [<italic>x</italic>
<sub>
<italic>ni</italic>
</sub>] is the expected value given person <italic>n&#x2019;s</italic> and item <italic>i&#x2019;s</italic> parameter estimates, and <italic>v</italic> [<italic>x</italic>
<sub>
<italic>ni</italic>
</sub>] is the variance. For the purpose of more detailed analysis, each person is further identified by their group membership, <italic>g</italic>, and by the class interval, <italic>c</italic>. This gives:<disp-formula id="e3">
<mml:math id="m4">
<mml:mrow>
<mml:msub>
<mml:mi>z</mml:mi>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>g</mml:mi>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>g</mml:mi>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>E</mml:mi>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>g</mml:mi>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:msqrt>
<mml:mrow>
<mml:mi>v</mml:mi>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>g</mml:mi>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msqrt>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
<label>(3)</label>
</disp-formula>
</p>
<p>The residuals are then analysed using a factorial ANOVA. This test determines if there is a statistically significant difference among the mean residuals for the factor groups. The common discussion in the literature on DIF focuses on two groups. Some of this literature focuses on a minority group compared to a majority group. Other literature investigates groups of equal status, such as the genders. In this article we focus on boys and girls in secondary school. DIF also appears graphically as a difference between the item characteristic curves (ICCs) plotted for two factor groups. Examples are shown later in this article. If, for example, DIF uniformly favors boys, the ICC for boys is higher than the ICC for girls, which is plotted separately. In this case the average standardized residual has a higher positive value for boys than&#x20;girls.</p>
<p>Systematic DIF refers to the aggregation of DIF in favor of a factor group across a test. It refers to a generalized lack of invariance between factor groups. Decisions are normally based on test scores, so systematic DIF is of practical importance. In the Rasch model, as will be shown, systematic DIF cannot appear in a single test because the total score is the sufficient statistic for person estimates. In words, grouping persons by their parameter estimates is equivalent to grouping persons by their total scores. Hence, there can be no residuals and therefore no systematic DIF at the test&#x20;level.</p>
<p>In estimating the person parameter for each person, the sum of probabilities across items must equal the total score, i.e.,&#x20;<inline-formula id="inf2">
<mml:math id="m5">
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:msub>
<mml:mo>&#x2211;</mml:mo>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mrow>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mstyle>
<mml:mo>&#x3d;</mml:mo>
<mml:mstyle displaystyle="true">
<mml:msub>
<mml:mo>&#x2211;</mml:mo>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mrow>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
</mml:math>
</inline-formula>. &#x201c;Thus the sum of the probabilities, or proportions, of positive responses across items of persons with a total score of <italic>r</italic> must be <italic>r</italic>&#x201d; (<xref ref-type="bibr" rid="B1">Andrich and Hagquist, 2012</xref>, p. 27). If, for a group <italic>g</italic> of persons with a mean ability estimate <inline-formula id="inf3">
<mml:math id="m6">
<mml:mrow>
<mml:mover accent="true">
<mml:mi>&#x3b2;</mml:mi>
<mml:mo>&#x5e;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
</inline-formula>, observed proportions correct are higher than the probabilities across all items in a test, the solution equation will not be satisfied. For example, suppose boys with a total score of <italic>x</italic> on a single test are grouped together. For this group, a consequence of the maximum likelihood estimation (MLE) solution equation is that<inline-formula id="inf4">
<mml:math id="m7">
<mml:mrow>
<mml:mo>&#xa0;</mml:mo>
<mml:mstyle displaystyle="true">
<mml:msub>
<mml:mo>&#x2211;</mml:mo>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:msub>
<mml:mo>&#x2211;</mml:mo>
<mml:mi>n</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mi>g</mml:mi>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>/</mml:mo>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
</mml:mstyle>
<mml:mo>&#x3d;</mml:mo>
<mml:mstyle displaystyle="true">
<mml:msub>
<mml:mo>&#x2211;</mml:mo>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:msub>
<mml:mo>&#x2211;</mml:mo>
<mml:mi>n</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mrow>
<mml:mi>g</mml:mi>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>/</mml:mo>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
</mml:math>
</inline-formula>. That is, the sum of probabilities is equal to the sum of mean scores, or proportions, for persons in group <italic>g,</italic> <inline-formula id="inf5">
<mml:math id="m8">
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:msub>
<mml:mo>&#x2211;</mml:mo>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mrow>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mi>g</mml:mi>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mstyle>
<mml:mo>&#x3d;</mml:mo>
<mml:mstyle displaystyle="true">
<mml:msub>
<mml:mo>&#x2211;</mml:mo>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>x</mml:mi>
<mml:mo>&#xaf;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>g</mml:mi>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
</mml:math>
</inline-formula> (see <xref ref-type="bibr" rid="B4">Andrich, 1988</xref>). Once person abilities have been estimated according to the model, therefore, the group of boys cannot have higher proportions correct than expected in the model on all items in the test. If the proportions were higher on all items, the sum of total scores would exceed the sum of probabilities for the&#x20;group.</p>
<p>In the Rasch model, if one factor group has a higher ability than another factor group, the persons in the former group will receive higher total scores and ability estimates. However, this does not represent systematic DIF. Even though persons in the former group demonstrated higher abilities on the latent trait, this is not the same as the test functioning differently for persons between groups who otherwise share the same ability (for a discussion, see <xref ref-type="bibr" rid="B7">Drasgow, 1987</xref>).</p>
<p>Few studies have investigated systematic DIF. <xref ref-type="bibr" rid="B7">Drasgow (1987)</xref> analysed the ACT Assessment English and Mathematics Usage tests using the three-parameter logistic (3P&#xa0;L) model. Drasgow did not find systematic DIF in either test and concluded that the tests provided equivalent measurement for all factor groups. But it may have been impossible for Drasgow to identify underlying systematic DIF using the 3P&#xa0;L model, for the very same reason that the Rasch model cannot reveal systematic DIF in single tests<sup>1</sup>. It is therefore interesting to note Drasgow&#x2019;s comment about the apparent lack of systematic DIF in his studies:</p>
<disp-quote>
<p>Several readers of an earlier draft of this article &#x2026; suggest[ed] that some methodological artifact in the IRT analysis would always force observed test-characteristic curves of female and minority groups to match the white male curve <italic>even when there really were differences</italic> (p. 27, emphases in original)<italic>.</italic>
</p>
</disp-quote>
<p>Others have searched for systematic DIF using the one-parameter logistic (1P&#xa0;L) model (<xref ref-type="bibr" rid="B14">Takala and Kaftandjieva, 2000</xref>) and the 3P&#xa0;L model (<xref ref-type="bibr" rid="B10">Pae, 2004</xref>). Neither study found systematic DIF. Others such as <xref ref-type="bibr" rid="B6">Chalmers, Counsell, and Flora (2016)</xref> searched for systematic DIF using the differential functioning of items and tests (DFIT) methodology originally proposed by <xref ref-type="bibr" rid="B11">Raju, van der Linden, and Fleer (1995)</xref>. This approach calculates item difficulty estimates using the raw scores from each factor group in separate analyses. For example, in the case of gender groups, item estimates are calculated using boys&#x2019; raw scores, and then recalculated using girls&#x2019; raw scores. The two sets of item estimates are placed on a common scale. From here two sets of person estimates for each group are calculated. For example, the boys&#x2019; person estimates are calculated using the anchored item estimates derived from the boys&#x2019; raw scores, and then recalculated using the anchored item estimates derived from the girls&#x2019; raw scores. According to the methodology, if the two sets of person estimates for each factor group are not equal it is an indication of systematic&#x20;DIF.</p>
<p>However, it also appears that the DFIT methodology cannot reveal systematic DIF. When multiple sets of item estimates are calculated using the raw scores from different factor groups, and then placed on a common scale, the resulting sets of person estimates for any one factor group must be equal or near equal, on average. This is because after equating, each set of item estimates share the same mean and standard deviation. It does not matter which set of anchored item estimates is used to calculate person estimates, there will never be an appreciable difference in the person estimates, on average.</p>
<p>The current study is based on a general reasoning test designed for Australian senior secondary school students. The test comprises 72 dichotomous items evenly divided into nonverbal and verbal reasoning item subsets. First, we hypothesize that when subsets are treated as single tests, differences in the factor groups will appear as differences in the mean person estimates and the distributions of these estimates. However, we hypothesize that systematic DIF will not appear when subsets are treated as single tests, from which it can be inferred that systematic DIF will not appear if the general reasoning test is treated as a single test. Second, we hypothesize that subset DIF analyses will reveal subset DIF in the nonverbal and verbal item subsets, respectively. As such, this alternative methodology does not assess systematic DIF, but instead tests the assumption that the person locations in each factor group are invariant across subsets.</p>
</sec>
<sec id="s2">
<title>2 Method</title>
<sec id="s2-1">
<title>2.1 Data and Instrument</title>
<p>The Year 10 general reasoning test used in this study was developed by Academic Assessment Services (AAS). It is a pencil-and-paper test comprising 72 dichotomous items. The test is similar to the <italic>Otis-Lennon School Ability Test</italic> (OLSAT 8) (<xref ref-type="bibr" rid="B9">Otis, 2009</xref>), in that it includes 36 nonverbal reasoning items (pictorial reasoning, figural reasoning, and quantitative reasoning), and 36 verbal reasoning items (verbal comprehension, sentence completion and arrangements, and logical, arithmetic, and verbal reasoning). The items in the test are arranged in ascending order of difficulty.</p>
<p>In 2019, the general reasoning test was administered to 12,476 students in over 100 secondary schools. This included schools from a range of socioeconomic locations in every Australian state and territory, except the Northern Territory. The schools included state schools and independent schools from a range of denominations. The data for this study were derived from a random sample of 1,604 students in 12 schools from the 2019 data set (<italic>M</italic>
<sub>
<italic>Age</italic>
</sub> &#x3d; 15.066, <italic>SD</italic>
<sub>
<italic>Age</italic>
</sub> &#x3d; 0.462). The sample comprised 806 boys (<italic>M</italic>
<sub>
<italic>Age</italic>
</sub> &#x3d; 15.067, <italic>SD</italic>
<sub>
<italic>Age</italic>
</sub> &#x3d; 0.475) and 798 girls (<italic>M</italic>
<sub>
<italic>Age</italic>
</sub> &#x3d; 15.063, <italic>SD</italic>
<sub>
<italic>Age</italic>
</sub> &#x3d; 0.449).</p>
</sec>
<sec id="s2-2">
<title>2.2 Procedure</title>
<p>The general reasoning test used in this study is administered to students as part of the AAS Year 10 standardized testing program. The general reasoning test is one of five tests in the program, which also includes numeracy, reading, writing, and spelling tests. The half-day program is normally conducted in a large communal area such as a gym. The program begins between 8:30&#xa0;am and 9:00&#xa0;am and includes two sessions which both run for 112&#xa0;min.</p>
<p>The general reasoning test is the first test administered to students. Each student receives a general reasoning test booklet which they are not permitted to mark, an optical mark recognition (OMR) answer booklet, and scrap paper for workings. Calculators are not permitted in this test. The supervisor briefly introduces the test and demonstrates how to answer items by referring to the five sample items at the beginning of the test booklet. There is no reading time and students have a maximum of 45&#xa0;min to complete the test. Supervisors are not permitted to answer questions that could assist students.</p>
</sec>
<sec id="s2-3">
<title>2.3 Naming Conventions</title>
<p>For clarity, in the general reasoning test analyses the item set comprising nonverbal reasoning items is named the <italic>nonverbal reasoning subset</italic> (i.e.,&#x20;as opposed to the <italic>nonverbal reasoning test</italic>). Likewise, in the general reasoning test analyses the item set comprising verbal reasoning items is named the <italic>verbal reasoning subset</italic>. However, when these subsets are analysed as single tests, they are simply named the <italic>nonverbal reasoning test</italic> and <italic>verbal reasoning test</italic> (see <xref ref-type="fig" rid="F1">Figure&#x20;1</xref>).</p>
<fig id="F1" position="float">
<label>FIGURE1</label>
<caption>
<p>Data formats for tests and subsets.</p>
</caption>
<graphic xlink:href="feduc-06-742560-g001.tif"/>
</fig>
</sec>
<sec id="s2-4">
<title>2.4 Identifying Misfitting Items</title>
<p>The Rasch model analyses in this study were all performed in the software package <italic>RUMM2030 Professional</italic> (<xref ref-type="bibr" rid="B5">Andrich, Sheridan, and Luo, 2018</xref>). Initially, a random sample of 300 persons was derived from the complete data set of 12,476 persons. The general reasoning test was then analysed and misfitting items were identified. As can be seen in <xref ref-type="table" rid="T1">Table&#x20;1</xref>, the summary statistics revealed that the standard deviation of the item locations was close to 1. The mean fit residual was close to zero, but the standard deviation was higher than 1. The item-trait interaction was significant, &#x3c7;2 (648, 300) &#x3d; 796.368, <italic>p</italic>&#x20;&#x3c; 0.0001. The correlation between item locations and standardized residuals was low. The mean person location was slightly higher than zero and its standard deviation was close to 1, while the person separation index was&#x20;high.</p>
<table-wrap id="T1" position="float">
<label>TABLE 1</label>
<caption>
<p>Summary statistics of general reasoning&#x20;test.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left"/>
<th align="center">SD <italic>&#x3b4;</italic>
</th>
<th align="center">Mean fit residual</th>
<th align="center">SD&#x20;fit residual</th>
<th align="center">Correl. <italic>&#x3b4;/s</italic>td residual</th>
<th align="center">Mean <italic>&#x3b2;</italic>
</th>
<th align="center">SD <italic>&#x3b2;</italic>
</th>
<th align="center">PSI</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">Complete Test (72 Items)</td>
<td align="char" char=".">1.004</td>
<td align="char" char=".">0.095</td>
<td align="char" char=".">1.344</td>
<td align="char" char=".">0.119</td>
<td align="char" char=".">0.262</td>
<td align="char" char=".">0.922</td>
<td align="char" char=".">0.863</td>
</tr>
<tr>
<td align="left">Modified Test (64 items)</td>
<td align="char" char=".">0.984</td>
<td align="char" char=".">0.039</td>
<td align="char" char=".">1.064</td>
<td align="char" char=".">0.138</td>
<td align="char" char=".">0.351</td>
<td align="char" char=".">0.993</td>
<td align="char" char=".">0.866</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Subsequent analyses of the fit residuals, chi-square statistics, and expected score curves revealed two low discriminating items in the nonverbal reasoning subset, and six low discriminating items in the verbal reasoning subset. These items were removed and the modified data set was reanalyzed. As can be seen in <xref ref-type="table" rid="T1">Table&#x20;1</xref>, the item location standard deviation fell slightly. The mean fit residual fell, and the standard deviation moved closer to 1. The item-trait interaction was no longer significant, &#x3c7;2 (575, 300) &#x3d; 589.215, <italic>p</italic>&#x20;&#x3d; 0.332. The correlation between the item locations and standardized residuals increased slightly, but still remained low. Finally, the mean and standard deviation of the person locations both increased slightly, but there was almost no change in the person separation&#x20;index.</p>
</sec>
<sec id="s2-5">
<title>2.5 Data Structure</title>
<p>After removing low discriminating items, analyses were performed on the general reasoning, nonverbal reasoning, and verbal reasoning data sets, respectively. The purpose was to calculate the item estimates. Missing data were treated as missing.</p>
<p>Person estimates were calculated in new analyses. In each analysis, individual item anchoring was used to anchor the item estimates from the previous respective analyses. This time missing data were treated as incorrect. The assumption here was that for the purpose of calculating person estimates, missing data were missing not at random (MNAR) (i.e.,&#x20;missingness of data was related to person ability) and would bias person estimates.</p>
<p>In these latter analyses the most difficult items had very low discrimination. This is because in the most difficult items (i.e.,&#x20;items with the highest proportion of missing data), the proportion of correct responses in each class interval was far lower than the expected scores derived from the anchored item estimates. This indicated that the missing data in the initial analyses were probably MNAR, negatively biasing the item estimates for the most difficult items (for a discussion, see <xref ref-type="bibr" rid="B15">Waterbury, 2019</xref>). These analyses were therefore disregarded, and missing data were treated as incorrect in subsequent analyses.</p>
</sec>
</sec>
<sec id="s3">
<title>3 Results</title>
<sec id="s3-1">
<title>3.1 Person Location Estimates for Factor Groups</title>
<p>This study began with a comparison of person parameter estimates derived from the general, nonverbal, and verbal reasoning test analyses. The person estimates from the nonverbal and verbal reasoning tests were placed on the same scale as the person estimates from the general reasoning test using a two-step mean equating process performed in a spreadsheet. The item parameter estimates from the nonverbal and verbal reasoning tests were mean equated to the item estimates from the general reasoning test. New analyses were then performed on the nonverbal and verbal tests, in which the mean equated item estimates were anchored using individual item anchoring.</p>
<p>As can be seen in <xref ref-type="table" rid="T2">Table&#x20;2</xref> and <xref ref-type="fig" rid="F2">Figure&#x20;2</xref>, in the nonverbal reasoning test the mean person estimate for boys was higher than it was for girls. The opposite was true in the verbal reasoning test, although the magnitude of the difference was smaller. This imbalance in superior performances for boys and girls in the tests resulted in a slightly higher mean person estimate for boys in the general reasoning&#x20;test.</p>
<table-wrap id="T2" position="float">
<label>TABLE 2</label>
<caption>
<p>Mean person estimates for factor groups.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left"/>
<th align="center">Nonverbal reasoning test analysis</th>
<th align="center">Verbal reasoning test analysis</th>
<th align="center">General reasoning test analysis</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">Boys</td>
<td align="char" char=".">0.015</td>
<td align="char" char=".">&#x2212;0.167</td>
<td align="char" char=".">&#x2212;0.065</td>
</tr>
<tr>
<td align="left">Girls</td>
<td align="char" char=".">&#x2212;0.240</td>
<td align="char" char=".">&#x2212;0.072</td>
<td align="char" char=".">&#x2212;0.165</td>
</tr>
<tr>
<td align="left">Difference</td>
<td align="char" char=".">0.255</td>
<td align="char" char=".">&#x2212;0.095</td>
<td align="char" char=".">0.100</td>
</tr>
</tbody>
</table>
</table-wrap>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption>
<p>Mean person estimates for boys and girls.</p>
</caption>
<graphic xlink:href="feduc-06-742560-g002.tif"/>
</fig>
<p>The frequency distributions of the person estimates for the two factor groups were also different in the nonverbal and verbal reasoning tests. In the nonverbal reasoning test, the frequency distribution for boys was positioned to the right of the frequency distribution for girls (see <xref ref-type="fig" rid="F3">Figure&#x20;3</xref>). A Mann Whitney <italic>U</italic> Test was then performed, which revealed that the difference in the estimates for boys (<italic>Mdn</italic> &#x3d; 0.066) and girls (<italic>Mdn</italic> &#x3d; -0.241) was statistically significant, <italic>U</italic>(<italic>N</italic>
<sub>boys</sub> &#x3d; 806, <italic>N</italic>
<sub>girls</sub> &#x3d; 798), <italic>z</italic>&#x20;&#x3d;&#x20;5.543, <italic>p</italic>&#x20;&#x3c; 0.0001, <italic>r</italic>&#x20;&#x3d;&#x20;0.14.</p>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption>
<p>Frequency distribution of person estimates for boys and girls: Nonverbal reasoning test analysis.</p>
</caption>
<graphic xlink:href="feduc-06-742560-g003.tif"/>
</fig>
<p>The opposite was true in the verbal reasoning test (see <xref ref-type="fig" rid="F4">Figure&#x20;4</xref>). The frequency distributions of the person estimates for boys and girls were more closely positioned because of the imbalance in superior performances between boys and girls on the nonverbal and verbal tests. Nevertheless, a Mann Whitney <italic>U</italic> Test indicated that the difference in the person estimates for girls (<italic>Mdn</italic> &#x3d; -0.097) and boys (<italic>Mdn</italic> &#x3d; -0.293) was statistically significant, <italic>U</italic>(<italic>N</italic>
<sub>girls</sub> &#x3d; 798, <italic>N</italic>
<sub>boys</sub> &#x3d; 806), <italic>z</italic>&#x20;&#x3d; -2.058, <italic>p</italic>&#x20;&#x3d; 0.04, <italic>r</italic>&#x20;&#x3d;&#x20;0.05.</p>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption>
<p>Frequency distribution of person estimates for boys and girls: Verbal reasoning test analysis.</p>
</caption>
<graphic xlink:href="feduc-06-742560-g004.tif"/>
</fig>
<p>In the general reasoning test analysis, the frequency distributions of the person parameter estimates were more closely matched because the estimates were derived from an overall measure of general reasoning. Combining tests limited the effect of factor group superior performances in the nonverbal and verbal tests (see <xref ref-type="fig" rid="F5">Figure&#x20;5</xref>). However, the frequency distribution for boys was positioned to the right of the frequency distribution for girls because of the relative imbalance in superior performances. A Mann Whitney <italic>U</italic> Test indicated that the difference in estimates for boys (<italic>Mdn</italic> &#x3d; -0.093) and girls (<italic>Mdn</italic> &#x3d; -0.223) was statistically significant, <italic>U</italic>(<italic>N</italic>
<sub>boys</sub> &#x3d; 806, <italic>N</italic>
<sub>girls</sub> &#x3d; 798), <italic>z</italic>&#x20;&#x3d; 2.492, <italic>p</italic>&#x20;&#x3d; 0.01, <italic>r</italic>&#x20;&#x3d;&#x20;0.06.</p>
<fig id="F5" position="float">
<label>FIGURE 5</label>
<caption>
<p>Frequency distribution of person estimates for boys and girls: General reasoning test analysis.</p>
</caption>
<graphic xlink:href="feduc-06-742560-g005.tif"/>
</fig>
</sec>
<sec id="s3-2">
<title>3.2 Systematic DIF in Single Tests and Subset DIF in Subsets</title>
<p>In the following analyses, we extracted the person estimates, expected total scores, and observed means from <italic>RUMM2030 Professional</italic>, and plotted test characteristic curves (TCCs) in a spreadsheet. When the nonverbal reasoning test was treated as a single test, the observed means for boys and girls directly conformed to the TCC (see <xref ref-type="fig" rid="F6">Figure&#x20;6</xref>), indicating no systematic DIF. The same was true when the verbal reasoning test was treated as a single test (see <xref ref-type="fig" rid="F7">Figure&#x20;7</xref>).</p>
<fig id="F6" position="float">
<label>FIGURE 6</label>
<caption>
<p>Test characteristic curve with observed scores for boys and girls: Nonverbal reasoning test.</p>
</caption>
<graphic xlink:href="feduc-06-742560-g006.tif"/>
</fig>
<fig id="F7" position="float">
<label>FIGURE 7</label>
<caption>
<p>Test characteristic curve with observed scores for boys and girls: Verbal reasoning test.</p>
</caption>
<graphic xlink:href="feduc-06-742560-g007.tif"/>
</fig>
<p>Subset DIF analyses were then performed. The nonverbal and verbal reasoning tests were combined, with each subset effectively acting as a frame of reference for person estimates in the other subset. The TCCs for the nonverbal and verbal subsets were constructed by performing two subtest analyses in the general reasoning test analysis (see <xref ref-type="fig" rid="F8">Figure&#x20;8</xref>, <xref ref-type="fig" rid="F9">9</xref>). For example, the nonverbal reasoning subset analysis initially started in <italic>RUMM2030 Professional</italic>, where the nonverbal items were aggregated into a higher order polytomous item, after which the general reasoning test was reanalyzed. The resulting person estimates, expected total scores, and the observed means for the subtest were extracted from <italic>RUMM2030 Professional,</italic> and the nonverbal reasoning subset TCC was plotted in a spreadsheet. The same process was followed for the verbal reasoning subset&#x20;TCC.</p>
<fig id="F8" position="float">
<label>FIGURE 8</label>
<caption>
<p>Test characteristic curve with observed scores for boys and girls: General reasoning test&#x2019;nonverbal reasoning subset.</p>
</caption>
<graphic xlink:href="feduc-06-742560-g008.tif"/>
</fig>
<fig id="F9" position="float">
<label>FIGURE 9</label>
<caption>
<p>Test characteristic curve with observed scores for boys and girls: General reasoning test&#x2019;verbal reasoning subset.</p>
</caption>
<graphic xlink:href="feduc-06-742560-g009.tif"/>
</fig>
<p>The nonverbal reasoning subset TCC revealed the observed means for boys were uniformly higher than both the expected score curve and the observed means for girls, thus indicating subset DIF in favor of boys (see <xref ref-type="fig" rid="F8">Figure&#x20;8</xref>). The verbal reasoning subset TCC revealed the observed means for girls were uniformly higher than both the expected score curve and the observed means for boys, thus indicating subset DIF in favor of girls (see <xref ref-type="fig" rid="F9">Figure&#x20;9</xref>).</p>
</sec>
<sec id="s3-3">
<title>3.3 DIF Magnitude</title>
<p>The number of items in favor of boys and girls was initially measured using the mean residual of the observed means for boys and girls from expected scores across 10 class intervals. When the nonverbal and verbal reasoning tests were treated as single tests, the total number of items in favor of boys and girls was almost equal in both tests (see <xref ref-type="table" rid="T3">Table&#x20;3</xref> and <xref ref-type="fig" rid="F10">Figure&#x20;10</xref>).</p>
<table-wrap id="T3" position="float">
<label>TABLE 3</label>
<caption>
<p>Items in favor of factor groups: Mean DIF residuals across 10 class intervals.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left"/>
<th align="center">Nonverbal reasoning test analysis</th>
<th align="center">Verbal reasoning test analysis</th>
<th align="center">Nonverbal reasoning subset analysis</th>
<th align="center">Verbal reasoning subset analysis</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">Boys</td>
<td align="char" char=".">16</td>
<td align="char" char=".">13</td>
<td align="char" char=".">22</td>
<td align="char" char=".">6</td>
</tr>
<tr>
<td align="left">Girls</td>
<td align="char" char=".">17</td>
<td align="char" char=".">16</td>
<td align="char" char=".">12</td>
<td align="char" char=".">23</td>
</tr>
<tr>
<td align="left">Equal</td>
<td align="char" char=".">1</td>
<td align="char" char=".">1</td>
<td align="char" char=".">0</td>
<td align="char" char=".">1</td>
</tr>
</tbody>
</table>
</table-wrap>
<fig id="F10" position="float">
<label>FIGURE 10</label>
<caption>
<p>Number of items in favor of boys and girls: Mean DIF residuals across 10 class intervals.</p>
</caption>
<graphic xlink:href="feduc-06-742560-g010.tif"/>
</fig>
<p>This was not the case in the subset DIF analyses. When the mean residuals from the nonverbal subset were extracted from the general reasoning test analysis, 22 items favored boys and 12 items favored girls. When the mean residuals from the verbal subset were extracted from the general reasoning test analysis, 23 items favored girls and 6 items favored&#x20;boys.</p>
<p>The factorial ANOVA for the nonverbal subset residuals revealed a statistically significant main effect for gender, <italic>F</italic> (91,594), 70.42, <italic>p</italic>&#x20;&#x3c; 0.001, indicating a difference between boys (<italic>M</italic>&#x20;&#x3d; 0.520, <italic>SD</italic> &#x3d; 0.184) and girls (<italic>M</italic>&#x20;&#x3d; 0.475, <italic>SD</italic> &#x3d; 0.166). The interaction effect between class interval and gender was also significant, <italic>F</italic> (91,594), 2.98,&#x20;<italic>p</italic>&#x20;&#x3c; 0.01, indicating that although the gradients of the observed means for boys and girls did not cross, these gradients were different and the subset DIF was non-uniform (see <xref ref-type="fig" rid="F8">Figure&#x20;8</xref>).</p>
<p>The factorial ANOVA for the verbal subset residuals showed a statistically significant main effect for gender, <italic>F</italic> (91,594), 82.5, <italic>p</italic>&#x20;&#x3c; 0.001, once again indicating a significant difference between boys (<italic>M</italic>&#x20;&#x3d; 0.479, <italic>SD</italic> &#x3d; 0.166) and girls (<italic>M</italic>&#x20;&#x3d; 0.495, <italic>SD</italic> &#x3d; 0.153). However, the interaction effect between class interval and gender was not significant, <italic>F</italic> (91,594), 1.57,&#x20;<italic>p</italic>&#x20;&#x3d; 0.12, indicating that the gradients of the observed means for boys and girls were the same and that the subset DIF was uniform (see <xref ref-type="fig" rid="F9">Figure&#x20;9</xref>).</p>
</sec>
</sec>
<sec id="s4">
<title>4 Discussion and Implications</title>
<p>The first aim was to demonstrate that the Rasch model cannot reveal systematic DIF in single tests. As hypothesized, when the nonverbal and verbal reasoning tests in this study were analyzed as single tests, differences in the performances of boys and girls appeared in the mean person parameter estimates and in the distributions of these estimates. Nevertheless, these differences do not represent systematic DIF, which happens when a test does not function in the same way for different factor groups who otherwise share the same ability estimate on the latent&#x20;trait.</p>
<p>In the Rasch model, the person total score is the sufficient statistic for the person estimate, eliminating the possibility for residuals at the test level. Therefore, as hypothesized, systematic DIF did not appear in either the nonverbal nor verbal reasoning single test analyses. This was demonstrated in the direct conformity of the observed means to the expected score curves for both factor groups in both TCCs. For the same reason, there were only minor differences in the total number of items in favor of either factor group, which is indicative of no systematic&#x20;DIF.</p>
<p>The second aim was to introduce subset DIF analysis as an alternative methodology to systematic DIF analysis. In subset DIF analysis, single tests are divided into item subsets that form the components of the broader latent trait, such as the nonverbal and verbal subsets in the general reasoning test reported here. Person parameter estimates are initially calculated in a Rasch model that includes all test items. Then, in separate analyses, these person estimates are compared to the observed means for each factor group in each subset, and the residuals assessed. Therefore, this methodology tests the assumption that the person locations in each factor group are invariant across subsets. Subset DIF analysis is therefore not a direct alternative to a systematic DIF analysis, but instead offers insights to largescale DIF across clusters of&#x20;items.</p>
<p>In applied contexts, practitioners can determine item subsets for DIF analysis by identifying questions that are hypothesized to favor a particular factor group. In doing so, practitioners can draw on previous experience and research, if any is available. For example, it might be hypothesized that a subset of questions in a mathematics test, which demand a higher level of vocabulary, favor girls. In that case, a practitioner can place these questions into a subset, and other questions into a separate subset. It is then possible to apply the approach introduced in this article to examine subset&#x20;DIF.</p>
<p>In this study we revealed subset DIF in the nonverbal and verbal subsets. In both subsets we showed that the observed means of the factor groups were uniformly different to each other and to the expected score curves. These results were confirmed with factorial ANOVAs of the residuals, and the mismatch between the number of items favoring factor groups in each subset. Therefore, we revealed subset DIF by rejecting the assumption that the person estimates in each factor group were the same in each subset. As such, this study demonstrates that subset DIF is concealed when the Rasch model is used to analyse systematic DIF in single tests. Stated differently, if persons in one class interval received higher scores than expected across all items in a subset, their observed means in that subset would deviate from their expected scores. But this is not possible in a single test because the MLE solution equation constrains person expected scores to their observed&#x20;means.</p>
<p>Thus, using the Rasch model it is inherently impossible to detect systematic DIF across a single test. In practice, there are two main options available when using subset DIF. One is to identify item subsets within a single test that may have group-specific DIF and to use the approach outlined in this article to see whether there is subset DIF for a factor group in such subsets. The other is to broaden the frame of reference by testing persons on additional construct-relevant items and to examine subset DIF in the context of this broadened frame of reference. In both options, underlying subset DIF can only appear if the factor group observed means within a subset are inconsistent with their person estimates, which are partially based on the items in the frame of reference.</p>
<p>The article focuses on DIF as a source of misfit. Inevitably, other forms of model misfit are present in real data and may confound the interpretation of results. As such, further studies are recommended to investigate whether other sources of misfit impact inferences regarding subset DIF. Related to this, a second consideration in applying the methodology proposed here concerns the selection of the frame of reference subset. Broadening the measurement in this way changes the substantive definition of the latent trait and it therefore changes the measurement. This kind of change may introduce effects such as differences in item difficulty that influence targeting, differences in item discrimination between subsets, and increased misfit across the whole test. Practitioners need to be aware of these possible effects when considering the methodology.</p>
</sec>
</body>
<back>
<sec id="s5">
<title>Data Availability Statement</title>
<p>The raw data supporting the conclusion of this article will be made available by the authors, without undue reservation.</p>
</sec>
<sec id="s6">
<title>Ethics Statement</title>
<p>Ethical review and approval was not required for the study on human participants in accordance with the local legislation and institutional requirements. Written informed consent to participate in this study was provided by the participants&#x2019; legal guardian/next of&#x20;kin.</p>
</sec>
<sec id="s7">
<title>Author Contributions</title>
<p>All authors contributed to the article and approved the submitted version.</p>
</sec>
<sec sec-type="COI-statement" id="s8">
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s9">
<title>Publisher&#x2019;s Note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Andrich</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Hagquist</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>2012</year>). <article-title>Real and Artificial Differential Item Functioning</article-title>. <source>J.&#x20;Educ. Behav. Stat.</source> <volume>37</volume> (<issue>7</issue>), <fpage>387</fpage>&#x2013;<lpage>416</lpage>. <pub-id pub-id-type="doi">10.3102/1076998611411913</pub-id> </citation>
</ref>
<ref id="B2">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Andrich</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Hagquist</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>2004</year>). <article-title>Real and Artificial Differential Item Functioning Using Analysis of Variance</article-title>. <comment>Paper presented at the Second</comment>. <conf-name>Procedding of the International Conference on Measurement in Health, Education, Psychology, and Marketing: Developments with Rasch Models</conf-name>, <conf-loc>Perth, Australia</conf-loc>, <conf-date>January 20-22, 2004</conf-date>. <publisher-name>Murdoch University</publisher-name>. </citation>
</ref>
<ref id="B3">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Andrich</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Marais</surname>
<given-names>I.</given-names>
</name>
</person-group> (<year>2019</year>). <source>A Course in Rasch Measurement: Measuring in the Educational, Social, and Health Sciences</source>. <publisher-loc>Singapore</publisher-loc>: <publisher-name>Springer</publisher-name>. </citation>
</ref>
<ref id="B4">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Andrich</surname>
<given-names>D.</given-names>
</name>
</person-group> (<year>1988</year>). <source>Rasch Models for Measurement</source>. <publisher-loc>Newbury Park, CA</publisher-loc>: <publisher-name>Sage Publications</publisher-name>. </citation>
</ref>
<ref id="B5">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Andrich</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Sheridan</surname>
<given-names>B. E.</given-names>
</name>
<name>
<surname>Luo</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2018</year>). <source>RUMM2030 Professional: Rasch Unidimensional Models for Measurement [Computer Software]</source>. <publisher-loc>Perth, Western Australia</publisher-loc>: <publisher-name>RUMM Laboratory</publisher-name>. </citation>
</ref>
<ref id="B6">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chalmers</surname>
<given-names>R. P.</given-names>
</name>
<name>
<surname>Counsell</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Flora</surname>
<given-names>D. B.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>It Might Not Make a Big DIF: Improved Differential Test Functioning Statistics that Account for Sampling Variability</article-title>. <source>Educ. Psychol. Meas.</source> <volume>76</volume> (<issue>1</issue>), <fpage>114</fpage>&#x2013;<lpage>140</lpage>. <pub-id pub-id-type="doi">10.1177/0013164415584576</pub-id> </citation>
</ref>
<ref id="B7">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Drasgow</surname>
<given-names>F.</given-names>
</name>
</person-group> (<year>1987</year>). <article-title>Study of the Measurement Bias of Two Standardized Psychological Tests</article-title>. <source>J.&#x20;Appl. Psychol.</source> <volume>72</volume> (<issue>1</issue>), <fpage>19</fpage>&#x2013;<lpage>29</lpage>. <pub-id pub-id-type="doi">10.1037/0021-9010.72.1.19</pub-id> </citation>
</ref>
<ref id="B8">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hagquist</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Andrich</surname>
<given-names>D.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>Recent Advances in Analysis of Differential Item Functioning in Health Research Using the Rasch Model</article-title>. <source>Health Qual. Life Outcomes</source> <volume>15</volume> (<issue>1</issue>), <fpage>181</fpage>. <pub-id pub-id-type="doi">10.1186/s12955-017-0755-0</pub-id> </citation>
</ref>
<ref id="B9">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Otis</surname>
<given-names>A. S.</given-names>
</name>
</person-group> (<year>2009</year>). <source>Otis-lennon School Ability Test, (OLSAT 8)</source>. <publisher-loc>San Antonio, TX</publisher-loc>: <publisher-name>Pearson Education</publisher-name>. </citation>
</ref>
<ref id="B10">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pae</surname>
<given-names>T.-I.</given-names>
</name>
</person-group> (<year>2004</year>). <article-title>DIF for Examinees with Different Academic Backgrounds</article-title>. <source>Lang. Test.</source> <volume>21</volume> (<issue>1</issue>), <fpage>53</fpage>&#x2013;<lpage>73</lpage>. <pub-id pub-id-type="doi">10.1191/0265532204lt274oa</pub-id> </citation>
</ref>
<ref id="B11">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Raju</surname>
<given-names>N. S.</given-names>
</name>
<name>
<surname>van der Linden</surname>
<given-names>W. J.</given-names>
</name>
<name>
<surname>Fleer</surname>
<given-names>P. F.</given-names>
</name>
</person-group> (<year>1995</year>). <article-title>IRT-based Internal Measures of Differential Item Functioning of Items and Tests</article-title>. <source>Appl. Psychol. Meas.</source> <volume>19</volume> (<issue>4</issue>), <fpage>353</fpage>&#x2013;<lpage>368</lpage>. <pub-id pub-id-type="doi">10.1177/014662169501900405</pub-id> </citation>
</ref>
<ref id="B12">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Rasch</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>1961</year>). &#x201c;<article-title>On General Laws and the Meaning of Measurement in Psychology</article-title>,&#x201d; in <source>Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability IV</source>. Editor <person-group person-group-type="editor">
<name>
<surname>Neyman</surname>
<given-names>J.</given-names>
</name>
</person-group> (<publisher-loc>Berkeley</publisher-loc>: <publisher-name>University of California Press</publisher-name>), <fpage>321</fpage>&#x2013;<lpage>334</lpage>. </citation>
</ref>
<ref id="B13">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Rasch</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>1980</year>). <source>Probabilistic Models for Some Intelligence and Attainment Tests</source>. <publisher-loc>Copenhagen</publisher-loc>: <publisher-name>Danish Institute for Educational Research</publisher-name>. <comment>Expanded edition (1980) with foreword and afterword by B. D. Wright (1980). Chicago, IL: University of Chicago Press</comment>. </citation>
</ref>
<ref id="B14">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Takala</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Kaftandjieva</surname>
<given-names>F.</given-names>
</name>
</person-group> (<year>2000</year>). <article-title>Test Fairness: A DIF Analysis of an L2 Vocabulary Test</article-title>. <source>Lang. Test.</source> <volume>17</volume> (<issue>3</issue>), <fpage>323</fpage>&#x2013;<lpage>340</lpage>. <pub-id pub-id-type="doi">10.1177/026553220001700303</pub-id> </citation>
</ref>
<ref id="B15">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Waterbury</surname>
<given-names>G. T.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Missing Data and the Rasch Model: The Effects of Missing Data Mechanisms on Item Parameter Estimation</article-title>. <source>J.&#x20;Appl. Meas.</source> <volume>20</volume> (<issue>2</issue>), <fpage>154</fpage>&#x2013;<lpage>166</lpage>. </citation>
</ref>
</ref-list>
</back>
</article>