<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Psychol.</journal-id>
<journal-title>Frontiers in Psychology</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Psychol.</abbrev-journal-title>
<issn pub-type="epub">1664-1078</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fpsyg.2022.821459</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Psychology</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Examination of Gender-Related Differential Item Functioning Through Poly-BW Indices</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Huang</surname> <given-names>Tsai-Wei</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1213682/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Wu</surname> <given-names>Pei-Chen</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1572326/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Mok</surname> <given-names>Magdalena Mo Ching</given-names></name>
<xref ref-type="aff" rid="aff3"><sup>3</sup></xref>
<xref ref-type="aff" rid="aff4"><sup>4</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1418075/overview"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Department of Counseling, National Chiayi University</institution>, <addr-line>Chiayi</addr-line>, <country>Taiwan</country></aff>
<aff id="aff2"><sup>2</sup><institution>Department of Educational Psychology and Counseling, National Pingtung University</institution>, <addr-line>Pingtung</addr-line>, <country>Taiwan</country></aff>
<aff id="aff3"><sup>3</sup><institution>Graduate Institute of Educational Information and Measurement, National Taichung University of Education</institution>, <addr-line>Taichung</addr-line>, <country>Taiwan</country></aff>
<aff id="aff4"><sup>4</sup><institution>Department of Psychology, Assessment Research Centre, The Education University of Hong Kong</institution>, <addr-line>Tai Po</addr-line>, <country>Hong Kong SAR, China</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Laura Girelli, University of Salerno, Italy</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Zhushan Li, Boston College, United States; Kubra Atalay Kabasakal, Hacettepe University, Turkey</p></fn>
<corresp id="c001">&#x002A;Correspondence: Tsai-Wei Huang, <email>twhuang@mail.ncyu.edu.tw</email></corresp>
<fn fn-type="other" id="fn004"><p>This article was submitted to Quantitative Psychology and Measurement, a section of the journal Frontiers in Psychology</p></fn>
</author-notes>
<pub-date pub-type="epub">
<day>25</day>
<month>02</month>
<year>2022</year>
</pub-date>
<pub-date pub-type="collection">
<year>2022</year>
</pub-date>
<volume>13</volume>
<elocation-id>821459</elocation-id>
<history>
<date date-type="received">
<day>24</day>
<month>11</month>
<year>2021</year>
</date>
<date date-type="accepted">
<day>28</day>
<month>01</month>
<year>2022</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x00A9; 2022 Huang, Wu and Mok.</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Huang, Wu and Mok</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract>
<p>The existing differential item functioning (DIF) detection approaches relying on item difficulty or item discrimination are limited for understanding the associates of DIF items, and consequently, DIF items were conventionally either deleted or ignored. Given the importance of minimizing DIF items in test construction, teachers or testing practitioners need more information regarding possible associates of DIF items. Using an example of a teacher-made mathematics achievement test, this study aimed to examine how the Poly-BW indices (power, defenselessness, disturbance, and hint) contributed to the properties of gender-related DIF items. Data from a 34-item mathematics achievement test that involved 1,439 seventh-grade students from Taiwan (51.01% boys and 48.99% girls) showed that the differences of the defenselessness (mp) and power (cp) indices between men and women served as salient predictors of the DIF measures estimated by the Poly Simultaneous Item Bias Test (Poly-SIBTEST) procedure and with satisfactory accuracy of hit rates. Items with relatively large defenselessness for men were likely to present male-favoring DIFs, whereas items with relatively large power for men were likely to present female-favoring DIFs. The Poly-BW indices yielded directions for modifying items for teachers in practice.</p>
</abstract>
<kwd-group>
<kwd>differential item functioning (DIF)</kwd>
<kwd>Poly-SIBTEST</kwd>
<kwd>Poly-BW indices</kwd>
<kwd>teacher-made mathematics test</kwd>
<kwd>item-fit statistics</kwd>
</kwd-group>
<counts>
<fig-count count="3"/>
<table-count count="2"/>
<equation-count count="5"/>
<ref-count count="39"/>
<page-count count="11"/>
<word-count count="9040"/>
</counts>
</article-meta>
</front>
<body>
<sec id="S1" sec-type="intro">
<title>Introduction</title>
<p>Differential item functioning (DIF) indicates the situation where participants from different memberships (e.g., age and gender) on the same level of the latent trait (e.g., math performance) have a different probability of a certain response to a particular item (<xref ref-type="bibr" rid="B11">Holland and Thayer, 1988</xref>). DIF may hamper the interpretation of mean comparisons or even lead to misleading conclusions as an item with DIF yields either constant benefits for a particular membership (uniform DIF) or benefits differing in direction and magnitude across various memberships (non-uniform DIF).</p>
<p>Given the distorting consequences of DIF, various DIF techniques have been advanced to assess invariance in both person and item parameters. However, most DIF detection approaches and the estimation of DIF effect sizes are limited in providing the possible associates of DIF items (<xref ref-type="bibr" rid="B19">Kim et al., 2007</xref>; <xref ref-type="bibr" rid="B22">Li, 2014a</xref>,<xref ref-type="bibr" rid="B23">b</xref>, <xref ref-type="bibr" rid="B24">2015</xref>). The possible associates of DIF-flagged items entail the possible reasons why the DIF occurs and through which we could determine whether these items should be deleted or revised. On the other hand, if the information on possible associates of DIF items is limited, the DIF items cannot be treated appropriately. Consequently, some researchers chose to remove a DIF-flagged item from the item bank, while others might conduct further analyses (e.g., <xref ref-type="bibr" rid="B39">Wang et al., 2012</xref>; <xref ref-type="bibr" rid="B2">Chen and Hwu, 2018</xref>). Furthermore, by reviewing 27 studies on DIF item treatment, <xref ref-type="bibr" rid="B3">Cho et al. (2016)</xref> found that 30% of the studies removed the DIF items while 26% of the studies ignored them.</p>
<p>The associates of DIF are complex and could be examined from several perspectives. Some studies developed multi-dimensional DIF detection methods to explain DIF resulting from the secondary dimension (e.g., <xref ref-type="bibr" rid="B32">Roussos and Stout, 1996</xref>; <xref ref-type="bibr" rid="B28">Mazor et al., 1998</xref>; <xref ref-type="bibr" rid="B21">Lee et al., 2016</xref>). Some studies explore differential distractor functioning (DDF) to express that DIF could be derived from distractors being attracted differently by individuals from different groups (e.g., <xref ref-type="bibr" rid="B37">Suh and Bolt, 2011</xref>; <xref ref-type="bibr" rid="B38">Suh and Talley, 2015</xref>) and others argued that the presence of DIF is not directly related to the defined groups but associated with a latent or unknown group (<xref ref-type="bibr" rid="B5">De Ayala et al., 2002</xref>; <xref ref-type="bibr" rid="B4">Cohen and Bolt, 2005</xref>). Item characteristics, person effects, and the interaction between item and person effects could be accounted for DIF. To provide implications for teacher-made tests in classrooms (e.g., modifications for DIF items based on properties of items), however, this study would investigate the associates of DIF with an emphasis on item effects (e.g., properties of items).</p>
<p>The properties of items delineate item characteristics (e.g., item difficulty), which could provide teachers or testing practitioners directions to modify or revise items exhibiting DIF. Two common properties are item difficulty and item discrimination. However, both of them are limited in addressing possible associates of DIF. Beyond that, aberrant response indices derived from a response pattern in an item could be alternative options to identify possible associates in DIFs. Due to the function of detecting the aberrant response patterns by assessing the agreement of difficulty within an item response pattern, the caution index (<italic>SCI</italic>, <xref ref-type="bibr" rid="B33">Sato, 1975</xref>) or modified caution index (<italic>MCI</italic>, <xref ref-type="bibr" rid="B10">Harnisch and Linn, 1981</xref>) developed by early researchers could be used to identify possible associates in DIF items. However, they are still limited by their ineffectiveness in distinguishing extreme levels of item difficulty (<xref ref-type="bibr" rid="B17">Huang and Wu, 2013</xref>). For example, a response pattern, such as 000&#x007C; 00000111, consisting of &#x201C;0&#x201D; (representing a wrong response) and &#x201C;1&#x201D; (representing a correct response) is ranged by difficulty levels from left-easy items to right-hard items. Two different response patterns of the within-difficulty response pattern (000 before the symbol &#x201C;&#x007C;&#x201D;) and the beyond-difficulty response pattern (111 after the symbol &#x201C;&#x007C;&#x201D;) are handled by the <italic>MCI</italic> or <italic>SCI</italic> identically, even though these two response patterns might be due to different item characteristics. That is, possible intrinsic properties of item response patterns might not be distinguished by the <italic>SCI</italic> or by the <italic>MCI</italic> measures alone.</p>
<p>As an extension of the <italic>SCI</italic> measure, the beyond-surprise/within-concern dichotomous <italic>BW</italic> indicators (<xref ref-type="bibr" rid="B14">Huang, 2012</xref>; <xref ref-type="bibr" rid="B17">Huang and Wu, 2013</xref>) have yielded more useful information for the intrinsic properties of items. The dichotomous <italic>BW</italic> indicators provide the information of item characteristics about power (<italic>c</italic>), defenselessness (<italic>m</italic>), disturbance (<italic>b</italic>), and hint (<italic>w</italic>) for each item. The four indicators are useful for exploring the intrinsic properties of an item (<xref ref-type="bibr" rid="B17">Huang and Wu, 2013</xref>). Recently, the original dichotomous <italic>BW</italic> indicators have been extended to the <italic>Poly-BW</italic> indices for polytomously scored items (<xref ref-type="bibr" rid="B16">Huang and Lu, 2017</xref>), in which they are recorded as <italic>cp, mp</italic>, <italic>bp</italic>, and <italic>wp</italic>, respectively. However, no studies have yet examined the association between these four polytomous indices and DIFs. Furthermore, since the <italic>Poly-BW</italic> indices are non-parametric-based to yield diagnostic information for items (e.g., whether the items respond normally or aberrantly), they can provide information on item classifications by means of the approximate permutation test (APT, <xref ref-type="bibr" rid="B7">Edgington, 1995</xref>). Through the APT process, many data matrixes can be approximately simulated according to an original data matrix so that the 95th percentile values of individual indices in an item can be set and classified. This analysis enables us to disclose possible associates in suspective DIF items and by which DIF items could be more appropriately dealt with. Details of the item classification procedures are provided in the &#x201C;Materials and Methods&#x201D; section.</p>
<p>On the other hand, although recently, there are several existing DIF detection methods depending on different assumptions of their models, the Poly Simultaneous Item Bias Test (Poly-SIBTEST) approach is a wildly used non-parametric-based DIF-detecting procedure in polytomously scoring situations (<xref ref-type="bibr" rid="B1">Chang et al., 1996</xref>). Capitalizing on its non-parametric-based and polytomously scoring characteristics, the <italic>Poly-SIBTEST</italic> procedure was chosen in the current study as a reference method to estimate DIFs for examining the associations between the estimated DIF measures and the non-parametric-based <italic>Poly-BW</italic> indices measures.</p>
<p>Given the importance of the possible associates of DIF items, this study exemplifies the advantages of the four <italic>Poly-BW</italic> indices by assessing their predictive effects on the gender-related DIFs measured from the <italic>Poly-SIBTEST</italic> approach in a teacher-made math test. A gender-related DIF in a math text is a particularly important issue because gender difference in math performance has been debated for some decades. Some researchers argued that there were no gender differences in math performance (<xref ref-type="bibr" rid="B18">Hyde and Mertz, 2009</xref>; <xref ref-type="bibr" rid="B26">Lindberg et al., 2010</xref>), but others proposed that boys outperformed girls since third grade (<xref ref-type="bibr" rid="B8">Fryer and Levitt, 2010</xref>; <xref ref-type="bibr" rid="B31">Robinson and Theule Lubienski, 2011</xref>), and still other researchers showed that girls performed similarly or better than boys in terms of classroom grades (<xref ref-type="bibr" rid="B30">Pomerantz et al., 2002</xref>; <xref ref-type="bibr" rid="B6">Ding et al., 2006</xref>). However, these comparisons are valid only when a math test is established freely with the gender invariance. Accordingly, two main research questions guide this study: (1) Do the four <italic>Poly-BW</italic> indices predict effectively a DIF measure obtained from the <italic>Poly-SIBTEST</italic> approach? (2) How accurately do the four <italic>Poly-BW</italic> indices predict a DIF item?</p>
<sec id="S1.SS1">
<title>Dichotomous <italic>BW</italic> Indicators</title>
<p>Aberrant responses, known as inconsistent or unexpected responses compared to the overall responses in a test, can provide diagnostic information for persons and items (<xref ref-type="bibr" rid="B20">Kogut, 1986</xref>; <xref ref-type="bibr" rid="B34">Seol, 1998</xref>). Indices developed for detecting a person&#x2019;s aberrant response patterns were usually labeled as person-fit indices, while for detecting an item&#x2019;s aberrant responses was called item-fit indices. Some indices are based on the characteristics of a group, and others are based on item response theory (IRT) models (<xref ref-type="bibr" rid="B10">Harnisch and Linn, 1981</xref>; <xref ref-type="bibr" rid="B20">Kogut, 1986</xref>; <xref ref-type="bibr" rid="B29">Meijer and Sijtsma, 1999</xref>). Based on the group-based <xref ref-type="bibr" rid="B9">Guttman (1944)</xref> principles (1944) (i.e., able persons should answer those items correctly, which exhibited difficulty levels lower than those persons&#x2019; ability levels), the dichotomous <italic>BW</italic> indicators (<xref ref-type="bibr" rid="B12">Huang, 2002</xref>) were designed to detect persons&#x2019; aberrant responses patterns, i.e., person-fit oriented, and were subsequently extended to detect aberrant responses of items, i.e., item-fit oriented. The person-fit <italic>BW</italic> indicators originally (<xref ref-type="bibr" rid="B12">Huang, 2002</xref>) measured a person&#x2019;s tendency of &#x201C;<italic>guess&#x201D;</italic> (<italic>B</italic> indicator) and &#x201C;<italic>carelessness&#x201D;</italic> (<italic>W</italic> indicator) and later (<xref ref-type="bibr" rid="B17">Huang and Wu, 2013</xref>) extended to measure a person&#x2019;s tendency of &#x201C;<italic>mastery&#x201D;</italic> (<italic>C</italic> indicator) and &#x201C;<italic>misconception&#x201D;</italic> (<italic>M</italic> indicator), respectively. <xref ref-type="bibr" rid="B13">Huang (2011)</xref> investigated the robustness of <italic>BW</italic> aberrance indices against test length and found that the person-fit <italic>BW</italic> indicators were almost unrelated to test length. Later, <xref ref-type="bibr" rid="B14">Huang (2012)</xref> compared the aberrance detection powers among the person-fit BW indicators, other four group-based indicators (<italic>SCI, MCI, NCI</italic>, and <italic>Wc &#x0026; Bs</italic>), and five IRT-based indicators (<italic>OUTFITz, INFITz, ECI2z, ECI4z</italic>, and <italic>lz</italic>) under the conditions of content category, type of aberrance, the severity of aberrance, and the ratios of aberrance persons. He found the person-fit BW indicators and the four group-based indices exhibited higher detection rates (over 90%) than the five IRT-based indicators, and furthermore, the BW indicators exhibited the best stability across different situations.</p>
<p>According to symmetrical characteristics, the properties of the item-fit <italic>BW</italic> indicators are the same as those of the person-fit <italic>BW</italic> indicators. Specifically, the dichotomous <italic>BW</italic> item-fit indicators (<xref ref-type="bibr" rid="B12">Huang, 2002</xref>) were developed to detect the tendency of &#x201C;<italic>disturbance&#x201D;</italic> (<italic>b</italic> indicator) and &#x201C;<italic>hint&#x201D;</italic> (<italic>w</italic> indicator) in an item beyond and within the item&#x2019;s difficulty level, respectively. In addition to the two aberrant indicators, <xref ref-type="bibr" rid="B17">Huang and Wu (2013)</xref> incorporated another two indicators with normal responses for an item, called &#x201C;<italic>power</italic>&#x201D; (<italic>c</italic> indicator) and &#x201C;<italic>defenselessness</italic>&#x201D; (<italic>m</italic> indicator) within and beyond the item&#x2019;s difficulty level, respectively. Here, normal responses mean those responses that obey the Guttman&#x2019;s principles. In their study, a cognitive diagnostic model based on the <italic>WBstar</italic> program was used to examine the quality of a teacher-made test (22 items) on the contents of fractions and decimals and found that the <italic>BW-based</italic> cognitive diagnostic model performed effectively for detecting the misfit of items in small-sample scenario of 32 fourth-grade students.</p>
<p>For applying to a DIF scenario, <xref ref-type="bibr" rid="B15">Huang and Lin (2017)</xref> conducted a Monte Carlo study to investigate the predictive effects of the dichotomous <italic>BW</italic> indicators on DIFs. They simulated data under five conditions (sample size, item number, DIF type, DIF ratio, and DIF severity) and found that the power indicator and the defenselessness indicator could significantly explain the variances of DIFs. Moreover, the four dichotomous <italic>BW</italic> indicators also exhibited over 90% of the accuracy rate of prediction on the flagged-DIF items. However, Huang and Lin&#x2019;s study was limited in the dichotomously scoring system and not suitably interpreted in polytomously scoring scenarios. In recognition of limited usage of the dichotomous <italic>BW</italic> indicators, <xref ref-type="bibr" rid="B16">Huang and Lu (2017)</xref> extended these indicators to fit in polytomously scored scenarios, called the <italic>Poly-BW</italic> indices. In their study, they developed the <italic>PWBstar</italic>1.0 program<sup><xref ref-type="fn" rid="footnote1">1</xref></sup> for estimating the <italic>Poly-BW</italic> indices, but they did not investigate the associations between the <italic>Poly-BW</italic> indices and DIF measures. Therefore, the predictive effectiveness of the <italic>Poly-BW</italic> indices to DIF measures is unknown.</p>
</sec>
<sec id="S1.SS2">
<title><italic>Poly-BW</italic> Indices</title>
<p>Similar to the dichotomous situations, the main idea of the <italic>Poly-BW</italic> indices was based on the concept of a discrepancy, which would reflect the level of aberrant or normal responses. The discrepancy distances between persons&#x2019; abilities and an item&#x2019;s difficulty would be calculated according to normal or aberrant responses beyond and within an item&#x2019;s difficulty level, respectively. Specifically, suppose in a <italic>K</italic>-item test participated by <italic>N</italic> examinees, we can first rank ability levels by individual person scores from the bottom (low) to top (high) and then rank difficulty levels by individual item scores from left (low) to right (high). Let the <italic>j</italic>th item has a maximum score of <italic>s</italic><sub><italic>j</italic></sub> and the <italic>i</italic>th person has an earned score <italic>x</italic><sub><italic>ij</italic></sub> on the <italic>j</italic>th item, then <italic>x</italic><sub><italic>ij</italic></sub> &#x2264; <italic>s</italic><sub><italic>j</italic></sub>. Since the <italic>N</italic> examinees are ranked by individual&#x2019;s total earned scores on the <italic>K</italic>-item test from bottom to top, we can define <inline-formula><mml:math id="INEQ2"><mml:mrow><mml:msub><mml:mi>t</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:msubsup><mml:mo largeop="true" symmetric="true">&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>K</mml:mi></mml:msubsup><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>/</mml:mo><mml:mrow><mml:msubsup><mml:mo largeop="true" symmetric="true">&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>K</mml:mi></mml:msubsup><mml:msub><mml:mi>s</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:mrow></mml:mrow></mml:mrow></mml:math></inline-formula> as the <italic>i</italic>th person&#x2019;s potential ability. Then we correspondingly calculate the accumulated maximum score by defining a specified <italic>I</italic><sub><italic>j</italic></sub> (1 &#x2264; <italic>I</italic><sub><italic>j</italic></sub> &#x2264; <italic>N</italic>) such that <inline-formula><mml:math id="INEQ4"><mml:mrow><mml:mrow><mml:msub><mml:mi>I</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow><mml:mo>&#x2264;</mml:mo><mml:msub><mml:mi>Q</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:msubsup><mml:mo largeop="true" symmetric="true">&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>N</mml:mi></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>&lt;</mml:mo><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mi>I</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:mrow></mml:math></inline-formula> for all <italic>N</italic> examinees from bottom to top, indicating a certain item score (<italic>Q</italic><sub><italic>j</italic></sub>) defined as the difference between the maximum score and the earned score is falling between two sequential accumulated maximum scores and thus exhibits the item&#x2019;s robustness so as for all <italic>N</italic> examinees not to answer it correctly. Ideally, the <italic>i</italic>th person who possesses the complete potential ability <italic>t</italic><sub><italic>i</italic></sub> should gain the accumulated maximum score on the <italic>j</italic>th item.</p>
<p>Therefore, the true item difficulty (<inline-formula><mml:math id="INEQ5"><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>Q</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mo>&#x002A;</mml:mo></mml:msubsup></mml:math></inline-formula>) for the <italic>j</italic>th item can be expressed by two sequential persons&#x2019; potential ability levels (<italic>t</italic><sub><italic>i</italic></sub> and <italic>t</italic><sub><italic>i+1</italic></sub>) through interpolation technique with a 0.5 error unit as Equation 1 shows:</p>
<disp-formula id="S1.E1"><label>(1)</label><mml:math id="M1"><mml:mrow><mml:msubsup><mml:mi>t</mml:mi><mml:msub><mml:mi>Q</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>&#x002A;</mml:mo></mml:msubsup><mml:mo>=</mml:mo><mml:mrow><mml:msub><mml:mi>t</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mfrac><mml:mrow><mml:mrow><mml:mrow><mml:msub><mml:mi>Q</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow><mml:mo>-</mml:mo><mml:mrow><mml:msub><mml:mi>I</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mo>+</mml:mo><mml:mn>0.5</mml:mn></mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mfrac><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mrow></mml:mrow></mml:math></disp-formula>
<p>Adding a 0.5 error unit in Equation 1 is because the score unit is 1 and using half of the score unit for correction will be helpful to distinguish difficulty levels when the examinee sample is too small.</p>
<p>With the true item difficulty (<inline-formula><mml:math id="INEQ6"><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>Q</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mo>&#x002A;</mml:mo></mml:msubsup></mml:math></inline-formula>) and based on the dichotomous logic, the <italic>Poly-BW</italic> indices (<italic>cp, mp, bp</italic>, and <italic>wp</italic>) designed to measure the degree of <italic>power</italic>, <italic>defenselessness</italic>, <italic>disturbance</italic>, and, <italic>hint</italic> for an item, respectively, are given from Equations 2 to 5 as follows:</p>
<disp-formula id="S1.E2"><label>(2)</label><mml:math id="M2"><mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:msub><mml:mi>p</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msubsup><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:msub><mml:mi>I</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:msubsup><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:mfrac><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>s</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mfrac></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mi>t</mml:mi><mml:msub><mml:mi>Q</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>&#x002A;</mml:mo></mml:msubsup><mml:mo>-</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>N</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>/</mml:mo><mml:mn>2</mml:mn></mml:mrow><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:mfrac></mml:mrow></mml:math></disp-formula>
<disp-formula id="S1.E3"><label>(3)</label><mml:math id="M3"><mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:msub><mml:mi>p</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msubsup><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:msub><mml:mi>I</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mrow><mml:mi>N</mml:mi></mml:msubsup><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mfrac><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>s</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mfrac><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mi>t</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>-</mml:mo><mml:msubsup><mml:mi>t</mml:mi><mml:msub><mml:mi>Q</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>&#x002A;</mml:mo></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>N</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>/</mml:mo><mml:mn>2</mml:mn></mml:mrow><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:mfrac></mml:mrow></mml:math></disp-formula>
<disp-formula id="S1.E4"><label>(4)</label><mml:math id="M4"><mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:msub><mml:mi>p</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msubsup><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:msub><mml:mi>I</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mrow><mml:mi>N</mml:mi></mml:msubsup><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:mfrac><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>s</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mfrac></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mi>t</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>-</mml:mo><mml:msubsup><mml:mi>t</mml:mi><mml:msub><mml:mi>Q</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>&#x002A;</mml:mo></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>N</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>/</mml:mo><mml:mn>2</mml:mn></mml:mrow><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:mfrac></mml:mrow></mml:math></disp-formula>
<disp-formula id="S1.E5"><label>(5)</label><mml:math id="M5"><mml:mrow><mml:mrow><mml:mi>w</mml:mi><mml:msub><mml:mi>p</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msubsup><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:msub><mml:mi>I</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:msubsup><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mfrac><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>s</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mfrac><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mi>t</mml:mi><mml:msub><mml:mi>Q</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>&#x002A;</mml:mo></mml:msubsup><mml:mo>-</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>N</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>/</mml:mo><mml:mn>2</mml:mn></mml:mrow><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:mfrac></mml:mrow></mml:math></disp-formula>
<p>The properties of the four <italic>Poly-BW</italic> indices are delineated as follows. First, the <italic>power index</italic> (<italic>cp</italic>) refers to the power of an item, that is, the examinees would answer it incorrectly just truly due to their ability is lower than the item difficulty (i.e., within the difficulty level). Second, the <italic>defenselessness index</italic> (<italic>mp</italic>) refers to the possible inefficiencies of an item, that is, examinees whose ability levels were higher than the difficulty level of the item (i.e., beyond the difficulty level) would easily answer this item correctly. The <italic>cp</italic> and <italic>mp</italic> indices differ from the difficulty index in the classical test theory (CTT). In CTT, the computation of the difficulty index does not distinguish &#x201C;within&#x201D; or &#x201C;beyond&#x201D; an item difficulty level. In this study, the <italic>cp</italic> index measures the part of &#x201C;not correct&#x201D; responses within the item difficulty, and the <italic>mp</italic> index measures the part of &#x201C;correct&#x201D; responses beyond the item difficulty. An item with a greater value of <italic>power index</italic> is more likely to be answered &#x201C;incorrectly&#x201D; than another item with a lower value of <italic>power index</italic>. Therefore, when comparing <italic>power index</italic> between two groups (gender groups), a greater value of <italic>power index</italic> for a particular item in the focal group (e.g., men) indicates that this item is more likely to be answered incorrectly by male examinees than by female examinees, given those examinees&#x2019; ability within this item difficulty. In contrast, an item with a greater value of <italic>defenselessness index</italic> is more likely to be answered &#x201C;correctly&#x201D; than another item with a lower value of <italic>defenselessness index</italic>. Therefore, when comparing the <italic>defenselessness index</italic> between two groups (gender groups), a greater value of <italic>defenselessness index</italic> for a particular item in the focal group (e.g., men) means that this item is more likely to be answered correctly by male examinees than by female examinees, given those examinees&#x2019; ability beyond this item difficulty.</p>
<p>On the other hand, the <italic>bp</italic> and <italic>wp</italic> indices indicate that an item displays possible aberrances of <italic>disturbances</italic> or <italic>hints</italic> that examinees might encounter, respectively. The <italic>disturbance index</italic> (<italic>bp</italic>) measures the property of an item with possible pitfalls, that is, examinees answer the item incorrectly even though their ability levels are beyond the difficulty level of the item. An item with a greater value of <italic>disturbance index</italic> is more likely to be answered &#x201C;incorrectly&#x201D; than another item with a lower value of <italic>disturbance index</italic>. Therefore, when comparing the <italic>disturbance index</italic> between two groups (gender groups), a greater value of <italic>disturbance index</italic> for a particular item in the focal group (e.g., men) suggests that this item has more likelihood to be answered wrongly for male examinees than female examinees, even given those examinees&#x2019; abilities beyond the item difficulty. In contrast, the <italic>hint index</italic> (<italic>wp</italic>) measures the possible prompts in an item, that is, examinees answer the item correctly even though their ability levels are within the difficulty level of this item. An item with a greater value of <italic>hint index</italic> is more likely to be answered &#x201C;correctly&#x201D; than another item with a lower value of <italic>hint index</italic>. Thus, a greater value of <italic>hint index</italic> for a particular item in the focal group (e.g., men) delineates that this item has more likelihood to be answered &#x201C;correctly&#x201D; for male examinees than female examinees, even given those examinees&#x2019; abilities are within the item difficulty.</p>
<p><xref ref-type="fig" rid="F1">Figure 1</xref> provides a visualization for the idea of the four <italic>Poly-BW</italic> indices in different situations of normal and aberrant responses. Ideally, an item is supposed to be answered incorrectly when examinees&#x2019; abilities are lower than the item difficulty (i.e., <italic>power</italic>) and to be answered correctly when examinees&#x2019; abilities are higher than the item difficulty (i.e., <italic>defenselessness</italic>). On the other hand, when the ideal principles of responses are violated, the levels of discrepancy will reflect the levels of aberrance. The <italic>disturbance</italic> situation occurs when an item is still answered incorrectly even though examinees&#x2019; abilities are higher than the item difficulty. The <italic>hint</italic> situation occurs when an item is still answered correctly even though examinees&#x2019; abilities are lower than the item difficulty.</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption><p>Ideas of the four <italic>Poly-BW</italic> indices beyond and within item difficulty.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpsyg-13-821459-g001.tif"/>
</fig>
</sec>
</sec>
<sec id="S2" sec-type="materials|methods">
<title>Materials and Methods</title>
<sec id="S2.SS1">
<title>Participants</title>
<p>A total of 1,439 seventh-grade students from central Taiwan (51.01% men and 48.99% women) participated in the study. They were selected from 40 classes in eight junior high schools. The students completed a math test in a math class within around 40 min. In Taiwan, all math textbooks are developed based on directions governing the <italic>Basic Education Curricula</italic> and the contents of math textbooks should be approved by the Ministry of Education. Therefore, all junior-high-school students with the same grade are taught similar contents or units of math. Before administering this math test, all participants were supposed to have learned the contents of arithmetic, algebra, and geometry during their math class. We confirmed that all teachers had taught these contents before giving this math test.</p>
</sec>
<sec id="S2.SS2">
<title>Instrument</title>
<p>The math test was developed by our research team, i.e., three junior-high-school teachers and two experts in math and psychological measurement. It was a criterion-referenced and formative assessment. The objective of this test was to evaluate what the students had learned and assessed whether students obtained the required knowledge. Therefore, all items were developed according to the directions governing the <italic>Basic Education Curricula</italic> in Taiwan. The test comprised 34 items involving the three major components of math: 14 items for <italic>arithmetic</italic> (such as addition/subtraction of integers, multiplication/division of integers, operations on fractions, and the concept of factors/multiples), 16 items for <italic>algebra</italic> (such as the operation/application of integers and fractions), and 4 items for <italic>geometry</italic> (such as the concepts of the number line and absolute numbers). Each item required more than two problem-solving steps to solve and was scored on the basis of the number of steps taken by the students (according to the rubric of the math test). <xref ref-type="supplementary-material" rid="DS1">Supplementary Appendix A</xref> presents the specifications of items by contents and formats. <xref ref-type="supplementary-material" rid="DS1">Supplementary Appendix B</xref> presents the original teacher-made math test.</p>
</sec>
<sec id="S2.SS3">
<title>Rubric of Test</title>
<p>The rubric was established on the basis of the rationale of the partial credit model (PCM) (<xref ref-type="bibr" rid="B27">Masters, 1982</xref>). According to the PCM, to proceed with the next problem-solving step, the previous step is required to be correct. For instance, in the case of a three-step item, if a student answers the first step correctly but fails in the second step, it is reasonable to assume that he/she cannot answer correctly in the third step. Thus, based on the requirement of the PCM, each correct step, given that the previous steps were correct, was given one point, but a step was not given a point if any previous step was incorrect. Thus, the possible scores for each item in this study, ranging from 0 to 2, 0 to 3, 0 to 4, and 0 to 5, depended on the total number of solution steps and the corresponding previous correct steps for each item. To validate the step scores, two raters were asked to judge the step scores and to add the total scores for all the examinees on each item. The Pearson product-moment correlations were calculated between the two columns of total scores from the two raters for all the examinees on each item and ranged from 0.88 to 0.99, indicating good levels of inter-rater consistency for all items. All intracorrelations between items were significant at 0.01 alpha level and shown in <xref ref-type="supplementary-material" rid="DS1">Supplementary Appendix C</xref> by the male and female groups, respectively.</p>
</sec>
<sec id="S2.SS4">
<title><italic>Poly-SIBTEST</italic> Procedure</title>
<p>The <italic>Poly-SIBTEST</italic>, a confirmatory and theory-driven approach to detect associates of DIF, was designed to address the problem of DIF identification in polytomously scoring situations (<xref ref-type="bibr" rid="B1">Chang et al., 1996</xref>). <italic>Poly-SIBTEST</italic> is a non-parametric method (i.e., it did not assume the parametric shape of the item response function) and it was a Shealy-Stout multidimensional model (MDM; <xref ref-type="bibr" rid="B35">Shealy and Stout, 1993</xref>) used for DIF detection. The MDM model delineates that a DIF is a result of the second construct that is not intended to measure in a test (<xref ref-type="bibr" rid="B35">Shealy and Stout, 1993</xref>). Put differently, a DIF item measures more than one construct and the focal group (studied group) have different scores in the second construct. Therefore, according to MDM, a DIF occurs when the reference and focal groups that are matched on the same level of the intended (main) construct have different scores (distributions) on the second construct.</p>
<p>Based on <italic>Poly-SIBTEST</italic> theory and prior research, when items are hypothesized to have a common secondary construct, they could be bundled together and assessed for differential bundle functioning (DBF). Specifically, the inequality <italic>T</italic><sub><italic>jF</italic></sub>(&#x03B8;) &#x003C; <italic>T</italic><sub><italic>jR</italic></sub>(&#x03B8;), where &#x03B8; represents the measured target ability, <italic>T</italic> represents the marginal item response functions on item <italic>j</italic> for the focal group (<italic>F</italic>) and the reference group (<italic>R</italic>), respectively. The difference between the subtest response functions gives a preliminary index of DBF, given the examinees&#x2019; ability level. If all ability levels are considered, an index of DBF, i.e., <italic>Bu</italic>, can be estimated. The estimate of <italic>Bu</italic> can be tested by the standardized statistic, which has an approximately normal distribution with a mean 0 and SD of 1 for a large sample. Items with values greater than 1.96 or less than &#x2212; 1.96 are deemed as DIF items. In <italic>Poly-SIBTEST</italic> analysis, we set the female group as the reference group so that the values of <italic>Bu</italic> greater than 0 indicate male-favoring potential and values of <italic>Bu</italic> less than 0 indicate female-favoring potential. However, for a DIF-flagged item, only the values of the standardized <italic>Bu</italic> greater than 1.96 or less than &#x2212; 1.96 were considered.</p>
</sec>
<sec id="S2.SS5">
<title>Valid Matching Subtest</title>
<p>Before DIF estimation, a valid matching subset was required previously in the <italic>Poly-SIBTEST</italic> procedure. There were 18 of 34 items of interest, such as mathematics content (arithmetic, algebra, and geometry), number type (fraction and integer), and item format (operation and word problem), which were classified into five bundles of the studied subtest. The remaining 16 items, after precluding the items in the studied subtest, were used in the matching subtest for the purpose of purification in the first automatic DIF analysis (<xref ref-type="bibr" rid="B36">Stout and Roussos, 1995</xref>). After conducting automatic DIF analysis three times and canceling five items that displayed significant DIFs, a valid matching subtest consisting of 11 non-DIF items was found (item 1, 6, 8, 10, 12, 13, 15, 20, 22, 27, and 32). These uncontaminated items were used to identify the ability levels of male and female students in <italic>Poly-SIBTEST</italic> analysis.</p>
</sec>
<sec id="S2.SS6">
<title>Analysis</title>
<p>In this study, we set women as the referenced group in both <italic>Poly-SIBTEST</italic> analysis and <italic>Poly-BW</italic> differences estimation. A positive DIF measure in <italic>Poly-SIBTEST</italic> analysis indicated a male-favoring item and a negative DIF measure indicated a female-favoring item. A positive index difference in <italic>Poly-BW</italic> analysis implied a certain <italic>Poly-BW</italic> index estimated from the male group greater than that estimated from the female group, and vice versa. Thus, for research question 1, the <italic>Poly-BW</italic> indices were calculated for the male and female groups separately through the <italic>PWBstar</italic>1.0 program, and the differences in individual indices between the two groups were used as predictors for the DIF measure (<italic>Bu</italic>) obtained from <italic>Poly-SIBTEST</italic> through stepwise multiple regression analysis. For research question 2, the accuracy of classification for a DIF-flagged item by the four <italic>Poly-BW</italic> indices from the APT procedure was assessed by multiple discriminant analysis.</p>
<p>Specifically, the study used a 100-repetition APT procedure through the following steps: (1) setting three types of <italic>cp</italic> values (high, middle, and low, using 33% ranks as cutoffs); (2) setting two types of levels (high and low) by using upper or lower the 95% percentile value for each <italic>mp</italic>, <italic>bp</italic>, and <italic>wp</italic> index in the male and female groups, respectively. After these APT steps, all information on the item classifications can be provided by the <italic>PWBstar</italic> 1.0 program by labeling the types of item classifications as (letter + number)&#x2019;, in which the letters represent high (H), middle (M), and low (L) <italic>power</italic> levels of the <italic>cp</italic> index; while the numbers 1, 2, 3, and 4 refer to <italic>normal</italic>, <italic>disturbance</italic>, <italic>hint</italic>, and <italic>hybrid</italic> (disturbance plus hint) types of responses, respectively. Further, the prime symbol (&#x2032;) indicates a significantly high <italic>defenselessness</italic> value of the <italic>mp</italic> index. The item classifications provide insights into the associates of DIF. For example, if a DIF item is labeled as L1&#x2032; and L2&#x2032; for men and women, respectively, it implies that this DIF item displays a low level of <italic>power</italic> (L) and a high level of <italic>defenselessness</italic> (&#x2032;) for both genders, but with <italic>normal</italic> performances (&#x201C;1&#x201D;) for the male groups and aberrant <italic>disturbances</italic> (&#x201C;2&#x201D;) for women. Suspiciously, disturbance might be an associate of DIF.</p>
</sec>
</sec>
<sec id="S3" sec-type="results">
<title>Results</title>
<sec id="S3.SS1">
<title>Descriptive Statistics and Preliminary Analysis of Dimensionality</title>
<p>Descriptive statistics of item level performance and parameters estimated by the <italic>Poly-WB</italic> item indices formula for men (<italic>N</italic> = 734) and women (<italic>N</italic> = 705) were displayed in <xref ref-type="supplementary-material" rid="DS1">Supplementary Appendix D</xref>. Due to different problem-solving steps required in individual items, the scores of items were ranged from 0 to 2, 0 to 3, 0 to 4, and 0 to 5. As can be seen in <xref ref-type="supplementary-material" rid="DS1">Supplementary Appendix D</xref>, the mean scores on an item earned by male students ranged from 0.54 (max score 3 in item 29) to 3.48 (max score 5 in item 26) and ranged from 0.78 (max score 2 in item 30) to 3.85 (max score 5 in item 26) by female students. The smallest SDs for both groups occurred on item 19 (0.81 and 0.74 for men and women, respectively) and the largest SDs occurred on item 23 (2.34 and 2.42 for men and women, respectively). Before examining DIFs, the dimensionality of these 34 items was preliminarily checked. Principal components analysis of standardized residuals showed that unexplained variance explained by the main dimension was only 0.8%, indicating this math test exhibited a unidimensionality.</p>
</sec>
<sec id="S3.SS2">
<title>Prediction</title>
<p><xref ref-type="table" rid="T1">Table 1</xref> summarizes the stepwise multiple regression analysis. Two <italic>Poly-BW</italic> indices explain almost 78% of the variance (63.8% from the predictor of <italic>mp</italic><sub><italic>M&#x2013;F</italic></sub> and 14.2% from the predictor of <italic>cp</italic><sub><italic>M&#x2013;F</italic></sub>) of the DIF measures (<italic>Bu</italic>) obtained from <italic>Poly-SIBTEST</italic> analysis (<italic>F</italic><sub>4,29</sub> = 28.53, <italic>p</italic> &#x003C; 0.001). Specifically, the <italic>defenselessness (mp) index</italic> has the largest contribution (63.8%), followed by the <italic>power (cp) index</italic> (14.2%). This finding indicates that the <italic>defenselessness</italic> (<italic>mp</italic>) and <italic>power (cp) indices</italic> could be used to predict the gender DIF measures in the math test (&#x03B2; = 0.734, <italic>p</italic> &#x003C; 0.001 for <italic>mp</italic><sub><italic>M&#x2013;F</italic></sub>; &#x03B2; = &#x2212; 0.286, <italic>p</italic> &#x003C; 0.05 for <italic>cp</italic><sub><italic>M&#x2013;F</italic></sub>)<sup><xref ref-type="fn" rid="footnote2">2</xref></sup>. Because the women were regarded as the referenced group in both <italic>Poly-SIBTEST</italic> analysis and <italic>Poly-BW</italic> difference estimations, a positive regression coefficient implied that the larger the <italic>defenselessness</italic> differences (<italic>mp</italic><sub><italic>M&#x2013;F</italic></sub>), the larger are the DIF measures (i.e., the items were more likely to favor men when they had a relatively large defenselessness for men). Likewise, a negative regression coefficient indicated that the larger the <italic>power</italic> differences, the smaller are the DIF measures (i.e., the items were more likely to favor women when they had a relatively large power for men). In summary, the <italic>defenselessness</italic> (<italic>mp</italic>) and <italic>power (cp) indices</italic> were significant predictors of DIF properties in this case. From the definitions of these two indices, these findings indicated that if an item was perceived as more defenseless (easy to be answered correctly beyond the item&#x2019;s difficulty) or powerful (hard to be answered correctly within the item&#x2019;s difficulty) by one gender group, then the item was more likely to exhibit a DIF. Interestingly, the abnormal indices (<italic>hint wp</italic> and <italic>disturbance bp</italic>) did not contribute significantly to the DIF properties. This finding indicated that these two abnormal indices were less likely to be the possible associates of DIFs in this case.</p>
<table-wrap position="float" id="T1">
<label>TABLE 1</label>
<caption><p>Regression model summary for <italic>Poly-BW</italic> indices on differential item functioning (DIF) measures.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left"></td>
<td/>
<td valign="top" align="center"><italic>SS</italic></td>
<td valign="top" align="center"><italic>df</italic></td>
<td valign="top" align="center"><italic>MS</italic></td>
<td valign="top" align="center"><italic>F</italic></td>
<td valign="top" align="center"><italic>p</italic></td>
<td valign="top" align="center"><italic>R</italic><sup>2</sup></td>
<td/>
<td/>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left"><italic>ANOVA</italic></td>
<td valign="top" align="center">Regression</td>
<td valign="top" align="center">0.300</td>
<td valign="top" align="center">4</td>
<td valign="top" align="center">0.075</td>
<td valign="top" align="center">28.53</td>
<td valign="top" align="center">&#x003C;0.001</td>
<td valign="top" align="center">0.797</td>
<td/>
<td/>
</tr>
<tr>
<td/>
<td valign="top" align="center">Residual</td>
<td valign="top" align="center">0.076</td>
<td valign="top" align="center">29</td>
<td valign="top" align="center">0.003</td>
<td/>
<td/>
<td valign="top" colspan="3"/></tr>
<tr>
<td/>
<td valign="top" align="center">Total</td>
<td valign="top" align="center">0.376</td>
<td valign="top" align="center">33</td>
<td/>
<td/>
<td/>
<td valign="top" colspan="3"/>
</tr>
<tr>
<td valign="top" align="center" colspan="10"><hr/></td>
</tr>
<tr>
<td valign="top" align="left"></td>
<td/>
<td valign="top" align="center"><bold>Var.</bold></td>
<td valign="top" align="center"><bold><italic>R</italic></bold></td>
<td valign="top" align="center"><bold>&#x0394;<italic>R<sup>2</sup></italic></bold></td>
<td valign="top" align="center"><bold>&#x03BB;</bold></td>
<td valign="top" align="center"><bold>S.E.</bold></td>
<td valign="top" align="center"><bold><italic>B</italic></bold></td>
<td valign="top" align="center"><bold><italic>t</italic></bold></td>
<td valign="top" align="center"><bold><italic>p</italic></bold></td>
</tr>
<tr>
<td valign="top" align="center" colspan="10"><hr/></td>
</tr>
<tr>
<td valign="top" align="left">Coefficients</td>
<td/>
<td valign="top" align="center">Con.</td>
<td/>
<td/>
<td valign="top" align="center">0.082</td>
<td valign="top" align="center">0.015</td>
<td/>
<td valign="top" align="center">5.358</td>
<td valign="top" align="center">&#x003C;0.001</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="center"><italic>mp</italic><sub><italic>M&#x2013;F</italic></sub></td>
<td valign="top" align="center">0.799</td>
<td valign="top" align="center">0.638</td>
<td valign="top" align="center">1.952</td>
<td valign="top" align="center">0.298</td>
<td valign="top" align="center">0.734</td>
<td valign="top" align="center">6.543</td>
<td valign="top" align="center">&#x003C;0.001</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="center"><italic>cp</italic><sub><italic>M&#x2013;F</italic></sub></td>
<td valign="top" align="center">0.883</td>
<td valign="top" align="center">0.142</td>
<td valign="top" align="center">&#x2212;0.686</td>
<td valign="top" align="center">0.300</td>
<td valign="top" align="center">&#x2212;0.286</td>
<td valign="top" align="center">&#x2212;2.285</td>
<td valign="top" align="center">0.030</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="center"><italic>wp</italic><sub><italic>M&#x2013;F</italic></sub></td>
<td valign="top" align="center">0.892</td>
<td valign="top" align="center">0.015</td>
<td valign="top" align="center">2.970</td>
<td valign="top" align="center">2.437</td>
<td valign="top" align="center">0.138</td>
<td valign="top" align="center">1.219</td>
<td valign="top" align="center">0.233</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="center"><italic>bp</italic><sub><italic>M&#x2013;F</italic></sub></td>
<td valign="top" align="center">0.893</td>
<td valign="top" align="center">0.002</td>
<td valign="top" align="center">1.441</td>
<td valign="top" align="center">2.401</td>
<td valign="top" align="center">0.071</td>
<td valign="top" align="center">0.600</td>
<td valign="top" align="center">0.553</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn><p><italic>Dependent Variable: Bu.</italic></p></fn>
<fn><p><italic>Subscripts in selected variables (e.g., mp<sub>M&#x2013;F</sub>) represent the difference in an index value between male and female.</italic></p></fn>
</table-wrap-foot>
</table-wrap>
<p>Furthermore, to demonstrate the <italic>Poly-BW</italic> indices outperformed the difficulty index of CTT, i.e., <italic>P</italic> = (<italic>P</italic><sub><italic>H</italic></sub> + <italic>P</italic><sub><italic>L</italic></sub>)/2 in terms of DIF variance explained, we compared DIF variance explained from the <italic>P</italic> index and the <italic>Poly-BW</italic> indices. Results showed that the CTT difficulty index explained 53.4% variance of DIF measure<italic>s</italic> (<italic>F</italic><sub>1,32</sub> = 36.68, <italic>p</italic> &#x003C; 0.001), but it was lower than the variance explained by the <italic>defenselessness</italic> (<italic>mp</italic>) <italic>index</italic> (63.8%). Since the difficulty index based on CTT is calculated based on total scores, it confounded the within-difficulty and the beyond-difficulty effect in a person&#x2019;s responses. Instead, the within-difficulty and beyond-difficulty effects can be distinguished by the <italic>power (cp) index</italic> and the <italic>defenselessness</italic> (<italic>mp</italic>) <italic>index</italic>, respectively, so that the intrinsic properties in DIFs can be more disclosed.</p>
</sec>
<sec id="S3.SS3">
<title>Accuracy</title>
<p>For research question 2, we examined how accurately the four <italic>Poly-BW</italic> indices predicted the DIF-flagged items using multiple discriminant analysis. Based on the standardized values of <italic>Bu</italic> obtained from the <italic>Poly-SIBTEST</italic> procedure (i.e.,&#x007C;<italic>Bu</italic>&#x007C; &#x003E; 1.96), 12 items were detected with DIFs: 6 items favoring men (items 2, 5, 7, 11, 14, and 17) and 6 items favoring women (items 4, 18, 19, 21, 29, and 31). The remaining 22 items were neutral to both genders. The average hit rate of the DIFs (see <xref ref-type="fig" rid="F2">Figure 2</xref>) was 82.4%. Without considering the group favored by the DIF items, the DIFs were perfectly (100%) predicted by the four <italic>Poly-BW</italic> indices. But the neutral category was predicted with an accuracy of 72.7% with six undetermined items. There were five neutral items (items 22, 23, 26, 27, and 28) classified as female-favoring items and one neutral item (item 25) classified as a male-favoring item.</p>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption><p>Hit rates of differential item functioning (DIF)-flagged items according to multiple discriminant analysis.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpsyg-13-821459-g002.tif"/>
</fig>
<p>When examining the values of <italic>Poly-BW</italic> indices (see <xref ref-type="table" rid="T2">Table 2</xref>), we found that five of six male-favoring items (except item 7) have higher <italic>mp</italic> values in the male group than in the female group, and all six female-favoring items have higher <italic>mp</italic> values in the female group than in the male group. For the undetermined items, the six neutral items displayed higher <italic>cp</italic> values in the male group than in the female group, but different magnitudes of <italic>mp</italic> values in both groups. All the five female-favoring items displayed higher <italic>mp</italic> values in the female group than in the male group, but in contrast, the male-favoring item (item 25) displayed a higher <italic>mp</italic> value in the male group than in the female group (0.094 vs. 0.076). This implies that the <italic>mp</italic> index (<italic>defenselessness</italic>) dominates the key reasons of item transformation from no DIF to favoring one group. Other <italic>Poly-BW</italic> indices values of remaining items can be seen in <xref ref-type="supplementary-material" rid="DS1">Supplementary Appendix D</xref>.</p>
<table-wrap position="float" id="T2">
<label>TABLE 2</label>
<caption><p>Values of <italic>Poly-BW</italic> indices for differential item functioning (DIF) items and undetermined items.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left"></td>
<td/>
<td valign="top" align="center" colspan="2">Raw Mean<hr/></td>
<td valign="top" align="center" colspan="2">Raw SD<hr/></td>
<td valign="top" align="center" colspan="2"><italic>cp</italic><hr/></td>
<td valign="top" align="center" colspan="2"><italic>mp</italic><hr/></td>
<td valign="top" align="center" colspan="2"><italic>bp</italic><hr/></td>
<td valign="top" align="center" colspan="2"><italic>wp</italic><hr/></td>
</tr>
<tr>
<td valign="top" align="left">Category</td>
<td valign="top" align="center">Item</td>
<td valign="top" align="center">Male</td>
<td valign="top" align="center">Female</td>
<td valign="top" align="center">Male</td>
<td valign="top" align="center">Female</td>
<td valign="top" align="center">Male</td>
<td valign="top" align="center">Female</td>
<td valign="top" align="center">Male</td>
<td valign="top" align="center">Female</td>
<td valign="top" align="center">Male</td>
<td valign="top" align="center">Female</td>
<td valign="top" align="center">Male</td>
<td valign="top" align="center">Female</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Male-favoring</td>
<td valign="top" align="center">v2</td>
<td valign="top" align="center">1.30</td>
<td valign="top" align="center">1.26</td>
<td valign="top" align="center">0.94</td>
<td valign="top" align="center">0.95</td>
<td valign="top" align="center">0.084</td>
<td valign="top" align="center">0.117</td>
<td valign="top" align="center">0.416</td>
<td valign="top" align="center">0.352</td>
<td valign="top" align="center">0.056</td>
<td valign="top" align="center">0.047</td>
<td valign="top" align="center">0.026</td>
<td valign="top" align="center">0.026</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">v5</td>
<td valign="top" align="center">1.27</td>
<td valign="top" align="center">1.29</td>
<td valign="top" align="center">0.95</td>
<td valign="top" align="center">0.94</td>
<td valign="top" align="center">0.099</td>
<td valign="top" align="center">0.113</td>
<td valign="top" align="center">0.420</td>
<td valign="top" align="center">0.385</td>
<td valign="top" align="center">0.032</td>
<td valign="top" align="center">0.029</td>
<td valign="top" align="center">0.022</td>
<td valign="top" align="center">0.021</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">v7</td>
<td valign="top" align="center">1.65</td>
<td valign="top" align="center">1.76</td>
<td valign="top" align="center">1.38</td>
<td valign="top" align="center">1.34</td>
<td valign="top" align="center">0.164</td>
<td valign="top" align="center">0.150</td>
<td valign="top" align="center">0.321</td>
<td valign="top" align="center">0.326</td>
<td valign="top" align="center">0.036</td>
<td valign="top" align="center">0.027</td>
<td valign="top" align="center">0.025</td>
<td valign="top" align="center">0.022</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">v11</td>
<td valign="top" align="center">1.23</td>
<td valign="top" align="center">1.23</td>
<td valign="top" align="center">0.97</td>
<td valign="top" align="center">0.96</td>
<td valign="top" align="center">0.114</td>
<td valign="top" align="center">0.130</td>
<td valign="top" align="center">0.398</td>
<td valign="top" align="center">0.351</td>
<td valign="top" align="center">0.028</td>
<td valign="top" align="center">0.025</td>
<td valign="top" align="center">0.024</td>
<td valign="top" align="center">0.028</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">v14</td>
<td valign="top" align="center">1.84</td>
<td valign="top" align="center">1.86</td>
<td valign="top" align="center">1.87</td>
<td valign="top" align="center">1.85</td>
<td valign="top" align="center">0.263</td>
<td valign="top" align="center">0.256</td>
<td valign="top" align="center">0.215</td>
<td valign="top" align="center">0.197</td>
<td valign="top" align="center">0.034</td>
<td valign="top" align="center">0.029</td>
<td valign="top" align="center">0.029</td>
<td valign="top" align="center">0.030</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">v17</td>
<td valign="top" align="center">1.03</td>
<td valign="top" align="center">1.04</td>
<td valign="top" align="center">1.64</td>
<td valign="top" align="center">1.61</td>
<td valign="top" align="center">0.583</td>
<td valign="top" align="center">0.559</td>
<td valign="top" align="center">0.062</td>
<td valign="top" align="center">0.048</td>
<td valign="top" align="center">0.012</td>
<td valign="top" align="center">0.009</td>
<td valign="top" align="center">0.017</td>
<td valign="top" align="center">0.022</td>
</tr>
<tr>
<td valign="top" align="left">Female- favoring</td>
<td valign="top" align="center">v4</td>
<td valign="top" align="center">1.68</td>
<td valign="top" align="center">1.99</td>
<td valign="top" align="center">1.24</td>
<td valign="top" align="center">1.16</td>
<td valign="top" align="center">0.141</td>
<td valign="top" align="center">0.080</td>
<td valign="top" align="center">0.308</td>
<td valign="top" align="center">0.375</td>
<td valign="top" align="center">0.065</td>
<td valign="top" align="center">0.072</td>
<td valign="top" align="center">0.035</td>
<td valign="top" align="center">0.037</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">v18</td>
<td valign="top" align="center">1.34</td>
<td valign="top" align="center">1.52</td>
<td valign="top" align="center">0.92</td>
<td valign="top" align="center">0.82</td>
<td valign="top" align="center">0.080</td>
<td valign="top" align="center">0.041</td>
<td valign="top" align="center">0.463</td>
<td valign="top" align="center">0.576</td>
<td valign="top" align="center">0.034</td>
<td valign="top" align="center">0.030</td>
<td valign="top" align="center">0.017</td>
<td valign="top" align="center">0.013</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">v19</td>
<td valign="top" align="center">1.32</td>
<td valign="top" align="center">1.50</td>
<td valign="top" align="center">0.81</td>
<td valign="top" align="center">0.74</td>
<td valign="top" align="center">0.081</td>
<td valign="top" align="center">0.048</td>
<td valign="top" align="center">0.429</td>
<td valign="top" align="center">0.520</td>
<td valign="top" align="center">0.056</td>
<td valign="top" align="center">0.045</td>
<td valign="top" align="center">0.022</td>
<td valign="top" align="center">0.020</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">v21</td>
<td valign="top" align="center">0.95</td>
<td valign="top" align="center">1.18</td>
<td valign="top" align="center">1.33</td>
<td valign="top" align="center">1.39</td>
<td valign="top" align="center">0.478</td>
<td valign="top" align="center">0.359</td>
<td valign="top" align="center">0.099</td>
<td valign="top" align="center">0.133</td>
<td valign="top" align="center">0.017</td>
<td valign="top" align="center">0.020</td>
<td valign="top" align="center">0.018</td>
<td valign="top" align="center">0.024</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">v29</td>
<td valign="top" align="center">0.54</td>
<td valign="top" align="center">0.80</td>
<td valign="top" align="center">1.10</td>
<td valign="top" align="center">1.26</td>
<td valign="top" align="center">0.710</td>
<td valign="top" align="center">0.521</td>
<td valign="top" align="center">0.023</td>
<td valign="top" align="center">0.040</td>
<td valign="top" align="center">0.009</td>
<td valign="top" align="center">0.022</td>
<td valign="top" align="center">0.034</td>
<td valign="top" align="center">0.046</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">v31</td>
<td valign="top" align="center">1.40</td>
<td valign="top" align="center">1.68</td>
<td valign="top" align="center">1.43</td>
<td valign="top" align="center">1.39</td>
<td valign="top" align="center">0.267</td>
<td valign="top" align="center">0.174</td>
<td valign="top" align="center">0.217</td>
<td valign="top" align="center">0.291</td>
<td valign="top" align="center">0.032</td>
<td valign="top" align="center">0.036</td>
<td valign="top" align="center">0.024</td>
<td valign="top" align="center">0.018</td>
</tr>
<tr>
<td valign="top" align="left">Undetermined</td>
<td valign="top" align="center">v25</td>
<td valign="top" align="center">0.93</td>
<td valign="top" align="center">0.94</td>
<td valign="top" align="center">1.38</td>
<td valign="top" align="center">1.36</td>
<td valign="top" align="center">0.484</td>
<td valign="top" align="center">0.462</td>
<td valign="top" align="center">0.094</td>
<td valign="top" align="center">0.076</td>
<td valign="top" align="center">0.016</td>
<td valign="top" align="center">0.016</td>
<td valign="top" align="center">0.025</td>
<td valign="top" align="center">0.032</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">v26</td>
<td valign="top" align="center">3.48</td>
<td valign="top" align="center">3.85</td>
<td valign="top" align="center">2.04</td>
<td valign="top" align="center">1.78</td>
<td valign="top" align="center">0.060</td>
<td valign="top" align="center">0.034</td>
<td valign="top" align="center">0.497</td>
<td valign="top" align="center">0.582</td>
<td valign="top" align="center">0.039</td>
<td valign="top" align="center">0.038</td>
<td valign="top" align="center">0.019</td>
<td valign="top" align="center">0.015</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">v27</td>
<td valign="top" align="center">1.22</td>
<td valign="top" align="center">1.45</td>
<td valign="top" align="center">1.40</td>
<td valign="top" align="center">1.43</td>
<td valign="top" align="center">0.356</td>
<td valign="top" align="center">0.242</td>
<td valign="top" align="center">0.151</td>
<td valign="top" align="center">0.217</td>
<td valign="top" align="center">0.026</td>
<td valign="top" align="center">0.027</td>
<td valign="top" align="center">0.029</td>
<td valign="top" align="center">0.025</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">v28</td>
<td valign="top" align="center">1.46</td>
<td valign="top" align="center">1.68</td>
<td valign="top" align="center">1.44</td>
<td valign="top" align="center">1.42</td>
<td valign="top" align="center">0.244</td>
<td valign="top" align="center">0.162</td>
<td valign="top" align="center">0.234</td>
<td valign="top" align="center">0.286</td>
<td valign="top" align="center">0.032</td>
<td valign="top" align="center">0.037</td>
<td valign="top" align="center">0.029</td>
<td valign="top" align="center">0.033</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">v22</td>
<td valign="top" align="center">1.62</td>
<td valign="top" align="center">1.90</td>
<td valign="top" align="center">2.20</td>
<td valign="top" align="center">2.26</td>
<td valign="top" align="center">0.465</td>
<td valign="top" align="center">0.368</td>
<td valign="top" align="center">0.098</td>
<td valign="top" align="center">0.117</td>
<td valign="top" align="center">0.019</td>
<td valign="top" align="center">0.023</td>
<td valign="top" align="center">0.029</td>
<td valign="top" align="center">0.036</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">v23</td>
<td valign="top" align="center">1.94</td>
<td valign="top" align="center">2.28</td>
<td valign="top" align="center">2.34</td>
<td valign="top" align="center">2.42</td>
<td valign="top" align="center">0.375</td>
<td valign="top" align="center">0.276</td>
<td valign="top" align="center">0.151</td>
<td valign="top" align="center">0.199</td>
<td valign="top" align="center">0.017</td>
<td valign="top" align="center">0.018</td>
<td valign="top" align="center">0.023</td>
<td valign="top" align="center">0.020</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Furthermore, after excluding the six undetermined items, the remaining 28 items predicted by the <italic>defenselessness</italic> (<italic>mp</italic>) and <italic>power</italic> (<italic>cp</italic>) <italic>indices</italic> as male-favoring, female-favoring, and neutral are shown in <xref ref-type="fig" rid="F3">Figure 3</xref>. As can be seen, in terms of the <italic>Q</italic> values from CTT, men mostly perceived an item as more difficult than women across all items, suggesting that CTT item difficulty cannot explain the possible associates of gender DIFs. Nevertheless, the <italic>mp</italic> and <italic>cp</italic> indices provide useful information on the associates of a DIF. First, in the female-favoring part (the left part of <xref ref-type="fig" rid="F3">Figure 3</xref>), most of the female-favoring items exhibited low values of <italic>defenselessness</italic> (<italic>mp</italic>) but high values of <italic>power</italic> (<italic>cp</italic>) between men and women. Here, the values of <italic>defenselessness</italic> (<italic>mp</italic>) and <italic>power</italic> (<italic>cp</italic>) were the difference values between the gender groups (where women were the referenced group). Accordingly, when the items had lower <italic>defenselessness</italic> and higher <italic>power</italic> for men, they were likely to be female-favoring items. Second, in the male-favoring part (the right part of <xref ref-type="fig" rid="F3">Figure 3</xref>), most of the male-favoring items exhibited high values of <italic>defenselessness</italic> (<italic>mp</italic>) but low values of power (<italic>cp</italic>) between men and women. Thus, when the items had higher <italic>defenselessness</italic> and lower <italic>power</italic> for men, they were likely to be male-favoring items.</p>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption><p>Differential item functioning (DIF) predicted by <italic>mp</italic> and <italic>cp</italic> indices and classifications for favoring items.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpsyg-13-821459-g003.tif"/>
</fig>
<p>Furthermore, the item classifications from the APT yielded more insightful information for DIF item treatment. In the female-favoring part, three items (i.e., items 4, 18, and 19) with high values of <italic>defenselessness</italic> (<italic>mp</italic>) exhibited consistent item types (L2&#x2032;, L1&#x2032;, and L1&#x2032;, respectively) for both men and women. This implied that men and women consistently perceived the three items as defenseless. Moreover, they consistently perceived item 4 as defenseless and with disturbance (L2&#x2032;), and items 18 and 19 as normal (L1&#x2032;). Two other DIF items (items 31 and 21) with relatively high values of <italic>power</italic> (<italic>cp</italic>) exhibited consistent item types (M1 and H1, respectively) for both men and women, indicating that the two groups perceived the two items as having middle to high levels of power, respectively. On the other hand, only the item classification of item 29 was different for the gender groups: H3 for women and H1 for men. Thus, item 29 was perceived as having a high level of power for both groups but with a <italic>hint</italic> only for the female group. The <italic>Hint</italic> was likely an associate of this DIF item (item 29).</p>
<p>In the male-favoring part, six items exhibited male-favoring properties. Two of them (items 5 and 2) exhibited the same <italic>defenselessness</italic> item types (L1&#x2032;) and three of them (items 7, 17, and 14) exhibited the same <italic>power</italic> item types (M1, H1, and, M1, respectively) for both groups. Thus, these findings implied that men and women consistently perceived the former two items as defenseless but the latter three items as having middle to have high levels of power. Interestingly, only item 11 was labeled as L1&#x2032; and M1 for men and women, respectively, suggesting that men perceived this item as more defenseless than women did, which likely resulted in its male-favoring property. In the neutral part, all 15 of 16 items performed consistently across the male and female groups. Only item 13 was labeled as M3 and M1 for men and women, respectively, indicating the item performed middle power for both groups, but men perceived more hints than women. Interestingly, item 13 did not perform DIF. This might be due to middle power performed by the item and, according to <xref ref-type="table" rid="T1">Table 1</xref>, the variances of DIFs explained by the power (<italic>cp</italic>) index are low (only 14.2%).</p>
<p>In summary, a 100-repetition APT found only two items (items 29 and 11) to be perceived with different item classifications for men and women among the 12 DIF items.</p>
</sec>
</sec>
<sec id="S4" sec-type="discussion">
<title>Discussion</title>
<p>Given that the possible associates of DIF items have not attracted much attention, this study employed the <italic>Poly-SIBTEST</italic> approach as a reference method to demonstrate how the <italic>Poly-BW</italic> indices contribute to DIF measures with an example of the math test. To our best knowledge, this study may be the first study to investigate how the <italic>Poly-BW</italic> indices explain the possible associates of DIFs. There are several significant findings which were reported. First, two of the <italic>Poly-BW</italic> indices (<italic>defenselessness</italic> and <italic>power</italic>) significantly contributed to the <italic>Poly-SIBTEST</italic>-based DIF measures. This finding was largely consistent with the Monte Carlo study of <xref ref-type="bibr" rid="B15">Huang and Lin (2017)</xref>, in which they simulated dichotomous data under five conditions (sample size, item number, DIF type, DIF ratio, and DIF severity) for Rasch-based DIF measures. In their study, they found that the <italic>power index</italic> and <italic>defenselessness index</italic> could on average predict Rasch-based DIFs with significant absolute beta values as 0.41 and 0.26, respectively, and explain 22.4% of total variances of DIFs.</p>
<p>In our polytomously-scoring study, the <italic>defenselessness index</italic> (<italic>mp</italic>) explained most of the variance of the DIF measures in this case, indicating that the extent to which an item is more likely to be answered correctly by the examinees whose ability levels were higher than the difficulty level of the item in one group than those in the other group is a significant predictor of the occurrence of DIF. More specifically, if an item is perceived as weak and exhibits high <italic>defenselessness</italic> so that persons with ability levels higher than the difficulty level are very easy to answer it correctly, then the item is likely to be identified as a DIF item. In addition and importantly, when compared with the difficulty index based on CTT, <italic>defenselessness index</italic> (<italic>mp</italic>) explained more variance of DIF measures because the <italic>Poly-BW</italic> index could clearly distinguish &#x201C;within&#x201D; or &#x201C;beyond&#x201D; the item difficulty level. If we ignore the within-beyond effect on DIF associates and just wholly deal with traditional difficulty index as an indicator of DIFs, then we may lose some useful information of possible associates in a DIF item. This is because the effects of <italic>power</italic> and <italic>defenselessness</italic> may be washed out in a single item.</p>
<p>Second, the <italic>Poly-BW</italic> indices provide clearer clues for understanding the different associates of DIFs for men and women in the math test. In this case, if an item has relatively high <italic>defenselessness</italic> and relatively low <italic>power</italic> for men, it is likely to be a male-favoring DIF item. By contrast, if an item has relatively low <italic>defenselessness</italic> and relatively high <italic>power</italic> for women, it is likely to be a female-favoring DIF item. The gaps of <italic>defenselessness</italic> or <italic>power</italic> between genders may be the possible reasons for gender-related DIF in the math test. Given the gap of <italic>defenselessness</italic> explaining the large DIF variance (63.8%), this study revealed that DIF mainly occurs in a certain situation where a person ability is beyond item difficulty. More specifically, the more difference on <italic>defenselessness</italic> exists between both genders, the more likely DIF occurs. In line with this finding, we suggest that the treatment of DIF items should depend on the type of assessment. If the assessment is a norm-referenced test, the DIF items with high <italic>defenselessness</italic> for both genders may be modified. Such an item should be modified as a more difficult item. By contrast, if the assessment is a criterion-referenced test, the DIF items with high <italic>defenselessness</italic> for both genders may be retained. Because the objective of criterion-referenced tests is simply to inspect whether the students have learned the materials, the items (relatively defenseless items) measuring the basic concept of the materials are commonly or necessarily included in the test.</p>
<p>Third, although we found the <italic>Poly-BW</italic> indices can precisely classify most of the DIF items identified by Poly-SIBTEST procedures, there were six neutral items misclassified as the female-favoring or male-favoring items. With more specific inspections, the <italic>mp</italic> index (<italic>defenselessness</italic>) dominates the key associates of DIF items and major transformation reasons from no DIF to favoring one group. The possible main reason for this discrepancy might be due to the Poly-SIBTEST procedures assessing a DIF based on the concept of a &#x201C;whole&#x201D; item score; however, the <italic>Poly-BW</italic> indices distinguish two response patterns (within-difficulty response pattern and the beyond-difficulty response pattern).</p>
<p>The findings in this study have some important implications. Teachers or testing practitioners could modify or revise DIF items based on the <italic>Poly-BW</italic> indices. When an item is flagged as DIF in practice, the possible reasons of DIF items, such as <italic>defenselessness</italic> or <italic>power</italic>, should be examined through the <italic>Poly-BW</italic> indices. From the <italic>Poly-BW</italic> indices, researchers and practitioners can understand possible associates of DIF items on math performance. Further treatment of the DIF items depends on the item classifications and the types of assessment. In addition, although the <italic>disturbance</italic> (<italic>bp</italic>) and <italic>hint</italic> (<italic>wp</italic>) are not significant predictors of DIF measures in this study, it does not mean that <italic>disturbance</italic> or <italic>hint</italic> is not the associates of DIFs in this case. It is likely that the effects of <italic>disturbance</italic> and <italic>hint</italic> on the DIF measures of all items are offset and negligible, but they may be significant for some items. Finally, item classifications involving complicated procedures in APTs could be easily conducted using the <italic>PWBstar</italic>1.0 program; thus, practitioners or teachers could employ this program to examine the intrinsic properties of tests.</p>
<p>This study has the following limitations. First, this study is the first study to explore how the <italic>Poly-BW</italic> indices contribute to DIF measures in a teacher-made mathematics achievement test; thus, the findings are more exploratory in the context of teacher-made math tests, and additional evidence is required for similar or different fields. For example, additional evidence as to whether the <italic>defenselessness</italic> property is the primary associate for DIF items in criterion-referenced tests should be obtained. Second, the significant predictors (<italic>Poly-BW</italic> indices) for DIF measures in norm-referenced tests still need to be explored. Third, this study used the Poly-SIBTEST index as a reference method, limiting to examine uniform DIF. The Crossing SIBTEST (e.g., <xref ref-type="bibr" rid="B25">Li, 2020</xref>) in assessing non-uniform DIF could be included in further studies. Finally, the <italic>Poly-BW</italic> indices include four indices for both persons and items. This study emphasizes the use of the <italic>Poly-BW</italic> indices for items; thus, future studies could investigate examinees&#x2019; performance using the <italic>Poly-BW</italic> person-fit indices.</p>
</sec>
<sec id="S5" sec-type="data-availability">
<title>Data Availability Statement</title>
<p>The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.</p>
</sec>
<sec id="S6">
<title>Ethics Statement</title>
<p>Ethical review and approval was not required for the study on human participants in accordance with the local legislation and institutional requirements. All participants voluntarily joined this study. The procedures of this study follow ethics principles for human research.</p>
</sec>
<sec id="S7">
<title>Author Contributions</title>
<p>T-WH contributed for creating ideas, data collection, data analysis, and writing. P-CW contributed to data analysis, discussion of ideas, and revision of the manuscript. MM yielded feedback and critically revised the article for important intellectual content. All authors contributed to the interpretation of the results and read and approved the submitted version.</p>
</sec>
<sec id="conf1" sec-type="COI-statement">
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec id="pudiscl1" sec-type="disclaimer">
<title>Publisher&#x2019;s Note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
</body>
<back>
<sec id="S8" sec-type="supplementary-material">
<title>Supplementary Material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/fpsyg.2022.821459/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/fpsyg.2022.821459/full#supplementary-material</ext-link></p>
<supplementary-material xlink:href="Data_Sheet_1.PDF" id="DS1" mimetype="application/pdf" xmlns:xlink="http://www.w3.org/1999/xlink"/>
</sec>
<ref-list>
<title>References</title>
<ref id="B1"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chang</surname> <given-names>H. H.</given-names></name> <name><surname>Mazzeo</surname> <given-names>J.</given-names></name> <name><surname>Roussos</surname> <given-names>L. A.</given-names></name></person-group> (<year>1996</year>). <article-title>Detecting DIF for polytomously scored items: an adaptation of the SIBTEST procedure.</article-title> <source><italic>J. Educ. Measure.</italic></source> <volume>33</volume> <fpage>333</fpage>&#x2013;<lpage>353</lpage>. <pub-id pub-id-type="doi">10.1111/j.1745-3984.1996.tb00496.x</pub-id></citation></ref>
<ref id="B2"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>C. T.</given-names></name> <name><surname>Hwu</surname> <given-names>B.-S.</given-names></name></person-group> (<year>2018</year>). <article-title>Improving the assessment of differential item functioning in large-scale programs with dual-scale purification of Rasch models: the PISA example.</article-title> <source><italic>Appl. Psychol. Measure.</italic></source> <volume>42</volume> <fpage>206</fpage>&#x2013;<lpage>220</lpage>. <pub-id pub-id-type="doi">10.1177/0146621617726786</pub-id> <pub-id pub-id-type="pmid">29881122</pub-id></citation></ref>
<ref id="B3"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cho</surname> <given-names>S. J.</given-names></name> <name><surname>Suh</surname> <given-names>Y.</given-names></name> <name><surname>Lee</surname> <given-names>W.-Y.</given-names></name></person-group> (<year>2016</year>). <article-title>After differential item functioning is detected: IRT item calibration and scoring in the presence of DIF.</article-title> <source><italic>Appl. Psychol. Measure.</italic></source> <volume>40</volume> <fpage>573</fpage>&#x2013;<lpage>591</lpage>. <pub-id pub-id-type="doi">10.1177/0146621616664304</pub-id> <pub-id pub-id-type="pmid">29881071</pub-id></citation></ref>
<ref id="B4"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cohen</surname> <given-names>A. S.</given-names></name> <name><surname>Bolt</surname> <given-names>D. M.</given-names></name></person-group> (<year>2005</year>). <article-title>A mixture model analysis of differential item functioning.</article-title> <source><italic>J. Educ. Measure.</italic></source> <volume>42</volume> <fpage>133</fpage>&#x2013;<lpage>148</lpage>. <pub-id pub-id-type="doi">10.1111/j.1745-3984.2005.00007</pub-id></citation></ref>
<ref id="B5"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>De Ayala</surname> <given-names>R. J.</given-names></name> <name><surname>Kim</surname> <given-names>S. H.</given-names></name> <name><surname>Stapleton</surname> <given-names>L. M.</given-names></name> <name><surname>Dayton</surname> <given-names>C. M.</given-names></name></person-group> (<year>2002</year>). <article-title>Differential item functioning: a mixture distribution conceptualization.</article-title> <source><italic>Int. J. Test.</italic></source> <volume>2</volume> <fpage>243</fpage>&#x2013;<lpage>276</lpage>. <pub-id pub-id-type="doi">10.1080/15305058.2002.9669495</pub-id></citation></ref>
<ref id="B6"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ding</surname> <given-names>C. S.</given-names></name> <name><surname>Song</surname> <given-names>K.</given-names></name> <name><surname>Richardson</surname> <given-names>L. J.</given-names></name></person-group> (<year>2006</year>). <article-title>Do mathematical gender differences continue? A longitudinal study of gender difference and excellence in mathematics performance in the U.S.</article-title> <source><italic>Educ. Studies</italic></source> <volume>40</volume> <fpage>279</fpage>&#x2013;<lpage>295</lpage>. <pub-id pub-id-type="doi">10.1080/00131940701301952</pub-id></citation></ref>
<ref id="B7"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Edgington</surname> <given-names>E. S.</given-names></name></person-group> (<year>1995</year>). <source><italic>Randomization tests</italic></source>, <edition>3rd Edn</edition>. <publisher-loc>New York, NY</publisher-loc>: <publisher-name>Marcel Dekker</publisher-name>.</citation></ref>
<ref id="B8"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fryer</surname> <given-names>R. G.</given-names></name> <name><surname>Levitt</surname> <given-names>S. D.</given-names></name></person-group> (<year>2010</year>). <article-title>An Empirical Analysis of the Gender Gap in Mathematics.</article-title> <source><italic>Am. Econ. J. Appl. Econ.</italic></source> <volume>2</volume> <fpage>210</fpage>&#x2013;<lpage>240</lpage>. <pub-id pub-id-type="doi">10.1257/app.2.2.210</pub-id></citation></ref>
<ref id="B9"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Guttman</surname> <given-names>L.</given-names></name></person-group> (<year>1944</year>). <article-title>A basis for scaling qualitative data.</article-title> <source><italic>Am. Sociol. Rev.</italic></source> <volume>9</volume> <fpage>139</fpage>&#x2013;<lpage>150</lpage>. <pub-id pub-id-type="doi">10.2307/2086306</pub-id></citation></ref>
<ref id="B10"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Harnisch</surname> <given-names>D. L.</given-names></name> <name><surname>Linn</surname> <given-names>R. L.</given-names></name></person-group> (<year>1981</year>). <article-title>Analysis of item response patterns: questionable test data and dissimilar curriculum practices.</article-title> <source><italic>J. Educ. Measure.</italic></source> <volume>18</volume> <fpage>133</fpage>&#x2013;<lpage>146</lpage>. <pub-id pub-id-type="doi">10.1111/j.1745-3984.1981.tb00848.x</pub-id></citation></ref>
<ref id="B11"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Holland</surname> <given-names>P. W.</given-names></name> <name><surname>Thayer</surname> <given-names>D. T.</given-names></name></person-group> (<year>1988</year>). &#x201C;<article-title>Differential item performance and the Mantel Haenszel procedure</article-title>,&#x201D; in <source><italic>Test validity</italic></source>, <role>eds</role> <person-group person-group-type="editor"><name><surname>Wainer</surname> <given-names>H.</given-names></name> <name><surname>Braun</surname> <given-names>H. I.</given-names></name></person-group> (<publisher-loc>New Jersey</publisher-loc>: <publisher-name>Lawrence Erlbaum Associates, Inc</publisher-name>), <fpage>129</fpage>&#x2013;<lpage>145</lpage>.</citation></ref>
<ref id="B12"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Huang</surname> <given-names>T. W.</given-names></name></person-group> (<year>2002</year>). <source><italic>The power of the Beyond-Ability-Surprise Index and the Within-Ability Concern Index for detecting person/item aberrances: A Monte Carlo study. Unpublished doctorate dissertation.</italic></source> <publisher-loc>Ohio</publisher-loc>: <publisher-name>The Ohio State University</publisher-name>.</citation></ref>
<ref id="B13"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Huang</surname> <given-names>T. W.</given-names></name></person-group> (<year>2011</year>). <article-title>Robustness of BW aberrance indices against test length.</article-title> <source><italic>Know. Manage. E Learn.</italic></source> <volume>3</volume> <fpage>310</fpage>&#x2013;<lpage>318</lpage>. <pub-id pub-id-type="doi">10.34105/j.kmel.2011.03.023</pub-id></citation></ref>
<ref id="B14"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Huang</surname> <given-names>T. W.</given-names></name></person-group> (<year>2012</year>). <article-title>Aberrance detection powers of the BW and Person-Fit Indices.</article-title> <source><italic>J. Educ. Technol. Soc.</italic></source> <volume>15</volume> <fpage>28</fpage>&#x2013;<lpage>37</lpage>.</citation></ref>
<ref id="B15"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Huang</surname> <given-names>T. W.</given-names></name> <name><surname>Lin</surname> <given-names>S. Y.</given-names></name></person-group> (<year>2017</year>). &#x201C;<article-title>Predictions of DIFs by the BW item-fit indices: a monte carlo study,</article-title>&#x201D; in <source><italic>Paper presented at The Seventh Asian Conference on Psychology &#x0026; the Behavioral Sciences 2017(ACP2017)</italic></source> (<publisher-loc>Kobe</publisher-loc>).</citation></ref>
<ref id="B16"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Huang</surname> <given-names>T. W.</given-names></name> <name><surname>Lu</surname> <given-names>C. M.</given-names></name></person-group> (<year>2017</year>). <source><italic>Develop BW</italic> Cognitive Diagnostic Models and Program to Assess Students&#x2019; Learning Statuses,&#x201D;<italic>2017 6th IIAI International Congress on Advanced Applied Informatics (IIAI-AAI)</italic></source>, <publisher-loc>Hamamatsu, Japan</publisher-loc>: <publisher-name>IEEE</publisher-name></citation></ref>
<ref id="B17"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Huang</surname> <given-names>T. W.</given-names></name> <name><surname>Wu</surname> <given-names>P. C.</given-names></name></person-group> (<year>2013</year>). <article-title>A classroom-based cognitive diagnostic model for teacher-made tests: an example with fractions and decimals.</article-title> <source><italic>J. Educ. Technol. Soc.</italic></source> <volume>16</volume> <fpage>347</fpage>&#x2013;<lpage>361</lpage>.</citation></ref>
<ref id="B18"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hyde</surname> <given-names>J. S.</given-names></name> <name><surname>Mertz</surname> <given-names>J. E.</given-names></name></person-group> (<year>2009</year>). <article-title>Gender, culture, and mathematics performance.</article-title> <source><italic>Proc. Natl. Acad. Sci.</italic></source> <volume>106</volume> <fpage>8801</fpage>&#x2013;<lpage>8807</lpage>. <pub-id pub-id-type="doi">10.1073/pnas.0901265106</pub-id> <pub-id pub-id-type="pmid">19487665</pub-id></citation></ref>
<ref id="B19"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kim</surname> <given-names>S. H.</given-names></name> <name><surname>Cohen</surname> <given-names>A. S.</given-names></name> <name><surname>Alagoz</surname> <given-names>C.</given-names></name> <name><surname>Kim</surname> <given-names>S.</given-names></name></person-group> (<year>2007</year>). <article-title>DIF detection and effect size measures for polytomously scored items.</article-title> <source><italic>J. Educ. Measure.</italic></source> <volume>44</volume> <fpage>93</fpage>&#x2013;<lpage>116</lpage>.</citation></ref>
<ref id="B20"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kogut</surname> <given-names>J.</given-names></name></person-group> (<year>1986</year>). <source><italic>Review of IRT-based indices for detecting and diagnosing aberrant response patterns (Research Report No. 86-4).</italic></source> <publisher-loc>Enschede, Netherlands</publisher-loc>: <publisher-name>University of Twente</publisher-name>.</citation></ref>
<ref id="B21"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lee</surname> <given-names>S.</given-names></name> <name><surname>Bulut</surname> <given-names>O.</given-names></name> <name><surname>Suh</surname> <given-names>Y.</given-names></name></person-group> (<year>2016</year>). <article-title>Multidimensional extension of multiple indicators multiple causes models to detect DIF.</article-title> <source><italic>Educ. Psychol. Measure.</italic></source> <volume>77</volume> <fpage>545</fpage>&#x2013;<lpage>569</lpage>. <pub-id pub-id-type="doi">10.1177/0013164416651116</pub-id> <pub-id pub-id-type="pmid">30034019</pub-id></citation></ref>
<ref id="B22"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>Z.</given-names></name></person-group> (<year>2014a</year>). <article-title>A power formula for the SIBTEST procedure for differential item functioning.</article-title> <source><italic>Appl. Psychol. Measure.</italic></source> <volume>38</volume> <fpage>311</fpage>&#x2013;<lpage>328</lpage>. <pub-id pub-id-type="doi">10.1177/0146621613518095</pub-id></citation></ref>
<ref id="B23"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>Z.</given-names></name></person-group> (<year>2014b</year>). <article-title>Power and sample size calculations for logistic regression tests for differential item functioning.</article-title> <source><italic>J. Educ. Measure.</italic></source> <volume>51</volume> <fpage>441</fpage>&#x2013;<lpage>462</lpage>. <pub-id pub-id-type="doi">10.1111/jedm.12058</pub-id></citation></ref>
<ref id="B24"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>Z.</given-names></name></person-group> (<year>2015</year>). <article-title>A power formula for the Mantel&#x2013;Haenszel test for differential item functioning.</article-title> <source><italic>Appl. Psychol. Measure.</italic></source> <volume>39</volume> <fpage>373</fpage>&#x2013;<lpage>388</lpage>. <pub-id pub-id-type="doi">10.1177/0146621614568805</pub-id> <pub-id pub-id-type="pmid">29881014</pub-id></citation></ref>
<ref id="B25"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>Z.</given-names></name></person-group> (<year>2020</year>). <article-title>The Power of Crossing SIBTEST.</article-title> <source><italic>Appl. Psychol. Measure.</italic></source> <volume>44</volume> <fpage>393</fpage>&#x2013;<lpage>408</lpage>. <pub-id pub-id-type="doi">10.1177/0146621620909907</pub-id> <pub-id pub-id-type="pmid">32879538</pub-id></citation></ref>
<ref id="B26"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lindberg</surname> <given-names>S. M.</given-names></name> <name><surname>Hyde</surname> <given-names>J. S.</given-names></name> <name><surname>Petersen</surname> <given-names>J. L.</given-names></name> <name><surname>Linn</surname> <given-names>M. C.</given-names></name></person-group> (<year>2010</year>). <article-title>New trends in gender and mathematics performance: a meta-analysis.</article-title> <source><italic>Psychol. Bull.</italic></source> <volume>136</volume> <fpage>1123</fpage>&#x2013;<lpage>1135</lpage>. <pub-id pub-id-type="doi">10.1037/a0021276</pub-id> <pub-id pub-id-type="pmid">21038941</pub-id></citation></ref>
<ref id="B27"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Masters</surname> <given-names>G. N.</given-names></name></person-group> (<year>1982</year>). <article-title>A Rasch model for partial credit scoring.</article-title> <source><italic>Psychometrika</italic></source> <volume>47</volume> <fpage>149</fpage>&#x2013;<lpage>174</lpage>. <pub-id pub-id-type="doi">10.1590/S0120-41572011000300013</pub-id> <pub-id pub-id-type="pmid">22674317</pub-id></citation></ref>
<ref id="B28"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mazor</surname> <given-names>K. M.</given-names></name> <name><surname>Hambleton</surname> <given-names>R. K.</given-names></name> <name><surname>Clauser</surname> <given-names>B. E.</given-names></name></person-group> (<year>1998</year>). <article-title>Multidimensional DIF analyses: the effects of matching on unidimensional subtest scores.</article-title> <source><italic>Appl. Psychol. Measure.</italic></source> <volume>22</volume> <fpage>357</fpage>&#x2013;<lpage>367</lpage>. <pub-id pub-id-type="doi">10.1177/014662169802200404</pub-id></citation></ref>
<ref id="B29"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Meijer</surname> <given-names>R. R.</given-names></name> <name><surname>Sijtsma</surname> <given-names>K.</given-names></name></person-group> (<year>1999</year>). <source><italic>A review of methods for evaluating the fit of item score patterns on a test (Research Report No. 99-01).</italic></source> <publisher-loc>Enschede, Netherlands</publisher-loc>: <publisher-name>University of Twente</publisher-name>.</citation></ref>
<ref id="B30"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pomerantz</surname> <given-names>E. M.</given-names></name> <name><surname>Altermatt</surname> <given-names>E. R.</given-names></name> <name><surname>Saxon</surname> <given-names>J. L.</given-names></name></person-group> (<year>2002</year>). <article-title>Making the grade but feeling distressed: gender differences in academic performance and internal distress.</article-title> <source><italic>J. Educ. Psychol.</italic></source> <volume>94</volume> <fpage>396</fpage>&#x2013;<lpage>404</lpage>. <pub-id pub-id-type="doi">10.1037/0022-0663.94.2.396</pub-id></citation></ref>
<ref id="B31"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Robinson</surname> <given-names>J. P.</given-names></name> <name><surname>Theule Lubienski</surname> <given-names>S.</given-names></name></person-group> (<year>2011</year>). <article-title>The development of gender achievement gaps in mathematics and reading during elementary and middle school: examining direct cognitive assessments and teacher ratings.</article-title> <source><italic>Am. Educ. Res. J.</italic></source> <volume>48</volume> <fpage>268</fpage>&#x2013;<lpage>302</lpage>. <pub-id pub-id-type="doi">10.3102/0002831210372249</pub-id></citation></ref>
<ref id="B32"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Roussos</surname> <given-names>L. A.</given-names></name> <name><surname>Stout</surname> <given-names>W. F.</given-names></name></person-group> (<year>1996</year>). <article-title>A multidimensionality-based DIF analysis paradigm.</article-title> <source><italic>Appl. Psychol. Measure.</italic></source> <volume>20</volume> <fpage>353</fpage>&#x2013;<lpage>371</lpage>. <pub-id pub-id-type="doi">10.1177/014662169602000404</pub-id></citation></ref>
<ref id="B33"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sato</surname> <given-names>T.</given-names></name></person-group> (<year>1975</year>). <source><italic>The construction and interpretation of S-P tables.</italic></source> <publisher-loc>Toyko, Japan</publisher-loc>: <publisher-name>Meiji Tosho</publisher-name>.</citation></ref>
<ref id="B34"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Seol</surname> <given-names>H.</given-names></name></person-group> (<year>1998</year>). <source><italic>Sensitivity of five Rasch-model-based fit indices to selected person and item aberrances: A simulation study. Unpublished dissertation.</italic></source> <publisher-loc>Ohio</publisher-loc>: <publisher-name>The Ohio State University</publisher-name>.</citation></ref>
<ref id="B35"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Shealy</surname> <given-names>R. T.</given-names></name> <name><surname>Stout</surname> <given-names>W. F.</given-names></name></person-group> (<year>1993</year>). <article-title>A model-biased standardization approach that separates true bias/DIF from group ability differences and detects test bias/DIF as well as item bias/DIF.</article-title> <source><italic>Psychometrika</italic></source> <volume>54</volume> <fpage>159</fpage>&#x2013;<lpage>194</lpage>. <pub-id pub-id-type="doi">10.1007/bf02294572</pub-id></citation></ref>
<ref id="B36"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Stout</surname> <given-names>W.</given-names></name> <name><surname>Roussos</surname> <given-names>L.</given-names></name></person-group> (<year>1995</year>). <source><italic>SIBTEST User Manual.</italic></source> <publisher-loc>Urbana, IL</publisher-loc>: <publisher-name>University of Illinois</publisher-name>.</citation></ref>
<ref id="B37"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Suh</surname> <given-names>Y.</given-names></name> <name><surname>Bolt</surname> <given-names>D. M.</given-names></name></person-group> (<year>2011</year>). <article-title>A nested logit approach for investigating distractors as causes of differential item functioning.</article-title> <source><italic>J. Educ. Measure.</italic></source> <volume>48</volume> <fpage>188</fpage>&#x2013;<lpage>205</lpage>. <pub-id pub-id-type="doi">10.1111/j.1745-3984.2011.00139.x</pub-id></citation></ref>
<ref id="B38"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Suh</surname> <given-names>Y.</given-names></name> <name><surname>Talley</surname> <given-names>A. E.</given-names></name></person-group> (<year>2015</year>). <article-title>An empirical comparison of DDF detection methods for understanding the causes of DIF in multiple-choice items.</article-title> <source><italic>Appl. Measure. Educ.</italic></source> <volume>28</volume> <fpage>48</fpage>&#x2013;<lpage>67</lpage>. <pub-id pub-id-type="doi">10.1080/08957347.2014.973560</pub-id></citation></ref>
<ref id="B39"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>W. C.</given-names></name> <name><surname>Shih</surname> <given-names>C. L.</given-names></name> <name><surname>Sun</surname> <given-names>G. W.</given-names></name></person-group> (<year>2012</year>). <article-title>The DIF-Free-Then-DIF strategy for the assessment of differential item functioning.</article-title> <source><italic>Educ. Psychol. Measure.</italic></source> <volume>72</volume> <fpage>687</fpage>&#x2013;<lpage>708</lpage>. <pub-id pub-id-type="doi">10.1177/0013164411426157</pub-id></citation></ref>
</ref-list>
<fn-group>
<fn id="footnote1">
<label>1</label>
<p><ext-link ext-link-type="uri" xlink:href="https://www.ncyu.edu.tw/gcweb/content.aspx?site_content_sn=44700">https://www.ncyu.edu.tw/gcweb/content.aspx?site_content_sn=44700</ext-link></p></fn>
<fn id="footnote2">
<label>2</label>
<p>We also chose a 1-parameter Rasch model to estimate the DIF contrasts (men minus women). The results showed that three <italic>Poly-BW</italic> indices explain almost 88% of the variance (70% from the predictor of <italic>mp</italic><sub><italic>M&#x2013;F</italic>,</sub> 15% from the predictor of <italic>cp</italic><sub><italic>M&#x2013;F</italic></sub>, and 3% from the predictor of <italic>w</italic><sub><italic>M&#x2013;F</italic></sub>) of the DIF contrast obtained from Rasch DIF analysis. The findings were similar to those from the <italic>Poly-SIBTEST</italic> analysis. Because Rasch DIF analysis is not a non-parametric-based method, the results for further analyses were not conducted.</p></fn>
</fn-group>
</back>
</article>
