Examining a Dutch short form of The Balanced Inventory of Desirable Responding Version 6: comparing polytomous and dichotomous scoring methods in a multidimensional framework

Noteborn, Mirthe G. C.; Hildebrand, Martin; Sijtsema, Jelle J.; Bogaerts, Stefan; Denissen, Jaap J. A.

doi:10.3389/fpsyg.2025.1532969

ORIGINAL RESEARCH article

Front. Psychol., 13 June 2025

Sec. Forensic and Legal Psychology

Volume 16 - 2025 | https://doi.org/10.3389/fpsyg.2025.1532969

Examining a Dutch short form of The Balanced Inventory of Desirable Responding Version 6: comparing polytomous and dichotomous scoring methods in a multidimensional framework

Mirthe G. C. Noteborn¹^*

Martin Hildebrand^2,3

Jelle J. Sijtsema^1,4,5

Stefan Bogaerts^1,5

Jaap J. A. Denissen⁶

¹Department of Developmental Psychology, Tilburg University, Tilburg, Netherlands
²Private Practice, De Bilt, Netherlands
³Lelystad Prison, Lelystad, Netherlands
⁴Department of Educational Sciences/GION, University of Groningen, Groningen, Netherlands
⁵Fivoor Science and Treatment Innovation, Rotterdam, Netherlands
⁶Department of Developmental Psychology, Utrecht University, Utrecht, Netherlands

Introduction: The 40-item Balanced Inventory of Desirable Responding Version 6 (BIDR) is a widely used tool to measure two components of social desirability: Self-Deceptive Enhancement (SDE) and Impression Management (IM). In three studies, we aimed to create and validate a short form of the Dutch language version of the 40-item BIDR.

Methods: In Study 1 (general population sample N = 577), item properties were examined using (Multidimensional) Item Response Theory (IRT) for both dichotomous and polytomous scoring methods to create a short form. In Study 2 (general population sample N = 719), IRT analyses of Study 1 were replicated, and the nomological network of the short form was examined by investigating its relation with the Big Five personality traits and deviant traits and thoughts. Study 3 (men from the general population N = 100) investigated whether SDE and IM could detect response bias in self-reported aggression. All samples consisted of individuals volunteering to participate in scientific research (recruited in various ways) in a low-stake condition.

Results: This yielded a short form containing 10 SDE and 10 IM dichotomously scored items (BIDR-D20). While results indicated a loss of information compared to the original version, the overall psychometric qualities were equal to or sometimes better compared to the BIDR (Version 6). Across studies, dichotomous scoring was generally better than polytomous scoring in terms of model fit, estimated IRT parameters, and internal consistency. Both forms correlated with self-reported aggression, but SDE and IM failed to detect response bias in the current sample.

Conclusion: The BIDR-D20 could be a worthy replacement for the 40-item BIDR (Version 6), with the same properties and less time-consuming. However, more research is needed to establish the short measure’s predictive validity as a response bias.

Introduction

The 40-item Balanced Inventory of Desirable Responding Version 6 (BIDR-6) (Paulhus, 1994) is one of the most widely used self-report questionnaires to measure socially desirable responding (SDR). Although the BIDR has several good psychometric properties (i.e., satisfactory reliability, convergent, and discriminant validity), the instrument has been criticized for its equivocal factor structure and lack of unidimensionality. Also, the model fit has been found to be suboptimal, leading researchers to remove items to increase fit (Lanyon and Carle, 2007; Li and Li, 2008; Li et al., 2015). In addition, scholars have differing views on which scoring method of the BIDR (i.e., dichotomous or polytomous) is preferable. Some researchers favor dichotomous scoring because it singles out abnormal responses (Paulhus, 1994), whereas others found that polytomous scoring is more reliable (higher alpha, test–retest reliability) and valid (convergent validity; Cervellione et al., 2009; Vispoel and Kim, 2014; Vispoel and Tao, 2013). Finally, although this may seem as an irrelevant point for a questionnaire that takes up to 5–10 min to complete, filling out the relatively long BIDR-6 may limit its utility when used in conjunction with other self-reports, which is often the case in clinical (forensic) assessments. Longer questionnaires are more time-consuming and “may increase transient measurement errors, as respondents may become frustrated or respond carelessly due to boredom or fatigue” (Hart et al., 2015, p. 2; also Schmidt et al., 2003; Stanton et al., 2002). This may be especially apparent in clinical forensic settings where concentration, cooperation, and response rates are often low (Moser et al., 2004; Young and Cocallis, 2019). Hence, several short forms have been developed, in different languages (e.g., Asgeirsdottir et al., 2016; Hart et al., 2015; Subotić et al., 2016). In line with the advocacy to consider cultural impact on SDR questionnaires, Lalwani et al. (2006) demonstrated in their study that national culture and ethnicity predicts distinct patterns of SDR. Thereby, one could argue that the item selection of the short forms is (partly) driven by these distinct patterns of SDR. The little overlap between the included items in these short forms could be evidence of this cultural impact.

Therefore, the purpose of the current study is to create and validate a Dutch short form of the original 40-item BIDR (Version 6), which currently does not exist. Using Item Response Theory (IRT), we focused on the multidimensionality of the BIDR as well as the appropriateness of scoring methods. Moreover, we investigated whether this short form outperforms the 40-item BIDR-6 in terms of psychometric properties, nomological network, and validity.

Social desirable responding

SDR can be described as the inclination to provide biased, distorted, or excessively positive self-descriptions to present oneself in a way that creates a favorable impression on others (Paulhus, 2002; see also Furnham, 1986; Nederhof, 1985). SDR has long been recognized as a potential confounder when using self-report measures, especially in a context with a high possibility of secondary gains or losses, where there are potential gains or losses, such as in personnel selection or forensic and clinical settings, where individuals may have a strong motivation to portray themselves in a positive light (Hanson and Bussière, 1998; Hildebrand et al., 2018; Tan and Grace, 2008; Tatman et al., 2009). Banse et al. (2010), for example, discussed the problems associated with self-report measures of sexual (e.g., pedophilic) interest and concluded that “their validity is jeopardized by impression management and deliberate faking. The general problem of transparency in direct measures is all the more critical if disclosure of personal information is highly embarrassing, socially undesirable, or has legal implications, as it is commonly the case in forensic contexts” (p. 320).

The most common strategy for addressing socially desirable response bias is to use measures designed to assess individuals’ SDR tendencies. SDR scales typically contain descriptions of desirable behaviors or traits (e.g., “I never hesitate to go out of my way to help someone in trouble”; Crowne and Marlowe, 1960). Scores on these instruments are sometimes used to flag possible invalid responses that may be discarded or to control for a desirability response bias statistically. In addition, SDR measures are utilized to assess convergent and/or divergent validity; demonstrating that scores from the SDR scale align with other measures as expected and using them as dependent variables in controlled experiments to emphasize situations most likely to prompt (Hildebrand et al., 2018; Vispoel and Kim, 2014; Vispoel and Tao, 2013; see also Tan and Grace, 2008). Although several ways to handle SDR have been proposed, discussion on the use of SDR measures is still ongoing. Whereas exclusion of flagged responses and statistically controlling for desirable responses are encouraged by some researchers (Martínez-Catena et al., 2017; van de Mortel, 2008), others stress that this might remove valid variance in personality differences (McCrae and Costa, 1983; Pauls and Stemmler, 2003; Uziel, 2010).

Several instruments have been created and validated to uncover SDR. Most of these measures operationalized SDR as a unidimensional construct (e.g., Edwards, 1957; Crowne and Marlowe, 1960; Crowne and Marlowe, 1964; Eysenck and Eysenck, 1964; Hofstee, 2003). Conversely, due to low correlations between different SDR measures and the results of factor analyses, some researchers have questioned the one-dimensional approach to understanding SDR (Edwards et al., 1962; Messick, 1962; Wiggins, 1964). Recognizing the empirical divergence of SDR, Paulhus (1984, 1991) proposed that measures of social desirability evaluate two distinct components, which he called Impression Management (IM) and Self-Deceptive Enhancement (SDE). IM involves intentionally distorting responses to create a positive impression on others (Paulhus, 1984; Stöber et al., 2002) and is sometimes called lying or faking. Alternatively, SDE involves portraying oneself in a positive light to maintain a positive self-image. SDE is associated with self-deceptive overconfidence and is closely linked to narcissism. SDE is a sincerely believed self-deception (Paulhus, 1984) and is deeply ingrained in one’s belief system to the extent that individuals may not be aware of it. Researchers believe that IM poses a more significant threat to the accuracy of questionnaire results than SDE, as it involves a deliberate distortion of information (Paulhus, 1984, 2002; Paulhus and Vazire, 2007; Vispoel and Tao, 2013).

The Balanced Inventory of Desirable Responding

Paulhus (1984, 1988, 1991, 1994, 1998) developed the 40-item Balanced Inventory of Desirable Responding (BIDR) with the aim to measure the two distinct dimensions of SDR, IM (20 items) and SDE (20 items). Both the IM and SDE scales have 10 positively keyed items and 10 negatively keyed items that are reverse-scored before calculating the overall score. The IM scale items represent desirable but implausible statements (e.g., “I never take things that do not belong to me”; “I have [not] done things that I do not tell other people about”). Endorsing a high number of these statements may indicate intentional tailoring of responses. SDE items (e.g., “Once I’ve made up my mind, other people can seldom change my opinion”; “I am fully in control of my own fate”) represent a level of overconfidence that does not match levels of the actual abilities. Individuals scoring high on this scale are thought to report unrealistic yet honestly believed positive self-descriptions. Thus, the distinction between these two dimensions is that IM involves intentional manipulation of one’s image to deceive others, whereas SDE involves unconsciously attempting to maintain a positive self-image.

Respondents rate their agreement level with the items on a 5-point (1 = not true, 5 = very true) or a 7-point (1 = totally disagree, 4 = neutral, 7 = totally agree) Likert-type scale. Paulhus suggested a polytomous (continuous) and a dichotomous (i.e., only scores on the end of the scale are counted) scoring method. However, he recommended the use of dichotomous scoring of the 7-point scale (with scores of 6 and 7 being coded as 1 and all other answers coded as 0). These responses are likely to be of the most interest since only extremely high levels of self-deception and impression management are assumed to be abnormal (Paulhus, 1994). Indeed, most research conducted with the BIDR has used the dichotomous scoring method (e.g., Li and Bagger, 2007). However, while limited in number, studies in which dichotomous and polytomous scoring have been compared generally support a polytomous scoring method (Cervellione et al., 2009; Kam, 2013; Stöber et al., 2002; Vispoel and Kim, 2014; but see Asgeirsdottir et al., 2016; Gignac, 2013; Leite and Beretvas, 2005).

Studies with the BIDR have been conducted in a wide variety of community and clinical samples and settings, and the measure is currently one of the most widely used instruments to assess SDR (Leite and Cooper, 2010; Steenkamp et al., 2010). Generally speaking, the scale is a robust measure in forensic/correctional settings as well as in the general population, showing satisfactory reliability (internal consistency; test–retest stability) and convergent and discriminant validity of both IM and SDE (e.g., Lanyon and Carle, 2007; Li and Bagger, 2006; Littrell et al., 2021; Mathie and Wakeling, 2011; Stöber et al., 2002; Vispoel and Tao, 2013; but see also Li and Bagger, 2007). That is, in line with other studies (e.g., Mathie and Wakeling, 2011; Littrell et al., 2021). Paulhus (1994, 1998) reported that BIDR scores have good reliability, with internal consistency estimates ranging from.72 to.75 for SDE and.81 to.84 for IM. Paulhus considered these values highly satisfactory, but some researchers disagree. For instance, in their meta-analysis, Li and Bagger (2007) showed an average reliability of.74 for IM and.68 for SDE, suggesting weaker reliability than Paulhus claimed.

However, results on the factor structure of the BIDR are equivocal (Gignac, 2013; Lanyon and Carle, 2007; Leite and Beretvas, 2005). Although it is acknowledged that the BIDR captures IM and SDE as separate constructs, studies have indicated a two-factor structure of SDE in which SDE can be further divided into the attribution of positive outcomes (Enhancement) and the denial of negative attributes (Denial) (Kroner and Weekes, 1996; Paulhus and Reid, 1991). Moreover, others indicate that both IM and SDE can be split into Enhancement and Denial (Li and Li, 2008; Li et al., 2015).

Short forms of the Balanced Inventory of Desirable Responding

Due to the non-optimal model fit and the resulting need to remove items as well as the wish to shorten the time needed to administer the instrument to increase its usefulness, several short forms of the BIDR have been developed in recent years. Whereas some short forms have been created using Classical Test Theory (CTT) (Bobbio and Manganelli, 2011; Hart et al., 2015), other attempts have been made using Item Response Theory (IRT) (Asgeirsdottir et al., 2016; Subotić et al., 2016). IRT allows for examining the possibility of multidimensionality on item level to consider how each item is related to SDE and IM. IRT also allows for the instantaneous consideration of the number of endorsed items and item properties (e.g., discrimination, difficulty) when estimating each respondent’s score on SDE and IM (for more information on the difference between CTT and IRT, see Hambleton et al., 1991).

Although the studies using IRT provided valuable information about the performance of a BIDR short form, no study simultaneously considered aspects of multidimensionality and various scoring methods (dichotomous versus polytomous). In addition, IRT analyses can be used to graphically depict the expected item score over the full range of the IM and SDE (i.e., Item Characteristic Curves; ICC) and the amount of information that each item (Item Information Curve; IIC) or the full scale (Test Information Curve; TIC) provides across varying ability levels of IM and SDE (Baker and Kim, 2017; Reeve and Fayers, 2005). ICCs give more detailed insight into possible scoring patterns, especially with regard to polytomous scoring. In addition, for IIC and TIC, the higher the information at specific ability levels of SDE or IM, the lower the associated error (Baker and Kim, 2017). Also, low item information can indicate poorly functioning items (see Reeve and Fayers, 2005). By examining these sources of information, a more informed item selection can be made to create a short form.

Additionally, while several short forms have been developed in various countries (e.g., Asgeirsdottir et al., 2016; Hart et al., 2015; Subotić et al., 2016), only four of the BIDR- items are included in at least seven of the eight short forms that are developed (i.e., Item 11 “I never regret my decisions.,” Item 15 “I am a completely rational person.,” Item 17 “I am very confident of my judgments” (SDE) and Item 37 “I have taken sick-leave from work or school even though I wasn’t really sick” (IM)). Finally, as social desirable behavior is per definition determined by the norms and values within a society, it is advocated to consider the cultural impact on social desirability questionnaires when used in different cultural backgrounds. For instance, research has shown that different cultural values such as individualism/collectivism, masculinity/femininity, and uncertainty avoidance have a diverse influence on SDR (e.g., Bernardi, 2006; Keillor et al., 2001; Middleton and Jones, 2000; see also Li and Reb, 2009; Shulruf et al., 2011).

The present study

This study was designed to create a short form of the Dutch language version of the BIDR (Version 6) using IRT analysis, aiming to select items with optimal measurement properties suitable for both dichotomous and polytomous scoring methods. In Study 1, we investigated the BIDR’s item qualities using IRT, taking both multidimensionality and scoring method into account, in a community sample. To examine whether the short form is equal to, or even outperforms, the 40-item BIDR (for both scoring methods), we compared the psychometric properties (i.e., internal consistency, test–retest reliability, test information) of the short form and the 40-item BIDR in Study 1. To examine the extent to which the results from Study 1 could be replicated, a second IRT analysis in another community sample was conducted in Study 2. In Study 2 we also investigated the nomological network of the BIDR using both the short form and original version by examining the associations between basic personality features and deviant traits and thoughts and SDE and IM. To establish the validity of our short form, we tested the utility of the short form and the 40-item BIDR as validity scales by investigating whether these scales moderate the convergence between self-reported aggression with informant reports of aggressive behavior in Study 3.

Study 1

Methods

Participants and procedure

Five-hundred and seventy-seven community participants volunteered to take part in the study between 2016 and 2018. Participants were acquaintances (e.g., neighbors, colleagues, friends) of 30 university psychology students. Participants had to be at least 18 years old and be proficient in the Dutch language to be included in the study. After being introduced to the general aim of the study—Do questionnaires measure what they intend to measure?—participants signed an informed consent form. Participants completed a battery of questionnaires, including topics such as antisocial behavior, psychopathy, major life events, narcissism, and the BIDR. After completion, participants were instructed to send the questionnaires in a sealed envelope to the first author to guarantee anonymity. Participants were also asked to indicate in writing if they would like to participate for a second time. If participants indicated that they were willing to participated a second time, a questionnaire was send by regular mail including a return envelope. Participants were informed that they could withdraw from the study at any time without providing a reason and that their responses would be removed from the database upon request. The study was approved by the School of Social and Behavioral Sciences Ethics Review Board of Tilburg University (ED-2015.70).

Regarding missing values, the total percentage of randomly missing data on the BIDR was 5.7%, with a maximum of two missing values for six participants. Missing values were handled using Full Information Maximum Likelihood. The sample consisted of 577 participants (57.9% male; three participants did not indicate their sex) with an average age of 32.9 years (SD = 14.6; range 18–77 years; 26 participants did not report their age). Almost all participants had the Dutch nationality (96.6%; 1.4% missing). Regarding the level of education (0.5% missing), few participants (1.4%) only completed elementary school, 27.4% held a high school degree, 52.1% had a lower or higher vocational education, and 18.5% held a university degree. In total, 10.3% (n = 59) indicated having been in contact with the law for various crimes (e.g., traffic violations, vandalism, arson, (sexual) assault). Additionally, 155 participants (26.8%) indicated having received treatment for psychiatric complaints/symptoms.

To examine test–retest reliability, 87 participants (57.5% female) with an average age of 39.3 years (SD = 16.4; range 18–77 years) filled out the BIDR for a second time. A sample size of 87 can be considered to be sufficient for test–retest reliability (Kennedy, 2022). Almost all participants reported being of Dutch nationality (97.7%). Regarding the highest education received, 25.3% finished high school, 48.2% had lower or higher vocational education, and 26.4% had a university diploma. The time between the first and second assessments was M = 51 days (SD = 18.6; range 15–169 days).

Measures

For the purpose of this study, the 40-item BIDR Version 6 (Paulhus, 1991, 1994) was translated into Dutch by the first and second author. Two independent bilingual translators conducted a backward translation. Both forward, backward, and final translations were discussed among the authors to reach a consensus when necessary. Translation was done in accordance with the International Test Commission Guidelines for translation and adapting tests (International Test Commission, 2017). Each item was rated on a 7-point Likert-type scale (1 = totally disagree, true, 4 = neutral, 7 = totally agree). When appropriate, items were reverse coded (half of the items). For polytomous scoring, items remained on a 7-point scale, with higher scores indicating higher levels of SDE or IM items. For dichotomous scoring, a score of 1 was assigned to each response of 6 or 7, and a score of 0 was assigned to other responses on the 7-point Likert-type scale.

Analytic approach

As stated, IRT is a well-validated method to shorten scales, but it requires a fitting model. The model fit was established for both dichotomous and polytomous scoring. Following previous studies (Asgeirsdottir et al., 2016; Vispoel and Kim, 2014) and recommendations (Paek and Cole, 2020), we used a two-parameter logistic model (2-PL) (Birnbaum, 1968; Lord, 1997) for dichotomous scoring and a graded response model (GRM) (Samejima, 1969, 1997) for polytomous scoring. We wanted to take the possible multidimensionality of both IM and SDE into account (i.e., Denial and Enhancement), hence, in line with previous research (e.g., Asgeirsdottir et al., 2016), we examined whether a one- (IM and SDE separately, respectively) or a two-factor model (further division into Denial and Enhancement for IM and SDE) fit the data best. The two factor model was conducted using an ordinal confirmatory factor model. Additionally, the fit of individual items was evaluated using S-X2 statistics (Kang and Chen, 2008; Orlando and Thissen, 2000, 2003), with resulting p-values adjusted for false discovery rates (FDR) (Benjamini and Hachberg, 1995). Furthermore, Yen (1984) Q3 LD statistic was used to check for local independence. The Q3 statistic measures the correlation between performances on two items, after accounting for performance on the overall assessment of SDE or IM (for more information, see Chen and Thissen, 1997).

IRT analyses were used to identify the strongest items for the SDE and IM subscales. First, Factor loadings (e.g., factor analysis parameters) were examined to see how well the items represent the underlying construct. Factor models for ordinal data share a strong connection with Item Response Theory (IRT) models and, in certain cases, can be considered equivalent (Takane and De Leeuw, 1987). This equivalence indicates that the parameters of an IRT model can be transformed into those of a factor model, and vice versa, without loss of information. When the two models are closely aligned, their reparametrized parameters tend to be nearly identical. However, results in our analysis sometimes indicated that the factor loadings and discriminant parameter did classify in different categories when interpreting the findings. To maintain consistency with prior research on the BIDR, we incorporated factor loadings as a key informative component (Asgeirsdottir et al., 2016; Subotić et al., 2016). Second, IRT parameters were examined. Item discrimination parameter (a) estimates were obtained to determine how well each item could identify people at various levels of SDE and IM. Highly discriminating items can discriminate between respondents with subtly different levels of SDE or IM, whereas items that do not discriminate well are only able to discriminate between persons with very different levels of SDE and IM (Reise and Henson, 2003). Complementary to a, the difficulty parameter estimates b (or item location) were examined. Item difficulty describes how high a person typically scores on SDE or IM before an item is endorsed (Reise and Henson, 2003). Finally, the Item Characteristic Curve (ICC) and the Item Information Curve (IIC) were visually inspected for each item. For specific selection criteria, see the result section.

After selecting the items for the short form based on the IRT results (see the result section for specific criteria), Test Information Curves (TIC; i.e., equal to item information) will be interpreted and compared to examine whether the short form performs equally the full 40-item BIDR scale. The TIC is the graphic depiction of the sum of probabilities of endorsing the correct answer for all the items in the measure and therefore estimates the expected test score. A trade-off between the amount of information and the information range is desired: A selection of items that together give a relatively high amount of information over the full range is preferred.

McDonald’s Omega was calculated to measure the internal consistency for each (sub)scale. Test–retest reliability was assessed using Spearman correlations. As the time period between the first and second assessments varied largely (M = 51 days, SD = 18.6; Range = 15–169 days), moderation analyses were performed to investigate if the time between measurements influences the association between the time points.

All analyses were conducted using R version 3.5.3 (R Core Team, 2017). IRT analyses (including both factor analysis parameters and IRT parameters) were performed using the package Multidimensional Item Response Theory (MIRT; Version 1.3) (Chalmers, 2012). Coefficient Omega was calculated using the package “userfriendlyscience” (Version 0.7.2) (Peters, 2018). Correlational analyses and moderation analyses were conducted using the package Psych (Version, 1.8.12) (Revelle, 2018) and Lavaan (Version 6.3) (Rosseel, 2012), respectively.

Results

Measurement properties at item level using item response theory

Model and Item Fit. The overall model fit of SDE and IM was assessed as follows: absolute, overall model-data fit M2 p > .05 for exact fit; bivariate Root Mean Square Error of Approximation (RMSEA2) ≤ .089 for adequate fit, ≤.050 for close fit, and.050/(number of categories – 1) for excellent fit; and Standard Root Mean Squared Residual (SRMSR) ≤ .060 for adequate fit, ≤.027 for close fit, and.027/(number of categories −1) for excellent fit (Hu and Bentler, 1999; Maydeu-Olivares and Joe, 2014). To compare relative model fit across different solutions, the nested log-likelihood test (LR) (Hambleton et al., 1991), the Akaike Information Criterion (AIC) (Akaike, 1974), Bayesian Information Criterion (BIC) (Schwarz, 1978), and the sample size adjusted BIC (saBIC) (Sclove, 1987) were used with the lowest values on a specific information criterion being indicative of the better model. As an indication of local dependence, a cut-off of.30 minus the average correlation was used as a critical value as the Q3 is dependent on the sample, the number of items, and scoring method (see Christensen et al., 2017). Additionally, significant S-X2 statistics (Kang and Chen, 2008; Orlando and Thissen, 2000, 2003) indicated item misfit (S-X2, p < .05).

Fit statistics indicated that for both SDE and IM, a two-factor solution (i.e., SDE and IM subdivided into Denial and Enhancement) fit the data best for dichotomous scoring (Table 1), with an adequate fit for RMSEA and SRMSR and information-based fit indices indicating the two-factor solution as the better fit. In addition, no indications of item misfit or local dependence were found. For the polytomous scoring method, a two-factor model fit was also considered adequate for both SDE and IM (Table 2) with adequate fit for RMSEA, close to adequate fit for SMRS and information-based fit indices indicating the two-factor solution as the better fit. However, for SDE, item pair 3 (“I do not care to know what other people really think of me”) and 17 (“I am very confident of my judgments”) gave an indication of local dependence for polytomous scoring (Q3 = −.42), suggesting a residual correlation between this item pair beyond the overall construct. The local dependence was possibly the result of item content (for S-X2 statistics and LD Q3 matrix, see Supplementary material S1). In fact, the model fit did improve to an acceptable fit after items 3 and 17 were removed, and no other cases of local dependence were identified. Due to improvement in model fit with all fit indices dropping and information-based fit indices indicating the modified version as best fit for the data, we used the modified two-factor version (excluding items 3 and 17) of the SDE when polytomously scored. Additionally, individual item fit statistics indicated that for IM the model could be improved by removing items 31, 36, and 39 (S-X2, p < .05). Therefore, these items were not included in the short version. However, the model fit of the two-factor model did not improve when these items were removed (fit indices and information-based fit indices increased).

Table 1

Table 1. Fit statistics for the IRT models for SDE and IM dichotomous scoring.

Table 2

Table 2. Fit statistics for the IRT Models for SDE and IM polytomous scoring.

Item pool selection based on item properties of the item response methods

To identify the strongest SDE and IM items for inclusion in our short form that worked well for both dichotomous and polytomous scoring, we investigated factor loadings, item discrimination parameter estimates (a), and item difficulty parameter estimates (b). Standardized factor loadings <.40 were considered to be low (Floyd and Widaman, 1995), factor loadings.40–.55 to be adequate and factor loadings >.55 were considered to be good in terms of linking to the underlying construct (Comrey and Lee, 1992). Baker and Kim’s guidelines (2017) were used to interpret the item discrimination parameter estimates (a), with values close to 0 indicating no discrimination, values ≤ 0.34 very low discrimination, values 0.35–0.64 low discrimination, values 0.65–1.34 moderate discrimination—which is considered the minimum threshold for discriminating between respondents—and values ≥ 1.35 indicating a high to very high discrimination. For the item difficulty parameters estimates (b) of dichotomous items, it is desirable to create a social desirability measure with a variety of difficulty levels within the higher SDE/IM range, with people higher on the trait having a higher probability of answering affirmative (i.e., no extremely negative b values). For polytomous scoring, six b values are given, indicating the threshold between the seven possible scoring options. These threshold values indicate how high an individuals’ SDE or IM trait level needs to be to have a.50 probability of endorsing this category or a higher response category (Baker and Kim, 2017). It is desirable to obtain items with difficulties spread across the full normal range of SDE and IM, with a lower-bound threshold of one and a higher-bound threshold of six. We aimed at selecting items matching at least the minimum required properties for both scoring methods for IM and SDE.

The following criteria were used for the visual inspection of the ICC and IIC. For polytomous scoring, it is desirable that ICCs are well distributed spread out peaked distributions across response categories (e.g., the probability of giving a score of 5 is on the higher side of the trait, whereas the probability of scoring a 2 is on the lower side of the trait). For dichotomous scoring, the ICC depicts the relationship between the probability of endorsing an item and the level of SDE/IM. The desired item is less likely to be endorsed when a person has a low SDE or IM level than a person with a high trait level and vice versa. In terms of desirable ICC for dichotomous scoring, an S shaped form is preferred with a steep increase at moderate levels of SDE and IM. For IIC, in selecting items for SDE and IM, a trade-off between the amount of information and the information range is desired. A selection of items that together give a relatively high amount of information over the full range is preferred.

Self-Deceptive Enhancement

SDE items with the best trade-off between item properties (see explanation above), for both dichotomous and polytomous scoring, were items 2, 4, 6, 10, 18, 20 (Denial), and 5 (Enhancement) (see Table 3 for item labels and parameter estimator values). That is, these items had factor loadings ranging from.51 to.75, discriminated well between low- and high-levels of SDE with moderate to high discrimination estimates (a range = 1.01–1.91), and demonstrated difficulty estimates in a broad range without the items being too easy or too difficult (Table 3). All items provided relatively higher levels of information at the middle/higher side of the SDE trait. Overall, response categories were endorsed at the appropriate underlying trait level (see, for example, the ICC and IIC of item 20 in Figure 1A; ICCs and IICs of the other SDE items can be found in Supplementary material S1).

Table 3

Table 3. Factor loadings, discrimination, threshold, and difficulty parameters for SDE for dichotomous and polytomous scoring.

Figure 1

Figure 1. Item Characteristic Curve (ICC) and Item Information Curve (IIC) for item 20 (A) and Item 21 (B) dichotomous (upper rows) and polytomous scoring (lower rows) of the BIDR 40-item original version.

Impression Management

The IM items with the best trade-off between properties for both scoring options (see explanation above) were items 21, 23, 25, 27 (Denial), and 28 (Enhancement) options (Table 4). Factor loadings of these items ranged from.52 to.69 for dichotomous scoring, and from.47 to.64 for polytomous scoring. Items had moderate to high discrimination (a between 0.90 and 1.61), and difficulty levels were in a broader range without being extremely easy or difficult. IICs indicated high(er) levels of information, together covering a relatively broad range of IM (see, for example, the ICC and IIC of item 21 in Figure 1B; ICCs and IICs of the other items can be found in Supplementary material S1). Also, these items showed an overall pattern of a probability of endorsing most response options at the corresponding IM level. In sum, the IRT analyses resulted in a 12-item short form (seven SDE and five IM items) of the BIDR that is suitable for both dichotomous (D) and polytomous (P) scoring. In the following, we will refer to this short form as the BIDR-DP12.

Table 4

Table 4. Factor loadings, discrimination, threshold, and difficulty parameters for IM for dichotomous and polytomous scoring.

Dichotomous versus polytomous scoring method

Although the primary goal of the study was to create a short form of the BIDR for both dichotomous and polytomous scoring, which led to the proposed BIDR-DP12, our results indicated that the dichotomous scoring method was better than the polytomous scoring in terms of model fit, factor loadings, and IRT item properties (i.e., a, b, ICC, IIC). Moreover, for the polytomous scoring method, all BIDR items were often more or less answered with a more skewed answering tendency (i.e., most of the time, a score of 1–2 or 6–7 was given). Especially for IM, although the items with the best parameter trade-off were selected, ICCs indicated a skewed answering tendency for more than half of the items (see item 29 in Figure 2A, for an example). Besides, the probability of selecting some answering categories was extremely low as indicated by flat and overlapping ICCs (i.e., the ICC of some answer categories was below all other answer category ICCs; see item 28 in Figure 2B for an example). This could indicate that a 7-point scale may not be the most suitable solution. Furthermore, factor loadings and discrimination estimates appeared to be lower and sometimes more unfavorable for the polytomous scoring method, which resulted in the elimination of some items with good item parameter properties when dichotomously scored (e.g., item 22, see Table 4). Regarding SDE, some items with good item parameter properties were not selected due to more unfavorable parameters for polytomous scoring method (see, for example, item 11 in Table 3).

Figure 2

Figure 2. Item Information Curve (IIC) for item 29 (A) and item 29 (B) polytomous scoring of the BIDR 40-item original version.

Thus, our 12-item short form that is suitable for both dichotomous and polytomous scoring resulted in a loss of items when used polytomously. Moreover, ICCs indicated that the polytomous scoring method resulted in a strongly skewed distribution leading to a comparison of participants in one tail of the distribution versus the others. This might indicate a more or less dichotomous answering tendency. Therefore, we also decided to create a short form for dichotomous scoring only. Based on factor loadings, estimator parameters, ICCs, and IICs, we included the SDE items 4, 6, 10, 18, 20 (Denial) and 3, 5, 9, 15, 17 (Enhancement) and the IM items 21, 23, 25, 29, 35 (Denial), and 22, 28, 32, 36, 38 (Enhancement) in this 20-item ‘dichotomous only’ short form, which we named the BIDR-D20.

Test information, internal consistency, and test–retest reliability

Test information indicated that the original BIDR provided the most information for both the subscales Denial and Enhancement (Figure 3). This is not surprising as the test information is generated by aggregating the item information, and therefore the original 40-item BIDR was expected to have the highest test information. The dichotomous scoring method provided the strongest information at medium-to-high levels of SDE and IM, but weak to no information for discriminating among respondents at low levels. Polytomous scoring provided information over a broader range of SDE and IM than dichotomous scoring, but information levels were generally lower than the peak level information of the dichotomous scoring method.

Figure 3

Figure 3. Test information functions for SDE and IM dichotomous and polytomous scoring.

Although analyses indicated that, within the IRT framework, both SDE and IM had the best model fit as a two-factor model (i.e., Denial and Enhancement), in clinical practice and research, IM and SDE are rarely further divided into Denial and Enhancement scales. For this reason, all further analyses were also conducted using IM and SDE as one-factor constructs. Internal consistency was considered adequate for most versions; five of the 22 different versions of the BIDR (sub)scales (original both polytomous, and dichotomous, for denial and enhancement; the BIDR-DP12 both versions and BIDR-D20 denial and enhancement; all for IM and SDE) had an McDonald’s Omega between.60 and.70 on both time measures. Omega was lower for polytomous scoring than for dichotomous scoring. The same held for Denial and Enhancement, respectively. Test–retest correlations for the SDE and IM short form(s) and the original BIDR ranged from.61 to.85, which is in line with previous research (e.g., Paulhus, 1991; Hart et al., 2015). After correcting for multiple testing, moderation analyses showed that the time between the Wave 1 and Wave 2 assessment did not affect BIDR scoring.

Correlations across the different versions of SDE or IM at Wave 1 and Wave 2 were moderate to high for SDE (r = .62–.88) and for IM (r = .62–.90), except for the correlations across the Denial and Enhancement subscales of SDE (r = .22–.39) and IM (r = .33–.47). Correlations between SDE and IM for the same BIDR version (e.g., BIDR-D12 SDE and BIDR-D12 IM) were small to medium (r between.07 and.45). Omega, test–retest and correlations between IM and SDE, together with the mean and standard deviations of all studies, can be found in Supplementary material S2.

Discussion

The goal of Study 1 was to create a Dutch short form of the BIDR Version 6. Based on the results of IRT analyses, a short form containing seven SDE and five IM items that can be used for both dichotomous and polytomous scoring methods was proposed, the so-called BIDR-DP12. However, the IRT analyses indicated that responses in the polytomous scoring method tended to be skewed, implying that participants often responded in a more dichotomous fashion rather than utilizing the full range of options. While such skewness is not inherently problematic as it may simply reflect items that are particularly easy or difficult—our objective was to design a questionnaire that made full use of a 7-point Likert scale. Ideally, Item Characteristic Curves (ICCs) would show a balanced distribution across response categories, with higher scores more likely at the upper end of the trait and lower scores at the lower end. In addition, analyses indicated that dichotomous scoring resulted in a better fit in terms of model fit and factor loadings and was more suitable for distinguishing between participants scoring high versus low on SDE and IM. Further, the BIDR-DP12 consisted mostly of items that deny negative attributes (both SDE and IM included only one item of the Enhancement subscale). Subsequently, this resulted in a one-factor model for both SDE and IM. To overcome the aforementioned limitations and create a balanced scale, we also created a ‘dichotomous only’ short form, which we named the BIDR-D20, that includes the best SDE and IM items in terms of IRT parameters, with an equal number of items for Denial and Enhancement.

Contrary to previous research (Stöber et al., 2002; Vispoel and Tao, 2013), dichotomous scoring was better or equal in terms of internal consistency for all BIDR forms. In terms of test–retest reliabilities, however, polytomous scoring produced somewhat better results. This is in line with previous outcomes (Schnapp et al., 2017; Stöber et al., 2002). Comparing our short forms with the 40-item BIDR, it can be concluded that the dichotomously scored BIDR and the BIDR-D20 are comparable in terms of internal consistency and test–retest reliability. The BIDR-D20, however, provided less information over the same range as the full BIDR, which is not unexpected. In general, as the TIC is generated by aggregating the item information, the original version (i.e., longer test) can test an examiner’s ability with greater precision than the shortened versions. To confirm the item and test psychometric properties of the short forms, further examination and replication of the results is needed.

Study 2

In Study 2, we further examined the psychometric properties of the BIDR-DP12 and BIDR-D20 short forms. First, we repeated IRT analyses in a new community sample to investigate to what extent the item properties of both short forms could be replicated. Second, we aimed to replicate prior research that outlined the nomological network of SDR by testing the associations between SDE, IM, and basic personality traits and deviant traits and thoughts. Research has indicated that higher levels of SDE are associated with lower levels of wself-reported negative emotionality and higher levels of extraversion, open-mindedness, and emotional stability. More specifically, positive correlations have been reported between the SDE Enhancement subscale, self-reported extraversion, and open-mindedness, whereas SDE Denial has been positively associated with self-reported emotional stability, conscientiousness, and agreeableness (for a meta-analysis see Li and Bagger, 2006).

Conversely, higher levels of IM have been associated with higher self-reported agreeableness, conscientiousness, and emotional stability (Holden and Passey, 2010; Li and Bagger, 2006; Meston et al., 1998; Paulhus, 1988, 2002). Research has also suggested that honesty-humility was the most prominent personality factor in explaining SDR (De Vries et al., 2014; Zettler et al., 2015). Regarding the scoring method, Stöber et al. (2002) found in a student population sample that continuous SDE scores demonstrated significantly higher correlations with conscientiousness [r = −.51 vs. r = −.31, z(diff) = −3.54, p < .001], extraversion [r = .17 vs. r = .03, z(diff) = 2.21, p < .05], and negative emotionality [r = .41 vs. r = .27, z(diff) = 2.28, p < .05] than dichotomous scoring.

In addition, a recent meta-analysis found a significant, albeit small, negative association between IM and SDE on the one hand and self-reports measuring antisocial cognitions (e.g., entitlement to sex) and antisocial personality patterns/traits (e.g., psychopathy traits, antisocial behavior), suggesting that higher scores on IM and SDE are associated with lower scores on self-reports measuring these dynamic risk factors in samples of men who have offended (Hildebrand et al., 2018). Since both IM and SDE seem to be positively correlated with age, and women tend to score higher on IM, whereas men score higher on SDE (Bobbio and Manganelli, 2011; Kroner and Weekes, 1996; Li et al., 2011; but see Hildebrand et al., 2018; Mathie and Wakeling, 2011), both age and sex will be taken into account in the analyses. Expectations of the correlations (i.e., positive or negative) can be found in Table 5 in the result section.

Table 5

Table 5. Correlations for gender, age, personality, risk factors for deviant traits and thoughts and all forms of the BIDR.