Using Item Response Theory for the Development of a New Short Form of the Eysenck Personality Questionnaire-Revised

The present work aims at developing a new version of the short form of the Eysenck Personality Questionnaire-Revised, which includes Psychoticism, Extraversion, Neuroticism, and Lie scales (48 items, 12 per scale). The work consists of two studies. In the first one, an item response theory model was estimated on the responses of 590 individuals to the full-length version of the questionnaire (100 items). The analyses allowed the selection of 48 items well discriminating and distributed along the latent continuum of each trait, and without misfit and differential item functioning. In the second study, the functioning of the new form of the questionnaire was evaluated in a different sample of 300 individuals. Results of the two studies show that reliability of the four scales is better than, or equal to that of the original forms. The new version outperforms the original one in approximating scores of the full-length questionnaire. Moreover, convergent validity coefficients and relations with clinical constructs were consistent with literature.


INTRODUCTION
In the view of Eysenck (see Eysenck andEysenck, 1975, 1991), the structure of personality may be effectively described by three main traits: psychoticism (P), extraversion (E), and neuroticism (N). These dimensions are also known as the "Giants Three" and represent basic, independent, and biologically founded traits. They characterize all subjects, with varying degrees, and allow for effectively describing behavioral, emotional, and individual differences among adults and young people. According to the authors, PEN traits do not represent pathological dimensions in themselves, but could lead to the development of abnormal conditions only in particular situations . In this perspective, neurosis and psychosis should be conceived as pathological exaggerations of the underlying traits of neuroticism and psychoticism Mor, 2010).
Extraversion and neuroticism have been the first two dimensions included in the Eysenck's model and were conceptualized as orthogonal continua (Eysenck andEysenck, 1964, 1991). The neuroticism dimension describes a trait opposed to emotional stability, and defines the degree to which a person is predisposed to experience negative affect (Eysenck andEysenck, 1964, 1991;Mor, 2010). Individuals with high levels of this trait tend to be worried, apprehensive, moody, fed-up, and irritable Eysenck and Barrett, 2013). Extraversion is the second dimension included in the model and depicts sociable, carefree, friendly, convivial, easygoing, and impulsive individuals. This trait is opposed to introversion which, in contrast, defines individuals introspective, quiet, serious, and reserved (Eysenck andEysenck, 1975, 1991;Eysenck and Barrett, 2013). The third dimension included in the Eysenck's model has been psychoticism, or toughmindedness. The typical toughminded is an individual hostile, aggressive, untrusting, cold, unemotional, rude, lacking in human feelings, and unfriendly. On the opposite pole of the continuum, there are individuals with well-adjusted personality, agreeable, empathic, tolerant, conscientious, openminded, friendly, and warm (Eysenck andEysenck, 1975, 1991;Eysenck and Barrett, 2013).
The short form of the Eysenck Personality Questionnaire-Revised (EPQ-R; Eysenck et al., 1985;Eysenck and Eysenck, 1991) includes 48 items (out of 100 of the EPQ-R), 12 per each of the four dimensions. This version of the instrument has been translated in several languages and is widely used, across different countries, for scientific and clinical purposes (Hosokawa and Ohyama, 1993;Aluja et al., 2003;Alexopoulos and Kalaitzidis, 2004;Dazzi et al., 2004;Francis et al., 2006;Tiwari et al., 2009;Sanavio et al., 2013). However, it suffers from the same drawbacks of the full-length version. In particular, P scale exhibited poor reliability with a restricted range of scores and a strong positive skewness (Bishop, 1977;Block, 1977;Claridge, 1981;Hosokawa and Ohyama, 1993;Katz and Francis, 2000;Alexopoulos and Kalaitzidis, 2004). In addition, several items showed differential item functioning (DIF) across gender (Eysenck et al., 1985;Eysenck and Eysenck, 1991;Lynn and Martin, 1997;Forrest et al., 2000;Karanci et al., 2006;Escorial and Navas, 2007), which makes the comparison between groups questionable.
A better selection of the items from the full-length version of the instrument could allow for reducing some of the aforementioned drawbacks. The present work aims at developing a new version of the short form of the EPQ-R with improved psychometric properties.
Item response theory (IRT; Bock, 1997;Thissen and Steinberg, 2009) is one of the most promising approaches to this aim. There are several successful applications of IRT for the development and validation of measurement scales (see, Da Dalt et al., 2013Dalt et al., , 2015Balsamo et al., 2014;Anselmi et al., 2015;Zanon et al., 2016;Sotgiu et al., 2018). Moreover, compared with classical test theory, IRT was found to provide more diagnostic information useful for the development of brief scales (Spence et al., 2012;Bortolotti et al., 2013;Petrillo et al., 2015). IRT allows for identifying the items that are best at discriminating different levels of the latent trait of interest, while ensuring that the entire trait continuum is covered. Selecting these items can result in a brief version of the scale that produces scores very similar to those obtained with the full-length scale and has the same external validity (i.e., the same correlations with other constructs; Reise and Henson, 2000;Spence et al., 2012). Moreover, IRT allows for detecting items that are unclear, ambiguous, or which exhibit DIF. These items should be not included in the brief scale. Despite advantages offered by IRT, only a few studies employed this approach for the refinement of Eysenck's instruments (e.g., Ferrando, 2001;Ferrando and Chico, 2001;Escorial and Navas, 2007;Maij-de Meij et al., 2008). Recently, Colledani et al. (2018) used IRT for developing a new version of the abbreviated form of the Junior EPQ-R (6 items per scale). The new version outperformed the original one on several aspects. This work includes two main studies. In Study 1, a series of analyses were performed on the responses to the full-length version of the EPQ-R in order to select the 48 items (12 per each scale) with the best psychometric properties. In Study 2, the functioning of the new short form was tested in a new data sample. Reliability, validity and factor structure were examined. Relationships of the new scales with social desirability, the dimensions of the Five Factor Model (FFM), and clinically relevant constructs were verified.

STUDY 1 Participants
A total of 590 participants took part in the study (mean age = 36.69 years, SD = 14.16; from 18 to 75 years; 55.8% females). They were recruited from different Italian regions through convenience sampling. All participants were native Italian speakers and completed the questionnaire anonymously and voluntarily. All standards for research with human subjects were respected. Written informed consent was obtained from the participants. The project has been approved, now as later, by the Ethical Committee for the Psychological Research of the University of Padova since a prospective ethics approval was not required at the time when the research was conducted (Protocol n. 2622).

Instruments
The participants were presented with the Italian version of the EPQ-R (Dazzi et al., 2004;Dazzi, 2011). The instrument consists of 100 dichotomous items (yes/no), 32 for P scale (e.g., "Should people always respect the law?, " "Do you enjoy hurting people you love?"), 23 for E scale (e.g., "Do you enjoy meeting new people?, " "Can you get a party going?"), 24 for N scale (e.g., "Would you call yourself a nervous person, " "Are you often troubled about feelings of guilt?"), and 21 for L scale (e.g., "Are all your habits good and desirable ones?, " "Have you ever cheated at a game?"). Administration of the questionnaire was individual and paper-and-pencil.
The Italian version of the questionnaire has good reliability and the four-factor structure was confirmed (α = 0.67, 0.78, 0.85, and 0.75 for P, E, N, and L scales, respectively; Dazzi et al., 2004;Dazzi, 2011). The reliability found in the current sample (α = 0.60, 0.79, 0.85, and 0.77 for P, E, N, and L scales) is in line with literature.
Studies in the Italian context aimed also to test the factor structure and the psychometric characteristics of the short version of the instrument (Dazzi et al., 2004). Consistently with cross-cultural findings, results supported the four-factor structure of the instrument and showed reliability coefficients satisfactory for E, N, and L scales, while lower for P (α = 0.37, 0.77, 0.83, and 0.70 for P, E, N, and L, respectively; Dazzi et al., 2004). The reliability found in the current sample (α = 0.40, 0.73, 0.83, and 0.73 for P, E, N, and L scales) is in line with literature.

Analysis Strategy
The two-parameter logistic (2PL) model (see Thissen and Steinberg, 2009) was separately estimated on the responses to each of the four scales of the questionnaire. This model describes the probability that a subject endorses a certain item as a function of the latent trait level of the subject (parameter θ), the "endorsability" level of the item (i.e., the ease of providing a "yes" response to that item; parameter ε), and the capability of the item in differentiating subjects with different trait levels (parameter δ). In the case of the P scale, for instance, the greater the value of parameter θ, the greater the level of psychoticism of the subject; the greater the value of parameter ε, the greater the ease of responding "yes" to the item (i.e., of providing a response that is indicative of the presence of psychoticism); the greater the value of parameter δ, the greater the capability of the item in differentiating between subjects with different levels of psychoticism. All the analyses were run using the packages "difR" (Magis et al., 2016) and "ltm" (Rizopoulos, 2012) for the statistical environment R (R Core Team, 2016).
The 2PL assumes unidimensionality of the scales. Confirmatory factor analyses were run on the data of each of the four scales (for a reasonable fit, CFI ≥0.90, RMSEA <0.08; see Hu and Bentler, 1999;Marsh et al., 2004;Brown, 2006). These analyses confirmed the unidimensionality of N ]. An exploratory factor analysis on this scale suggests a four-factor solution with 7 items out of 32 exhibiting cross-loadings. In line with literature (e.g., Howarth, 1986;Roger and Morris, 1991;Chico and Ferrando, 1995;Dazzi, 2011), this result confirms that P scale defines a complex and multifaceted construct.

Item Selection for the New Short Scales
DIF and item fit statistics were used to identify the items with the poorest psychometric properties that were not included in the new short scales.
Three item fit statistics were used: infit, outfit (Wright and Masters, 1982), and the index suggested by Bock (1972). Infit and outfit are two χ 2 -based statistics, the former being effective in detecting unexpected responses to items close to a subject's trait level, the latter being effective in detecting unexpected responses to items far from the subject's trait level. In this work, items with infit and/or outfit higher than 1.4 (Wright and Linacre, 1994) were considered misfitting and not included in the new short scales. The index suggested by Bock involves grouping subjects into n categories on the basis of their latent trait level, and observed and expected proportions of subjects endorsing the item for each group are compared (Bock, 1972;Reise, 1990). In this work, subjects were grouped into four categories and the items which displayed a medium (0.3 ≤ < 0.5) to large ( ≥ 0.5) effect size (Cohen, 1988) were not selected for inclusion in the new questionnaire.
Items exhibiting gender DIF were also excluded from the new questionnaire. Both uniform and non-uniform DIF were considered. The former is a systematic bias expressing a different probability of endorsing an item for the members of a specific group. The latter is a non-systematic bias which varies with the latent trait level. Females were used as reference group. Effect sizes of uniform and non-uniform DIF were evaluated by the R 2 difference test (Nagelkerke, 1991;Gómez-Benito et al., 2009), with values higher than 0.035 denoting moderate DIF and values higher than 0.07 denoting strong DIF (Jodoin and Gierl, 2001;Magis et al., 2016).
Parameters ε and δ were examined to select, among the remaining items, those that allow for covering the entire trait continuum and with the greatest discrimination level.

Assessment of the Psychometric Characteristics of the New Short Scales
Reliability and validity of the newly developed PEN-L scales were evaluated and compared with those of the original short scales. Reliability was evaluated through Cronbach's α and test information function (TIF). TIF tells us how well the test measures the latent trait levels over the entire range of interest (Baker, 2001;Petrillo et al., 2015). The larger the value of TIF, the greater the accuracy with which the latent trait levels are measured. TIF depends on the latent trait range under consideration and on the number of items in the test (Baker, 2001). In this work, the old and new short scales had the same length (12 items), and TIF was defined on the same range of latent trait levels (−5 to 5). Validity was evaluated using a bias index and the correlation between scores obtained with full-length and short scales. The bias index was computed as the average difference (in absolute terms) between the parameters θ estimated on the full-length scales and those estimated on the short scales. Low biases suggest that the latent trait estimates obtained with the short scales approximate those of the full-length versions.
In addition, the correlations between scores obtained with the full-length and short scales were computed and corrected for common items using the Levy's (1967) method.

Results
Three of the 32 items of P scale exhibited uniform and nonuniform gender DIF of moderate (Items 68 and 91) or strong (Item 12) size. Fit statistics were adequate for all the items. From the remaining 29 items, 12 were selected taking into account their parameters ε and δ. This resulted in a new short scale, that differed from the original one for eight items (see Table 1). Specifically, Item 91 was changed because it showed uniform and non-uniform gender DIF of moderate size. These modifications allowed for obtaining a new scale with increased reliability (α increased from 0.40 to 0.62; TIF increased from 8.13 to 12.86) and with scores that better approximate those obtained with the full-length scale (bias decreased from 0.37 to 0.18, corrected correlation increased from 0.47 to 0.52). It is worth noting that Cronbach's α of the new 12-item scale (0.62) largely resembles that of the full 32-item scale (0.60).
Regarding the 23 items of E scale, only Item 55 exhibited uniform gender DIF of moderate size and no item showed misfit. Selecting 12 items upon the basis of their parameters ε and δ, we obtained a new E scale that differed from the original one for three items (see Table 2). The differences in reliability and validity of the new and original scales were small in size, nevertheless in favor of the new version (α increased from 0.73 to 0.75; TIF increased from 16.62 to 16.83; bias decreased from 0.21 to 0.19; corrected correlation increased from 0.74 to 0.77).
Concerning N and L scales, no one item exhibited gender DIF or misfit. Therefore, items were selected considering their ε and δ parameters. For both scales, the new versions differed from the original ones for two items (see Tables 3, 4

Discussion
This study aimed at developing a new short version of the EPQ-R with improved psychometric characteristics. IRT based statistics allowed the identification of 48 items without gender DIF or misfit, well discriminating, and well distributed along the four latent traits continua. The new version of the P scale differs from the original one for eight items (out of 12), E scale for three, and N and L only for two. The largest improvement was reached for P scale, which in literature was found to perform less well than the  other three scales (e.g., Bishop, 1977;Block, 1977;Claridge, 1981).
In particular, the new version is not affected by gender DIF and outperforms the original one for reliability and approximation of the scores obtained with the full-length form. The new versions of the other three scales performed as well as, or slightly better than the original ones. Although small in size, these improvements are valuable taking into account that were obtained by substituting a small number of items and reducing content redundancy.

STUDY 2
This study aimed at investigating the functioning of the new version of the short EPQ-R on a new data set. Other to reliability and factor structure, construct validity was evaluated by taking into account relationships with social desirability, the dimensions of the FFM, and measures of anxiety and depression.

Participants
Participants were 300 native Italian speakers aged between 18 and 65 (mean age = 29.28, SD = 10.38; 60.2% females). They were recruited from different Italian regions using convenience sampling. All participants were presented with the new version of the short EPQ-R, whereas a subsample of 158 participants (mean age = 34.73, SD = 9.88; 68.7% females) also received the other measures. The participation to the study was anonymous and voluntary, and all standards for research with human subjects were respected. Written informed consent was obtained from the participants. The project has been approved, now as later, by the Ethical Committee for the Psychological Research of the University of Padova since a prospective ethics approval was not required at the time when the research was conducted (Protocol n. 2622).

Instruments
The new form of the short EPQ-R devised in Study 1 was administered to all participants. The five traits of the FFM of personality (i.e., extraversion, agreeableness, conscientiousness, emotional stability, and openness) were measured through the Italian version (Ubbiali et al., 2013;Chiorri et al., 2016) of the Big Five Inventory (BFI; John et al., 2008). The questionnaire consists of 44 items answered on a five-point Likert scale (from 1 "Strongly disagree" to 5 "Strongly agree"; e.g., "I see myself as someone who is full of energy" for extraversion; "I see myself as someone who is helpful and unselfish with others" for agreeableness; "I see myself as someone who perseveres until the task is finished" The items are ordered by increasing easiness. The items included in the new and in the original short forms are marked by "." for conscientiousness; "I see myself as someone who worries a lot" for emotional stability; "I see myself as someone who is ingenious, a deep thinker" for openness). Convincing evidence was found concerning construct validity, factor structure, gender invariance, and reliability (α from 0.75 to 0.86; Ubbiali et al., 2013;Chiorri et al., 2016; α from 0.73 to 0.83 in the current sample). The Impression Management (IM) scale of the Italian brief version (Bobbio and Manganelli, 2011) of the Balanced Inventory of Desirable Responding (BIDR; Paulhus, 1991) was also administered. The scale comprises 8 items answered on a six-point Likert scale (from 1 "Strongly disagree" to 6 "Strongly agree") and assesses the conscious tendency of individuals to provide positively inflated self-descriptions (e.g., "I have never dropped litter on the street"). Internal consistency of the scale ranges from 0.73 to 0.81 (Bobbio and Manganelli, 2011; in the current sample, α = 0.75).
The trait scale of the State-Trait Anxiety Inventory (STAI-Y; Spielberger et al., 1983;Pedrabissi and Santinello, 1989) was used to evaluate anxiety. The scale comprises 20 items answered on a four-point Likert scale (from 1 "Not at all" to 4 "Very much"). The instrument evaluates the tendency of people to experience general anxiety and the relatively stable predisposition to view stressful situations as threatening (e.g., "I am regretful"). The Italian version of the questionnaire showed adequate validity and reliability (α from 0.85 and 0.90; Pedrabissi and Santinello, 1989; in the current sample, α = 0.92).
Finally, the Italian version of the Patient Health Questionnaire-9 (PHQ-9; Spitzer et al., 1999;Kroenke et al., 2001) was used to evaluate depressive symptoms. The questionnaire is a self-administered instrument and assesses the nine DSM-IV (American Psychiatric Association, 2000) criteria for depression. Respondents are asked to evaluate the presence of depressive symptoms over the last 2 weeks through nine items scored on a four-point Likert scale (from 0 "Not at all" to 3 "Nearly every day"; e.g., "Feeling tired or having little energy"). This instrument showed adequate reliability (α from 0.86 to 0.89), and good sensitivity and specificity (see Kroenke et al., 2001). In the current sample, α equals 0.81.

Analysis Strategy
Reliability of the new version of the short EPQ-R was tested through Cronbach's α. Construct validity was evaluated by computing convergent validity coefficients and by analyzing the factor structure of the instrument.
Convergent validity was evaluated considering correlations between the four PEN-L traits, the five dimensions of FFM, social desirability, and indexes of depression and trait anxiety. According with literature, L scores are expected to positively correlate with the IM scale of the BIDR (e.g., Gillings and Joseph, 1996), while PEN traits are expected to correlate with BFI scales, depression and trait anxiety. In particular, positive correlations are expected between E scores of the EPQ-R and the extraversion measure of the BFI, while negative correlations are expected between P scale and agreeableness and conscientiousness. Positive correlations are also expected between N scale of the EPQ-R and the neuroticism measure of the BFI (e.g., McCrae and Costa, 1985;Draycott and Kline, 1995;Saggino, 2000;Barbaranelli et al., 2003;Scholte and De Bruyn, 2004;Heaven et al., 2013). Neuroticism, in addition, is expected to positively correlate with indexes of anxiety and depression (STAI-Y; Spielberger et al., 1983;PHQ-9;Spitzer et al., 1999;Kroenke et al., 2001). In contrast, extraversion is expected to negatively correlate with these two clinical indexes.
An Exploratory Structural Equation Model (ESEM; Asparouhov and Muthén, 2009) was run to evaluate the factor structure. The ESEM framework represents an integration of confirmatory factor analysis (CFA), structural equation modeling (SEM), and exploratory factor analysis (EFA). ESEMs give access to all the common statistics of SEM/CFA but, at the same time, overcome the restrictions associated with the confirmatory approach. CFA fixes non-target loadings to zero and, therefore, it may be inadequate to handle complex and multifaceted constructs where many cross-loadings may be expected (Marsh et al., 2009(Marsh et al., , 2010(Marsh et al., , 2011(Marsh et al., , 2014. When this is the case, fit problems and upward-biased estimates of correlations between factors can be observed (Cole et al., 2007;Marsh and Hau, 2007;Marsh et al., 2010). As in EFA, ESEMs allow for the free estimation of cross-loadings between items and non-target factors. In this work, ESEM was run using Mplus7 (Muthén and Muthén, 2012), and the WLSMV as estimator (weighted least squares mean and variance-adjusted). This method is recommended for binary or ordinal observed data (e.g., Flora and Curran, 2004;Brown, 2006) such as the dichotomous items of the EPQ-R. In the model, the 48 items were the indicators and four factors were modeled. The GEOMIN oblique rotation was used. To evaluate the goodness of fit of the model, several fit indexes were considered: χ 2 , Comparative Fit Index (CFI; Bentler, 1990), Weighted Root Mean Square Residual (WRMR; Yu, 2002), and Root Mean Square Error of Approximation (RMSEA; Browne and Cudeck, 1993) with its 90% confidence interval (90% CI) and the test of close fit (CFit; Browne and Cudeck, 1993). A solution fits the data well when χ 2 is nonsignificant (p ≥ 0.05). Since this statistic is sensitive to sample size, the other fit measures were also considered. In particular, a solution fits the data well when CFI is close to 0.95 (0.90 to 0.95 for reasonable fit), WRMR is close to 1.0, and RMSEA is smaller than 0.06 (0.06 to 0.08 for reasonable fit) with CFit non-significant (see Hu and Bentler, 1999;Marsh et al., 2004;Brown, 2006).

Results
Cronbach's α coefficients were 0.55, 0.80, 0.81, and 0.70 for P, E, N, and L scales, respectively. These values were consistent with those of Study 1. Compared with the original version, the largest improvement was reached for P scale, as observed in Study 1.
Convergent validity coefficients are reported in Table 5. All the four PEN-L traits correlated in the expected direction with the considered constructs. E scale showed a strong positive relation with the extraversion measure of the BFI (0.727). P scale was negatively related to agreeableness (−0.323) and conscientiousness (−0.321). N scale was strongly correlated with neuroticism (0.709). Relations with anxiety and depression were also in the expected directions. N scale showed positive relations with scores of PHQ-9 (0.619) and STAI-Y (0.697), while moderate negative relations were found between these two indexes and E scale (r = −0.409, −0.405 for PHQ-9 and STAI-Y, respectively). Finally, L scale showed a strong positive correlation with the IM scale of the BIDR.

Discussion
The analyses performed in this study provide further evidence concerning the adequate psychometric properties of the new short form of the EPQ-R. Concerning reliability, results are in line with those of Study 1 and confirm that, compared with the original version, the largest improvement was observed for P scale. Concerning validity, both the factor structure of the instrument and its convergent validity are supported.

FINAL REMARKS
This work aimed at developing a new and improved version of the short form of the EPQ-R. This instrument is well-known and widely used in different settings. However, some weaknesses have been pointed out, especially for P scale (e.g., Bishop, 1977;Block, 1977;Claridge, 1981). IRT approach was used to develop the new instrument. This approach allowed for removing items with misfit or gender DIF, and for identifying items that were best at discriminating different levels of traits, while ensuring that the respective continua were covered. As suggested in literature, following these criteria for item selection should lead to a short scale with the same psychometric properties of the full-length instrument (Reise and Henson, 2000;Spence et al., 2012). In fact, results of this work show that the new short form of the EPQ-R approximated the scores obtained with the full-length form better than the original short version. In addition, convergent validity of the new scale was consistent with literature (e.g., Saklofske et al., 1995;Gillings and Joseph, 1996;del Barrio et al., 1997;Dazzi et al., 2004;Jylhä and Isometsä, 2006;Mor, 2010). The moderate to strong relationships between Eysenck's traits and clinical constructs provide further evidence toward the usefulness of assessing these traits in clinical settings.
A strength of the present work is that it provides a solution to some well-known drawbacks of the full-length EPQ-R and of its short form existing in the literature (Eysenck et al., 1985;Eysenck and Eysenck, 1991). The largest improvement was obtained for P scale. The new version is not affected by gender DIF and outperforms the original one for reliability and approximation of the full-length form. The new versions of the other three scales performed as well as the original ones, or slightly better. These improvements are small in size, yet notable considering that were obtained by substituting a small number of items and reducing content redundancy.
In the present work, separate analyses have been performed on each of the four scales by using a unidimensional IRT model. An alternative could have been examining the four scales at once through a multidimensional IRT (MIRT) model (see Haberman et al., 2008;Reckase, 2009). MIRT models offer some advantages over unidimensional IRT models. They could allow for better understanding the traits measured by an instrument and how well individual items measure each of them (Ackerman, 1994). Moreover, MIRT models could provide a more precise estimation of scale reliability (Cheng et al., 2009) and item parameters (Finch, 2010). In the present work, some of these advantages are not very relevant. On the one hand, the factor structure of the EPQ-R has been widely tested and validated in the literature (e.g., Hosokawa and Ohyama, 1993;Maltby and Talley, 1998;Forrest et al., 2000;Qian et al., 2000;Scholte and De Bruyn, 2001;Aluja et al., 2003;Alexopoulos and Kalaitzidis, 2004;Dazzi et al., 2004;Francis et al., 2006;Karanci et al., 2006;Tiwari et al., 2009;Picconi et al., 2018). On the other hand, for scales whose length is analogous to that of the four EPQ-R scales (i.e., from 21 to 32 items), the unidimensional IRT models have been found to provide item parameter estimates whose precision exceeds or equals that of the estimates produced by the MIRT models (Finch, 2010). Finch (2010) investigated the precision of MIRT estimates on tests measuring a number of traits as small as two. For larger numbers of traits (e.g., the four traits of the EPQ-R), the number of parameters of a MIRT model increases considerably. Thus, the sample size of Study 1 (590 individuals) could have not been appropriate for performing a multidimensional analysis.
Concerning P scale, despite notable improvements, reliability remains rather low. This result, however, was expected. P scale, in fact, maybe because of its complex and clinical nature, is the most problematic and controversial of the instrument (e.g., Eysenck et al., 1985). Future research, therefore, should try to develop a new pool of items effective in capturing the multifaced aspects of this trait.
In the present work, a new short version of the EPQ-R has been devised, which consists of 12 items per each of the four scales. An abbreviated form exists also in literature (Francis et al., 1992) that consists of only 6 items per scale. This abbreviated form suffers of the same weaknesses that have been pointed out for the other Eysenck's questionnaires. Future research should try to devise a new version of the abbreviated form by using the IRT approach.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusions of this manuscript will be made available by the authors, without undue reservation, to any qualified researcher.

AUTHOR CONTRIBUTIONS
DC contributed to the conception and design of the study, conducted the research, performed the statistical analyses, and wrote the first draft of the manuscript. DC and PA wrote sections of the manuscript. All authors contributed to manuscript revision, read and approved the submitted version.