Validity of personality measurement in adults with anxiety disorders: psychometric properties of the Spanish NEO-FFI-R using Rasch analyses

The aim of this study was to analyse the psychometric properties of the Spanish NEO Five Factor Inventory–Revised (NEO-FFI-R) using Rasch analyses, in order to test its rating scale functioning, the reliability of scores, internal structure, and differential item functioning (DIF) by gender in a psychiatric sample. The NEO-FFI-R responses of 433 Spanish adults (154 males) with an anxiety disorder as primary diagnosis were analysed using the Rasch model for rating scales. Two intermediate categories of response (‘neutral’ and ‘agree’) malfunctioned in the Neuroticism and Conscientiousness scales. In addition, model reliabilities were lower than expected in Agreeableness and Neuroticism, and the item fit values indicated each scale had items that did not achieve moderate to high discrimination on its dimension, particularly in the Agreeableness scale. Concerning unidimensionality, the five NEO-FFI-R scales showed large first components of unexplained variance. Finally, DIF by gender was detected in many items. The results suggest that the scores of the Spanish NEO-FFI-R are unreliable in psychiatric samples and cannot be generalized between males and females, especially in the Openness, Conscientiousness, and Agreeableness scales. Future directions for testing and refinement should be developed before the NEO-FFI-R can be used reliably in clinical samples.


Introduction
The Five Factor Model (FFM) of personality has become the reference taxonomy for the study of both general and clinical personality traits (Clark, 2007;Gore and Widiger, 2013). The terms commonly used to describe the personality traits that underpin the FFM are Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism, known by the acronym OCEAN (Costa and McCrae, 1985). The FFM has several characteristics that make it a very helpful model: (1) it integrates terminology from diverse theoretical frameworks which facilitates communication between researchers, (2) it makes it easier to explore the relationship between personality and other phenomena, and (3) it is an efficient model that provides a simple outline of general personality structure (De Raad and Perugini, 2002). In recent years, numerous studies have associated the FFM domains with psychiatric disorders, particularly personality disorders (Widiger and Costa, 2002;Mullins-Sweatt and Widiger, 2006), and have provided empirical support for the value of understanding the DSM-IV (e.g., Saulsman and Page, 2004) and DSM-5 (e.g., Trull and Widiger, 2013) personality disorders in terms of the FFM traits.
Other studies have found that the FFM traits may be a key mediator in the utilization, time course, and effectiveness of various treatments of mental disorders (Hopwood et al., 2008). Furthermore, some research suggests an interaction between FFM traits and the modalities of treatment for mental disorders (Few et al., 2010). For example, Miller et al. (2006) examined the relationship between FFM traits and treatment utilization in depression, anxiety, and personality disorders. Their results showed openness to experience and conscientiousness significantly predicted the number of therapy sessions needed, and treatment satisfaction, and compliance. Moreover, they found medication use was significantly associated with low scores on extraversion and high scores on agreeableness. Although these preliminary results are limited, they support the potential utility of the FFM in treatment planning (Hopwood et al., 2008). However, the application of the model requires appropriate instruments to measure the FFM traits in psychiatric populations.
The NEO Personality Inventory-Revised (NEO-PI-R; Costa and McCrae, 1992) is one of the most frequently used questionnaires in the literature to assess the FFM domains. The results obtained with this measurement instrument have been consistent with the FFM in samples of different ages (De Fruyt et al., 2009;Spence et al., 2012) and from different cultures and countries (Rolland et al., 1998). There are two abbreviated versions of this self-report, the NEO Five Factor Inventory (NEO-FFI; Costa and McCrae, 1992) and, the more recent, NEO Five Factor Inventory-Revised (NEO-FFI-R; McCrae and Costa, 2004). Both consist of 60 items selected from the 240-item NEO-PI-R that assess the FFM traits of personality at the domain level. These brief versions are widely used in the literature because they measure the personality traits in less time and use fewer items than the NEO-PI-R (Hosie et al., 2014). Aluja et al. (2005Aluja et al. ( , 2009 found that the psychometric properties of the Spanish NEO-FFI and NEO-FFI-R adaptations are equivalent to the English original in non-clinical Spanish samples but it is not yet known how well they function, at psychometric level, with Spanish psychiatric samples. Some research suggests that instruments assessing FFM traits in non-clinical samples are valuable for assessing personality in psychiatric samples but results are inconclusive (Markon et al., 2005;Samuel et al., 2010).
In summary, extensive research highlights the need for brief and psychometrically reliable and valid instruments to assess the FFM personality traits in both non-clinical and clinical populations. Psychometric models based on item response theory (IRT), such as the Rasch Model (RM; Rasch, 1960), can provide more efficient personality measures and can improve existing measurement instruments (Inchausti et al., 2014). The RM is an alternative approach to Classical Test Theory (CTT), which solves some of CTT's methodological drawbacks (Wright and Stone, 1979). For example, it can be used to examine the coherence of items with regard to the latent trait in question, allowing the construct validity of the questionnaire to be assessed. A further advantage is that, because participants and items are measured along the same continuum, it is easy to identify which items have been endorsed by which participants (Wilson, 2005). The aim of this study was to analyse the psychometric properties of the Spanish NEO-FFI-R using Rasch analyses in order to test its rating scale functioning, the reliability of scores, internal structure, and differential item functioning (DIF) by gender in a large psychiatric sample.

Participants
Participants were 433 Spanish adults (154 males) with an anxiety disorder as primary diagnosis according to DSM-IV-TR criteria (American Psychiatric Association, 2000). The mean age was 36.45 (SD = 14.14) and the primary diagnosis distribution was as follows: 122 patients (29%) had a diagnosis of panic disorder, 97 (22%) had a diagnosis of social phobia, 68 (16%) had a diagnosis of obsessive-compulsive disorder, 57 (13%) had a diagnosis of generalized anxiety disorder, 46 (11%) had a diagnosis of specific phobia, and 43 (9%) had a diagnosis of posttraumatic stress disorder. These diagnoses were established using the Structured Clinical Interview for DSM-IV-TR Axis I Disorders (First et al., 2002). Participants with psychotic symptoms, substance abuse, personality disorders, and suspected intellectual disability associated with the anxiety disorder were excluded. Most participants (67%) had one or more additional diagnosis, including other anxiety, and mood disorders; 82.3% of participants had completed secondary school, 51.1% had completed high school, and 13.6% had completed university studies. The mean number of years of education was 13.96 (SD = 5.76).

Measures
The NEO-FFI-R and a brief sociodemographic questionnaire were administered to all participants in the context of a general clinical assessment. The Spanish version of the NEO-FFI-R (McCrae and Costa, 2004) contains 60 items selected from the NEO-PI-R (Costa and McCrae, 1999) which are summed to measure personality at the superordinate level only. Each of the five personality traits is measured using a 12-item scale and each item is rated on a 5-point Likert scale, ranging from strongly disagree (SD) to strongly agree (SA). Psychometric properties of the NEO-FFI-R scores have been previously analysed using CTT in a non-clinical Spanish sample obtaining good reliability indexes and factor structures in line with the results reported using the English original (Aluja et al., 2005(Aluja et al., , 2009).

Procedure
Participants completed the measures during general clinical assessments in the Mental Health Services of Badajoz (MHSB, Spain). Participants were informed about the research and, after signing the consent form, were asked to complete anonymous questionnaires. They received no type of incentive for taking part in the study. The measurement instruments were always completed under the supervision of a researcher. This study was approved by the Research and Ethics Committees at MHSB.

Data Analyses
Rasch analysis is a specific approach to construct modeling within the IRT framework. The RM provides a way of relating item difficulty and respondent characteristics. When applied to the measurement of psychological constructs, terms such as an 'item scale value' and an 'individual's attitude toward something' can be represented in item and respondent locations. In the measurement of personality using the NEO-FFI-R, item scale value relates to the likelihood of a particular trait being endorsed, and an individual's attitude refers to the amount that a person endorses a personality trait; the RM placed both of these on the same latent continuum.
The assumptions of the RM are different to those of many psychometric methods. For example, the model assumes that the way in which people respond is probabilistic. Thus, more questions on the neuroticism dimension of the NEO-FFI-R will be endorsed by a person with a higher level of neuroticism than a person with a lower level neuroticism, and indicators commonly reported are more likely to be endorsed than items that are rarely reported. Furthermore, it is assumed that the metric of the underlying construct is reflected by the fact that people respond to categories in an ordered manner. For each item on the NEO-FFI-R, indicator severity influences the probability of a person endorsing a response category that is high on the scale. Individual differences within the sample, such as the gender of the responder, should not affect the probability of endorsing a question (DIF). If the responses to a questionnaire meet the assumptions of the RM, it can be determined that the questionnaire has good construct validity and functions as a true interval-level measure of a latent variable.
The Rasch analyses were performed for the five NEO-FFI-R scales, using the software Winsteps (Linacre and Wright, 2000;Linacre, 2013). First, the quality of the response categories were tested with the Rating Scale Model (RSM; Andrich, 2013), an extension of the RM for polytomous items. This model gives an interpretation of category ordering in rating formats, by inferring a space of experimentally independent Bernoulli variables, characterized by Rasch's simple logistic model, from a complex of mathematical relationships among response spaces (Andrich, 2013). Linacre (2002) has proposed several criteria for diagnosing a malfunctioning empirical rating scale. A scale can be considered to be performing at an optimal level when: (a) all the categories are used frequently to estimate step calibrations, or when there are unimodal or bimodal distributions with the highest frequencies in the extreme categories. (b) The average person measures by category move up the rating scale monotonically. (c) The fit of persons, items, and categories can be assessed using averaged residuals. The degree of fit is indicated by the statistics Outfit, the averaged standard squared residuals, and Infit, the averaged standard squared residuals, weighted by the information function. For both statistics, the expected value is 1. Values higher than this point to patterns that are abnormal with respect to the model and values lower than this indicate overfit, i.e., response patterns that are deterministic. When empirical data are not predictable from the model, values will be substantially higher than 1. Linacre (2002) also states that category misfit is indicated by Outfit values of more than 2. (d) Within the variable there should be an instance in which the probability of responding in a category is higher than the probability of choosing any other category, i.e., within the adjacent categories the step calibrations must advance monotonically. It is also recommended that step difficulties should advance by at least 1.4 logits and by no more than 5.0 logits.
After testing the rating scale, the fit of the data to the RM was analysed in the five NEO-FFI-R scales. The assumption of unidimensionality was examined using Principals Components Analysis of Rasch measures and residuals. It can be affirmed that the data are essentially one-dimensional if the Rasch measurement shows a moderately high percentage of explained variance (at least 20%) and the first residuals components of the unexplained variance are less than 2. Finally, DIF by gender analyses were conducted in order to probe the generalized validity. DIF was considered to be present if there were significant differences of more than a half logit between the difficulty parameters in males and females. An item presents DIF when the probability of a score in individuals with the same level in the latent trait varies according to the group to which they belong (e.g., gender). The standardized localization parameter differences by gender were calculated after possible sample-related differences in the distribution of the NEO-FFI-R scales were adjusted for and a Bonferroni multiple-comparison correction of the chosen significance level was made (Linacre, 2013).

Rating Scale Functioning
Testing of the quality of the response categories with the RSM indicated that the category thresholds were disordered in the Neuroticism and Conscientiousness scales (see Table 1). In addition, the average person measures by category advanced monotonically with the rating scale but the step calibrations did not advance monotonically with the categories in the Conscientiousness scale. The average person measures and the step calibrations were both disordered in the Neuroticism scale. Thus, in both NEO-FFI-R scales, there was no interval wherein the probability of being observed (or responding) in some of the categories was higher than the probability of choosing any other one.    Table 2. The item fit values indicated each scale had items that did not achieve moderate to high discrimination on its dimension (n = 7), particularly in the Agreeableness scale (n = 3). It should be highlighted, these items with RM misfit also had minor correlations with the total score for their scales in logits, especially in two items from Agreeableness (r id < 0.10). Furthermore, the item relating to 'manipulation' showed a negative correlation with its dimension (r id = −0.08), which indicates that it was not in the same polarity as the scale (Linacre and Wright, 2000).
Concerning unidimensionality, the proportion of empirical variance explained by the Rash measures was lower than  and their variance component scree plots showed large components of unexplained variance with values for the first components higher than 2 (2.6, 2.8, 2.7, 3.5, and 2.0, respectively). These results reveal that the unidimensionality assumption was not met in any of the NEO-FFI-R scales and suggest that several latent variables were present. Wright maps, showing the conjoint representation of participants and items along the personality variables, can be seen in Figure 2. These maps display person ability and item difficulty estimates along interval level scales, so that the units between items, between participants and between participants and items can be read in terms of the represented variable (i.e., the five traits of personality in this case). Person and item mean (M), 1 SD (S) and 2 SDs (T) are located at the left and right sides of the axis, respectively, in order to facilitate the interpretation of the maps. As can be seen in Figure 2, participants' levels of Neuroticism and Agreeableness were much higher than the difficulty of the items. By contrast, levels of Extroversion within the sample were below the value of the items. These results were expected, due to the characteristics of the patients and indicate the need to include items more suited to clinical populations.

Analysis of DIF by Gender
After testing the fit of the data to the RM, DIF analyses were conducted to investigate the external validity of the NEO-FFI-R for measuring participants of different gender. DIF by gender was detected in many items (n = 17) from the five scales, violating the criteria proposed by Linacre (2013). A data summary of these analyses is presented in Table 3. As can be seen in Table 3, the scales which reflected less DIF by gender were Extraversion (n = 1) and Neuroticism (n = 3). Positive DIF contrast values indicate that the item was more difficult for females (n = 11), and negative DIF contrast values imply that the item was more difficult for males (n = 6).

Discussion
The testing of the NEO-FFI-R scales with the RM showed firstly that its response categories did not work appropriately on the Conscientiousness and Neuroticism scales; 'neutral' and 'agree' categories did not work well in both scales. These results suggest  Rasch-Welch's t contrast; p, probability. it would be advisable to recalibrate the rating scale for use with clinical populations. Similarly, Spence et al. (2012) found that extreme categories (i.e., 'strongly agree' and 'strongly disagree') are generally the most commonly used NEO-FFI responses in adolescents. This suggests that intermediate categories may have only a limited utility. Thus, the rating scale is not used in the same way by all participants and this may also have played a role in the lack of unidimensionality obtained in this study, since response styles may act as a second latent variable which, in addition to trait, influence item responses (Austin et al., 2006). It is also important to note that Wetzel and Carstensen (2014) have recently shown that disordered thresholds do not impair trait measurement using the Partial Credit Model. In this regard, whether thresholds and categories are ordered with respect to the average trait estimates, are two different aspects. As can be seen for Conscientiousness in Table 1, the average trait estimates per category can still be ordered despite disordered thresholds. By contrast, for the Neuroticism scale the average trait estimates per category did not increase monotonically, indicating a major violation of model assumptions. A possible solution to recalibrate the rating scale in the Neuroticism and Agreeableness scales could be to combine both malfunctioned adjacent response categories into one. However, even though disordered thresholds may be caused by several different factors, including response styles, they are likely to be caused by a failure of the hypothesis behind the items (Linacre, 2002). Secondly, the scores on the Agreeableness and Neuroticism scales were not sufficiently reliable. An explanation for these results could be the lack of items tailored to the characteristics of the sample. For example, Samuel et al. (2010) found that the items of the Neuroticism scale provided more psychometric information at the lower levels of the latent trait, and were poorer discriminators of people with high levels of Neuroticism. The current study suggests that the inclusion of more "difficult" items in the Neuroticism scale and "easier" items in the Agreeableness scale may improve their reliability indexes.
Furthermore, the fact that no NEO-FFI-R scale satisfied the unidimensionality assumption indicates that there were unknown variables that had not been taken into account, and that these variables interfered with the measurement. Once again, it was the Neuroticism and Agreeableness scales that performed particularly poorly, obtaining the worst dimensionality values. It may be that many of the trait indicators failed to discriminate the latent traits because the items were not referencing thoughts or behaviors that are relevant to people with anxiety disorders (Karsten et al., 2012). Another explanation could be found in the difficulties with item comprehension (e.g., McCrae et al., 2005).
Thirdly, the analysis of internal structure detected misfit according to the model in several items. These items also correlated weakly with the personality constructs that they were designed to asses. In the case of the Agreeableness scale, one item negatively correlated with its dimension. It is quite possible that these results were a consequence of social desirability, which could explain the poor psychometric values for the item 'manipulation.' Indeed, convergence between the Agreeableness scale and social desirability measures have been described in adults (Stöber, 2001).
Lastly, the DIF analysis revealed that gender influenced the difficulty of many items. The analysis suggests the NEO-FFI-R scores cannot be generalized between males and females, especially in the Openness, Conscientiousness, and Agreeableness scales. That is, the items of these scales were not measurement invariant for men and women. In addition, the direction of DIF was not balanced (11 items favored men and only six items favored women). Since DIF for gender may be explained by multidimensionality, other variables such as response style (Reise et al., 2001), educational level or psychopathological symptoms (e.g., Samuelsen, 2008) should be considered in future studies exploring the DIF by gender. On the other hand, crosscultural research has revealed significant gender differences in the responses to the NEO-PI-R, i.e., females often report higher Neuroticism, Agreeableness, warmth (Extraversion) and openness to feelings, and males frequently report higher assertiveness (Extraversion) and openness to ideas (Costa et al., 2001). Hence, these gender differences should be taken into consideration when the measuring instruments are constructed, and the items with measurement bias by gender should be removed.
The results found in the present study should be interpreted in light of the following limitations. All participants in the study had a primary diagnosis of an anxiety disorder. As generalizability from one psychiatric population to another is often limited, the degree to which our sample is representative of other psychiatric populations is unknown. The sample size of this study did not permit the analysis of DIF according to diagnosis. A further limitation is that only one measurement instrument was used and it was not possible to control for clinical variables such as levels of anxiety, mood, or treatment effects. Although previous evidence has shown that instruments assessing FFM personality traits in non-clinical samples could be valuable for assessing personality in psychiatric samples (e.g., Markon et al., 2005), the results of the current study suggest that their scores are unreliable in a sample of adults with anxiety disorders. It should also be noted that other studies have reported problematic psychometric properties for NEO family tests in samples from the general population (Reise et al., 2001), indicating that the results of this study may not be specific to clinical samples but may indicate more general problems with the measuring instrument. In conclusion, a complete review of the NEO-FFI-R should be performed before it can be used reliably in clinical contexts, especially with the Neuroticism and Agreeableness scales, which may need more development and testing.