Confirmatory Factor Analysis of the Inventory of Personality Organization-Reality Testing Subscale

The reality testing dimension of the Inventory of Personality Organization, the IPO-RT, has emerged as an important index of proneness to reality testing deficits. However, to date few studies have examined the factorial structure of the IPO-RT in isolation. This is an important and necessary development because studies use the IPO-RT as a discrete measure. Additionally, psychometric evaluation of the IPO suggests alternative factorial solutions. Specifically, recent work supports multidimensionality, whereas initial IPO assessment evinced a unidimensional structure. Accordingly, this study, using a heterogeneous sample (N = 652), tested the fit of several factorial models (one-factor, four-factor oblique, second-order, and bifactor) via maximum likelihood with bootstrapping due to multivariate non-normality. Analysis revealed superior fit for the bifactor solution (correlated errors) (CFI = 0.965, SRMR = 0.036, RMSEA = 0.042). This model comprised a general reality testing dimension alongside four subfactors (auditory and visual hallucinations, delusional thinking, social deficits, and confusion). Inter-factor correlations were in the moderate range. Item loadings and omega reliability supported the notion that the IPO-RT emphasizes a single latent construct. The model demonstrated invariance across gender and partial age invariance. Overall, from a psychometric perspective, the IPO-RT functioned effectively at both global and, to an extent, factorial levels. Findings recommend that the IPO-RT should be scored as a total scale, and rather than treat subscales independently, future studies should consider examining factor variance alongside overall scale scores.


Inventory of Personality Organization (IPO)
Researchers in discrete but related areas (i.e., psychopathology, personality, individual differences and parapsychology) commonly use the Inventory of Personality Organization (IPO) (Lenzenweger et al., 2001) to assess personality functioning (e.g., Prunas and Bernorio, 2016;Espinosa and Rudenstine, 2018). The IPO is a self-report measure that identifies and classifies personality pathology within clinical and non-clinical samples (Smits et al., 2009;Preti et al., 2015).
Conceptually, the IPO derives from Kernberg's (1984Kernberg's ( , 1996 psychodynamic model, which has a solid theoretical and diagnostic base. Kernberg (1984Kernberg ( , 1996 proposed that general personality disorder originates from a convergence of neurobiologically mediated (e.g., temperament and aggression) and environmentally moderated (e.g., trauma and neglect) factors (Kernberg and Caligor, 2005). Within this conceptualization, personality organization determines pathology. Specifically, Kernberg (1984Kernberg ( , 1996 postulated that position on the neurotic, borderline and psychotic realms determined level of personality dysfunction (Smits et al., 2009). Accordingly, the IPO defines level of personality disorder organization via three dimensions: reality testing, predominance of primitive psychological defenses and identity diffusion (Lenzenweger et al., 2012). Personality disorder manifests as a combination of these dimensions plus level of severity (Kernberg, 1996).
Each IPO dimension possesses its own unique attributes and the importance of these to psychological functioning and behavior is well-documented (Kernberg, 1975(Kernberg, , 1984. Reality testing denotes, "the capacity to differentiate self from nonself, intrapsychic from external stimuli, and to maintain empathy with ordinary social criteria of reality" (Kernberg, 1996, p. 120). Extreme reality testing failure manifests as psychotic disorganization of thought and behavior. A fuller discussion of reality testing appears later.
Primitive psychological defenses represent protective propensities that distort and interfere with interpersonal interactions (e.g., splitting) (Wolfe and Mash, 2006). Splitting occurs in situations where individuals respond to emotional conflict or stressors by compartmentalizing opposite affect states (Koenigsberg et al., 2001). This process produces ambivalence because it is impossible to integrate contrasting affective states into a cohesive image of self or others. Hence, image perception alternates between polar opposites and the individual excludes balanced views from emotional awareness. Primitive psychological defenses are protective inclinations associated with more severe psychopathology (i.e., projection, denial, dissociation or splitting), which are distinct from healthier defenses (i.e., reaction formation, isolation, undoing, suppression, and repression) (Lenzenweger et al., 2001).
Finally, identity diffusion refers to the failure to develop a distinct identity. Explicitly, lack of differentiated and integrated representations of the self and others (Sollberger et al., 2012). Typically, characteristics, such as lack of cohesion in the subjective experience of self, boundary confusion and fragmentation, are important features of identity diffusion. These manifest as difficulties with internalized value systems (norms, interests, ethics, and ideals) (Sollberger et al., 2012).
Within the three-factor solution, issues arise from the fact that the IPO-RT (the final dimension to emerge) explains insufficient independent variance. Consequently, literature advocates an alternative two-factor solution, where primitive psychological defenses and identity diffusion appear within a clustered dimension and reality testing forms a second factor (Berghuis et al., 2009). Alternatively, Ellison and Levy (2012) recommend a four-factor model, where dimensions represent instability across a range of domains: sense of self and other, goals, behaviors and psychosis. This structure acknowledges that key elements of personality organization, particularly those pertaining to representations of self and others, do not adequately fit a threefactor model (Ellison and Levy, 2012).
Discrepancies in IPO structure may arise from the use of different statistical procedures across studies. In the key psychometric evaluation of the measure, Lenzenweger et al. (2001) confirmed the superior fit of the three-factor solution (vs. alternative two-factor, one-factor, and null models) by conducting a series of confirmatory factor analyses. Whereas, Ellison and Levy (2012) scrutinized factor structure and criterion relations via exploratory structural equation modeling and multiple regression. Item adaptation within translation papers and modifications to item number as part of IPO evolution further complicate structural interpretation. For instance, Berghuis et al. (2009) investigated the dimensionality of the IPO-NL by means of principal component analysis with varimax rotation. In the case of production of shortened/abridged forms of the IPO, Verreault et al. (2013) tested the factorial structure of a 20-item abbreviated version using confirmatory factor analyses. Correspondingly, Smits et al. (2009) used confirmatory factor analyses to develop the IPO-R, a shortened version of the IPO.

Inventory of Personality Organization-Reality Testing (IPO-RT) Subscale
The current paper examined the content and factorial structure of the reality testing dimension of the IPO (IPO-RT). This was necessary because the IPO-RT has developed into a standalone measure of proneness to reality testing deficits (see Drinkwater et al., 2012;Dagnall et al., 2015). Use of the IPO-RT as a standalone measure dates back to Irwin's investigation of the relationship between reality testing and belief in the paranormal (Irwin, 2003(Irwin, , 2004. In his formative study, Irwin (2003) employed the Bell Object Relations and Reality Testing Inventory (BORRTI) (Bell et al., 1985;Bell, 1995). Using BORRTI Irwin (2003) reported that paranormal beliefs predicted the tendency to distort internal and external reality. Subsequent consideration of BORRTI revealed that the measure explicitly indexed paranormal content (Irwin, 2003(Irwin, , 2004. This was also true of the other established measure of reality testing, the Borderline Personality Inventory (Leichsenring, 1999). Noting conceptual overlap between BORRTI and the Revised Paranormal Belief Scale (RPBS), Irwin (2003Irwin ( , 2004 cautioned that shared variance might have inflated the relationship between reality testing and belief in the paranormal. Consequently, Irwin (2004) adopted the IPO-RT because it was free of explicit paranormal content.
In the context of belief in the paranormal, several researchers view the IPO-RT as a measure of information processing style rather than psychotic phenomena. This judgment derives from the notion that the IPO-RT provides a representative assessment of evaluative processes as defined by Langdon and Coltheart's (2000) explanation of belief generation (Irwin, 2004).
Initial psychometric evaluation of the IPO-RT suggested the subscale was unidimensional. Acknowledging this, Irwin (2004) stated that although the subscale indexes a range of reality testing aspects the IPO-RT probably provides an oversimplification of domain content. Recent findings have challenged the notion that the IPO-RT is unidimensional by identifying potential underlying dimensions . Dagnall et al. (2017), in their study examining the cognitive-perceptual basis of belief in urban legends and the paranormal, performed an exploratory factor analysis with oblique (promax) rotation on the IPO-RT. Exploratory factor analysis advocated a multidimensional four-factor solution accounting for 55% of variance, which was supported via confirmatory factor analysis (CFA). The identified factors were consistent with the theoretical underpinnings of reality testing deficits (Bell et al., 1985;Caligor and Clarkin, 2010). Emergent factors were factor 1, 'hallucinations' (auditory and visual); factor 2, 'delusional thinking' (beliefs contrary to reality); factor 3, 'social deficits' (difficulties reading social cues); and factor 4, sensory/perceptual 'confusion' (inability to understand feelings and sensations).
The suggestion that reality testing is multidimensional rather than unitary is not new. For example, Ellison and Levy (2012) using exploratory structural equation modeling found IPO-RT items split into two clusters. Their 'psychosis' factor most closely corresponded to the IPO-RT and contained items restricted largely to the pathological pole of reality testing (i.e., hallucinations and delusions) (Kernberg, 1975). Questions related to milder forms of reality testing difficulties, specifically maintaining a grasp on reality testing (e.g., "I can't tell whether I simply want something to be true, or whether it really is true"), loaded on the 'instability of self and others' factor.

The Present Study
The current paper assessed the psychometric structure of the IPO-RT in isolation. This is important for two main reasons. Firstly, several recent papers have used the IPO-RT as a standalone measure of proneness to reality testing deficits (e.g., Dagnall et al., 2015). Hence, it is important to examine how the IPO-RT functions in this specific context. Secondly, the scale structure from a reality testing perspective will be unaffected by other IPO subscales. Hence, the analysis permitted a cleaner, uncontaminated evaluation of IPO-RT content. Clearly, shared variance within the IPO structure is likely to influence subscale loadings. This was evident within the Ellison and Levy (2012) paper.
Accordingly, an assessment of IPO-RT model fit was undertaken. This compared unidimensional (Lenzenweger et al., 2001) vs. multidimensional structure  via consideration of a progressive hierarchy of competing models. Specifically, a one-factor model for a strict unidimensional assessment, a correlated multidimensional solution testing Dagnall et al.'s (2017) model, a second-order solution examining whether a latent general reality testing factor existed in addition to multiple dimensions, and a bifactor model examining the unidimensional vs. multidimensional argument in a single analysis (Reise et al., 2010). Bifactor models depict factors as orthogonal. Additionally, bifactor models assess the relative strength of a general underlying factor in comparison to multiple factors (Reise et al., 2010). Subsequent analysis evaluated IPO-RT structure stability using invariance testing. Explicitly, an assessment of invariance in relation to age and gender. Related studies have tended to focus on the IPO as a composite scale and failed to consider IPO-RT invariance (Verreault et al., 2013). Establishing invariance across groups indicates that observed mean differences are unlikely to be an artifact of measurement bias, and instead reflect true mean differences .
Invariance testing is an important means of assessing IPO-RT performance across sub-groups. Consistent with previous research on related measures (e.g., Bell et al., 1985;Preti et al., 2015), invariance tests assessed age and gender. Specifically, Bell et al. (1985) tested for age and gender bias in relation to the Bell Reality Testing Inventory, and Preti et al. (2015) assessed gender invariance for the IPO. Gender, in particular, is an important factor to consider when scrutinizing the psychometric properties of the IPO-RT because gender differences should technically not exist if the measure is an accurate index of personality pathology according to Kernberg's object-relations model (Kernberg, 1984). In addition to established gender invariance for related measures (Preti et al., 2015), research typically reports non-significant mean gender differences on the IPO-RT (Lenzenweger et al., 2001).

Participants
Merging independent IPO-RT data sets from previously published studies (Dagnall et al., 2014 and articles in production created a heterogeneous sample of 652 respondents. Mean (M) sample age was 28.63 years (SD = 12.41, range = 18-87 years). Sample disaggregation by gender specified that 245 (38%) respondents were male (M age = 30.51, SD = 13.59, range = 18-87 years) and 407 (62%) were female (M age = 27.50, SD = 11.52, range = 18-77 years). Recruitment was via emails to staff and students (undergraduate and postgraduate) enrolled on healthcare programs (Nursing, Physiotherapy, Psychology, Speech and Language Therapy, etc.) at a United Kingdom university, and local businesses. Participation occurred between January 2014 and September 2016 (see "Ethics" section). Instructions prevented multiple responses by informing participants not to complete the study if they have participated in similar research.
Several researchers have previously evaluated scale structure using this approach. For instance, Lange et al. (2000) top-down purification of RPBS; Roets and Van Hiel's (2011), Need for Closure Scale validation; and Drinkwater et al. (2017), assessment of RPBS dimensionality.

Materials
The only study measure was the IPO-RT (Lenzenweger et al., 2001). This is the reality testing subscale of IPO (Lenzenweger et al., 2001), which is used frequently as a standalone scale to assess proneness to reality testing deficits (Irwin, 2004;Dagnall et al., 2017). Specifically, the IPO-RT indexes "the capacity to differentiate self from non-self, intrapsychic from external stimuli, and to maintain empathy with ordinary social criteria of reality" (Kernberg, 1996, p. 120). This conceptualization is congruent with Langdon and Coltheart's (2000) account of belief generation, which emphases information-processing style rather than psychotic symptomology (Langdon and Coltheart, 2000;Irwin, 2004). The IPO-RT comprises 20-items presented as statements (e.g., "I can't tell whether certain physical sensations I'm having are real, or whether I am imagining them"). Respondents indicate agreement to each statement via a five-point Likert scale (1 = never true to 5 = always true), hence total IPO-RT scores range from 20 to 100; higher scores indicate proneness to report experiences of reality testing deficits. The IPO-RT possesses construct validity, good internal consistency and test-retest reliability indicating it is a largely psychometrically sound measure (Lenzenweger et al., 2001). However, the fact that studies have failed to establish factor invariance across countries limits generalizability across national samples. Particularly, it suggests cultural differences in interpretations and comprehensions of IPO items (Tucker et al., 2006).

Procedure
Respondents completed the IPO-RT alongside measures assessing anomalous beliefs, cognitive-perceptual personality factors and decision-making. The basic procedure across studies was standardized. Before taking part, the researchers presented prospective respondents with detailed background information. The brief outlined the nature of the study and delineated ethical procedures. If respondents agreed to participate, they registered informed consent and received the materials. Procedural instructions then directed respondents to consider questions carefully; work through the items systematically, at their own pace; respond to all questions; and answer in an honest and open manner. Questionnaire section order rotated in order to prevent order effects. Alongside item endorsement respondents forwarded basic demographic information (preferred gender, age, etc.).

Ethics
As preparation for grant bids (September 2012(September , 2014(September , and 2016) the researchers obtained ethical authorization for a series of studies investigating anomalous beliefs, cognitive-perceptual personality factors and decision-making. Each submission was "routine" and accordingly ratified (methodological and ethical) by the Director of the Research Institute for Health and Social Change (Faculty of Health, Psychology and Social Care) within Manchester Metropolitan University. This is was the required level of ethical clearance. Additionally, prior to submission, research proposals are peer-reviewed by members of the Professoriate (or equivalent). This process includes ethical and methodological scrutiny. Finally, the Head of the Psychology Department sanctioned the projects. Formal submission to a university ethics panel is not an institutional requirement for routine studies.
The correlated four-factor model assumed that reality testing was multidimensional and explained by obliquely related dimensions. Contrastingly, the second-order model derived from the notion that factors were uncorrelated and representative of a general reality testing construct. Finally, the bifactor model reconciled the unidimensional and multidimensional alternatives by advocating that IPO-RT items loaded on four subfactor dimensions and a general factor.
Model parameter appraisal used maximum likelihood (ML) estimation. Multiple indices including chi-square test, Comparative Fit Index (CFI), Standardized Root-Mean-Square Residual (SRMR), and Root-Mean-Square Error of Approximation (RMSEA) evaluated model fit. Using a range of indices ensures robust assessment of model fit.
Generally, non-significant chi-square signifies good datamodel fit. However, the statistic is sensitive to sample size and thus insufficient as a standalone CFA index (Byrne, 1994). Accordingly, model evaluation referenced also CFI, SRMR, and RMSEA. Good fit thresholds for these indices are CFI > 0.90, SRMR < 0.08 and RMSEA < 0.08 (Browne and Cudeck, 1993). A CFI above 0.87 and SRMR and RMSEA values below 0.10 indicate marginal fit (Bong et al., 2013). For RMSEA the 90% confidence interval (CI) was included. Additionally, for model comparison analysis considered Akaike's Information Criterion (AIC), with lower values indicative of superior fit.
For each model, Modification Indices (MI) indicated the extent chi-square would improve if constrained parameters covaried. MI values higher than 20 related to subfactor items were inspected (Rossier et al., 2012). Although, statisticians typically recommend against covarying subfactor item errors, covariance in the present study was justified because some subfactor items possessed similarities in item content (Byrne, 2010).
Following model specification and testing, Cronbach's alpha examined internal consistency of the IPO-RT. In addition to alpha, coefficient omega (ω) and omega hierarchical (ωh) considered reliability (estimated with the Omega program; Watkins, 2013), which can more accurately capture the reliability of bifactor solutions (Brunner et al., 2012). Coefficient omega calculates the reliability of a latent factor combining specific and general factor variance. Omega hierarchical calculates the reliability of a latent factor without factoring in the variance from other specific and general factors.
To assess invariance of the superior factor solution, multigroup CFA examined an increasingly restrictive set of models in relation to gender (male vs. female) and age (below 24 vs. above 24). A median split analysis informed the decision to separate the sample at 24 years of age, a method utilized in previous research (Allan et al., 2015). Analysis tested configural, metric and scalar invariance models. Configural invariance assesses the degree to which the same factor structure holds across the groups of interest. Metric invariance examines whether the factor structure and factor loadings are invariant across groups. Scalar invariance examines factor structure, factor loadings and item intercepts. If a measure possesses invariance at the scalar level, mean differences are valid across tested groups and are not a result of measurement bias. When testing invariance in addition to demonstrating satisfactory model fit, CFI values should not change by more than 0.02 (Cheung and Rensvold, 2002). In large samples, due to its sensitivity, use of chi-square as an index for invariance is not advisable (Brown, 2006). Following invariance tests, MANOVA examined mean comparisons among the groups utilized for invariance testing: gender (men vs. women) and age (below 24 years vs. above 24 years).

Preliminary Analyses
Data screening prior to analysis identified 16 extreme scores. Excluding these scores left a total sample of 652. The average IPO-RT score was 36.79 (SD = 11.58). Kurtosis and skewness scores for the subfactors and total IPO-RT all fell within −2 and +2 (Byrne, 2010) ( Table 1). An assessment of multivariate normality revealed Mardia's (1970) kurtosis coefficient to be 147.212 with a critical ratio of 63.357. The magnitude of this indicates that the data was multivariate non-normal and can result in standard error biases (Bentler and Wu, 2005). Accordingly, analysis used ML estimation with bootstrapping (600 resamples) to generate accurate estimations of standard errors with accompanying confidence intervals (bias-corrected at the 95% confidence level) and p-values (Byrne, 2010). Research demonstrates that naïve bootstrapping is a robust alternative to other ML robust methods (e.g., the Satorra-Bentler chi-square), and performs effectively even under conditions of extreme non-normality (Nevitt and Hancock, 2001). The Bollen-Stine bootstrap p assessed fit in addition to indices of χ 2 , CFI, SRMR, and RMSEA. Bollen-Stine gauges fit without normal theory limitations (Bollen and Stine, 1992), and p > 0.05 suggests excellent global fit.
Inter-correlations among all scale items were significant ( Table 2), and the four subscales identified by Dagnall et al. (2017) possessed moderate to strong relationships ( Table 1). All intercorrelations were below 0.90 suggesting no multicollinearity (Tabachnick and Fidell, 2001).
The bifactor solution possessed good data-model fit across all indices, χ 2 (150, N = 652) = 390.864, p < 0.001, CFI = 0.949, SRMR = 0.039, RMSEA = 0.050 (CI of 0.044 to 0.056). Bollen-Stine p = 0.002, suggesting poor fit; however, the majority of residual covariances exceeded two. For this model, items 9 and 16, 2 and 5, 15 and 14 possessed high within-item error. Correlating error terms resulted in a significant improvement in overall fit, χ 2 difference (3, N = 652) = 78.740, p < 0.001). Bollen-Stine p = 0.003, suggesting poor fit; however, the majority of residual covariances exceeded two. A comparison of AIC statistics among the tested models revealed that the bifactor solution with correlated errors demonstrated superior fit ( Table 3). The parameter estimates for the bifactor model demonstrated moderate to high factor loadings (i.e., >0.4) for all items relative to either a general factor or a subfactor (Figure 1). Specifically, loadings on the general factor were all greater than 0.4 and significant (p < 0.05), with lower 95% Confidence Intervals greater than 0.5. However, loadings on the subfactors did not all meet this threshold, specifically items 17 ( . This suggests that these items more directly predict general reality testing rather than delusional thinking.
Accordingly, a test for partial scalar invariance was undertaken. This examined intercepts on an item-by-item basis and then excluded items with differing intercepts from the invariance testing. This process resulted in identification of the intercepts responsible for the non-invariance. Excluding the constraint for item 2 resulted in a scalar model with a CFI difference less than 0.02 relative to the metric model (0.917 vs. 0.934), supporting partial invariance at the intercept level across the two age groups. Bollen-Stine, however, inferred poor fit for configural (p = 0.003), metric (p = 0.002), and partial scalar (p = 0.002) models. The majority of residual covariances were above two, supporting good fit between the model and data.
For age, participants differed in relation to IPO-RT total and all subfactors, with the under 24 age group consistently scoring higher than the over 24 age group. Bootstrap estimates, using 'above 24 years' as a reference category, supported these differences. For IPO-RT total, BCa 95% CI of 3.612 to 8.270, p = 0.002; auditory and visual hallucinations, BCa 95% CI of 1.396 to 2.896, p = 0.002; delusional thinking, BCa 95% CI of 0.608 to 2.549, p = 0.003; social deficits, BCa 95% CI of 0.075 to 1.138, p = 0.022; and confusion, BCa 95% CI of 0.928 to 2.038, p = 0.002.

Analysis Summary
The CFA analyses indicate that the bifactor solution (with correlated errors) explains the data best in terms of model fit statistics. Moreover, the results support invariance for this model across gender and partial invariance across age. The correlated four-factor model and second-order model proved almost as good as one another in relation to data-model fit. The onefactor model did not explain the data well, clearly suggesting that a general reality testing factor on its own does not sufficiently represent these data.

DISCUSSION
Evaluation of IPO-RT models (one-factor, four-factor, secondorder, and bifactor) found superior fit for the bifactor solution (with correlated errors). This model comprised a single general dimension alongside four distinct subfactors (auditory and visual hallucinations, delusional thinking, social deficits and confusion) . Item loadings for the bifactor solution were acceptable at general and to an extent subfactor levels. Alpha and omega coefficients suggested satisfactory to excellent reliability for the general and specific factors. However, omega hierarchical supported the superiority of a general reality testing factor. Examination of subfactor content revealed conceptual coherence; items possessed commonality and related clearly to factor labels. Correlations between subfactors were in the moderate range, the strongest association (r = 0.67) was between auditory and visual hallucinations and delusional thinking. These factors represent responses at the pathological pole of the reality testing dimension (Kernberg, 1975). In addition, multi-group CFA suggested that despite the existence of mean differences in reality testing across gender (confusion subfactor only) and age, the superior bifactor model was invariant across gender in terms of factor structure, factor loadings and item intercepts. For age, results supported partial invariance. This indicates that differences in IPO-RT scores are (with the exception of item two pertaining to age) likely to reflect true mean differences as opposed to bias in measurement. Furthermore, gender mean differences were not apparent for IPO-RT total and the majority of subfactors, and the difference for confusion did not exist following bootstrapping. This is consistent with Kernberg's object-relations model, in which personality pathology indicates no gender differences (Kernberg, 1984).
Adoption of a bifactor IPO-RT model resolves previously reported structural differences and reconciles dimensionality debates (unidimensional vs. multiple factors). Indeed, the range of solutions identified in preceding articles provides support for the bifactor structure. In situations where data index both unidimensional (single common factor) and multidimensional latent (similar domain content) structures, psychometric analysis often produces ambiguity and structural variations (Reise et al., 2010).
With reference to the IPO-RT, this explains why researchers report different factorial solutions. Specifically, Lenzenweger et al. (2001) delineated the IPO-RT as unidimensional, whereas Ellison and Levy (2012) observed that IPO-RT items split between two factors corresponding to severity of reality testing deficit. Explicitly, milder reality testing difficulties loaded on a general 'instability of self and others' dimension, whereas items related to psychopathology converged into a separate 'psychosis' dimension. Furthermore, when analyzed as a standalone measure, Dagnall et al. (2017) observed the four-factor solution assessed in the current paper. Research with other measures of reality testing has also identified subfactors. For example, Bell et al. (1985) performed a factor analysis on the Bell Reality Testing Inventory and identified three dimensions of reality testing ego function (reality distortion, uncertainty of perception, and hallucinations and delusions).
The proposal of alternative models reflects the fact that measurement of complex psychological/pathological constructs necessitates the inclusion of a broad range of items (Reise et al., 2010). This requirement creates the conceptual paradox where items concurrently assess both a general factor and separate subfactors. In such circumstances second-order and bifactor models best explain data (Chen et al., 2006). This is certainly true of the IPO-RT when researchers use the measure as a standalone index of proneness to reality testing deficits. With reference to the IPO, different solutions may emerge due to shared variance between subfactors and the reality testing construct breadth. This issue of shared variance existed in the present study for the IPO-RT, evident in the sense specific items (i.e., 17, 11, 5, and 4) loaded generally well on their designated factors across solutions but loaded poorly on these factors once they were examined in a bifactor context. A bifactor analysis helped to disentangle whether general vs. specific factors best explained items, revealing that a general factor accounted for the majority of variance. In practice, therefore, the use of unidimensional subscales is not recommended and the validity of such scales is debatable given the majority of variance shared between items pertaining to subscales is attributable to an underlying general factor.
Contrastingly, a one-factor model did not represent these data well, indicating that a general IPO-RT factor is not sufficient to account for all the variance in the measure. In addition, although this study highlights the significance of a general reality testing factor, the proposed subscales by Dagnall et al. (2017) are not completely invalid; a general factor explained the majority of variance, yet the four subfactors accounted for a non-redundant degree of variance. Correspondingly, though the validity of the subscales in isolation is questionable, they could be utilized in combination with total scores when administering the measure. This suggestion is in line with other studies observing greater data-model fit of bifactor solutions that emphasize the importance of a general factor relative to subfactors (e.g., Denovan et al., 2017;McElroy et al., 2018).
In addition, although bifactor modeling is increasingly used in psychological/social sciences and provides an intuitive method of assessing unidimensionality vs. multidimensionality, it has received criticism. Specifically, bifactor models, by virtue of incorporating a general factor that loads onto all items and more free parameters (Murray and Johnson, 2013), can be subject to bias in favor of data-model fit and explained variance compared with traditional CFA solutions. Brouwer et al. (2013) revealed that an element of bias exists (i.e., cross-loadings favored a general factor vs. subfactors), but this effect was marginal. Nonetheless, further research into bias linked with bifactor modeling is apposite (McElroy et al., 2018).
Theoretically, however, a bifactor IPO-RT model is advantageous because it allows researchers to investigate the degree to which general and specific factors predict external variables. This is important since relationships with potentially related factors, such as schizotypy, may vary as a function of the degree to which items index pathology. Indeed, Bell et al. (1985) found only low correlations between reality distortion and uncertainty of perception subscales and most Brief Psychiatric Rating Scale (BPRS) symptom scales. Additionally, schizophrenics, schizoaffectives, and borderlines scored higher on the reality distortion and hallucinations and delusions dimensions. In this context, a bifactor model provides a framework for explaining the degree to which both item commonality and heterogeneity contribute to specific constructs (Gustafsson and Aberg-Bengtsson, 2010). Accordingly, the identification of distinct factors within the reality testing dimension facilitates the development and testing of more sophisticated models.
Generally, the present study demonstrated that the IPO-RT is a psychometrically robust scale that functions as a concise measure of propensity to report reality testing deficits. This provides further validation for studies utilizing the measure previously and subsequently (Lenzenweger et al., 2001;Irwin, 2004;Dagnall et al., 2017). Although the IPO-RT possesses psychometric integrity, the degree to which the measure actually corresponds to real world situations has yet to be fully established. Additionally, because self-report responses index events retrospectively they are prone to forgetting and distortion (Afflerbach and Johnston, 1984). These concerns are not particular to the IPO-RT but apply to self-report measures generally. Until research validates the IPO-RT against objective measures of proneness to reality testing deficits, such as the Rorschach inkblot method, it is safer to conclude that the IPO-RT indexes subjective evaluation of the perceived likelihood of reality testing errors. The Rorschach inkblot method is a reliable index of perception of reality accuracy (Hilsenroth et al., 1998). This approach is consistent with the notion that psychopathological construct validation requires repeated assessment over time via a range of methods (Mason, 2015). This is an important development because work in related psychopathological and cognitive domains has demonstrated that self-report measures designed to assess metacognitive processes often lack validity. For example, Searleman and Herrmann (1994) observed that self-report measures used to assess participant's awareness of memory processes were reliable but failed to predict accurately memory abilities and use of metacognitive strategies (Searleman and Herrmann, 1994). This is because self-report measures, such as the IPO-RT, indirectly assess metacognitive processes. These are internal executive processes, which control, monitor and supervise cognitive processes (Sternberg, 1986). Metacognitive processes are vital to all stages of cognitive performance planning, monitoring, execution, and evaluation (Sternberg, 1986). In the current paper, self-report measures assessed metacognitive strategies. A potential limitation of this approach arises from the fact that strategies employed by individuals may not be fully accessible to conscious awareness and therefore not reportable (e.g., Nisbet and Ross, 1980;Kentridge and Heywood, 2000;Koriat and Levy-Sadot, 2000;Dijksterhuis et al., 2006). Consequently, self-report methods only provide a partial and limited view of the potential operation of metacognitive processes. Ideally, the validity of self-report measures needs substantiating with other performance measures. This would be possible in situations where the operation of particular metacognitive strategies lead to definitive predictions on tests of cognitive performance (for example, the use of metacognitive monitoring on tests of memory, Dodson and Schacter, 2001). In situations similar to these, self-report measures of individuals are not required. Rather, inferences derive from actual performance measures. Experiments designed to exploit or promote the usage of particular metacognitive strategies would allow appropriate predictions regarding their deployment. In relation to the current work, and the use of self-report measures, a fuller and more complete understanding is achievable by assessing (i) the concordance between objective (performance) and self-report methods and (ii) the conditions under which these measures are congruent or diverge. Findings derived from self-report measures do not necessarily always deviate from performance measures; rather, their degree of congruence may be subject to a number of situational (experimental) constraints. In this context, important future research needs to examine the degree to which IPO-RT scores correspond to other performance-based reality testing measures.
An additional limitation relates to the use of CFA estimation method (i.e., ML). Beauducel and Herzberg (2006) compared ML estimation using Pearson correlations with weighted least squares mean and variance adjusted (WLSMV) estimation using polychoric correlations. WLSMV led to more accurate results for Likert scale-type data. Therefore, although bootstrapping helped to protect against standard error biases in this study, future research should consider using WLSMV estimation in factorial analyses of the IPO-RT.
Finally, further work needs to examine the temporal stability of the IPO-RT. Temporal stability is an important factor to consider when assessing the efficacy of self-report measures indexing personality disorders (Samuel et al., 2011). Evidence advises that there are age-related differences related to susceptibility to borderline personality disorder (BPD) (Zanarini et al., 2003). Specifically, clinical studies report relatively high remission rates (e.g., Zanarini et al., 2003). Extrapolating this finding to non-clinical samples there is reason to believe that both the tendencies to experience and report reality testing deficits is likely to change over time. Knowing the degree of alteration would indicate whether propensity to reality testing errors was dispositional or more transitory in nature. Dimensional scores showing consistency over time would indicate that reality testing, similarly to personality traits, represents a relatively stable individual difference in thinking/information processing style. Replicating the present results via test-retest reliability over lengthy periods would establish IPO-RT reliability and indicate whether the proposed bifactor solution was enduring. In this context, tests of scale stability and change are essential features of subsequent work.

AUTHOR CONTRIBUTIONS
AD and ND: theoretical focus, data analysis, and article development. AP: contributed to the writing process. KD: collected data and contributed to the writing process. RW: provided additional conceptual guidance and commentary.