The Measurement of Individual Differences in Cognitive Biases: A Review and Improvement

Berthet, Vincent

doi:10.3389/fpsyg.2021.630177

ORIGINAL RESEARCH article

Front. Psychol., 18 February 2021

Sec. Personality and Social Psychology

Volume 12 - 2021 | https://doi.org/10.3389/fpsyg.2021.630177

The Measurement of Individual Differences in Cognitive Biases: A Review and Improvement

Vincent Berthet^1,2^*

¹Université de Lorraine, 2LPN, F-54000 Nancy, France
²Centre d'Économie de la Sorbonne, CNRS UMR 8174, Paris, France

Individual differences have been neglected in decision-making research on heuristics and cognitive biases. Addressing that issue requires having reliable measures. The author first reviewed the research on the measurement of individual differences in cognitive biases. While reliable measures of a dozen biases are currently available, our review revealed that some measures require improvement and measures of other key biases are still lacking (e.g., confirmation bias). We then conducted empirical work showing that adjustments produced a significant improvement of some measures and that confirmation bias can be reliably measured. Overall, our review and findings highlight that the measurement of individual differences in cognitive biases is still in its infancy. In particular, we suggest that contextualized (in addition to generic) measures need to be improved or developed.

Introduction

Since the seminal work of Kahneman and Tversky on judgment and decision-making in the 1970s, there has been a growing interest for how human judgment violates normative standards (e.g., Tversky and Kahneman, 1974; Kahneman et al., 1982; Gilovich et al., 2002). When making judgments or decisions, people often rely on simplified information processing strategies called heuristics, which may lead to systematic—and therefore predictable—errors called cognitive biases (hereafter CB). For instance, people tend to overestimate the accuracy of their judgements (overconfidence bias), to perceive events as being more predictable once they have occurred (hindsight bias), or to carry on fruitless endeavors in which they already have invested money, time or effort (sunk cost fallacy). Although the “heuristics and biases” program has raised criticism regarding its pessimistic view human rationality and its lack of theoretical ground, it has been remarkably fruitful. To date, behavioral scientists have identified dozens of CB and heuristics that affect judgment and decision-making significantly (e.g., Baron, 2008, listed 53 such biases) and have proposed several taxonomies (e.g., Carter et al., 2007; Stanovich, 2009; Pohl, 2017).

The “heuristics and biases” program led to a large body of research investigating how these mental shortcuts may impede decision-making in areas such as management (e.g., Maule and Hodgkinson, 2002), medicine (e.g., Blumenthal-Barby and Krieger, 2015), law (e.g., Rachlinski, 2018), or finance (e.g., Baker and Nofsinger, 2002). However, individual differences have been largely neglected in this endeavor (Stanovich et al., 2011; Mohammed and Schwall, 2012). In fact, most of the current knowledge about the impact of CB on decision-making relies upon experimental research and group comparisons (Gilovich et al., 2002), which might lead to the false inference that every single individual is susceptible to CB, and to the same extent.

Still, there has been a growing interest in going beyond aggregate level results by examining individual differences (e.g., Stanovich and West, 1998, 2000). This line of research has led to two noteworthy findings. The first one is that performance on CB tasks is only moderately correlated to cognitive ability, which suggests that a major part of the reliable variance of scores on CB tasks is unique (e.g., Stanovich and West, 2008; Stanovich, 2012; Teovanović et al., 2015; Bruine de Bruin et al., 2020). The second finding is that correlations between CB measures are low, suggesting the absence of any general factor of susceptibility to CB. Indeed, exploratory factor analysis reveals that at least two latent factors can be extracted from the intercorrelations between the scores on various CB tasks (Parker and Fischhoff, 2005; Bruine de Bruin et al., 2007; Aczel et al., 2015; Teovanović et al., 2015).

The Measurement of Cognitive Biases: A Review

It is worth noting that research on individual differences in CB has been conducted despite a lack of psychometrically sound measures¹. Here, we review this research topic in order to inventory which reliable measures are currently available. Note that self-report measures have been developed to assess the propensity to exhibit biases such as the bias blind spot (Scopelliti et al., 2015), correspondence bias (Scopelliti et al., 2018), and confirmation bias (Rassin, 2008). In this review, we considered only objective measures of individual differences in CB (i.e., based on performance on experimental tasks).

The development of reliable measures of CB faces several challenges. As a preliminary point, one should distinguish between two types of CB tasks. Some CB are measured by a single or a few equivalent items. For example, the gambler's fallacy can be assessed with a single problem such as “When playing slot machines, people win something about 1 in every 10 times. Julie, however, has just won on her first three plays. What are her chances of winning the next time she plays?” (Toplak et al., 2011). Likewise, base rate neglect, sunk cost fallacy, and belief bias are usually measured by a single or several equivalent items. For those biases, bias susceptibility is measured with respect to accuracy and the measurement of individual differences raises no particular methodological issue.

Other CB are evidenced by the effect of a normatively irrelevant factor on judgments or decisions, which is typically manipulated between subjects. For example, the framing effect is usually obtained by presenting a gain and a loss version of a same decision problem to two different groups (e.g., Tversky and Kahneman, 1981). Between-subjects designs are also used for anchoring bias, hindsight bias, and outcome bias. Therefore, a first challenge in the measurement of CB is to adapt between-subjects designs to within-subjects ones. In the latter case, bias susceptibility is measured by comparing each subject's responses to the different conditions. For example, the framing effect is also found using a within-subjects design (Frisch, 1993) where the two versions of the problem are separated in the questionnaire to avoid any memory effects (e.g., Parker and Fischhoff, 2005). Although there may be some limitations, the framing effect, anchoring bias, hindsight bias, and outcome bias can all be successfully assessed using within-subjects designs (Stanovich and West, 1998; Lambdin and Shaffer, 2009; Aczel et al., 2015).

A second challenge in the measurement of CB is to build reliable scores. Most studies that investigated individual differences in CB relied on composite scores derived from a large set of CB tasks (e.g., Toplak et al., 2011; Aczel et al., 2015). It turns out that such composite scores are unreliable (West et al., 2008; Toplak et al., 2011; Aczel et al., 2015). For instance, Toplak et al. (2011) reported that the internal consistency of composite scores consisting in the average performance on 15 classic heuristics and biases tasks was 0.484 (Cronbach's alpha). Likewise, Aczel et al. (2015) showed that the reliability of composite scores calculated as the sum of the scores (1 or 0) to 13 CB tasks was 0.37 for one form of the test and 0.23 for another parallel form (Cronbach's alpha). Even composite scores derived from various tasks measuring the same CB turned out to be unreliable (e.g., Rassin, 2008, in the case of confirmation bias). These studies, however, used a single item for each task, which is detrimental to score reliability. Moreover, such a practice affects the comparability of parallel versions of the same task (Aczel et al., 2015). On the other hand, using multiple items for each task allows for assessing the reliability of test scores, so that reliable scores can be aggregated irrespective to the format of the tasks from which they are derived (the same way as IQ scores result from aggregating scores to different subtests).

Two noteworthy studies sought to adjust CB tasks to improve scale reliability. Bruine de Bruin et al. (2007) evaluated the reliability and validity of a set of seven behavioral tasks (forming the Adult Decision-Making Competence; A-DMC) measuring different aspects of decision-making (resistance to framing, recognizing social norms, under/overconfidence, applying decision rules, consistency in risk perception, resistance to sunk costs, path independence). These authors adapted tasks from the Youth Decision-Making Competence (Y-DMC; Parker and Fischhoff, 2005) to achieve increased reliability with adults. For example, Parker and Fischhoff (2005) found relatively low internal consistency for the task measuring susceptibility to framing. To address that issue, Bruine de Bruin et al. increased the number of items and replaced the dichotomous choice by a 6-point rating scale (each endpoint reflecting a strong preference for one of the two original choice options). Bruine de Bruin et al. reported values of Cronbach's alpha ranging from 0.54 to 0.77 over the seven scales, and test-retest values around 0.50. Moreover, A-DMC scores showed evidence of criterion validity as they predicted the likelihood of reporting negative life events indicative of poor decision making.

In a similar vein, Teovanović et al. (2015) used multi-item tasks for the measurement of individual differences in seven CB (anchoring effect, belief bias, overconfidence bias, hindsight bias, base rate neglect, sunk cost effect, outcome bias). Here too, Teovanović et al. introduced adjustments to increase score reliability (especially for anchoring bias, see below). With the exception of hindsight bias, scores on all CB tasks reached satisfactory levels of reliability (Cronbach's alphas >0.70). This work represents a significant step forward in the measurement of individual differences in CB.

Finally, the unpublished work of Gertner et al. (2016) should be highlighted as a valuable attempt to develop a standardized assessment of CB in judgment and decision-making. These authors relied on a sound psychometric approach that started with identifying the facets of each bias to cover the most of each bias's construct. Accordingly, Gertner et al. used various tasks to measure each CB (e.g., the measurement of confirmation bias involves the Wason task, a task related to information search, and a task related to evaluation/weighting of evidence). While reporting acceptably high values of internal consistency for the different scales (with the exception of the confirmation bias scales), the test of Gertner et al. remained at an exploratory stage, calling for further development. As outlined by the authors, “the study of bias within an individual difference framework is still largely in its infancy” (p. 3).

Aim of the Study

Taken together, the studies of Bruine de Bruin et al. (2007) and Teovanović et al. (2015) provide evidence that a set of eight CB can be reliably measured: framing, anchoring, belief bias, overconfidence, hindsight bias, base rate neglect, sunk cost fallacy, and outcome bias². As the correlations between CB measures have been found to be low, this set may be viewed as an inventory of independent measures that could be used each separately. Such an inventory opens up a promising avenue to research on CB based on an individual differences approach. For example, the A-DMC (Bruine de Bruin et al., 2007) has been used to investigate executive functions in decision-making (Del Missier et al., 2010), age-related changes in decision-making competence (Del Missier et al., 2020b), and decision-making in schizophrenia (Del Missier et al., 2020a). However, this inventory should be both improved and extended. On the one hand, some measures are still inconvenient and therefore need to be improved. For instance, the measurement of outcome bias as reported by Teovanović et al. (2015) involves a 1-week delay between the two outcome conditions. On the other hand, reliable, multi-item, measures of key CB such as confirmation bias and availability bias are still lacking. The general aim of the study is to address those two issues by (1) replicating and improving the eight measures of CB identified, (2) testing a measure of confirmation bias. Open Science Practices: All data files are available at: https://osf.io/wfums/.

Study 1

Method

The aim of study 1 was primarily to replicate the findings relative to the eight measures of CB identified using fewer items for each task. In fact, the combined use of these eight measures with their current number of items would result in long completion times. We investigated to what extent this item reduction would impact the reliability of the measures. In addition, we made several adjustments: the measurement procedure of the outcome bias was changed as compared to Teovanović et al. (2015) in such a way to obtain the measure in one setting (without a 1-week delay), and the scoring method for some measures (framing bias and hindsight bias) was fine-tuned. Items were drawn from three sources: the original measure, the existing literature, or they were new. The only criteria for including or not items from the original measure or the existing literature was whether they were suited for French participants. When the number of suitable items was not sufficient, new items adapted to that population were created. All items can be found in the Supplementary Material.

Participants

The participants were 163 unpaid undergraduate students (26 males, 137 females) who attended first-year introductory course in differential psychology at the University of Lorraine (France). Their mean age was 18.52 (SD = 1.89). Participants gave their informed consent before taking part in the study.

Measures

Framing Bias. Framing is the tendency of people to be affected by how information is presented (Kahneman and Tversky, 1984). Based on the procedure reported by Bruine de Bruin et al. (2007), we measured a risky-choice framing effect (note that these authors also measured an attribute framing effect, using seven items for each framing task). Decision problems were presented to the subjects who chose between a sure-thing option (A) and a risky-choice option (B). Participants responded on a 6-point scale ranging from 1 (“I would definitely choose option A”) to 6 (“I would definitely choose option B”). Each decision problem had two versions, a gain version and a loss version. The two versions were identical, only the framing differed (e.g., Tversky and Kahneman, 1981; Fischhoff, 1983). Four decision problems (eight frames) were used, referring to various cases: an unusual disease (Tversky and Kahneman, 1981), a raise of income tax (Highhouse and Paese, 1996), selling an apartment (Fagley and Miller, 1997), and food poisoning in an African village (Svenson and Benson, 1993). Two of these decision problems are used in Bruine de Bruin et al. (2007). In Bruine de Bruin et al. (2007, p. 942), the framing bias is measured as “the mean absolute difference between ratings for the loss and the gain versions of each item” (accordingly, the scores range potentially from 0 to 5). However, prospect theory predicts a particular direction of risky-choice framing effects, subjects being more prone to choose the risky option in loss frames and the sure option in gain frames (Kahneman and Tversky, 1979). Therefore, we argue that framing scores should be calculated as the difference (rather than the absolute difference) between the mean ratings of the loss frames and the mean ratings of the gain frames. The gain and loss items appeared in separate blocks, with different item orders in each block (LeBoeuf and Shafir, 2003).

Hindsight Bias. Hindsight bias is the tendency to overestimate ex post the likelihood of an outcome (Fischhoff, 1975). We used the procedure reported by Teovanović et al. (2015), which is based on a memory/recall design. In a first phase, participants performed a task in which they were asked to find the exception in a set of four words (e.g., “November,” “August,” “December,” and “January”) and then indicate the confidence in their response using a 5-point scale (the set of words used were new). Later in the test, participants received feedback on the accuracy of each response and were asked to recall their initial confidence judgment. Teovanović et al. (2015) calculated the hindsight score as the proportion of hindsighted responses (a response was coded as hindsighted if the participant lowered her confidence after being informed that her response was incorrect, or raised her confidence after being informed that her response was correct). However, such a scoring procedure does not consider the magnitude of the hindsight bias. Therefore, the difference between the confidence rating recalled and the initial one should be considered. Moreover, there is a hypothesized direction for this difference: it should be positive when a correct feedback is provided, and negative when an incorrect feedback is provided. Accordingly, the hindsight score was calculated as (recalled confidence rating—initial confidence rating) × accuracy, with accuracy being coded 1 (correct feedback) or −1 (incorrect feedback) (we thank an anonymous reviewer for suggesting this scoring method). We used fewer items than Teovanović et al. (10 vs. 14). As subjects rated their confidence on a 5-point scale, the potential range of scores was 0–40.

Overconfidence Bias. Overconfidence has several aspects (Moore and Schatz, 2017) but it commonly refers to the tendency to overestimate one's own abilities. We used the standard measurement procedure in which participants respond to a performance task and then indicate the confidence in their response (e.g., Lichtenstein and Fischhoff, 1977). As Bruine de Bruin et al. (2007), we used dichotomous general knowledge items for the performance task. We used new items which were drawn from various tests used for the purpose of admission to competitions organized within the French civil service. Overconfidence was assessed through a calibration measure, defined as the difference between the mean confidence ratings and the mean accuracy (percentage of correct answers). Participants rated their confidence on a 6-point scale ranging from 50% (“I am just guessing”) to 6 (“I absolutely sure”). Therefore, scores ranged from −50 (maximum underconfidence) to 100 (maximum overconfidence). We used fewer items than Bruine de Bruin et al. (25 vs. 34).

Anchoring Bias. Anchoring bias is the tendency of people to adjust their—numerical—judgments toward the first piece of information (Tversky and Kahneman, 1974). We used the procedure reported by Teovanović et al. (2015) who proposed to measure the anchoring bias as the difference between a numerical estimate following an anchor value and an initial, anchor-free, estimate made before the anchor presentation. Participants were first required to make numerical estimates relative to general knowledge (E1) (e.g., the average number of babies born per day in France). In a second phase, they were presented with the same set of items and performed a comparative task and a final estimation task. In the former, participants indicated whether the number to estimate was higher or lower than a given value (anchor, A). Anchor values were set automatically by multiplying anchor-free estimates (E1) with predetermined values (ranging from 0.2 to 1.8 between items). Then, participants provided their final estimate (E2). In each item, the anchoring bias was calculated as (E1 – E2)/(A – E1). Anchoring values lower than 0 (lack of anchoring) or higher than 1 (total anchoring) were removed (12.57% of all observations). The anchoring score is the average anchoring bias across items. Items were selected from the existing literature on anchoring (e.g., Jacowitz and Kahneman, 1995) and were very similar to that used by Teovanović et al. (2015). We used fewer items than these authors (12 vs. 24).

Outcome Bias. Outcome bias is the tendency to evaluate the quality of a decision based on its outcome. This bias is typically evidenced in experiments where subjects are presented with a scenario describing a decision made by an individual (e.g., a physician who decided to go ahead with an operation.). In one condition, subjects are informed that the decision led to a positive outcome (e.g., “The operation succeeded”) and in another condition, subjects are informed that the decision led to a negative outcome (e.g., “The patient died”). Participants are asked to evaluate the quality of the decision itself. At the cost of reducing the effect size, outcome bias can be obtained in within-subjects designs (Baron and Hershey, 1988). Teovanović et al. (2015) reported a reliable measurement of the outcome bias using a within-subjects design in which subjects evaluate 10 decisions a first time and then a second time a week later but with different outcomes. Bias susceptibility amounts to the inconsistency between the responses to the two outcome conditions. However, the 1-week delay makes this procedure quite inconvenient. To address that issue, we used different items for the two outcome conditions at the cost of potentially increasing measurement error. To avoid confounding the effects of quality and outcome of the decision (a threat to construct validity), we chose a conservative approach by which decisions with a positive outcome were quite bad with respect to decision quality (e.g., “Celine was due to take an important college exam. Two days before, she was invited to a party. She decided to go. She had a great time and stayed with her friends until the early hours of the morning. The next day, she revised most of the day. She passed her exam”) while decisions with a negative outcome were quite good (e.g., “Paul was late for a college exam. Being stuck in traffic, he decided to walk to college as quickly as possible. But he arrived late and was not allowed to take the exam”). The bias score is defined as the difference between the mean ratings of decisions with positive outcomes and the mean ratings of decisions with negative outcomes. Five items were used per condition and participants were asked to rate the decision quality on a 6-point scale ranging from 1 (“It was a poor decision”) to 6 (“It was an excellent decision”). Seven items were selected from the existing literature (Baron and Hershey, 1988; Gino et al., 2009; Aczel et al., 2015; Teovanović et al., 2015) and three new items were created.

Base Rate Neglect. Base rate neglect is a bias in which the information regarding a specific case outweighs the information relative to prior probabilities (Bar-Hillel, 1980). On each item, participants were presented with two kinds of information: base-rates (e.g., “1,000 people participated in a study, including 4 men and 996 women”) and information concerning a specific case (e.g., “Dominique is a randomly chosen participant of this study. Dominique is 23 years old and is finishing a degree in engineering. On Friday nights, Dominique likes to go out cruising with friends while listening to loud music and drinking beer”). Participants were required to estimate a probability related to the specific case (“What is the probability that Dominique is a man?”) (free estimate). The bias score was defined as the proportion of responses that differed from the base rate information in the direction implied by the specific case (e.g., higher than 0.4% in the above example). In typical base-rate problems, the description of the specific case fits common stereotypes of the smaller population group, so that the description of the person and of the base rate are incongruent (De Neys and Glumicic, 2008). We used four such items, two of which were selected from De Neys and Glumicic (2008) and two were created.

Sunk Cost Fallacy. Sunk cost fallacy is the tendency to carry on fruitless endeavor because of the money, time or effort already invested (Arkes and Blumer, 1985). We used the same measurement procedure as Bruine de Bruin et al. (2007) and Teovanović et al. (2015). Participants were presented with hypothetical scenarios and choose between the sunk-cost option and the normatively correct option using a 6-point scale ranging from 1 (the normatively correct option) to 6 (the sunk-cost option). The bias score was defined as the mean rating score. We used fewer items than the two above studies (5 vs. 10 and 8, respectively). Three items were drawn from the existing literature (Arkes and Blumer, 1985; Bornstein and Chapman, 1995; Teovanović et al., 2015) and two were created.

Belief Bias. Belief bias is the tendency to evaluate deductive arguments based on the believability of the conclusion rather than its logical validity (Evans et al., 1983). We used the measurement procedure reported by Teovanović et al. (2015). Subjects were instructed to evaluate syllogisms by indicating whether the conclusion necessarily followed from the premises or not, assuming that all premises were true. The rationale is to assess the effect of the believability of the conclusion (believable vs. unbelievable) for a given level of validity of the argument (valid vs. invalid). Four pairs of syllogisms were used, each pair involving a consistent item and an inconsistent one. On four inconsistent items, the logical validity of the argument was incongruent with the believability of the conclusion (two of them were valid but unbelievable, and two were invalid but believable). On four consistent items, the logical validity of the argument was congruent with the believability of the conclusion (two of them were both valid and believable, and two were both invalid and unbelievable). The bias score was the number of biased responses. A response was coded as biased if the subject provided an incorrect answer to an inconsistent item and a correct answer to the corresponding consistent item. We used four pairs of syllogisms drawn from Teovanović et al. (2015).

Procedure

After providing consent, participants completed the eight tasks in the following order: (1) gain version items of the framing task, (2) the first phase of the hindsight task, (3) overconfidence bias, (4) anchoring bias, (5) outcome bias, (6) base rate neglect, (7) sunk cost fallacy, (8) belief bias, (9) the second phase of the hindsight task (recall), (10) loss version items of the framing task. After completing the test, participants were given feedback on the study.

Results

The mean testing time was 49.26 min (SD = 15.57). Six participants (3.68%) were excluded from the analysis because of abnormally long times to complete the test (superior to 2 SD), which resulted in a final sample of 157 participants. We reviewed the discriminative ability and reliability of the CB measures (see Table 1 for a summary of the results). With the exception of framing bias and outcome bias, medium to large effect sizes were found, with Cohen's d values ranging from 0.68 (overconfidence bias) to 2.76 (sunk cost fallacy).

TABLE 1

Table 1. Descriptive statistics, discriminative properties and internal consistency of CB measures.

Regarding score reliability, Study 1 revealed three main findings. First, we failed to replicate four measures in particular. The measure of framing bias produced a small effect size (M = 0.22, Cohen's d = 0.23) and was unreliable (Cronbach's alpha = 0.15). Note that the scoring method of Bruine de Bruin et al. (2007) produced a greater internal consistency (0.50). The results regarding overconfidence bias were also below those reported by Bruine de Bruin et al. The mean overconfidence score was 7.22% and the internal consistency was 0.36. This low reliability was due to the fact that accuracy scores were themselves unreliable (split-half = 0.25) (on the contrary, confidence scores were reliable, split-half = 0.78). In fact, it is not surprising that scores to a general knowledge test show poor reliability given the diversity of the items. Sunk cost fallacy scores were unreliable (Cronbach's alpha = 0.35) despite mean and effect size values similar to those reported by Bruine de Bruin et al. (2007) and Teovanović et al. (2015) (M = 3.73, Cohen's d = 2.76). Belief bias scores were also unreliable (Cronbach's alpha = −0.15) despite an effect size similar to that reported by Teovanović et al. (Cohen's d = 2.10).

Second, the internal consistency of hindsight bias and outcome bias measures was quite poor too (0.45 and 0.57, respectively) but such values could be attributed to the low number of items used for each (10). When using the scoring method of Teovanović et al. (2015), the internal consistency of the hindsight bias measure was 0.48, a value below that reported by these authors (0.66). Third and finally, two measures reached quite acceptable levels of reliability. The internal consistency of the anchoring bias measure was below that reported by Teovanović et al. (2015) (0.68 vs. 0.77) but that difference could be attributed to the difference in the number of items used (12 vs. 24). Our value suggests, however, that a reliable measure of this bias can be achieved with <24 items. The internal consistency of the base rate neglect measure was acceptable (Cronbach's alpha = 0.70) despite a reduced number of items (4 vs. 10 in Teovanović et al.).

Table 2 shows the bivariate correlations between CB measures. Correlations were low (all r < 0.22), and only six were statistically significant. This finding confirms what has been found in previous studies. Hindsight bias had the higher number (3) of significant correlations (with anchoring, outcome bias and belief bias).

TABLE 2

Table 2. Correlations between CB measures.

We performed factor analysis to investigate the factorial structure of the eight CB measures. While the Bartlett's test of sphericity was significant [χ²₍₂₈₎ = 53.5, p < 0.01], the Kaiser-Meyer-Olkin measure for sampling adequacy (KMO = 0.49) suggested that the data were not suited for factor analysis. However, as the KMO value was just below the recommended minimum value of 0.5 (Kaiser, 1974), we still performed the analysis. A two-factor model with oblimin rotation was retained on the basis on previous findings (Bruine de Bruin et al., 2007; Teovanović et al., 2015). The two factors accounted for only 22% of the total variance. This finding (which is not surprising given the low correlations between CB measures) is very similar to that reported by Teovanović et al. (2015). Only framing bias loaded on the first factor while hindsight, anchoring, outcome and belief bias had loadings of at least 0.30 on the second factor (Table 3). The two factors were barely correlated (r = −0.16). Note that these findings should be taken cautiously given the low KMO value.

TABLE 3

Table 3. Factor analysis of the CB measures (Study 1).

To sum up, we found lower internal consistency values than those reported by Bruine de Bruin et al. (2007) and Teovanović et al. (2015) except for anchoring bias—to some extent—and base rate neglect. The findings of Study 1 suggested that six of the eight measures of CB needed further investigation. While the item reduction might have impacted the reliability of the hindsight bias measure, other measures (in particular framing bias and overconfidence) required significant changes. In the case of framing, one could highlight that the scoring method of Bruine de Bruin et al. (2007) produced a more reliable measure (Cronbach's alpha = 0.50) and that the difference with the value reported by these authors (0.62) could be attributed to the reduced number of items used (4 vs. 14). As described in the Measures section, we argue however that the framing score should be calculated as the difference (rather than the absolute difference) between the mean ratings of the loss frames and the mean ratings of the gain frames, in accordance with the direction of risky-choice framing effects predicted by prospect theory (Kahneman and Tversky, 1979). Even though framing effects are smaller in within-subjects designs (Lambdin and Shaffer, 2009), the framing effect was particularly small here. We found a d-value of 0.23, which is lower than the value for risky frame (Cohen's d = 0.437) reported by Piñon and Gambara (2005) in their meta-analytic review of framing effects. In fact, using exactly the same decisions problems in the loss and gain versions might raise the likelihood that participants detect that feature (despite the two conditions being distanced from one another), leading them to be consistent in their responses, thereby reducing the effect size.

Since it is less prevalent in the literature on judgment and decision-making than the other biases, we did not further investigate the measurement of belief bias. In order to keep an overall testing time below 1 h, we splitted the set of CB to be improved into two subsequent separate studies.