Probabilistic Inference: Task Dependency and Individual Differences of Probability Weighting Revealed by Hierarchical Bayesian Modeling

Cognitive determinants of probabilistic inference were examined using hierarchical Bayesian modeling techniques. A classic urn-ball paradigm served as experimental strategy, involving a factorial two (prior probabilities) by two (likelihoods) design. Five computational models of cognitive processes were compared with the observed behavior. Parameter-free Bayesian posterior probabilities and parameter-free base rate neglect provided inadequate models of probabilistic inference. The introduction of distorted subjective probabilities yielded more robust and generalizable results. A general class of (inverted) S-shaped probability weighting functions had been proposed; however, the possibility of large differences in probability distortions not only across experimental conditions, but also across individuals, seems critical for the model's success. It also seems advantageous to consider individual differences in parameters of probability weighting as being sampled from weakly informative prior distributions of individual parameter values. Thus, the results from hierarchical Bayesian modeling converge with previous results in revealing that probability weighting parameters show considerable task dependency and individual differences. Methodologically, this work exemplifies the usefulness of hierarchical Bayesian modeling techniques for cognitive psychology. Theoretically, human probabilistic inference might be best described as the application of individualized strategic policies for Bayesian belief revision.


INTRODUCTION
distinguished between risky worlds, referring to situations where perfect knowledge about probabilities is present and uncertain worlds, referring to situations where probabilities remain unknown. Savage (1954) made a similar distinction when he introduced the term small worlds for situations where all alternatives and their probabilities are known. In contrast, relevant information is unknown and/or must be estimated in large worlds (see also Johnson and Busemeyer, 2010;Volz and Gigerenzer, 2012).
Small worlds provide opportunities for analysing Bayesian inference. In Bayesian decision theory (Jaynes, 2003;MacKay, 2003;Robert, 2007), degrees of belief in states of the world are specified. Bayesian inference updates prior beliefs using new evidence to derive posterior beliefs. Figure 1A shows small worlds, vested as an urn-ball task (Phillips and Edwards, 1966), consisting of two binary random variables, one representing unobservable states of the world (i.e., urns, H ∈ {H 1 , H 2 }), the other representing observable events (i.e., balls, e ∈ {0, 1}). Participants were asked for inferences about the current hidden state of that world, given small samples of events which could have been generated from either state ( Figure 1B). To introduce experimental variance, we manipulated the task's probabilistic contingencies at two levels. First, we introduced uncertainty about the sort of urn containing balls (by sampling urns from two probability distributions). Second, we manipulated the proportion of ball colors within each urn. We will refer to these probabilistic contingencies as prior probabilities and likelihoods, respectively.
Thus, participants are informed about the world's stochastic structure, and they have access to evidence generated from one of these states. Participants represent knowledge-based and evidence-based degrees of belief in world states as probabilities. Specifically, participants hold informed prior beliefs [i.e., P(H 1 ), P(H 2 ) = 1 − P(H 1 )] and likelihoods [i.e., P(E|H 1 ), P(E|H 2 )], i.e., conditional probabilities for some evidence, given each of the states. According to Bayes' theorem, prior probabilities are combined with likelihoods to provide posterior probabilities (Gold and Shadlen, 2007).
A long history of studies demonstrates that human judgment deviates from Bayesian decision theory (Kahneman et al., 1982). Initially, Edwards coined the term "conservatism" to describe probabilistic inference in which persons overweigh prior beliefs (base rates) and under-weigh new sample evidence when compared to Bayesian decision theory (Edwards, 1982). Shanteau's work (1972Shanteau's work ( , 1975 was centered around delineating conditions for the cognitive (sub-)additivity of new sample evidence. Later research identified the base rate neglect (Kahneman and Tversky, 1973;Bar-Hillel, 1980), i.e., a cognitive bias which indicates that the posterior probability of hypothesis H, given evidence e, is assessed without taking into account the prior probability (base rate) of H. Base rate neglect represents a particularly pertinent class of deviations in probabilistic judgment from Bayesian decision theory (Koehler, 1996).
Prospect Theory (Kahneman and Tversky, 1979;Tversky and Kahneman, 1992) successfully describes economic decision behavior. Its probabilistic part proposed subjective probabilities used in decision-making to be non-linear functions of objective probabilities, with their relationships being best described by Sshaped probability weighting functions, leading to a tendency to overestimate low probabilities and to underestimate high probabilities (see Figure 2). One of our reviewers brought our attention to the fact that the notion of weighting probabilities was around 15 years prior to the introduction of Prospect Theory (Edwards, 1954), and that Edwards' early discussion of probability weighting is in the context of probabilistic inference rather than risky choice, which is the focus of Prospect Theory.

FIGURE 1 | (A)
Participants were graphically informed about the four probability conditions of the urn-ball task [certain priors (P = 0.9, P = 0.1), uncertain priors (P = 0.7, P = 0.3), certain likelihoods (P = 0.9, P = 0.1), uncertain likelihoods (P = 0.7, P = 0.3)]. (B) Each of the four probability conditions (here, the uncertain priors and uncertain likelihoods condition serves as an example) comprised 50 consecutive episodes of sampling that consisted of the invisible drawing of one urn, the visible drawing of a sample of four balls from that urn (sequential drawing with replacement). Participants indicated which urn they considered more likely, given the number of blue balls drawn (zero to four), and based on the condition (prior probability and likelihood function).
Such S-shaped probability distortions are ubiquitous in research on probabilistic inference (Gonzalez and Wu, 1999;Luce, 2000;Zhang and Maloney, 2012;Cavagnaro et al., 2013). Stott (2006) provides a review of various probability weighting functions and a summary of parameter values reported in the literature for each of the weighting functions. We refer the interested reader to this publication for descriptions of multiple weighting functions. Earlier research on probability weighting functions revealed considerable inter-and intra-individual variance of probability weighting parameters (Tversky and Kahneman, 1992;Gonzalez and Wu, 1999;Stott, 2006;Wu et al., 2009), but no former study addressed the question whether Bayesian inference in a risky/small world involves probability weighting through Bayesian modeling approaches.
The present study aimed at contributing to the literature by applying Bayesian statistics (Jaynes, 2003;MacKay, 2003;Robert, 2007) to model Bayesian inference. Bayesian methods have become increasingly accepted for data analysis in cognitive science (Edwards et al., 1963;Wagenmakers, 2007;Gallistel, 2009;Kruschke, 2010;Lee, 2011;Hoijtink, 2012). Specifically, hierarchical Bayesian modeling provides flexible and interpretable ways of analysing formal models of cognitive processes (Lee, 2011). Hierarchical Bayesian models are models in which parameters are sampled from distributions determined by other parameters (so-called hyper-parameters; Shiffrin et al., 2008). By making the individual parameters dependent on their group mean, a trade-off between fitting the group as a whole and fitting each individual separately is introduced (Shiffrin et al., 2008), allowing for improved predictive robustness of the model (Gelman et al., 2014a). Here, we applied hierarchical Bayesian modeling to Bayesian inference in order to examine whether human subjects (a) follow normative Bayesian specifications, and (b) are influenced by non-normative tendencies, such as base rate neglect or (inverse) S-shaped probability weighting. Notice that this is the first study that applied hierarchical Bayesian modeling to Bayesian inference in a risky/small world.

Participants
Sixteen psychology students (I = 16) participated for course credit (15 female, 1 male). Age ranged from 19 to 50 years (M = 24.7; SD = 9.3). All participants indicated having normal or corrected-to-normal sight. The study was reviewed and approved by the Ethics Committee of TU Braunschweig (Department of Life Sciences). Informed consent was obtained from all subjects.

Inference Task
The inference task is a modification of tasks used in Phillips and Edwards (1966) and Grether (1980Grether ( , 1992see also Stern et al., 2010;Achtziger et al., 2014). Factorial combination of two levels of prior probabilities and two levels of the likelihoods yielded four experimental conditions which were administered to each participant. Their order was counterbalanced across participants under the following restriction: Prior probability was the slowly varying factor (one level of prior probabilities was repeated across two successive blocks of trials), whereas the likelihoods changed from block to block, yielding four different orders.
At the beginning of each condition, two types of urns (labeled H 1 and H 2 here, respectively) were presented on a computer screen. In uncertain likelihood conditions (L u ), urn type H 1 contained seven blue (e = 1) and three red (e = 0) balls [i.e., P(e = 1|H 1 ) = 0.7, P(e = 0|H 1 ) = 0.3], while urn type H 2 contained three blue and seven red balls [i.e., P(e = 1|H 2 ) = 0.3, P(e = 0|H 2 ) = 0.7]. In certain likelihood conditions (L c ), urn type H 1 contained nine blue balls and one red ball [i.e., P (e = 1 | H 1 ) = 0.9, P (e = 0 | H 1 ) = 0.1], while urn type H 2 contained one blue ball and nine red balls [i.e., P(e = 1|H 2 ) = 0.1, P(e = 0|H 1 ) = 0.9]. Prior probabilities were manipulated by presenting 10 urns, composed of different numbers of type H 1 and type H 2 urns. In uncertain prior probability conditions (P u ), three type H 1 and seven type H 2 urns [i.e., P(H 1 ) = 0.3, P(H 2 ) = 0.7] were present. In the certain prior probability condition (P c ), one type H 1 urn and nine type H 2 urns [i.e., P(H 1 ) = 0.1, P(H 2 ) = 0.9] were present. In the absence of a previous study that examined human probabilistic inference via hierarchical Bayesian modeling (see below), these quite unbalanced prior probabilities and likelihoods were chosen to minimize a potential role of pure "guessing." It should be noted that the results may have been biased by the selection of these parameters. Colors were counterbalanced across participants, but we will ignore this in our description to avoid confusion.
Each experimental trial ( Figure 1B) consisted of the following sequence of events: First, one urn was selected randomly, with the outcome of this selection remaining unknown to the participant. Subsequently, a random sample of four balls (K = 4) was drawn sequentially with replacement from that urn, shown one by one. Ignoring the order of drawn balls, this procedure generated five distinct possible situations (i.e., zero to four blue balls) in each condition. Ball stimuli were presented in the center of a computer screen (Eizo FlexScan T766 19 ′′ ; Hakusan, Ishikawa, Japan) against gray background (size = one degree, duration = 100 ms, stimulus onset asynchrony = 2500 ms). Trial duration amounted to around 10 s. Participants were asked to indicate the color of each stimulus by pressing the left or right Ctrl key on a standard computer keyboard. Participants indicated which urn they considered more likely, given the number of blue balls drawn (zero to four), and based on the condition (prior probability and likelihood function; Data Sheet 1 in Supplementary Material). They indicated their inference by pressing the left or right Ctrl key for urn type H 1 and urn type H 2 , respectively. Feedback was not provided in order to avoid confounding of probabilistic inference proper and evaluative processes, pending two-factor models of decision-making such as, for example, Prospect Theory (Kahneman and Tversky, 1979), although Grether and Plott (1979) have demonstrated that incentives may have little influence over performance in probabilistic inference tasks. Stimulus-response mapping was counterbalanced across participants. Neither feedback nor reward was provided during the experiment.
Each participant completed four practice trials under supervision of the experimenter, each with one urn type H 1 and one urn type H 2 exemplar, to become accustomed to the task. Successful completion of the practice trials demonstrated that participants understood the procedure and their task. There were N = 50 trials per condition, with short breaks between the conditions. We chose 50 trials per condition because the study was also designed to measure event-related potentials (Seer et al., 2016;Kopp et al., under review), and event-related potentials require large numbers of trials for averaging. Given five possible outcomes per condition, participants responded to each possible outcome an average of 10 times per condition. Thus, it should be mentioned that the responding had a quite repetitive character, and that participants may have at times simply recalled their responses from earlier trials in later trials. The experiment was run using Presentation R (Neurobehavioral Systems, Albany, CA).

Bayesian Inference
Binary inferences were requested (Figure 1). Given prior probabilities P (H 1 ) = 1 − P(H 2 ) for two hypothetical states H 1 and H 2 , and a set of binary data E = {e 1 , . . . , e K } with K ∈ N, we can compute the posterior probabilities as and where P (H 1 |E) = 1 − P(H 2 |E). Formulating these posterior probabilities in log-odds form (Jaynes, 2003), using the notation Lo(P(H 1 |E)) for the posterior log-odds favoring H 1 we obtain, For each single binary datum e k , with k ∈ [1, . . . , K], we obtain, Hence, the Bayesian updating in log-odds form for a binary hypothesis equals adding the logarithm of the likelihood ratio ln P(e k |H 1 )/P(e k |H 2 ) to the logarithm of the prior odds ln P(H 1 )/P(H 2 ) . We can therefore represent the accumulation of evidence by adding a new likelihood ratio for each new datum e k .

Probability Weighting
Zhang and Maloney (2012) proposed a linear probability weighting function in the log-odds space, capable of modeling probability weighting as in Prospect Theory (Kahneman and Tversky, 1979;Tversky and Kahneman, 1992). We chose this probability weighting function for its compatibility with the idea of Bayesian updating (see Section Bayesian Inference with Weighted Probabilities). Zhang and Maloney (2012) used the formula with p being a probability and the two weighting-parameters γ ∈ [0, ∞) and p 0 ∈ [0, 1] , logistic (x) = 1 1 + e −x being the logistic function, π being the corresponding weighted probability, and Lo p given by Note that, logistic Lo p = p. The unknown parameters are the slope of the weighting function γ and the point p 0 , where the weighted probability equals the unweighted probability, The slope parameter γ determines the shape of the weighting function. If γ = 1, the probabilities remain untransformed. If γ < 1, the weighting function has an inverse-S shape or a concave shape if p 0 approaches one. If γ > 1, the weighting function shows an S-shape or a convex shape if p 0 approaches one (Figure 2). Using a comparable log-odds approach, Zhang and Maloney (2012) successfully modeled choice and confidence data from a range of studies.

Bayesian Inference with Weighted Probabilities
Since Zhang and Maloney (2012) provide a linear weighting function in log-odds space, we can easily represent Bayesian inference with transformed log-odds. Applying Equation (5) to the log-odds evidence Lo(P(H 1 |E)) as defined in Equation (3) we obtain Therefore, we can also represent the weighted accumulation of evidence as the addition of for each single datum e k .

Unweighted Bayesian Posterior Probabilities
The simplest model incorporating Bayesian inference relies on unweighted posterior probabilities. This model is Bayes optimal, no other classifier of the urn state can do better on average. We use the posterior to predict the inference of H 1 , on any trial, using the sampling statement Here, y n represents the n-th inference (y n ∈ {0, 1} with n ∈ {1, .., N}). A value of zero denotes the inference of the urn with higher prior probability and a value of one the inference of the urn with lower prior probability. P(H 1 |E) is the n-th posterior probability for H 1 , given a sample of evidence E, as defined in Equation (1). bern P(H 1 |E) denotes the Bernoulli distribution: y n takes on value one with probability P (H 1 | E), and y n equals zero with probability 1 − P(H 1 |E). There are five different possibilities to draw combinations of blue and red balls (disregarding sequential order); therefore there are five different values of P(H 1 |E) and Lo(P(H 1 |E)) (Equation 3) per experimental condition which are shown in Table 1.

Base Rate Neglect
Base rate neglect leads to the assumption that π (P (H 1 )) = π (P (H 2 )) = 0.5, irrespective of the veridical prior probabilities P (H 1 ) and P(H 2 ). The model of Bayesian inference thus reduces to the likelihood of the data, i.e., to

Weighted Bayesian Posterior Probabilities
In the following, we describe three models of weighted Bayesian posterior probabilities (Equation 8).

Model without individual differences
This probability weighting model assumes the absence of interindividual differences with regard to the weighting parameters γ and p 0 . The parameters γ und p 0 were provided with weakly informative prior distributions-i.e., prior distributions restricting the posterior distribution very little-to ensure computational tractability (Gelman et al., 2014a). The sampling statement for γ ∼ N (1, 1) [0, ∞), with N being the normal distribution truncated at zero. Figure 3A depicts the corresponding probability density. The prior distribution of γ is a normal distribution centered on 1, to represent the weak prior assumption of no probability weighting taking place. We chose a flat prior p 0 ∼ beta(1, 1), where beta is the beta function. Figure 3B shows the corresponding probability density.
Thus, all urn inferences are generated by the underlying Bernoulli distribution with p = P(H 1 |E), given a particular set of evidence E, y n ∈ {0, 1} with n ∈ [1, . . . , N], γ ∈ [0, ∞) and p 0 ∈ [0, 1]. bern p b denotes the Bernoulli distribution: y n takes on value onedenoting an inference of the urn with lower prior probabilitywith probability p b , and y n equals zero-an inference of the urn with higher prior probability-with probability 1 − p b . Figure 4 depicts the graphical model without individual differences. Nodes represent variables and arrows represent conditional probabilities. Discrete variables are given square nodes, continuous variables circular nodes and observed variables are shaded, while unobserved are not. Variables fully determined by their parents are given double borders, while stochastic variables are given single borders. Plate notation is used to ensure sparse representation, by grouping repeating variables in a subgraph enclosed by a rectangle, and indicating the number of repetitions. The plate corresponds to the N trials. An arrow between a variable outside of a plate to a variable inside indicates the dependence of each repeated variable on their parent outside the plate. The variables in the graphical model are defined using the subscript n to indicate definitions specific to the variables in each trial.

Model with unrestricted individual differences
This model considers inter-individual variability in parameters γ i and p 0 i . The sampling statements of the prior distributions of each individual's parameters γ i and p 0 i are given by γ i ∼ N (1, 1) [0, ∞) and p 0 i ∼ beta(1, 1).
The assumed data-generating process equals with p = P(H 1 |E), given the corresponding set of evidence E, y n i ∈ {0, 1}, γ i ∈ [0, ∞) , and p 0 i ∈ [0, 1] for each individual i and datum n. bern p b denotes the Bernoulli distribution. in the graphical model are defined using the subscript i or n to indicate definitions for variables specific to each individual and trial, respectively.

Model with hierarchical individual differences
In this modeling attempt, we assume that the latent γ and p 0 parameters for individual participants are generated by more abstract latent parameters (hyper-parameters) describing group distributions across individuals. This model considers individual differences by specifying distributions of parameters γ and p 0 , out of which individual γ i and p 0i have to be sampled. For the parameter γ , these distributions were assumed to be normal distributions, characterized by mean µ γ and standard deviation σ γ , with weakly informative prior distributions. The beta distribution of parameter p 0 was specified by two hyperparameters, one being a mean parameter ϕ = α/(α + β) and one being a total count parameter λ = α + β, instead of the usual α and β, following Gelman et al. (2014b). The sampling statement for the parameter γ i for each individual i was γ i ∼ N µ γ , σ γ [0, ∞) with hyper-priors µ γ ∼ N(1, 1) and σ γ ∼ Uniform (0, ∞) . The sampling statement for the parameter p 0 i for each individual i was p 0 i ∼ beta(ϕλ, (1 − ϕ) λ) with ϕ ∼ beta(1, 1) and λ having a pareto prior, p(λ) ∝ λ −2.5 . The hierarchical model and the model with unrestricted individual differences share the same sampling state for the data (Equation 14), they differ only in the dependence (for the hierarchical model) or independence (for the model with unrestricted differences) of the γ i and p 0i parameters. Figure 6 displays the graphical model for the hierarchical individual differences.

Model Evaluation and Selection
The quality of the out-of-sample prediction of the models was evaluated by comparing their (a) widely applicable information criterion (WAIC; Watanabe, 2010) and (b) log-likelihood in a leave-one out cross-validation. For both measures, the log pointwise predictive density (lppd) was computed as, Gelman et al. (2014b). This quantity measures the likelihood P y n θ s of the datum y n , using s = 1, . . . , S sampling FIGURE 5 | The graph of the Bayesian network with unrestricted inter-individual differences, with the urn inference y n , the posterior probability p, and the individual parameters p 0i and γ i .
FIGURE 6 | The graph of the hierarchical Bayesian network, with the urn inference y n , the posterior probability p, the individual parameters p 0i and γ i , and the hyper-parameters µ γ , σ γ and ϕ, λ.
results from the posterior distribution of all model parameters, labeled θ s .

The Widely Applicable Information Criterion (WAIC)
The WAIC uses the log point-wise predictive density and subtracts a correction term, corresponding to the effective number of parameters (Gelman et al., 2014b), here given by N n = 1 V S s = 1 (ln P(y n | θ s )), with the likelihood P y n θ s of the datum y n , using s = 1, . . . , S sampling results from the posterior distribution of all model parameters, labeled θ s , and the variance of the posterior, Note that a s is represented by ln P(y n |θ s ) in (17). Subtracting (15) from (14) gives V S s = 1 (ln p y n θ s )), (17) which is the formula for calculating the WAIC (Gelman et al., 2014b). Multiplying the result from Equation (17) with −2 gives us a measure on the scale of other deviance and information measures like deviation information criterion (DIC; Spiegelhalter et al., 2002) and an information criterion (AIC; Akaike, 1973).

Bayesian Leave-One-Out Cross-Validation
In Bayesian leave-one-out cross-validation (Vehtari and Lampinen, 2002), each data point is predicted using the estimates of the posterior distributions obtained by fitting the model to all remaining data points. We fit the model with 15 individuals to predict the urn inferences of the 16-th person, since we are interested in predicting the urn inferences of additional individuals, rather than additional urn inferences of each individual. By doing this for every individual, we can calculate the sum of the individual log point-wise predictive densities. To make the interpretation of this score similar to other deviance measures, where smaller values denote a better fit, we use −2 · lppd. The model assuming unrestricted individual differences makes no prediction about the individual parameters γ and p 0 for additional individuals. It is therefore impossible to cross-validate this model, and this model fails a basic test of generalizability because it does not make sensible predictions for the behavior of additional individuals. For the model without individual differences, using the posterior distributions of parameters γ and p 0 is straightforward to compute the log posterior predictive density given in Equation (14). For the model with hierarchical individual differences, the unknown individual parameters γ and p 0 were assumed to be sampled out of the hierarchical distributions characterized by the hyper-parameters µ γ , σ γ , ϕ, and λ. The data were analyzed in Python 2.7.9 (Oliphant, 2007), and the models were fitted using the Stan Hamiltonian-Monte-Carlo sampler that provides approximate inference of the posterior distributions of the unknown parameters (Stan Development Team, 2013;Hoffman and Gelman, 2014). Each model in each condition was sampled with 10 chains and 20,000 iterations, with random initialization of the parameters and a warmup phase of 10,000 iterations (Data Sheet 2 in Supplementary Material).

Data Description
In Figure 7, the proportions of H 1 -urn inferences are plotted separately for each condition (P c L c , P u L c , P c L u , P u L u ) as a function of the log-odds posterior probabilities, Lo(P (H 1 | E)). Points represent the proportion from one participant; small random noise (±0.25) was added to the points to spread them out on the x-axis. Diamond symbols indicate mean proportions of inferences of H 1 across participants, red dots indicate the proportion corresponding to the unweighted posterior probabilities. There is a considerable difference between the ideal proportion, indicated by the cross symbol, and the actual proportions of urn inferences of the participants, especially in the P c L u condition, and there is also considerable variation between the individual proportions. Not represented are the different numbers of urn inferences which constituted the individual proportions.

Unweighted Bayesian Probabilities and Base Rate Neglect
Predictions of the unweighted Bayesian model depend solely on posterior probabilities, P(H 1 |E), whereas predictions of the base rate neglect model depend solely on the likelihood of the data, P(E|H 1 ). The predictions of these two models therefore do not depend on additional parameters. Further information about the quality of these predictions is given below under Model Evaluation and Selection.

Weighted Bayesian Posterior Probabilities
For the model without individual differences, sampling from the model described in Equation (12), yields the posterior distributions for parameters γ and p 0 for each condition. The means of these posterior distributions were used as the estimates for the parameters in the weighting function described in Equation (5). The resulting functions vary across conditions, because the means of the posterior distributions for parameters γ and p 0 differ between conditions. For the model with unrestricted individual differences, sampling from the model described in Equation (13), yields the posterior distributions of the parameters specific for each individual in each condition. In each condition, the means of the posterior distributions of the individual parameters γ and p 0 were used as the estimates of the parameters of the individual weighting functions described in Equation (5). Figure 9 shows the probability weighting functions for each individual and condition.
There is considerable variation in the shape of the probability weighting functions between individuals in one condition, and between conditions in general. Descriptive statistics of the fitted model with unrestricted inter-individual differences are attached (see Appendix B in Supplementary Material).
For the model with hierarchical individual differences, the posterior distributions of each individual parameter γ and p 0 , and their hyper-parameters ϕ, λ, µ γ , and σ γ for each condition were approximated by sampling. The means of the posterior distributions of µ γ and σ γ , for each condition, were used to plot the expected hierarchical normal distribution of the individual parameters γ ∼ N(µ γ , σ γ ) in Figure 10. The posterior distribution of the γ parameter varies considerably between conditions. For conditions with a certain likelihood the posterior is centered around one, and is quite narrow. This is in contrast to the conditions with an uncertain likelihood, where the posterior is centered around higher values of γ and more spread out.
The means of the posterior distributions of ϕ and λ, for each condition, were used to plot the expected hierarchical beta distribution of the individual parameters p 0 ∼ beta (ϕλ, (1 − ϕ) λ) (Figure 11). We can distinguish two patterns in Figure 11: The posterior mode is (slightly) more extreme for certain than for uncertain prior conditions, and the posterior mode is closer to zero in conditions with an uncertain likelihood and closer to one in conditions with a certain likelihood.
For the individual parameters γ and p 0 for each individual and condition, the means of the posterior distributions were used as the parameters of the weighting function described in Equation (5), and the resulting functions were plotted in Figure 12.
There is still variation in the shape of the probability weighting functions between conditions, but the variation between individuals in one condition is reduced in contrast to Figure 9 for conditions with a more certain likelihood. Interestingly, in conditions with a more uncertain likelihood the variation between participants is either similar (P c L u ) or larger (P u L u ). Descriptive statistics of the fitted model with hierarchical inter-individual differences are attached (see Appendix C in Supplementary Material).

Model Evaluation and Selection
Model Comparison Using the WAIC Table 2 displays the WAICs for different models for each condition (P c L c , P u L c , P c L u , P u L u ). Smaller values indicate a better out-of-sample prediction than larger values. A comparison between models reveals superior performance of weighted Bayesian models compared to unweighted models. In the latter category, both models have similar WAIC values on uncertain prior conditions P u , while the base rate neglect model performs better on the P c L c condition, and the unweighted Bayesian model outperforms the base rate neglect model on the P c L u condition. Among the weighted Bayesian models, those models with unrestricted individual differences and hierarchical individual differences perform better than the model without individual differences on all conditions. However, the hierarchical model outperforms the model with unrestricted FIGURE 9 | Individual probability weighting functions using the posterior mean for γ and p 0 with unrestricted inter-individual differences for the different conditions (P c L c , P u L c , P c L u , and P u L u ). Each plot represents one condition, and each line represents the best estimate obtained for the weighted probability function for one individual in that condition.
inter-individual differences on P c L u and P u L u conditions, while both models perform similarly on the remaining conditions. In sum, the unweighted models yield larger WAIC values than the weighted models, with the hierarchical model providing the best overall performance.

Model Comparison Using Cross-Validation
The results of the cross-validation, averaged over the 16 participants, are given in Table 3 for each condition (P c L c , P u L c , P c L u , P u L u ). Further, cross-validation was not applicable FIGURE 10 | Hierarchical distributions for parameter γ in different conditions (P c L c , P u L c , P c L u and P u L u ). The hierarchical distributions for γ ∼ N(µ γ , σ γ ) use the posterior means of the hyper-parameters µ γ and σ γ .
to the model with unrestricted individual differences. The weighted Bayesian models predict new data better than the unweighted models. However, in the P u L u condition, the model without inter-individual differences yields the worst average prediction of all models, while the model with hierarchical interindividual differences still yields the best prediction. Among the weighted Bayesian models, the model with hierarchical individual differences outperforms the model without differences on conditions P u L c , P c L u , and P u L u , with especially large differences in the quality of prediction in uncertain likelihood conditions (L u ), and similar performance of both models in the P c L c condition.

Effects of Prior and Likelihood Manipulations
Since we observe differences in the probability weighting functions between conditions, we can test for the effect of the experimental prior and likelihood manipulations on the model parameters parameter γ and p 0 . We use the parameters of the model with full inter-individual differences, and estimate a linear mixed-effects model, with a random intercept per participant, categorical variables representing the manipulation of prior and likelihood and the interaction between these manipulations (see Appendix D in Supplementary Material for a more detailed description and the full model statistics). We estimate such a model separately for γ and p 0 and find a significant effect for the uncertain likelihood conditions in γ (β = 0.35, SE = 0.17, z = 2.01, p = 0.036). Uncertain likelihood conditions increase the γ coefficient and change the probability weighting function to an S-shape (or a convex shape if p 0 is close to one). For the p 0 parameter, we find a significant effect also only for uncertain FIGURE 11 | Hierarchical distributions for parameter p 0 in different conditions (P c L c , P u L c , P c L u and P u L u ). The hierarchical distributions p 0 ∼ beta(ϕλ, (1 − ϕ) λ) use the posterior means of the hyper-parameters ϕ and λ for the different conditions. likelihood conditions (β = −0.15, SE = 0.05, z = −3.00, p = 0.003). In uncertain likelihood conditions the cutting point of the probability weighting function is lower than in certain likelihood conditions. Of course, this effect can also be seen in the posterior of the hierarchical model.

DISCUSSION
In this article, we explored various possibilities for modeling cognitive processes for probabilistic inference. To that end, probabilistic inference was observed in a small world (Savage, 1954), vested as a classic urn-ball paradigm (Phillips and Edwards, 1966;Grether, 1980Grether, , 1992Stern et al., 2010;Achtziger et al., 2014) involving a factorial two (prior probabilities) by two (likelihoods) design. Probabilistic inference was modeled as originating from different variants of Bayesian inference. Five computational models of cognitive processes for probabilistic inference were compared by Bayesian model evaluation (Vehtari and Lampinen, 2002;Shiffrin et al., 2008). We found considerable task dependency such that more certain likelihoods were associated with mean group probability weighting parameters that satisfy γ < 0.6 (cf. Tables A.1, A.2), whereas less certain likelihoods were associated with mean group probability weighting parameters that satisfy γ < 0.9 (cf. Table A.3) or γ > 1 (cf. Table A.4). We are not aware of a psychological theory that could predict this strong task dependency of probability weighting.
The least complex models (i.e., parameter-free Bayesian posterior probabilities, parameter-free base rate neglect) provide inadequate models of probabilistic inference in terms of the FIGURE 12 | Individual probability weighting functions using the posterior mean for γ and p 0 in the hierarchical model for the different conditions (P c L c , P u L c , P c L u and P u L u ). Each plot represents one condition, and each line represents the best estimate obtained for the weighted probability function for one individual in that condition.
robustness of the fit between models and data as well as in terms of their generalizability. The introduction of probability weighting functions (Kahneman and Tversky, 1979;Tversky and Kahneman, 1992;Prelec, 1998;Gonzalez and Wu, 1999;Luce, 2000;Zhang and Maloney, 2012;Cavagnaro et al., 2013) yielded more robust and generalizable fits between models and data than the two parameter-free models. Probability weighting models share the assumption that subjective probabilities deviate from true probabilities due to (inverted) S-shaped distortions of probabilities.
The least complex of these models conceptualized different slope, γ , and crossover point, p 0 , parameters for probability distortion across conditions, but not across individuals (model without individual differences). This model includes eight free parameters (four conditions by two free parameters, i.e., four γ and four p 0 parameters).
There were two variants of individual difference models which envisaged different slope and crossover point parameters for probability distortion across conditions and individuals. The unrestricted individual differences model considered individual differences, by specifying individual γ i and p 0 i parameters, which are fully determined by the data. This model includes four (conditions) by 32 free parameters (16 individual γ i parameters and 16 individual p 0 i parameters, for a total of 128 free parameters).
In contrast, the hierarchical individual differences model assumed individual parameters themselves to be generated by more abstract latent parameters (hyper-parameters) describing group distributions across individuals. This model thus considered individual differences by specifying distributions of parameters γ and p 0 , out of which individual γ i and p 0 i had to be sampled. This model includes four (conditions) by 32 free parameters (16 individual γ i parameters and 16 individual p 0 i parameters), plus sixteen hyper-parameters (four conditions by four hyper-parameters, µ γ , σ γ , µ p 0 , σ p 0 , for a total of 144 parameters). Since the effective number of parameters in a hierarchical model depends on the variance of the group level parameters (Gelman et al., 2014b), the actual number of free parameters is <144. We show that the hierarchical individual differences model outperformed the unrestricted individual differences model, which in turn outperformed the model without individual differences. This is the result of a process of model building, in which each increase in model complexity is justified by an increase in the model's fit and predictive power. To conclude, the assumption of large differences in probability distortions across tasks and individuals (i.e., in the values of the slope and crossover point parameters) seem critical for understanding Bayesian inference (see also Glöckner and Pachur, 2012;Zhang and Maloney, 2012). Further, it seems advantageous to consider individual differences as being sampled from weakly informative prior distributions of individual parameter values. The hierarchical model of weighted Bayesian posterior probabilities is thus the preferable model of probabilistic inference, despite the non-negligible association between model performance and model complexity. However, parameter counts and estimations of model complexity should not be considered as equivalents. While the hierarchical model is more complex in terms of the number of parameters, it also restricts the individual parameters more than the model with unrestricted differences. Further, parameters, which are fully determined by the data (or by parts of the data, as in the unrestricted model), have a higher chance to overfit in comparison to those parameters that are imposed by some theoretical structure (e.g., the similarity of inter-individual differences, as in the hierarchical model).
Similarly, to further prevent overfitting, the hierarchical model could be extended to explicitly model the different experimental conditions through an additional set of hyper-parameters. More concretely, an implementation in the framework of hierarchical Bayesian models would model the influence of subject, as well as experimental manipulation on the parameters of a probability weighting function. Further potential confounds, like task-order, and sequential effects in probability weighting might also be included. A final expansion of the hierarchical model could be its formulation as a Bayesian hierarchical mixture model (Bartlema et al., 2014), allowing groups of participants to have discrete, in addition to continuous, inter-individual differences. Hierarchical Bayesian modeling offers specific advantages. Since knowledge about parameters propagates in hierarchical Bayesian models, the flow of probabilistic influence (Koller and Friedman, 2009) allows individual parameters to influence each other with regard to the certainty of their estimation. This renders the sharing of statistical strength (Gelman et al., 2014a) possible, with consequential improvements in predictive accuracy under high levels of uncertainty. This ubiquitous characteristic of hierarchical Bayesian modeling can also be recognized in our study, where the hierarchical model of weighted Bayesian posterior probabilities proved especially superior on those conditions that involved particularly high uncertainty about individual parameter values (i.e., the L u conditions).
Our results are highly dependent on task conditions and individuals. However, the results from hierarchical Bayesian modeling converge with previous results in revealing that probability weighting parameters show considerable task dependency and individual differences (Tversky and Kahneman, 1992;Gonzalez and Wu, 1999;Stott, 2006;Wu et al., 2009). We have to admit that our work does not offer an a-priori explanation nor a post-hoc speculation as to why this should be the case. The hierarchical model is hence not a theory of cognitive processes for probabilistic inference, despite its hereby documented success in providing a reasonably good estimate of the observed data. Such a theory would encompass an answer to the question why subjective probabilities may deviate from the true probabilities in the ways that they apparently do, and specify the factors that affect this probability distortion. A full explanation of the phenomena just described would require not only that we account for the S-shaped form of the probability distortion, but also for the large differences in the values of the slope and crossover point parameters across tasks and individuals. Although such a theory still needs to be developed, we introduced hierarchical Bayesian modeling as a valuable method for developing a model, for testing it against data, for checking if there is systematic error in the predictions of the model, for increasing the complexity of the model as long as there are intolerable amounts of error, and for checking whether the more complex model provides a superior explanation for the observed data. Furthermore, hierarchical Bayesian modeling of cognitive processes for probabilistic inference may serve an instrumental role in neuroscientific studies of Bayesian inference (Kolossa et al., 2013(Kolossa et al., , 2015.

CONCLUSION
Our findings also bear on the question whether inductive inference (Anderson, 2015) can be described as being Bayesian or not (Gigerenzer and Murray, 1987;Koehler and Harvey, 2004). According to Clark (2013), the answer to this question is empirically underdetermined; thus, new methods for tackling the problem need to be developed and evaluated. Historically, two almost certainly irreconcilable intellectual camps evolved, although serious attempts exist to bridge the conflictive positions in the form of dual process models (Evans, 2008;Evans and Stanovich, 2013). The Bayesian camp claimed that there is sufficient evidence for postulating that Bayes' theorem can serve as a descriptive theory of human inductive inference (Chater et al., 2010;Tenenbaum et al., 2011;Griffiths et al., 2012). However, the non-Bayesian camp insisted that the evidence showed that humans critically deviate from prescriptive Bayesian solutions (e.g., base rate neglect; Kahneman and Tversky, 1973;Bar-Hillel, 1980). In fact, some researchers believe that humans should be better described as homo heuristicus rather than as rational probabilists (Kahneman et al., 1982;Gigerenzer and Brighton, 2009). Our data offer another possibility: Humans might be able to integrate information for inductive inference according to Bayesian prescriptions (cf. Equations 1-4), yet in terms of distorted probabilities (cf. Equations 5-9), as originally conjectured by Edwards (1954) and by Prospect Theory (Kahneman and Tversky, 1979;Tversky and Kahneman, 1992).

AUTHOR CONTRIBUTIONS
BK developed the study design and collected the data. MB performed the data analysis under the supervision of BK. MB and BK drafted the manuscript, and CS and FL provided critical revisions. All authors approved the final version of the manuscript for submission.