Context Effects in the Judgment of Visual Relative-Frequency: Trial-by-Trial Adaptation and Non-linear Sequential Effect

Humans' judgment of relative-frequency, similar to their use of probability in decision-making, is often distorted as an inverted-S-shape curve—small relative-frequency overestimated and large relative-frequency underestimated. Here we investigated how the judgment of relative-frequency, despite its natural reference points (0 and 1) and stereotyped distortion, may adapt to the environmental statistics. The task was to report the relative-frequency of black (or white) dots in a visual array of black and white dots. We found that participants' judgment was distorted in the typical inverted-S-shape, but the distortion curve was influenced by both the central tendency and spread of the distribution of objective relative-frequencies: the lower the central tendency, the higher the overall judgment (contrast effect); the higher the spread, the more curved the inverted-S-shape (curvature effect). These context effects are in the spirit of efficient coding but opposite to what would be predicted by Bayesian inference. We further modeled the context effects on the level of individual trials, through which we found not only a trial-by-trial adaptation, but also the non-linear sequential effects that were recently reported mainly in circularly distributed visual stimuli.


INTRODUCTION
The human perceptual system adapts to the environmental statistics from time to time (Helson, 1947;Gilchrist et al., 1999;Dean et al., 2005;Chopin and Mamassian, 2012;Gepshtein et al., 2013). For example, a lighted outdoor sign that dazzles at night may look dim in the daylight. Adaptation like this allows the human brain to use neurons of limited dynamic range to represent the immense dynamic range of physical stimuli (10 9 for luminance, from star light illumination to intense daylight conditions). But it comes at a cost: in order to be sensitive to differences in the current environment, the mapping from physical stimuli to perception must be non-stationary. That is, a stimulus that is physically 5 times as large as a second stimulus may be perceived 10 times as large as the latter in one context and only 2 times as large in a different context. This non-stationarity can be harmless in many situations (e.g., in the perception of lightness), where only ordering information (e.g., which is brighter and which is darker) is required.
Here we investigated how the judgment of relative-frequency may adapt to the environment. For the perception of relativefrequency, adaptation can be both helpful and harmful. On one hand, relative-frequency in real life, like luminance, has a vast dynamic range that may challenge the neural system. For example, the relative-frequencies of different causes of death span six orders of magnitude (Lichtenstein et al., 1978). On the other hand, as a source of probability information, relative-frequency needs an accurate representation. Any nonstationary transformations accompanying adaptation would hurt one's ability to maximize expected gain in decision-making.
Relative-frequency differs from many sensory stimuli in its abstractness and in its finite range-from 0 to 1. What is special about relative-frequency is also its stereotyped distortion: Humans' judgment of relative-frequency, similar to their use of probability in decision-making (Tversky and Kahneman, 1992;Gonzalez and Wu, 1999), is often distorted in an inverted-Sshape-small relative-frequency overestimated and large relativefrequency underestimated. For example, people overestimate the relative-frequency of rare causes of death such as flood and hurricane and underestimate that of common causes such as heart disease (Lichtenstein et al., 1978; see Zhang and Maloney, 2012 for more examples). The opposite pattern, Sshaped distortion, was also reported (Shuford, 1961;Pitz, 1966;Brooke and MacRae, 1977;Wu et al., 2009). Zhang and Maloney (2012) found that the inverted-S-or S-shaped distortion in a variety of tasks could be well-captured by a Linear-in-Log-Odds (LLO) transformation: where p and π p respectively denote objective and subjective probability or relative-frequency, λ [·] denotes the log-odds transformation, λ p = log p 1−p , and γ and p 0 are free parameters that are readily interpretable. The parameter γ indicates the slope of the distortion curve, with γ < 1 for inverted-S-shaped distortion, γ = 1 for no distortion, and γ > 1 for S-shaped distortion. The parameter p 0 indicates the crossover point where π p = p. In other words, the γ and p 0 are measures respectively for the curvature and elevation of probability distortion (Gonzalez and Wu, 1999).
In our experiment, participants judged the relative-frequency of black (or white) dots among an array of black and white dots ( Figure 1A). There were four conditions for the distribution of the objective relative-frequency p ( Figure 1B). In the baseline Uniform condition, p was uniformly distributed between 0.01 and 0.99. The Small (Large) condition differed from the Uniform condition mainly in the central tendency of the distribution by having a disproportionally great number of very small (large) values of p. The Extreme condition was U-shaped (i.e., most values were extreme) and differed from the Uniform condition in the spread of the distribution.
We asked two questions. The first question is whether and how participants' judgment of relative-frequency, π p , may vary with the distribution of p. There has been increasing evidence that adaption functions not only for sensory modalities, but also for abstract quantities such as utility (Tobler et al., 2005;Kobayashi et al., 2010;Louie et al., 2013;Khaw et al., 2017;Rustichini et al., 2017), numerosity (Burr and Ross, 2008;Cicchini et al., 2014), rate (Levitan et al., 2015), and variance (Payzan-LeNestour et al., 2016), in the form of contrast effects: the same quantity tends to be perceived larger in a context of small quantities and smaller in contrast with large quantities. Such contrast effect was also found for relative-frequency in a task similar to ours (Varey et al., 1990).
What concerned us are not individual values but how the whole curve of π p may change with the context and what principles the changes may follow. We considered two lines of theories that provide opposite predictions for the possible context effects (see Figure 2 for the simulated predictions). One line of theories is represented by the adaption-level theory (Helson, 1947;Parducci, 1965), which assumes that the perception of a specific stimulus reflects the difference between the stimulus and an internal reference point. The value of the reference point, called "adaptation level", is determined by the average value of the stimuli in the context. The adaptation-level theory predicts a contrast effect (Figure 2A): The Small condition, which had a lower central tendency than the Uniform condition, would lead to a higher elevation for the π p curve, while the Large condition would lead to a lower elevation than the Uniform condition. Because the adaptation level is not influenced by the spread of the distribution, the adaptation-level theory predicts no difference between the Extreme condition and the Uniform condition.
The second line of theories treats perceptual judgment as a Bayesian inference problem [see (Maloney and Zhang, 2010;Petzschner et al., 2015) for reviews]-inferring the true value of a physical stimulus (in the current experiment, relative-frequency) based on its noisy percept. To compensate for the uncertainty in the percept, the final judgment would combine the percept and prior information about the stimulus. If the prior participants used follows the distribution of p's they had experienced in the experiment, their judgment would be biased toward the highdensity regions of the distribution. Thus, the Bayesian inference theory predicts an assimilation effect ( Figure 2B): The Small condition, which had high densities on the small end, would have a lower elevation than the Uniform condition, while the Large condition would have a larger elevation than the Uniform condition. Similarly, for the U-shaped Extreme condition, the concentration of p's on the two ends would attract π p toward the two ends, that is, a steeper slope than the Uniform condition.
Our second question is how the context effects, if any, may arise from trial to trial. In our experiment, participants were never explicitly informed about the distribution of p and could only learn the distribution via individual trials. We modeled two processes on the level of individual trials. The first process is a trial-by-trial updating of reference point, which is a natural extension of the adaptation-level theory with the additional assumption that the adaptation level (reference point) is updated by the delta rule (Rescorla and Wagner, 1972). As the result, the reference point assigns higher weights to more recent trials and would be able to track the changes in the context.
The second process we investigated is the sequential effect, that how the stimulus or response of a precedent trial may bias the current response. A common practice to quantify the sequential effect (Pegors et al., 2015) is to regress the current response (R n ) against the stimulus (S n−i ) or response (R n−i ) itrial back, i = 1, 2, ..., m. In this way, it is assumed that R n changes linearly with S n−i or R n−i , or, in other words, the sequential effect is linear. Most sequential effects documented in the literature are linear sequential effects (Fründ et al., 2014). Fischer and Whitney (2014), however, reported a non-linear sequential effect in the perception of orientation: the sequential effect first increases then decreases as the distance between the previous stimulus and current stimulus increases. Non-linear sequential effects were also found in the perception of motion (Alais et al., 2017), facial identity (Liberman et al., 2014) and numerosity , and in visual working memory of colors (Makovski and Jiang, 2008). We were interested in whether there were similar non-linear sequential effects in the judgment of relative frequency and how these inter-trial effects might contribute to the global context effect.
Here is a brief summary of our experimental results: we found that the judgment of relative-frequency is distorted, as typical, but the distortion curve changes with the distribution of relative-frequencies in both curvature and elevation. The observed context effect in elevation agrees with the prediction of the adaptation-level theory but opposite to that of Bayesian inference, while the effect in curvature can be accounted by neither theory but conforms to the principle of efficient coding (Attneave, 1954;Barlow, 1961;Simoncelli and Olshausen, 2001;Stocker, 2012, 2015;Burr and Cicchini, 2014;Summerfield and Tsetsos, 2015). On the level of individual trials, we found evidence for a trial-by-trial adaptation and a non-linear sequential effect, which could partly account for the observed context effects.

Ethics Statement
The experiment had been approved by the Institutional Review Board of School of Psychological and Cognitive Sciences at Peking University. All participants gave written informed consent in accordance with the Declaration of Helsinki.

Participants
Sixty-four students of Peking University participated (28 male, aged 18-29) and were randomly assigned into four experimental conditions, with 16 participants for each condition. All participants had normal or corrected-to-normal vision. The experiment took ∼70 min and participants received 50 RMB (≈ 8 USD) for their time.

Apparatus and Stimuli
Participants were seated ∼86 cm (i.e., 1.5 cm ≈ 1 • of visual angle) in front of a 21.5 ′′ iMac monitor (47.3 × 26.6 cm, 1,920 × 1,080 FIGURE 2 | Opposing effects predicted by two influential lines of theories. The predicted deviation of the subjective from objective relative-frequency, π (p) − p, is plotted as a function of the objective relative-frequency, p, and compared across the four distribution conditions (U, E, S, L, color coded). (A) Adaptation-level theory (Helson, 1947). The π (p) is assumed to reflect the difference between the p and a reference point known as "adaptation-level", which shifts with the average value of the distribution. Thus, π (p) is repelled away from the concentrated areas of p: compared to the U condition, there is an overall overestimation for the S condition, and an overall underestimation for the L condition; there is no difference between the U and E conditions (the interleaved dashed lines), since the two have the same adaptation-level. (B) Judgment as Bayesian inference (Jazayeri and Shadlen, 2010). The subjective relative-frequency, π (p), is assumed to be a posterior estimate that integrates the percept with the prior distribution of the objective relative-frequency. Thus, π (p) is attracted toward the concentrated areas of p: compared to the U condition, there is less overestimation of small p's and less underestimation of large p's for the E condition, an overall underestimation for the S condition, and an overall overestimation for the L condition.
pixels, 60-Hz refresh rate). The display of stimuli and recording of responses were controlled by the iMac computer using Matlab and PsychToolbox-3 (Brainard, 1997;Pelli, 1997). Stimuli on each trial were an array of black and white dots on a gray background ( Figure 1A). Dots were located randomly but non-overlapped within a centered invisible circle that subtended a visual angle of 10 • . Each dot subtended ∼0.1 • .

Procedure and Design
The task was to judge the relative-frequency of the black (or white) dots in the array of black and white dots. Half of the participants judged for the black dots and the other half for the white dots. Figure 1A shows the time course of a trial: Shortly after the onset and offset of a fixation cross, an array of black and white dots was presented for 1.5 s, followed by a horizontal bar with tick marks from 0 to 100%. Participants were asked to click on the bar to report their estimate of relative-frequency. In particular, when they moved the mouse left and right, the indicator on the bar (i.e., the boundary between the black and white regions) moved accordingly. Participants confirmed their estimates by left clicking the mouse, which terminated the trial. There was no time limit for response.
The total number of dots in a display could be 200, 300, . . . , 800, and the relative-frequency (denoted p) could be 0.01, 0.02, . . . , 0.99. There were four conditions as below, concerning the distribution of p's across trials ( Figure 1B). The Uniform condition refers to a uniform distribution of p on the range of [0.01, 0.99]. In the Small condition, small p's had a higher density−50/99 of p's were in the range of [0.01, 0.1]. In the Large condition, large p's had a higher density−50/99 of p's were in the range of [0.9, 0.99]. In the Extreme condition, there were 30/99 of p's on each end ([0.01, 0.1] or [0.9, 0.99]). Participants' judgment is denoted π p . What concerned us is how π p , as a function of p, may differ between different distribution conditions.
Each participant only completed one distribution condition. There were 693 trials, divided into 7 blocks of 99 trials. Each participant also completed 35 practice trials prior to the formal experiment. No feedback was given during the experiment. Participants were encouraged to respond as accurate as they could.

Judgment as Bayesian Inference
One may apply the framework of Bayesian inference to model the judgment of relative-frequency. The original percept, y, is assumed to be disturbed by a Gaussian noise on the log-odds scale: where Pr denotes probability, p denotes the objective relativefrequency, λ [·] denotes the log-odds transformation, and σ noise is a free parameter. Transforming back into the probability scale, we have: Suppose participants' prior for a specific distribution condition is the same as the true distribution θ p , which is defined separately for the Uniform, Extreme, Small, and Large conditions as below: According to Bayes' theorem, for any q ∈ {0.01, 0.02, ..., 0.99}, given the percept y and the prior θ q , the posterior probability of the stimulus being q is: Pr q y ∝ Pr y q θ q .
Following Jazayeri and Shadlen (2010) Bayes Least-Square model, we assumed that participants, in order to minimize the mean square error, would use the expectation of the posterior distribution as their estimate for p. The estimate conditional on a specific percept y is Given that the percept y itself is a random variable that cannot be directly observed, we need to marginalize off y to obtain a mapping from p to the final estimate π p : π p = π y Pr y p dy.

Adaptation-Level Theory
The adaptation-level theory does not predict the inverted-Sshaped distortion itself but predict how the distortion may change with the context. In our simulation for the adaptationlevel theory, the π p is determined by the same equation as LLO except for the inclusion of the adaptation level L: where, as in LLO, γ and p 0 are free parameters, and λ [·] denotes the log-odds transformation. The value of L shifts with the distribution of p: where η is a free parameter and θ p is defined for each distribution condition as in Equation (4). For Figure 2A, the parameters γ = 0.8, p 0 = 0.5, η = 0.2 .

Slope and Crossover Point Estimated From LLO
For each participant, we used LLO (Equation 1) to fit the reported relative-frequency, π p and estimated the slope parameter γ and the crossover point parameter p 0 .

Smoothed Distortion Curve and Non-parametric Measures
The γ and p 0 provide a model-based summary for a distortion curve. Still, critical details of the curve may be lost due to the limitation of the model. As a complementary analysis, we smoothed the distortion curve for each participant and elicited non-parametric measures of distortion from the smoothed curve.
In particular, we smoothed π p − p using a kernel regression method with the commonly-used Nadaraya-Watson kernel estimator (Nadaraya, 1964;Watson, 1964;Aljuhani and Al turk, 2014) where x i and y i (i = 1, 2, ..., N) denote observed pairs of stimuli and responses,M h (x) denotes the smoothed response at the stimulus value x, and h is a parameter that controls the degree of smoothing. The K (·) denotes the Gaussian kernel function According to the optimal bandwidth selection algorithm by Bowman and Azzalini (1997) the optimal values of h for different conditions and participants ranged from 0.02 to 0.07. To avoid possible artifacts for using different values of h, we set h to be 0.03. We computed the smoothed value of π p − p for p = 0.01, 0.02, ..., 0.99 based on the observations of all trials. The curvature measure for the smoothed distortion curve was defined as the area between the curve and the zero line, which is inversely related to the γ in LLO. The elevation measure was defined as the area of the curve above the zero line minus that below the zero line, which is related to the p 0 in LLO.

Estimating Sequential Effects
For each participant, we performed a linear regression to estimate the possible dependence of the current response on the responses of previous trials: where R n denotes the current response, S n denotes the current stimulus, R n−i denotes the responses i-trial back, β 0 , β −i , β C are free parameters, and ε ∼ N 0, σ 2 noise is a Gaussian random noise term. Following LLO, we used responses and stimuli that are both in the form of log-odds, that is, S n = λ p n = log p n 1−p n , R n = λ π p n = log π (pn) 1−π (pn) .
It is possible that R n is influenced by the previous responses R n−i as well as by the previous stimuli S n−i . But because R n−i and S n−i were highly correlated (Pearson's r > 0.746, p < 0.001 for all participants) and it was not our major interest to distinguish between the influences of R n−i and S n−i , we did not include both of them in the linear regression.
We did not assert that the sequential effects were really linear. That is, the β −i in Equation (12) was not necessarily constant across different stimuli. In a further analysis, we estimated the value of β −1 as a function of p n and p n−1 using weighted leastsquare regressions (WLS), denotedβ WLS −1 . For the regression centered at a specific pair of p n , p n−1 , where p n , p n−1 ∈ {0.01, 0.02, ..., 0.99}, the weight of trial j was determined by a two-dimensional Gaussian kernel function: where σ k denotes the span of the Gaussian kernel and was set to be 0.1, p j and p j−1 respectively denote the objective relativefrequency of trial j and trial j−1, j = 2, 3, ..., N. If we define the coefficients of the weighted least-square regression at p n , p n−1 could be estimated as:

Modeling
We considered six models of π p , which are all based on LLO but differ in two dimensions: whether to include trial-by-trial adaptation and the type of sequential effects assumed. In the equations for all models, R n (= λ π p n = log π (pn) 1−π (pn) ) denotes the current response, S n (= λ p n = log p n 1−p n ) denotes the current stimulus, R n−i denotes the responses i-trial back, β 0 , β −i , β C are free parameters, and ε ∼ N 0, σ 2 noise is a Gaussian random noise term.
The baseline model LLO, as defined earlier in Equation (1), assumes no adaptation or sequential effects. To make its notations consistent with the other five models, we formulize it as: The AL model is the same as the LLO model except for the inclusion of an adaptation-level term: where L n denotes the adaptation level on Trial n, which varies from trial to trial following a delta-rule learning: where κ is a free parameter for learning rate. We did not formulize any model with a fixed adaptation-level term, because a fixed additional term to Equation (19) would be assimilated into β C and thus the model would reduce to LLO (Equation 18).
The LLO-L model is the LLO model with linear sequential effects (see Equation 12). Similarly, the AL-L model is the AL model with linear sequential effects: In models with linear sequential effects, the influence of R n−i is the same β −i for any R n−i . The LLO-NL model is the LLO model with non-linear sequential effects whose strength decreases with the distance between R n−i and S n : where ω is a free parameter. With R n−i multiplied by , the ω determines how fast the influence of R n−i decreases with the distance between R n−i and S n .
Similarly, the AL-NL model is the AL model with non-linear sequential effects: See Table 1 for a summary of models. The parameters of the models were estimated using maximum likelihood estimates. The MATLAB function fmincon (with the interior-point algorithm) was used for searching for the parameters that minimized negative log likelihood. To verify that we had found the global minimum, we repeated the searching process for 300 times with different starting points. Only R n 's with trial number n ≥ 6 were fitted so that the i in R n−i could take values up to 5 (n − i must be positive).

Efficient Coding Analysis
In any case the p-to-π p mappings were different between the Uniform condition and the other three conditions, we would be interested in whether the change of mapping across conditions agrees with efficient coding. Given that the responses for the judgment of relative-frequency are limited between 0 and 1, the distribution of responses that would maximize information transfer is a uniform distribution over the range of [0, 1] (Simoncelli and Olshausen, 2001). For a specific condition (Extreme, Small, or Large), if its p-to-π p mapping deviates from that of the Uniform condition in the direction of efficient coding, the distribution of its observed responses should be more similar to the uniform distribution than the response distribution predicted by the mapping of the Uniform condition is. We quantified the dissimilarity between a specific response distribution and the uniform distribution over [0, 1] using the Kullback-Leibler (KL) divergence:

VARIABLES OR FUNCTIONS
p A generic value on the probability scale; objective probability or relative-frequency λ (·) Log-odds function of probability or relative-frequency . λ (p) = log (p/ (1 − p)) π (p) Subjective probability or relative-frequency where f r (i) and f u (i) respectively denote the probability of the response distribution and the probability of the uniform distribution in the i-th bin, i = 1, 2, ..., 10, with bins evenly divided between 0 and 1. We collapsed across trials from all 16 participants of a specific condition to obtain the observed response distribution of the condition. For each condition, we simulated the response distribution predicted by the mapping of the Uniform condition (Umapping) as follows. For the stimulus of each trial, a virtual response was generated by randomly choosing one response from the observed responses of the Uniform condition that were associated with an identical stimulus. The virtual responses for all trials formed the U-mapping predicted distribution. We repeated the simulation to generate 1,000,000 U-mapping predicted response distributions. For conditions other than the Uniform condition, efficient coding implies that D KL observed uniform − D KL U-mapping uniform < 0.

RESULTS
We chose to use non-parametric statistical tests whenever possible, because most of the variables tested were parameters estimated from models (e.g., the γ and p 0 estimated from LLO) that were not necessarily normally distributed. Unless otherwise stated, the significance level of 0.05 was used. The capital letter P was used to denote the value of significance, in order to be distinguished from the notation of relative-frequency (p).

Context Effects
We performed two lines of analyses to quantify how the distortion of relative-frequency may adapt to the environmental statistics. First, for each participant, we fitted the LLO model to the reported relative frequency, π p , and estimated the slope γ and the crossover point p 0 for the distortion. The predictions of the LLO model agreed well with the data (Figure 3A). The mean γ and p 0 for each distribution condition are shown in Figure 3B. Participants in all conditions had mean γ < 1, indicating the typical inverted-S-shaped distortion. Based on the predictions of the adaption-level theory and Bayesian inference, we were interested in whether the Uniform and Extreme conditions differed in γ and whether the four conditions differed in p 0 . For γ , a two-tailed Wilcoxon rank sum test showed that the difference between the Uniform condition (median 0.82) and the Extreme condition (median 0.76) failed to reach significance, Z = 1.26, P = 0.21. According to a Kruskal-Wallis test (a non-parametric equivalent of one-way ANOVA) on p 0 , different distribution conditions differed significantly in p 0 , χ 2 (3) = 12.44, P = 0.006, with post-hoc multiple comparisons (Tukey-Kramer corrected) showing the Small condition (median 0.59) had a significantly larger p 0 than the Large condition (median 0.39).
In a second line of analysis, we obtained a smoothed curve of π p − p ( Figure 3C). We can see the Small condition had the largest crossover point (i.e., the point the curve crosses the zero line) among the distribution conditions, which agrees with the findings about p 0 above. Meanwhile, the Uniform condition was Curvature and elevation measures of probability distortion derived from the π (p) − p curve of (C) The curvature measure is defined as the area between the curve and the zero line, which is inversely related to the γ in LLO. The elevation measure is defined as the area of the curve above the zero line minus that below the zero line, which is related to the p 0 in LLO. The variability of the estimated mean curvature and elevation is visualized by their bootstrap resamples, in clouds of dots. The bootstrap procedure was as follows: for each condition and each pair of parametric or non-parametric measures, we randomly sampled with replacement for 16 times from the 16 participants and computed the means of the estimated measures for the resampled participants. This procedure was repeated for 500 times to generate the 500 resamples visualized in each cloud.
less curved (less deviated from the zero line) than the Extreme condition.
To characterize the differences visible in the smoothed distortion curve, we defined the curvature metric (the area between the curve and the zero line) and the elevation metric (the area of the curve above the zero line minus that below the zero line) for each participant (Figure 3D). By definition, the curvature is inversely related to γ for γ ≤ 1 and the elevation is related to p 0 . We performed similar tests on the curvature and elevation as we did for γ and p 0 . For the curvature, a two-tailed Wilcoxon rank sum test showed that the Uniform condition (median 0.050) had a significantly smaller curvature than the Extreme condition (median 0.073), Z = −2.28, P = 0.023. This difference was in the same direction as the insignificant trend in γ . According to a Kruskal-Wallis test, different distribution conditions differed significantly in the elevation, χ 2 (3) = 10.18, P = 0.017. Post-hoc multiple comparisons (Tukey-Kramer corrected) showed that the Small condition had a significantly larger elevation (median 0.025) than the Large condition (median −0.011), which was consistent with the finding above that the Small condition had a significantly larger p 0 than the Large condition.
In sum, we found that both the curvature and elevation of the distortion were influenced by the statistical environment. That the Small condition had a higher elevation than the Large condition is in accordance with the prediction of the adaptationlevel theory (Figure 2A) but against that of Bayesian inference ( Figure 2B). In contrast, our finding that the Extreme condition had a higher curvature than the Uniform condition could not be explained by either theory. However, as we discuss later, it agrees with the principle of efficient coding.

Sequential Effects
We used linear regressions to estimate the possible contribution of the previous response (R n−1 ) to the current response (R n ). For each participant and distribution condition, we first regressed R n against S n and R n−1 (Equation 12, with m = 1) for all trials and denoted the coefficient for R n−1 as β −1 . The median β −1 for the Uniform, Extreme, Small, and Large conditions were respectively 0.033, 0.019, 0.033, and 0.017, all significantly greater than 0 (Two-tailed Wilcoxon rank sum tests, P < 0.03), indicating an attraction effect of previous response. When we extended the regressors to the responses up to five trials back, only β −1 was significantly different from 0 ( Figure 4A). Therefore, we only considered the sequential effect up to one trial back in the subsequent analysis.
But was the sequential effect really linear? To test this, we estimated the β −1 as a function of p n and p n−1 using weighted least-square regressions (see section Methods). A linear sequential effect would imply a homogeneous regression coefficient map, that is, the β −1 would not change with the values of p n−1 or p n . Instead, the estimated β −1 (i.e.,β WLS −1 ) showed a clear pattern of stimulus-dependence ( Figure 4B): its value decreased as the distance between p n and p n−1 increased. The highest values ofβ WLS −1 occurred on the diagonal line when p n equaled p n−1 .
This non-linear sequential effect could be well predicted by the AL-NL model (Figure 4C, see Equation 23 for the model), which assumes that the weight for the previous trial decreases with the inter-trial distance in the form of a Gaussian function.
To quantify the similarity between data and model predictions in the pattern of sequential effects, for each model we computed the correlation (Pearson's r) between the matrix of the mean β WLS −1 observed and that predicted by the model. There were high correlations for models assuming non-linear sequential effects (LLO-NL and AL-NL), but much lower correlations or even negative correlations for models assuming linear (LLO-L and AL-L) or none (LLO and AL) sequential effect ( Table 2). In sum, the modeling analysis provided converging evidence that across participants is plotted separately for the four conditions. Note that the closer the p n−1 was to the p n , the larger theβ WLS −1 . (C) AL-NL model's predictions for sequential effects.
the sequential effect was non-linear and showed that the nonlinear form we assumed in the -NL models could well capture the pattern of sequential effects in the data.

Modeling the Processes Underlying the Context Effects
What trial-by-trial processes might underlie the context effects of π p , given that participants were never explicitly informed about the distribution of relative-frequencies? We modeled two processes-trial-by-trial learning of adaptation-level and nonlinear sequential effect-and tested whether they contributed to a better explanation of the observed π p . In particular, we constructed six alternative models (see section Methods and Table 1) to compare the assumption of dynamic adaptation-level ("AL" models) with that of constant adaptation-level ("LLO" models), and to compare the assumption of non-linear sequential effect ("-NL" models) with that of linear ("-L" models) or none (null-postfix models) sequential effect. All models were fitted to each participant's π p using maximum likelihood estimates.

Model Comparison
To compensate for the difference in number of parameters between models, we computed the Akaike information criterion corrected for small sample-size (AICc; Akaike, 1974;Hurvich and Tsai, 1989), for each participant and each model as the metric for goodnessof-fit, where ln L denotes the log likelihood maximized, k denotes the number of parameters, and N denotes the number of trials. The lower the AICc, the better the model fit.
The best model among the six models was the AL-NL model for all distribution conditions except for the Extreme condition (where the best was LLO-NL and the second best was AL-NL), according to the AICc summed across participants (Figure 5). A group-level Bayesian model selection (Stephan et al., 2009;Rigoux et al., 2014) based on AICc suggested the same (see the red dot in Figure 5 for the protected exceedance probability, that is, the probability a specific model is better than all the other models). We can also see that the models with nonlinear sequential effects outperformed models with linear or none sequential effects, other things being the same. Except for the Extreme condition, AL models fit better than LLO models. The advantage of AL over LLO models was small in the Uniform condition and even negative in the Extreme condition, probably because the distribution of relative-frequency was centered at p = 0.5 in these conditions, where the final adaptation-level differed little from its initial value.

Estimated Parameters
The estimated parameters for the AL-NL model are shown in Figure 6. The six parameters can be divided into four categories (see section Methods and Table 1 for more details): the slope and intercept parameters that belong to the original LLO model (β 0 , β C ), the learning rate of adaptation-level (κ ), the parameters that control the non-linear sequential effect (β −1 , ω ), and the standard deviation of the noise term (σ ).
According to Kruskal-Wallis tests separately for each parameter, different distribution conditions differed significantly only in ω , χ 2 (3) = 14.65, P = 0.0021. The value of ω controls how fast the sequential effect decreases with the distance between the previous response and the current stimulus. The larger the ω , the slower the sequential effect decreases with distance. In the limiting case of ω → ∞ , the sequential effect would be stimulus-independent, equivalent to a linear sequential effect.
The median value of ω for the Uniform condition (1.296) was the smallest among the four conditions, significantly smaller than those of the Extreme condition (2.209) and Small condition (1.989), and marginally significantly smaller than that of the Large condition (1.890, P = 0.062), according to post-hoc multiple comparisons (Tukey-Kramer corrected) following the Kruskal-Wallis test. Why should ω -the parameter that controls how fast the sequential effect decreases with the difference between adjacent trials-differ between conditions? We noticed that the average distance between two adjacent trials in the Uniform condition (1.936) was significantly smaller than those of the other three conditions (2.945, 2.316, 2.317, respectively for Extreme, Small, Large), according to a Kruskal-Wallis test (χ 2 (3) = 53.22, P < 0.001) and post-hoc multiple comparisons (P < 0.001). It seems the choice of ω adapted to the average between-trial distance of the environment or statistics alike.

DISCUSSION
Relative-frequency, similar to probability, is an abstract quantity that does not rely on the physical energy of stimuli and requires the involvement of higher cognition. It differs from many kinds of abstract quantity such as numerosity and utility in that it is naturally bounded between 0 and 1. Here we investigated how the perception of visual relative-frequency may change with the environmental statistics. As typical, the judgment π p was distorted as an inverted-S-shaped curve of the objective relativefrequency p. We found two context effects concerning the π p curve. The first one was about the elevation of the curve: The lower the central tendency of the distribution of p, the greater π p − p. This is consistent with the contrast effect widely reported in the adaptation literature including that specially for relative-frequency (Varey et al., 1990), as well as with the prediction of the adaptation-level theory (Helson, 1947).
We also found a second context effect concerning the spread of the stimuli: the more dispersed the distribution of p, the more curved the inverted-S-shape of π p . Had π p not changed across contexts, when there were more p's on the two ends as in the Extreme condition, there would be more π p 's on the two ends as well. An increase in curvature in the Extreme condition implies a change of p-to-π p mapping so that π p could be more evenly distributed between 0 and 1. Such effect cannot be explained by the adaption-level theory, for which adaptation equals to the adjustment of a single reference point. What may be relevant is Parducci (1965) range-frequency theory for categorical responses, where observers are supposed to adjust their responses to balance the number of responses in each category. However, the range-frequency theory is not directly applicable, because the responses in our task were not categorical but continuous.
The two context effects together suggest adaptations of relative-frequency that go beyond the adjustment of a single reference point. It echoes neurophysiological studies where neurons adjust to both the central tendency and spread of the (1) without adaptation (LLO) or with adaptation (AL), and (2) without sequential effects ("None"), with linear sequential effects ("L"), or with non-linear sequential effects ("NL"). The summed AICc (bars, left axis, the lower the better) and protected exceedance probability (red dots, right axis, the probability that a specific model excels the other models considered, the higher the better) are plotted for each model and each distribution condition. For all except the E condition, the AL-NL model had an overwhelming advantage over the other models.
stimulus distribution in utility (Kobayashi et al., 2010) as well as in sensory responses (Dean et al., 2005). Such adaptations of pto-π p mappings are in the spirit of efficient coding (Attneave, 1954;Barlow, 1961;Simoncelli and Olshausen, 2001), to the goal of maximizing the discrepancy between different stimuli. In particular, we tested whether the response distribution observed in a specific condition (Extreme, Small, or Large), compared with that predicted by the mapping of the Uniform condition, was closer to the optimal response distribution. Given the limited range of responses in the task, the response distribution that maximizes information transmission is a uniform distribution over [0, 1] (Simoncelli and Olshausen, 2001). Figure 7 shows that the response distributions of the Extreme, Small, and Large conditions were significantly closer to the uniform distribution than those predicted by the mapping of the Uniform condition were (P < 0.0001). That is, if participants had used the same p-to-π p in these conditions as in the Uniform condition, the distribution of their responses would have been less optimal than the observed, in the perspective of efficient coding. According to Petzschner et al. (2015), Bayesian inference provides a parsimonious explanation for the biases in magnitude estimation for length (Laming, 1997), time (Jazayeri and Shadlen, 2010), distance, and angle (Petzschner and Glasauer, 2011). One of the phenomena explained was the regression effect (also known as regression-to-mean) that small magnitudes are overestimated and large magnitudes underestimated. This account may even apply to the inverted-S-shaped distortion of probability and relative-frequency. Unfortunately, Bayesian inference fails to predict either of the two context effects we found when we manipulated the distribution of relativefrequency systematically. That being said, it is still possible that Bayesian inference may play a role, though non-dominant, in the judgment of relative-frequency. As we discuss later, the nonlinear sequential effects probably reflect a compensation for uncertainty that resembles Bayesian inference.
We went further to model how the context effects may arise trial by trial and identified two trial-by-trial processes. First, compared with a constant adaptation-level, a trial-bytrial adjusted adaptation-level could better explain the observed contrast effect in elevation. When the adaptation-level is updated after each trial as a weighted average of the previous adaptationlevel and the current stimulus, the value of the adaptation-level  Table 1 for the references of the parameters (β 0 , β −1 , β C , κ , ω , σ noise ). *0.01 ≤ P < 0.05. **0.001 ≤ P < 0.01.
would gradually approach the central tendency of the stimulus distribution.
One mystery is the lack of adaptation in the Extreme condition. We conjecture that it is partly due to the larger distance between two adjacent trials in the condition (2.945) than those of the other conditions (1.936, 2.316, 2.317). There is evidence that adaptation may stop to work when the discrepancy between trials are too large (Levitan et al., 2015). A second possibility for the lack of adaptation concerns the bimodal distribution of objective relative-frequencies used in the Extreme condition. While adaptation for a unimodal distribution leads to an adaptation level close to the mode of the distribution, similar adaptation for a bimodal distribution may result in an adaptation level falling between the two modes and thus representative of neither mode. For this reason, the perceptual system may adopt a different strategy for bimodal or multimodal distributions. These possibilities need to be tested in the future.
Second, we found a non-linear sequential effect: the current response was biased toward the response on the previous trial, with the size of the bias well captured by a Derivative-of-Gaussian (DoG) function of the inter-trial distance. Sequential effects had been widely reported in perceptual and cognitive tasks (Fründ et al., 2014), which can be rationalized in the framework of Bayesian inference as ways of compensating for sensorimotor uncertainty (Körding and Wolpert, 2004;Jazayeri and Shadlen, 2010;Petzschner and Glasauer, 2011;Cicchini et al., 2012;Raviv et al., 2012). If the percept for the current trial were independent of those of the precedent trials in random noises, combining the current percept with previous responses appropriately would allow one to achieve a less varied response than using the current percept alone. In practice, the weight for a precedent trial was often modeled as a constant so that the response would be a linear combination of the current percept and previous responses. However, it is recently found in the perception of orientation ) that the weight received by a precedent trial is not a constant; instead, it decreases with the distance between the current and precedent trial in the feature space. In discovering so, Fischer and Whitney (2014) plotted the judgment error of the current trial as a function of the orientation difference between the previous and current stimuli and obtained a telling DoG-shaped curve that implies decreasing weight for the precedent trial as the inter-trial distance increases. Similar DoG-shaped curves and thus non-linear sequential effects have been identified in the perception of motion (Alais et al., 2017) and facial identity (Liberman et al., 2014) and in visual working memory of colors (Makovski and Jiang, 2008).
Except for numerosity , all the nonsequential effects found so far were for circularly distributed stimuli via plotting judgment error as a function of inter-trial distance. The same visualization can hardly qualify as a test for the linearity of sequential effects in non-circularly distributed stimuli (though see Figure 4 for clues of non-linearity), because denotes the KL divergence from the uniform distribution to the response distribution predicted by the p-to-π (p) mapping of the Uniform condition. D KL ( observed uniform) − D KL ( U-mapping uniform) is plotted for each condition, with the Uniform condition serving as a sanity check (i.e., the difference should be 0). A negative difference implies that the p-to-π (p) mapping adopted by a specific condition is closer to efficient coding than applying the mapping of the Uniform condition to the condition. In the box plot, the middle line denotes the median across 1,000,000 simulations, the bottom and top lines denote the lower and upper quartiles, and the error bars denote the 99% confidence interval. ****P < 0.0001. different stimuli would be associated with different distributions of inter-trial distances. In the case of relative-frequency, the difficulty of visualization was increased by the systematic biases inherent in the judgment. We used modeling methods to overcome the difficulty: We constructed models assuming different forms of sequential effects, among which models with non-linear sequential effect fit best to the observed π p . The non-linear sequential effects were conjectured to be a mechanism that helps to keep visual stability across space and time . Our finding of similar non-linear sequential effects in the abstract quantity relativefrequency, along with that of numerosity , suggests a more general mechanism than the previously theorized. In fact, it implies a bifurcation: when the previous stimulus is close to the current stimulus, the current response merges the two; when the previous stimulus is far from the current stimulus, the current response simply dismisses the previous one. Such bifurcations have been widely observed in the group decision of animals (Couzin, 2009), in the competition between neurons (Nichols and Newsome, 2002), and in the integration of information from multiple sensory modalities (Wozny et al., 2010).
We found that the non-linear sequential effect could adapt to the distribution of p. In the winning AL-NL (or LLO-NL) model, the strength of the bias toward the previous trial is controlled by two parameters: a scaling factor β −1 and the scope-of-influence ω . The value of ω but not β −1 had significant differences between different distribution conditions. It would be interesting to see what factors may influence the non-linear sequential effects. Fischer and Whitney (2014) found that the size of the non-linear sequential effects in the perception of orientation would decrease with the spatial or temporal proximity between trials. Whether the non-linear sequential effects found in the judgment of relative-frequency follow similar principles is unknown. Whether non-linear sequential effects may give way to linear sequential effects under certain circumstances is also an empirical question.
To conclude, human judgment of relative-frequency adapts to the environmental statistics trial by trial toward the direction of maximizing the discrepancy between different stimuli. Between trials there are also non-linear sequential effects that may help to reduce the variability of response.

AUTHOR CONTRIBUTIONS
XR and HZ designed the experiment. XR performed the experiment. XR, MW, and HZ analyzed the data and wrote the manuscript.