A Model-Based Approach to Trial-By-Trial P300 Amplitude Fluctuations

It has long been recognized that the amplitude of the P300 component of event-related brain potentials is sensitive to the degree to which eliciting stimuli are surprising to the observers (Donchin, 1981). While Squires et al. (1976) showed and modeled dependencies of P300 amplitudes from observed stimuli on various time scales, Mars et al. (2008) proposed a computational model keeping track of stimulus probabilities on a long-term time scale. We suggest here a computational model which integrates prior information with short-term, long-term, and alternation-based experiential influences on P300 amplitude fluctuations. To evaluate the new model, we measured trial-by-trial P300 amplitude fluctuations in a simple two-choice response time task, and tested the computational models of trial-by-trial P300 amplitudes using Bayesian model evaluation. The results reveal that the new digital filtering (DIF) model provides a superior account of the trial-by-trial P300 amplitudes when compared to both Squires et al.’s (1976) model, and Mars et al.’s (2008) model. We show that the P300-generating system can be described as two parallel first-order infinite impulse response (IIR) low-pass filters and an additional fourth-order finite impulse response (FIR) high-pass filter. Implications of the acquired data are discussed with regard to the neurobiological distinction between short-term, long-term, and working memory as well as from the point of view of predictive coding models and Bayesian learning theories of cortical function.


INTRODUCTION
The notion of a Bayesian brain is increasingly recognized as providing a distinctive framework for investigating cognitive brain functions (Kersten et al., 2004;Knill and Pouget, 2004;Friston, 2005;Doya et al., 2007). Predictive coding theories of cortical function provide a possible route to the Bayesian brain (Friston, 2002). According to the predictive coding approach, constraints from higher levels of a cortical hierarchy provide contextual guidance to lower levels of processing, providing a theory of how bottom-up evidence is combined with top-down priors to compute the most likely interpretation of sensory data. Specifically, predictive coding theory proposes that an internal representation of the world generates predictions that are compared with stimulus-driven activity to calculate the residual error between the predicted and the actual information. The residual error is then used to update the internal representation so as to minimize the residual error imposed by future stimuli (Friston, 2002(Friston, , 2005Spratling, 2010).
The general scheme of predictive coding as a ubiquitous mode of cortical processing offers an instrumental framework for analyzing functional correlates of the P300 event-related brain potential (Sutton et al., 1965;Kopp, 2008). It has long been recognized that fluctuations in P300 amplitude reflect the degree of surprise related to the processing of attended, but unforeseeable sensory events. In particular (Donchin, 1981) argued that P300 amplitude is not crucially determined by the inherent attributes of eliciting events. Instead of that, he ascertained that "surprising events elicit a large P300 component" (p. 498). Squires et al. (1976) had presented a model of P300 amplitude fluctuations, based on the concept of expectancy, which was thought to be determined by three factors: "(i) the memory for event frequency within the prior stimulus sequence, (ii) the specific structure of the prior sequence, and (iii) the global probability of the event" (p. 1144).
More recently, Mars et al. (2008) proposed a computational model of processes underlying the generation of the P300 in which trial-by-trial fluctuations in P300 amplitudes were explained in terms of a Bayesian observer keeping track of the global probabilities of sensory events. The subjective estimates of statistical regularities in the environment were thought to depend crucially on the integration of sensory data over long periods of time. However, the adequacy of this Bayesian observer model is limited, because it cannot account appropriately for the well-documented effects of the recent stimulus sequence on P300 amplitudes (e.g., Squires et al., 1976;Leuthold and Sommer, 1993).
Here we tested these two state-of-the-art models against a newly developed computational model of trial-by-trial P300 amplitude fluctuations by Bayesian model selection (Kass and Raftery, 1995;Raftery, 1995). The new model assumes three additive digital filtering processes, thereby integrating aspects of both state-of-the-art models. Specifically, subjective estimates of statistical regularities in sensory data are kept at short-term and long-term decay time parameters. Further it implements an alternation term (as Squires et al., 1976) as well as uniform initial prior probabilities (as Mars et al., 2008). Our findings show that this new approach provides a superior account of parietally distributed trial-by-trial P300 amplitudes compared to these two state-of-the-art models.

PARTICIPANTS, EXPERIMENTAL DESIGN, AND DATA ACQUISITION
Sixteen healthy participants [fourteen women, mean age: 20 years; age range 18-23 years; mean handedness (Oldfield, 1971): 74; handedness range −76-100], all with normal or corrected-tonormal visual acuity participated in the experiment. All were recruited from introductory courses at the Department of Psychology at the Technische Universität Braunschweig in return for course credit. Experimental procedures were approved by the local ethics committee and in accordance with the Declaration of Helsinki.
Participants performed a simple two-choice response time [RT] task without feedback about response accuracy in which all stimuli had equal behavioral relevance. This feature of the experimental design constitutes an important difference between this and the classical oddball paradigm (Ritter and Vaughan, 1969) in which participants usually discriminate between task-relevant (target) and irrelevant (standard) stimuli.
The experiment was realized using the Presentation® software (Neurobehavioral Systems, Albany, CA, USA). Visual stimuli were presented one at a time for 100 ms each, with a stimulus presentation rate of f s = 2/3 Hz, i.e., one stimulus per 1.5 s. Stimuli were displayed at the center of a CRT monitor (FlexScan T766 19 ; Eizo, Hakusan, Ishikawa, Japan) with a refresh rate of 100 Hz at a resolution of 1280 × 1024 pixels against a light gray background. Viewing distance amounted to 1.25 m. Two types of visual stimuli were presented: the stimulus event was either a red or a blue rectangle, each of which subtended approximately 2.75˚× 2.25˚.
Participants were required to respond to each stimulus with the previously associated button as quickly as possible but not at the expense of accuracy. They used the index finger of both hands (e.g., left button on response to the red rectangle, right button in response to the blue rectangle). Stimulus-response mapping (i.e., [red-left, blue-right] or [red-right, blue-left], respectively) was counterbalanced over participants.
Participants performed twelve blocks of N = 192 trials of the two-choice RT task. The probability of the occurrence of each stimulus event was manipulated between blocks such that the relative probabilities of events were either 0.5 for each event, across six consecutive blocks (1152 trials overall), or [0.3, 0.7], across the remaining six consecutive blocks (1152 trials overall). Stimulusprobability mapping was counterbalanced over participants (i.e., a stimulus color identified the rare (0.3) stimulus in fifty percent of the participants but the frequent stimulus (0.7) in the remaining participants).
The order of the probability manipulation was counterbalanced over participants (probability category [0.5, 0.5] prior to [0.3, 0.7] or vice versa) who were not informed about these probabilities. Participants were informed that the two different stimuli were randomly distributed across blocks. Between the blocks a break was scheduled, participants were free to initiate the subsequent block at their own pace.
For each participant, the actual stimulus sequence of each probability category [0.5, 0.5], and [0.3, 0.7], respectively, was randomized only once in order to enhance the reliability of the sequential trial-by-trial P300 estimates (see below). Thus, each participant received solely one truly random arrangement of trials in each probability category. This arrangement was repeatedly presented across all six blocks of each probability category, unbeknownst to participants. In consequence, sequential P300 estimates could be averaged over the six sequence repetitions per probability category, thereby improving the notoriously low signal-to-noise ratio of single-trial EEG data. Task-related brain activity of a single trial is much more obscured by task-unrelated brain activity than is task-related activity averaged across trials (Blankertz et al., 2002).
Participants were informed about the problem of non-cerebral artifacts, and they were encouraged to reduce the occurrence of movement artifacts (Picton et al., 2000). Ocular artifacts were monitored by means of bipolar pairs of electrodes positioned at the sub and supraorbital ridges (vertical electrooculogram, vEOG) and at the external ocular canthi (horizontal electrooculogram, hEOG). The EEG and EOG channels were subject to a bandpass of 0.01-30 Hz and digitized at 250 Hz sampling rate.
Off-line analysis of the EEG data was performed by means of the BrainVision Analyzer® Version 2.0.1 software (Brain Products, Gilching, Germany). Careful manual artifact rejection was performed before averaging to discard trials during which eye movements, or any other non-cerebral artifact except blinks, had occurred. Deflections in the averaged EOG waveforms were small indicating that fixation was well maintained in those trials that survived the manual artifact rejection process. Semi-automatic blink detection and the application of an established method for blink artifact removal were employed for blink correction (Gratton et al., 1983). A digital high-pass filter was applied to the data (0.75 Hz cutoff frequency, 48 db/oct) in order to eliminate lowfrequency variations in the EEG signal which were associated with the occasional occurrence of electro-dermal artifacts.
The EEG was then divided into epochs of 1000 ms duration, starting 100 ms before stimulus onset. Epochs were corrected using the interval [−100, 0 ms] before stimulus presentation as the baseline. As a start, event-related potential (ERP) waveforms were created (Luck, 2005). ERP waveforms were calculated as trial Frontiers in Human Neuroscience www.frontiersin.org averages for each participant and for each event probability [i.e., 0.5, 0.3, 0.7], with the exception that those trials in which the participant selected the wrong behavioral response were excluded from averaging. Thereafter, trial-by-trial P300s were estimated from the EEG data at electrode Pz, where this ERP component is traditionally reported to be maximal (Duncan-Johnson and Donchin, 1977). To estimate trial-by-trial P300 amplitudes, for each participant, the time point at which the averaged P300 waveforms at Pz were modulated maximally by relative stimulus frequency in the [0.3, 0.7] probability category was determined (M = 344 ms, SD = 48 ms; range 280-464 ms). Identifying the P300 in single trials is a notoriously difficult problem, due to the low signal-to-noise ratio of single-trial EEG data (Blankertz et al., 2002). In our study, for each event probability, trial-by-trial P300 estimates were extracted over a temporal window of ±60 ms around the individual time point of maximal modulation (Barceló et al., 2008), thereby completely ignoring latency variability across single trials (Luck, 2005). Albeit this drawback of the method, it was nevertheless chosen in order (1) to keep the testing environment as similar as possible to the procedures employed by Mars et al. (2008), and (2) to improve the reliability of trial-by-trial amplitude measures, in comparison to peak detection measures, akin to previous studies (Debener et al., 2005).

CONVENTIONAL DATA ANALYSIS
Trial-by-trial P300 estimates, RTs, and error rates were averaged according to the three event probabilities [i.e., 0.3, 0.5, 0.7]. In the [0.5, 0.5] probability category, trial-by-trial P300 estimates were additionally averaged according to eight third-order stimulus sequences (denoted as aaaa, baaa, abaa, aaba, bbaa, abba, baba, bbba), four second-order stimulus sequences (aaa, baa, aba, bba), and two first-order sequences (aa, ba), with up to four consecutive trials (xxxx) = (trial n − 3, trial n − 2, trial n − 1, trial n = eliciting event). Please note that the symbol a simply denotes one of the two possible stimulus events while symbol b signifies the other one in this notation. For example, if a signifies the red rectangle, then b signifies the blue rectangle (and vice versa). In the [0.5, 0.5] probability category, sequential analysis could be collapsed across the two possible stimulus events since both stimuli were equally probable and task-relevant. The same kind of sequential analysis was performed in the [0.3, 0.7] probability category. However, in this experimental condition, a consistently denoted the rare stimulus in the [0.3, 0.7] probability category, whereas b signified the frequent stimulus in the [0.3, 0.7] probability category. We refrained from analyzing the eight third-order stimulus sequences in the [0.3, 0.7] probability category in order to obtain a sufficient number of trials entering the sequential P300 estimates. Thus, trialby-trial P300 estimates were solely averaged according to eight second-order stimulus sequences (aaa, baa, aba, bba, bbb, abb, bab, aab) and four first-order sequences (aa, ba, bb, ab) in the [0.3, 0.7] probability category, separately for rare and frequent stimuli.
Individual medians of trial-by-trial P300 estimates, RTs, and error rates over the three event probabilities [i.e., 0.3, 0.5, 0.7] as well as the sequential P300 estimates, generated as described above, were submitted to repeated measures analysis of variance (ANOVAs), using the Greenhouse-Geisser correction. The results of the univariate tests are provided, using a format which gives the uncorrected degrees of freedom, and in order to compensate for violations of sphericity or equal covariance among all pairs of levels of the repeated measures (Picton et al., 2000). A measure of effect size, η 2 p (partial eta squared), is also provided.

STATE OF THE ART MODELS
Let us call an estimated subjective probability (henceforth simply called subjective probability) that event k ∈ {1, . . ., K } on trial n ∈ {1, . . ., N } will be observed, given a sequence s n−1 1 = (s(1), s(2), . . . , s(n − 1)) of n − 1 former stimulus observations. While n is the discrete time index of the consecutive trials, the value N denotes the total number of trials in a block within an experimental probability category for one subject. Note that in (1) stimulus s(n) has not yet been observed, therefore, a subjective probability distribution P k (n) for all possible stimuli k ∈ {1, . . ., K } on trial n is of interest. However, once the stimulus k = s(n) on trial n has been observed (which is only a single value k out of set {1, . . ., K }), the respective subjective probability P k (n) can be used to calculate the degree of surprise (Shannon and Weaver, 1948;Strange et al., 2005) I (n) = − log 2 P k=s(n) (n) . (2) Following Mars et al. (2008), we assume the trial-by-trial P300 estimate Y (n)[µV] to be proportional to the surprise I (n)[bit]: Note that Squires et al. (1976) assumed direct proportionality between the so-called expectancy E k (n) and the trial-by-trial P300 estimate: In the following we briefly recapitulate these two well-known state-of-the-art approaches to compute the subjective probability P k (n) or expectancy E k (n), which play the role of a dynamically updated prior probability for learning statistical parameters of the stimulus sequence. Mars et al. (2008) proposed a Bayesian observer model (henceforth called MAR) without forgetting according to

Approach by Mars et al. (MAR)
counts the number of occurrences of event k until trial n − 1. The time sequence d k (ν) holds d k (ν) = 1 if s(ν) = k, ν = 1,2, . . ., Frontiers in Human Neuroscience www.frontiersin.org otherwise d k (ν) = 0. Note that K k=1c L,k (n) = n − 1. As can be easily seen in (5), the subjective probability for event k on trial n = 1 equals a uniform initial prior P k (1) = 1/K. After many trials (n K, andc L,k (n) 1), the subjective probability approximates P k (n) ≈˜c L,k (n) n−1 , i.e., the relative frequency of event k until trial n − 1. Note that the index "L" of the count functionc L,k (n) expresses the long -term memory character of Mars' model.

Approach by Squires et al. (SQU)
Unlike Mars et al. (2008), Squires et al. (1976) did not formulate a strict computational model to compute the subjective probability P k (n). Moreover, having investigated solely a K = 2 case, they use the notion of expectancy 1 E k (n) for stimulus k on trial n. While Squires et al. (1976) have described their model, hence called SQU, partly in math, partly in words, in the following we present a complete analytical formulation of their approach, which is straightforward to implement in software. Their empirical formulation of expectancy that event k ∈ {1, 2} will be observed on trial n ∈ {1, . . ., N } is given as E k (n) = 0.505 · P k + 0.235 ·c S,k (n) + 0.033 ·c ,k (n) − 0.027 (7) with three expectancy contributions, namely the assumed-to-beknown global probability P k , a count function for the short -term memory "S", and a count function for the alternation expectancy " " (and an additive constant).

Short-term memory.
The short-term count function is defined asc which is different to (6), since only a limited memory span of N depth = 5 is covered, and an exponential forgetting factor γ S = e − 1 β S with 0 ≤ γ S ≤ 1 and time constant 0 ≤ β S < ∞ is introduced, with γ S = 0.6 for all probability categories (i.e., β S = 1.96). Note that the count function (8) depends only on stimulus observations in the recent past. (7) denotes the true global probability of the stimulus being k. It is nothing else but the relative frequency of the stimulus in the current experimental probability category which must be made known to this model.

Alternation expectancy.
In contrast, the term c ,k (n) ∈ { −3, −2, 0, 2, 3} denotes the expectancy w.r.t. alternating stimuli, and how this expectancy is met by the present stimulus s(n). The sign ofc ,k (n) is negative if the stimulus s(n) violates the alternation expectation [i.e., s(n) and s(n − 1) are identical] and positive if the alternation expectation is met [i.e., s(n) and s(n − 1) differ from each other]. The amplitude ofc ,k (n) depends on the number of previous alternations in a row. The formulas for calculatingc ,k (n) are detailed in the Appendix. 1 Squires et al. (1976) denote expectancy also as subjective probability.
FIGURE 1 | Block diagram of the new digital filter (DIF) model with input g k (n) in (10) and output P k (n) in (9), digital filter transfer functions H (f ) with 0 ≤ f ≤ f s /2, a stimulus presentation rate of f s (= 2/3 Hz), and probability normalizing constant 1/C.

Explanatory notes
In summary, (7) provides a model for expectancy that is linearly composed of three contributions: Firstly, the relative frequency P k of event k which equals the correct global probability throughout probability categories. Note that, in contrast to Mars et al. (2008), the relative frequency P k is not learned sequentially by experience but assumed to be known by participants. Secondly, a purely predictive limited length (N depth = 5) exponentially decaying short-term memory [cf. count function c S,k (n) in (8)]. Thirdly, an expectancy contribution in the range −3 ≤c ,k (n) ≤ +3 depending on the extent to which a first-order alternation (aa or ba) expectation has been build up and then met/violated within the latest observed N depth = 5 trials.

PROPOSED DIGITAL FILTER MODEL
In this section we present our newly proposed model, inspired by both Mars et al. (2008) and Squires et al. (1976). Our aim is to unify the learned relative frequency estimation of Mars et al. (2008;long-term) with the exponentially decaying short-term memory and alternation expectation capabilities of Squires et al. (1976), and to express the result in terms of a simple new digital filter (DIF) model. Besides an additive probability-normalizing constant 1/C (see Appendix for details), it consists of three additive contributions to subjective probability: a long-term contribution ("L"), a short-term one ("S"), and one term capturing alternations (" ") as depicted in Figure 1: There are three different count functions used, each represented by a digital filter transfer function H (f ) applied to the common input signal g k (n), which is given as: Frontiers in Human Neuroscience www.frontiersin.org

FIGURE 2 | Block diagram of a first-order infinite impulse response length (IIR) filter with transfer function H S (f ) equivalent to (11) or (12).
Elements T denote a delay of one trial. At the adder element ⊕, updating of the weighted output γ S ·c S,k (n − 1) of the last trial n − 1 with the weighted input (1 − γ S )·g k (n − 1) of the last trial results in the current output c S,k (n). Note that via γ S ·c S,k (n − 1), all preceding inputs (and outputs) influence the current output, though for the short-term memory, the influence of trials not in the recent past is negligible.
which implicitly contains an initial prior of 1/K at the start of a block of trials 2 , and a "1" wherever a past stimulus s(ν) equals the current stimulus s(n) = k, otherwise a "0." Note that in contrast to the sequence d k (ν) as used in Mars et al. (2008) (6) and Squires et al. (1976) . The digital filter model yields an output signal P k (n) as given in (9). The weighting parameters α L , α S , α hold α L + α S + α = 1 and 0 ≤ α i ≤ 1, i ∈ {L,S, }.

Short-term memory
The block diagram of the infinite impulse response (IIR) digital filter is shown in Figure 2. The respective short-term memory count function can be expressed as with some normalizing constant C S and an exponential forget- (8). The transfer function of the short-term digital filtering process as described by (11) is depicted in Figure 1 as H S (f ), and is plotted in Figure 4 as dashed curve, revealing a smooth (i.e., weak) low-pass characteristic. Note that the short-term memory count function (11) can be expressed mathematically equivalent in a recursive form according to (see the Appendix) initialized with the uniform initial prior c S,k (0) = 1/K, which in (11) was contained in the values g k (ν) = 1/K for ν ≤ 0 of (10). The recursive character of (12) becomes apparent by substituting the right hand side of (12) in itself for calculating c S,k (n − 1). Figure 2 further illustrates the updating process inherent in (12). The input signal defined in (10) and the weights (1 − γ S ) and γ S g k (n) g k (n−1) T FIGURE 3 | Block diagram of a first-order infinite impulse response length (IIR) filter with transfer function H L (f,n) equivalent to (13) or (14). Elements T denote a delay of one trial. At the adder element ⊕, updating of the weighted output γ L,n−1 ·c L,k (n − 1) of the last trial n − 1 with the weighted input (1 − γ L,n−1 )·g k (n − 1) of the last trial results in the current output c L,k (n). Note that via γ L,n−1 ·c L,k (n − 1), all preceding inputs (and outputs) influence the current output.
guarantee 0 ≤ c S,k (n) ≤ 1. The equivalence of (11) and (12) and the derivation of C S = γ S /(1 − γ S ) are shown in the Appendix.

Long-term memory
The long-term memory count function can be expressed as with the time-dependent (i.e., dynamic) exponential forgetting the dynamic normalizing value C L,n , and the same model-exciting signal g k (ν) as before (10). The formulas for calculating γ L,n (ν) and C L,n are derived in the Appendix. The transfer function of the long-term digital filtering process as described by (13) is depicted in Figure 1 as H L (f,n). Analog to (12) a recursive function with the same behavior as (13) can be defined as with the same initial value C L,k (0) = 1/K. The dynamics of the forgetting factor of the long-term memory are detailed in the Appendix. Figure 3 illustrates the updating process inherent in (14). The long-term transfer function H L (f,n) is plotted in Figure 4 as dashdotted curve (for n = 1) and as solid curve (for n = N = 192), respectively. Inspection of Figure 4 reveals an initially moderate low-pass characteristic which becomes much sharper when the number of trials increases.

Alternation expectation
Finally, our model comprises a count function capturing alternations: with some normalizing constant C and the same model-exciting signal g k (ν) as before (10). In contrast to the short-and long-term

FIGURE 4 | Amplitude responses of the long-term (H L (f,n), dash-dotted and solid low-pass curves), short-term (H S (f ), dashed low-pass curve) and alternation (H (f ), dashed high-pass curve) filters of the DIF model as a function of the input signal frequency f (logarithmic scale!).
The dynamic long-term filter (dash-dotted and solid curves) is shown for n = 1 and n = N = 192.
IIR filters which both have a low-pass characteristic, this finite impulse response (FIR) filter reveals a high-pass characteristic. Its transfer function is plotted in Figure 4 as ascending dashed curve. The coefficients γ ,n−ν are specified in more detail in the Appendix.

Explanatory notes
There are two important differences to Squires et al. (1976): We propose to use two terms with different time parameters β S and β L,n , one accounting for the short-term memory [as in (8)], the other one accounting for a dynamically adapted long-term memory. Secondly, we allow for N depth → ∞. Moreover, the role of negative trial indices in our model is to define the initial subjective probability distribution, which is P k (9), reflecting that participants were not informed about the actual relative frequencies of events over a block of trials.
In summary, the DIF model expresses both a long-term memory contribution, and a short-term memory contribution by exponential decay processes, with uniform initial subjective prior probabilities, and a contribution of alternation expectation. Though there are similarities between (7) and (9), the model of Squires et al. (1976) uses information about the experimental design (P k ) which was actually unknown to the participants. In contrast, the DIF model uses only the information contained in the stimulus sequence as observed by participants, and it always starts with a uniform initial subjective probability distribution P k (1) = 1/K, k ∈ {1, . . ., K }, regardless of the actual relative frequencies of events over a block of trials. Finally it should be noted that our new model yields a conceptually well-defined subjective probability, as opposed to an expectancy as in Squires et al. (1976). Following Mars et al. (2008) we compared the DIF to the MAR and SQU models using the log-Bayes factor based on the model evidences. The evidences were approximated using the variational free energy which consists of an accuracy and complexity term, thus enabling the comparison and selection of competing models (Penny et al., 2004;Friston et al., 2007;Penny, 2012). We employed the same three-level hierarchical general linear model (GLM) as Mars et al. (2008). For model fitting and calculation of the model evidences we used parametric empirical Bayes (PEB) from the spm_PEB.m function of the Statistical Parametric Mapping (SPM8) software (Friston et al., , 2007.

Evaluation methods
The different models (DIF, MAR, and SQU) generate the model-specific surprise 3 I (n) or expectancy E k, (n) values as regressors, with the subscript ∈ {1, . . ., L = 16} denoting the individual participants and n ∈ {1, . . ., N = 192} being the discrete time index of the consecutive trials within one block. The first level of the GLM models the measured trial-by-trial P300 estimates Y (n) [µV] as a linear function of the surprise I (n) [bit] with the intercept θ (1) [µV], the slope ϑ (1) [µV/bit], and an error Note that the fitted model-based P300 estimates then follow The second level models the participant -specific parameters θ (1) and ϑ (1) as deviations from the corresponding group parameters θ (2) [µV ] and ϑ (2) [µV /bit]: The third level functions as a shrinkage prior on the group parameters. In matrix notation the GLM structure is × 1 is a vector concatenating the trial-by-trial P300 estimates for all participants, with [] T being the transpose, making Y a column vector. The participantspecific vector Y = [Y (n = 1), . . ., Y (n = 2N )] ∈ R 1 × 2N contains the trial-by-trial P300 estimates for one participant, averaged over the six sequence repetitions, for both probability categories. The first level design matrix X (1) ∈ R 2NL × 2L is blockdiagonal with L partitions X as explanatory variables. The second level design matrix X (2) = 1 L ⊗I 2 × 2 = [I 2 × 2, =1, . . . , I 2 × 2, =L ] T ∈ R 2L × 2 is the Kronecker product of an all-one column vector 1 L of length L and an identity matrix I 2 × 2 . The third level design matrix X (3) shall have all-zero elements. The unknown level-one parameters θ (1) and ϑ (1) are assembled in the parameter vector =L ] T ∈ R 2L×1 . Likewise, the second level parameters θ (2) and ϑ (2) are assembled in the vector (2) ∈ R 2 × 1 . All errors are assumed to be normally distributed E (j) ∼ N (0, (j) ). The covariance is parameterized following (j) = λ (j) I (j) , with I (j) as an identity matrix with the same dimension as the number of rows of the design matrix of the corresponding level X (j) . The hyperparameters λ (j) are the free parameters of the hierarchical linear model and are estimated using an EM algorithm for maximum likelihood estimation.
The conditional means of the first level parameters µ  Table 4  . The log-evidences or marginal log-likelihoods of the models F = ln(p(Y | M)), with p(Y | M) being the likelihood of the data Y given the model M, were used for model comparison via the log-Bayes factor ln(BF) which is the natural logarithm of the quotient of the model likelihoods or the difference in log-evidence (Kass and Raftery, 1995;Penny et al., 2004;Friston et al., 2007): (21) If the log-evidence is calculated this way, positive values reflect evidence in favor of the DIF model and negative values in favor of the XXX model (being SQU or MAR), respectively. Values larger than five are considered "very strong" evidence (Kass and Raftery, 1995;Penny et al., 2004). To summarize, in our evaluation method we closely followed (Mars et al., 2008).

DIF model parameter identification
The values for the free model parameters, namely α L in (9), τ 1 and τ 2 (both (A12)), β S [for (11)], α in (9), and γ ,2 in (15), have to be trained on the measured data, i.e., on the trial-by-trial P300 estimates. The model-based P300 estimates are calculated with the same model parameters for all participants and then used for maximization of the DIF model evidence, which is our optimization criterion, as described in Section 2.4.5.
The calculation of the model evidence for the whole range of possible combinations of the parameters with a reasonable resolution is computationally too expensive. For this reason only subsets of parameter combinations were optimized simultaneously with a resolution of 100 values per parameter, which results in 30,000 possible parameter combinations for one iteration. In the first iteration, while optimizing one set of parameters, the not yet optimized parameters were fixed to the center of their respective intervals. In the following iteration, parameters not currently optimized were fixed to the optimal values from the last iteration. Note that α S is not a free parameter due to the restriction α S = 1 − α L − α and thus not optimized independently.
In these two iterations a total of 60,000 parameter combinations have been evaluated, and the set with the highest evidence was considered optimal. Note that only a locally optimal parameter combination can be found using this procedure, as many iterations may be necessary for convergence toward the global optimum, if it can be found at all. Table 1 gives an overview over the searched parameter space. Error rates similarly showed clear dependence on stimulus probability (0.3, M = 9.6%, SE = 1.6%; 0.5, M = 4.7%, SE = 0.9%; 0.7, M = 2.1%, SE = 0.4%). The enhanced error proneness in response to less probable stimuli was confirmed by an ANOVA on arcsin-transformed error rates as a function of probability, F (2,30) = 25.19, p < 0.001, η 2 p = 0.63, = 0.65. Polynomial contrasts revealed a linear trend, F (1,15) = 28.94, p < 0.001, η 2 p = 0.66 as well as a quadratic trend, F (1,15) = 4.86, p < 0.05, η 2 p = 0.25.

Figure 5 depicts grand-average ERP waveforms (upper panels) and
topographic maps (lower panels). Left panels illustrate ERP waveforms at Pz that were obtained in the [0.3, 0.7] probability category. Right panels show third-order sequence effects on ERP waveforms at Pz that were obtained in the [0.5, 0.5] probability category. Note that sequences of four successive stimuli are illustrated, in temporal order (trial n − 3, trial n − 2, trial n − 1, trial n = eliciting event); a signifies a particular stimulus, b the other one. For example, aaaa gives a description of stimulus a being repeated across four consecutive trials (shown as green dashed curve), whereas bbba represents the presentation of stimulus a after having stimulus b repeated across the three immediately preceding trials (shown as black dashed curve). amplitudes. Specifically, for the single -b-sequence abaa, the P300 waveform lies amongst those from dual -bb-sequences, whereas for the dual -bb-sequence baba, the P300 waveform appears indistinguishable from the waveforms from single -b-sequences. As further detailed in the Discussion, this amplitude reversal is attributed to the disconfirmation of alternation expectation in the abaa sequence, was well as to the confirmation of alternation expectation in the baba sequence. (D) Sequence maps show the scalp topography of the bbba-aaaa difference wave in the [0.5, 0.5] probability category at various points in time (292-412 ms, divided into five time windows of 24 ms each).

MODEL-BASED TRIAL-BY-TRIAL ANALYSIS
The maximum log-Bayes factors in favor of the DIF model over the MAR and SQU models are shown in Table 2. In both cases, this is considered a very strong evidence in favor of the DIF model (Kass and Raftery, 1995;Penny et al., 2004). Table 3 shows the free parameters of the DIF model which were used for calculating the log-Bayes factors in Table 2. Figure 6 illustrates how the log-Bayes factors vary in dependence on the model parameters, with the tip of the "V" marking the parameter combination with the highest evidence. It is important to note the relatively flat tops of the contours implying good generalization capability of the DIF model. Due to computational complexity, only two parameters were optimized simultaneously, as described in Section 2.4.6. The relatively high value of α L = 0.83 shows that the subjective probability mainly follows the long-term memory. With the identified values for τ l and τ 2 we get β L,1 = 40.3 and β L,192 = 11787, which is further illustrated in Figure A2 in the Appendix. With a short-term memory time constant of β S = 1.82 and a weight of α S = 0.12 the influence of recent events to the subjective probability is captured. While the weight of the filter modeling alternation expectancy α = 0.05 appears to be small, Figure 6C clearly shows the importance of this contribution. Figure 7 shows timeline plots for one exemplary participant. Figures 7A,B show plots of the expectancy E k=1 (n) and the subjective probability P k=1 (n) of seeing stimulus k = 1 on trial n for the [0.5, 0.5] and [0.3, 0.7] probability category for all competing models. Figures 7C,D show plots of the expectancy E k=s(n) (n) and the subjective probability P k=s(n) (n) of seeing the actually occurring stimulus k = s(n) on trial n for both probability categories and all models. The transition from Figures 7A,B to Figures 7C,D illustrates that the subjective probability is traced as a distribution for all possible events k ∈ {1, . . ., K } simultaneously over all trials and that at the moment of seeing a new stimulus k = s(n) only the corresponding subjective probability of that event k is relevant for the surprise I (n) and consequently for the model-based P300 estimate Y (n). Figures 7E,F show plots of both the measured and the model-based trial-by-trial P300 estimates Y (n) and Y (n) for both probability categories, where the latter is calculated according to (17). It is visible that the DIF model estimates are smoother over trials than those of the SQU model but not as undynamic as the estimates of the MAR model, which loses its initial dynamic and becomes almost binary over increasing trial number n. This effect is especially prominent in Figures 7E,F. Similar consecutive trials n elicit a descent in the measured trial-by-trial P300 estimate. For small n all models show this behavior, but for increasing n the MAR model yields nearly constant estimates. Furthermore the MAR model does not account appropriately for the well-documented sequence effects (Squires et al., 1976). Figure 8 shows the tree diagrams of the measured (Y (n)) and model-based ( Y (n)) P300 estimates as a function of the preceding stimuli sequences for the different probability categories. The DIF and SQU model are both capable of estimating the envelope and general tree structure quite well, but the SQU model fans out too much for higher order effects for the frequent stimulus in the [0.3, 0.7] and in general in the [0.5, 0.5] probability category, while for the MAR model higher-order effects are nearly non-existent.
As an additional measure of goodness-of-fit of the models Table 4 shows the mean squared error (MSE) and the fraction of variance explained (FVE) of the fitted model predictions Y (n). Although differences appear to be somewhat smaller, still the superiority of the DIF model is evident, supporting the log-Bayes factors presented in Table 2.

DISCUSSION
We tested three computational models of trial-by-trial P300 amplitudes using Bayesian model selection (Kass and Raftery, 1995;Raftery, 1995). Trial-by-trial P300 amplitude estimates at Pz were obtained in a two-choice RT task. Behavioral data indicated that on average participants reacted slower and committed more errors when they responded to rarely occurring stimuli, consistent with many earlier reports (Miller, 1998). P300 amplitudes showed the expected relationships with stimulus probability. Further, they were influenced by the immediately preceding stimulus sequence, a finding which is also consistent with earlier reports. Thus, our data replicate two of the most ubiquitous P300 findings, notably probabilistic and sequential effects on P300 amplitudes. The DIF model (9) possesses important advantages over previous models of P300 amplitude fluctuations. It relies completely on mathematical notations and definitions, unlike the notions of expectancy (Squires et al., 1976), global vs. local probability (Squires et al., 1976), temporal probability (Gonsalvez and Polich, 2002), or context updating (Donchin and Coles, 1988). It is a formal model, akin to the MAR model (Mars et al., 2008), but it offers a more parsimonious explanation of trial-by-trial P300 fluctuations. The competitive advantage of the DIF model over the MAR model stems, in large part, from the non-negligible contribution of the short-term traces to subjective estimates of event probabilities, as evidenced by the scarce sensitivity of the MAR model to the sequential effects on P300 amplitudes (Figures 7 and 8).
The SQU model of P300 amplitude fluctuations (Squires et al., 1976) can be considered as a precursor of our DIF model insofar as it comprised memory for event frequencies within the prior stimulus sequence (equivalent to the short-term trace) and event probabilities (loosely related to the long-term trace). Yet, the longterm contribution to the DIF model does not incorporate global event probabilities, P k , themselves, but rather subjective estimates, c L,k (n), of these probabilities, these being based on counting observed events in continuously larger samples. Thus, subjective probability estimates are constantly revised while evidence is accumulating, and the DIF model is a pure model of subjective statistical parameters, rather than a mixture of subjective and objective parameters such as the SQU model. Given the hereby documented superiority of the DIF model over its competitors, we will shortly consider some of its cornerstones. To begin with, it is important to view the DIF model in the context of the processing of event frequencies (Sedlmeier and Betsch, 2002). In particular, the reliable encoding of the frequency with which events occur (Underwood, 1969;Hintzman, 1976) led to the claim that event frequency is automatically encoded in memory, placing only minimal demands on attentional resources (Hasher and Zacks, 1984;Zacks and Hasher, 2002). A variety of representation modes for memory for event frequency can be envisaged. According to multiple-trace views (Hintzman, 1976), a record of individual events is stored such that each attended occurrence of an event results in an independent memory trace. In contrast, according to strength views, each attended event occurrence produces an increment in the strength of a single memory trace or a frequency counter (Alba et al., 1980), supporting the event frequency counter assumption which is inherent in the DIF model. Both, the short-term, c S,k (n), and the long-term, c L,k (n), memory traces are frequency counters (11) and (13).
The retention functions describing the short-term and longterm traces in the DIF model are of exponential-decay nature (Lu et al., 2005), differing mainly with regard to their decay half-lives. The dual decay rate assumption is compatible with the fact that short-term and long-term memory functions depend on dissociable neuronal processes (Jonides et al., 2008). Further, recent functional brain imaging data suggest different distributions of cortical responses for short-term and long-term decay functions (Harrison et al., 2011). The optimal short-term time constant β S approximated the value of two. Note that the DIF model does not allow for variations in β S , implying that the short-term contribution to the subjective probability for the appearance of event k on trial n, P k=s(n) (n), occurs stable over the progression of n. In contrast, β L,n varies as a function of n (A12), with γ L,n gradually approximating the value of one. β L,n (and hence γ L,n ) are relatively low during early trials when compared to late trials within blocks of trials (Figure A2 in Appendix). On the one hand, the long-term quality of the long-term contribution to P k=s(n) (n) is gradually increasing as a function of n, as revealed by the dynamics of the long-term low-pass filter (Figure 4). In other words, the decay half-life of the long-term trace gradually increases when the observer experiences more and more trials. On the other hand, the recursive formulation of the long-term contribution in the DIF model (14) reveals that the balance between the most recently experienced stimuli (which occurred on trial n − 1, weighted by (1 − γ L,n−1 )) and the counted frequency (weighted by γ L,n−1 ) is biased toward recent stimuli during early trials (when γ L,n−1 is relatively low), but biased toward the counted frequency during late trials (when γ L,n−1 is relatively high). Thus, the DIF model postulates that the decay half-life of the long-term trace evolves dynamically with the amount of experience. The observer is modeled to rely more and more on environmental experience rather Frontiers in Human Neuroscience www.frontiersin.org than on prior assumptions, possibly reflecting the fact that the exploitation of statistical redundancy becomes gradually more reliable with progressing time (Barlow, 2001). Visual working memory (vWM, Baddeley, 2003) can maintain representations of only around four objects at any given moment (Cowan, 2005). The surprisingly limited vWM capacity offers a rationale for the assumption of a capacity-limited alternation term in the DIF model (15), c ,k (n). Its finite impulse response (FIR) characteristic resembles the alternation term in the SQU model (Squires et al., 1976). This FIR high-pass filter searches for alternation patterns over short sequences of trials (such as those in abab and in baba sequences). The discovery of such patterns leads one to expect the completion of the pattern in the upcoming trial, an expectation which will be confirmed in the ababa sequence, but will be disconfirmed in the babaa sequence. It is important that alternation expectation appears conditional upon the detection of alternation patterns in vWM, as revealed by the larger P300 amplitudes in response to ba sequences compared to aa sequences (Figure 8). Thus, the detection of alternation patterns in vWM entrains alternation expectation, as evidenced by our data (Figure 5B, see also Jentzsch and Sommer, 2001;Ford et al., 2010). Further, the effects of pattern completion (such as baba sequences) vs. pattern violation (such as abaa sequences) might underlie the first-by second-by third-order sequence interaction which we identified in the [0.5, 0.5] probability category.
While the effects of alternation expectation are non-negligibly measurable at Pz, visual inspection of our data at more anterior electrode sites suggested that these alternation expectancy effects might possess a more anterior, i.e., P3a-like scalp topography than the proper event frequency effects which showed the typical, P3blike scalp topography (Polich, 2007;Duncan et al., 2009). While these initial observations ask for multi-channel data analyses, the present modeling work should be mainly considered as a model for P3b generation, since the task procedures (all stimuli required a button press) and the ERP waveforms [i.e., their scalp distribution (cf. Figure 5) and peak latencies] favor an interpretation in terms of predominant P3b-potentials.
The DIF model offers a digital filtering account of multiple memory systems in the brain (Figures 1-4). Specifically, the DIF model characterizes frequency memory as two digital first-order infinite impulse response (IIR) low-pass filters, one filter with an experience-invariant short-term exponential-decay function (Figure 2), and another filter with an experience-dependent longterm exponential-decay function, such that the low-pass characteristic becomes progressively apparent as the amount of experience increases (Figure 3). Moreover, vWM is conceptualized as an additional fourth-order finite impulse response (FIR) high-pass filter (Figure 9). The input signal g k (n) in (10) to all three filters is a binary representation of the stimulus sequence, with all samples prior to the first trial filled with the uniform initial prior.
Our theory of variation in trial-by-trial P300 amplitudes bears implications on the nature of cortical processing. It is in agreement with the predictive coding approach (Friston, 2002(Friston, , 2005Spratling, 2010). Viewed from this perspective, predictive surprise -and hence trial-by-trial P300 amplitude -is proportional to the residual error between top-down priors and bottom-up sensory evidence. Predictive coding theory is a successful guiding T T T T FIGURE 9 | Block diagram of the fourth-order finite impulse response (FIR) filter H (f ). Elements T represent a delay of one trial, the multipliers γ ,i compose the filter coefficients and C constitutes a normalizing constant. model for functional neuroimaging and electrophysiological studies of sensory cortical processing (Summerfield et al., 2006;Garrido et al., 2009;Summerfield and Egner, 2009;Egner et al., 2010;Rauss et al., 2011;Winkler and Czigler, 2011). Further, the DIF model is a Bayesian model of cortical processing (Knill and Pouget, 2004;Friston, 2005). It conceives performance on our two-choice RT task as sequential Bayesian learning (MacKay, 2003), with initial prior knowledge being conceptualized as a uniform prior probability distribution (10), consistent with Laplace's Principle of Indifference.
It is important to note that we do not claim that the observed P300 modulations were exclusively related to predictive surprise over sensory input, since in the present task design the probabilities of sensory events were mirrored on probabilities of motor responses, as each stimulus was mapped onto a distinct motor response. Thus, particular stimuli also called for particular motor programs, and it could be that the observed P300 modulations are related to predictive surprise over motor responses. We deliberately leave it open whether the observed P300 modulations were due to surprise conveyed by the visual stimuli, or whether they were related to surprise associated with the selection of a motor response, given a visual stimulus on each trial (Barceló et al., 2008;O'Connell et al., 2012).
The DIF model specifies how predictive surprise determines trial-by-trial P300 amplitudes, seemingly representing barely more than a re-iteration of Donchin's (1981) surprise hypothesis of P300 amplitude fluctuations. However, one should not confuse predictive surprise, as defined by the DIF model, with Bayesian surprise (Ostwald et al., 2012). Bayesian surprise numeralizes the divergence between P 1 (n), . . ., P K (n) and P 1 (n + 1), . . ., P K (n + 1), i.e., the divergence between probability distributions across successive trials which can be computed using the Kullback-Leibler metric (Baldi and Itti, 2010). Bayesian surprise thus quantifies the revision of the internal model of the world, given stimulus s(n), whereas predictive surprise I (n) in (2) refers to the unpredictability of s(n), given the internal model immediately before observing s(n). To conclude, we propose a formal computational model of predictive surprise, along with a strategy for testing the model's ability to predict trial-by-trial P300 amplitudes.

ACKNOWLEDGMENTS
Thanks are due Sandra Mann for her help in data collection and analysis. We also thank Herbert Hoijtink, University of Utrecht, NL, for his valuable suggestions and advice concerning Bayesian model selection.

Frontiers in Human Neuroscience
www.frontiersin.org

APPENDIX ALTERNATION EXPECTATION OF THE SQU MODEL
This Appendix gives computational details on the expectation w.r.t. alternating stimulic ,k (n), as required in (7) and described in words in Squires et al. (1976). The expectation w.r.t. alternating stimuli, and how this expectancy is met by the present stimulus s(n) is given as: with N depth = 5. The use of the minimum function ensures that an expectation for alternation requires at least N alt = 2 consecutive previous stimulus alternations. Following Squires et al. (1976), its sign is negative (u k (n) = −1) if stimulus s(n) violates the alternation expectation [i.e., s(n) and s(n − 1) are identical]. On the other hand, if the alternation expectation is met [i.e., s(n) and s(n − 1) differ from each other], the sign is positive: u k (n) = +1. The number of previous alternations in a row constitutes the amplitude of the expectation, which is given by ) with 1 N = {1,2, . . ., (N depth − 2)}.

ALTERNATION EXPECTATION OF THE DIF MODEL
This Appendix gives details on the coefficients γ ,n−ν and the probability normalizing constants C and C . Figure 9 shows the block diagram of this fourth order finite impulse response filter. In order to reduce the complexity of the model by keeping the number of independent parameters small, only the coefficient γ ,2 is chosen freely within the range 0 ≤ γ ,2 ≤ γ ,max . The effect of the multiplicative constant C and additive constant C (as shown in Figure 1) is merely to ensure 0 ≤ c ,k (n) + 1/C ≤ 1. Thus, γ ,max can be set to an arbitrary value if some constraints are met: The filter coefficients are set following γ ,1 = − γ ,2 , γ ,3 = − γ ,4 , and γ ,4 = γ ,max − γ ,2 . The normalizing constants have to be set according to: C = γ ,max + γ ,2 + γ ,4 = 2γ ,max and C = γ ,max +γ ,2 +γ ,4 γ ,max = 2. We chose γ ,max = 1.

ON COUNT FUNCTIONS AND DIGITAL FILTERS (DIF MODEL)
In this Appendix we present some mathematical details of the DIF model, its constants, and its relation to the count functions.
We will show that the count functions (11) for c S,k (n) and (13) for c L,k (n) are equivalent to their recursive formulations (12) and (14), respectively. To this end we will transform the block diagram of Figure 3 to obtain the count functions from the new resulting block diagrams. Figure A1A shows the block diagram of Figure 3 according to (14) with some notation omitted. A delay of one trial is represented by T , the input signal is defined as g k (n), the output signal as c L,k (n), (1 − γ L,n−1 ) and γ L,n−1 are time-dependent values. The multiplier (1 − γ L,n−1 ) can be moved to the right yielding the block diagram shown in Figure A1B. Figure A1C shows the block diagram after moving (1 − γ L,n−1 ) even further to the right. Finally, when moving the multiplier γ L,n−1 /(1 − γ L,n−1 ) to the right of the lower branch delay unit, the time dependency has to be accounted for as is shown in Figure A1D. For the long-term filter we obtain an effective filter coefficient The time-varying discrete-time impulse response of the longterm filter depicted in Figure A1D is given as h L,ν (n) = (n − ν − 1) 1 − γ L,n−1 γ L,n n υ=ν+1 γ L,υ with n(>ν) being the currently observed trial, and ν being the trial when the stimulus initiated the impulse response. We make use of the step function with the dynamic normalizing value C L,n = γ L,n 1−γ L,n . The inputoutput relation of this time-varying linear system is given by (c.f. Claasen and Mecklenbräuker, 1982;Prater and Loeffler, 1992): Block diagram of a filter equivalent to (B), where the multiplier (1 − γ L,n − 1 ) has been moved even further to the right. (D) Block diagram of a filter equivalent to (C), where the multiplier γ L,n − 1 /(1 − γ L,n − 1 ) has been moved to the right of the delay unit in the lower branch.
Due to the step function (A7) in the impulse response (A6) only stimuli at trials ν ≤ n − 1 contribute to the output at trial n and we obtain which is exactly the long-term count function (13) with γ L,n (ν) = n υ=ν+1 γ L,υ = n υ=ν+1 γ L,υ 1−γ L,υ−1 1−γ L,υ , and g k (ν), as before. Note that for the short-term filter we obtain n υ=ν+1 γ S = γ n−ν S and C S = γ S /(1 − γ S ), which simplifies (A10) to the discrete-time convolution which is the short-term count function (11). Thus, equivalence has been shown between the count functions (11) and (13), and their simple recursive formulations (12) and (14), respectively. The behavior of the dynamic long-term time value β L,n follows: with normalized time constants τ 1 , τ 2 , controlling the speed of transition from reliance on prior assumptions to experience. The time value 0 ≤ β L,n < ∞ holds β L,n > β S . The effect of this dynamic formulation of β L,n is further illustrated in Figure A2 which shows the values of β L,n and the corresponding forgetting factor γ L,n = e Frontiers in Human Neuroscience www.frontiersin.org