Edited by: Sven Bestmann, University College London, UK
Reviewed by: William Milberg, Harvard Medical School, USA; Tamer Demiralp, Istanbul University, Turkey; Laurence T. Maloney, New York University, USA
*Correspondence: Karl J. Friston, Wellcome Trust Centre for Neuroimaging, Institute of Neurology, Queen Square, London WC1N 3BG, UK. e-mail:
This is an open-access article subject to an exclusive license agreement between the authors and the Frontiers Research Foundation, which permits unrestricted use, distribution, and reproduction in any medium, provided the original authors and source are credited.
We suggested recently that attention can be understood as inferring the level of uncertainty or precision during hierarchical perception. In this paper, we try to substantiate this claim using neuronal simulations of directed spatial attention and biased competition. These simulations assume that neuronal activity encodes a probabilistic representation of the world that optimizes free-energy in a Bayesian fashion. Because free-energy bounds surprise or the (negative) log-evidence for internal models of the world, this optimization can be regarded as evidence accumulation or (generalized) predictive coding. Crucially, both predictions about the state of the world generating sensory data and the precision of those data have to be optimized. Here, we show that if the precision depends on the states, one can explain many aspects of attention. We illustrate this in the context of the Posner paradigm, using the simulations to generate both psychophysical and electrophysiological responses. These simulated responses are consistent with attentional bias or gating, competition for attentional resources, attentional capture and associated speed-accuracy trade-offs. Furthermore, if we present both attended and non-attended stimuli simultaneously, biased competition for neuronal representation emerges as a principled and straightforward property of Bayes-optimal perception.
Attention is a ubiquitous and important construct in cognitive neuroscience. Many accounts of attention fall back on Jamesian formulations, famously articulated as “the taking possession by the mind, in clear and vivid form, of one out of what seem several simultaneously possible objects or trains of thought” (James,
We have suggested recently that perception is the inference about causes of sensory inputs and attention is the inference about the uncertainty (precision) of those causes (Friston,
The basic idea we pursue is that attention entails estimating uncertainty during hierarchical inference about the causes of sensory input. We develop this idea in the context of perception based on Bayesian principles, under the free-energy principle (Friston,
In predictive coding schemes, sensory data are replaced by prediction error, because this is the only sensory information that has yet to be explained. Here, the weighting is implemented by synaptic gain. We therefore return to the central role of precision-weighted prediction errors in optimal inference. Neurobiologically, this is easy to relate to theories of attentional gain, where the post-synaptic responsiveness of sensory (prediction error) units is modulated by attentional mechanisms (Desimone,
Electrophysiologically, desynchronization with increased gamma activity (between 30 and 100 Hz) is seen during attentional tasks in invasive (Steinmetz et al.,
In terms of neurotransmitters, gamma oscillations are profoundly affected by acetylcholine, which is released into sensory cortex from nuclei in the basal forebrain. It acts through both fast ion channel (nicotinic) receptors and slow metabotropic (muscarinic) receptors (Wonnacott,
In summary, it may be the case that attention is the process of optimizing synaptic gain to represent the precision of sensory information (prediction error) during hierarchical inference. Furthermore, if we allow for state-dependent changes in precision, the neurobiology of attention must involve activity-dependent changes in synaptic gain; assuming that neuronal activity represents the states of the world and synaptic gain represents precision. Given this sort of architecture we can, in principle, simulate attentional processing with established (Bayes-optimal) inversion or recognition schemes, using models with state-dependent noise. What follows is an attempt to do this.
This paper comprises four sections. In the remainder of Section we provide a brief review of attention in psychological and neurobiological terms. This section focuses on directed spatial attention and, in particular, the Posner (cueing) paradigm that emphasizes the importance of valid cues in establishing attentional set during target detection (Posner,
In this section, we review some of the key paradigms and theories that have dominated attention research over the past decades. This review can be regarded as a primer for readers who do not have a cognitive neuroscience background (and should be omitted by readers who do). Our focus will be the Posner paradigm, which we simulate in later sections, and biased competition, which is one of the most prevalent electrophysiologically grounded theories of attention. We will also cover some key distinctions, such as the difference between early and late selection and exogenous versus endogenous cueing.
Early cognitive models of attention, although inherently limited by lack of knowledge about the underlying neural processes, elucidated the important difference between early and late selection. Broadbent (
Biased competition (Desimone and Duncan,
This premise leads to a key prediction: if two stimuli are presented within a cell's receptive field, the response to both will be smaller than the sum of the response to the stimuli presented separately (Reynolds et al.,
Large RFs thus cause stimuli to compete. The probability with which stimuli are represented by cells is thought to be influenced by a number of top-down and bottom-up biases. Bottom-up biases result from the properties of the stimulus itself, such as visual or emotional salience and novelty. Abrupt-onset stimuli, which have high temporal contrast, and thus salience, can attract attention even if they are task-irrelevant (Yantis and Jonides,
Top-down biases reflect the cognitive requirements of the task rather than the stimuli. Top-down biases have been most studied via spatially-directed attention experiments. Electrophysiologically, if attention is directed toward one of two competing stimuli in a receptive field, the mutually suppressive effect disappears and the response of the cell emulates the response to the attended stimulus alone (Moran and Desimone,
In later sections we will simulate optimal perception under the Posner task, a covert attention task. Attending to an object usually involves looking at it; that is placing its image at the fovea (the central area of the retina with highest acuity). However, attention can be directed independently of eye movement (Posner et al.,
The two types of cues used in the Posner paradigm – central and peripheral – show the same facilitation effect. However, they may operate by different mechanisms. Peripheral stimuli are labeled as “exogenous,” because the change in attention is triggered by an external event. It is well established that abrupt-onset peripheral stimuli can attract attention via bottom-up mechanisms (Yantis and Jonides,
Exogenous and endogenous cuing fit well with biased competition theory: exogenous cuing can be thought of as a bottom-up bias, based on the prior expectation that salient events recur in the same part of the visual field. On the other hand the effect of endogenous cues must be mediated by top-down bias. However, these top-down effects do not necessarily call on semantic or explicit processing: for example, Decaix et al. (
This section reviews the theoretical principles used later to explain perception and attention. This treatment is a bit technical but serves as a standalone summary of the mathematical principles behind the simulations of subsequent sections. More mathematical details can be found in (Friston et al.,
Our objective, given a model (brain),
Under ergodic assumptions, this is proportional to the long-term average of surprise, also known as negative log-evidence
Minimizing sensory entropy therefore corresponds to maximizing the accumulated log-evidence for a model of the world. Although sensory entropy cannot be minimized directly, we can create an upper bound
This function is a Kullback–Leibler divergence
The free-energy has been expressed here in terms of 𝓗(
Where
Crucially, this means the free-energy is only a function of the conditional mean or expectation. The expectations that minimize free-energy are the solutions to the following differential equations. For the generalized states
Where 𝒟 is a derivative matrix operator with identity matrices above the leading diagonal, such that
For slowly varying parameters φ(
Here, the solution
Equations
In summary, we have derived recognition dynamics for expected states (in generalized coordinates of motion) and parameters, which cause sensory samples. The solutions to these equations minimize free-energy and therefore minimize a bound on surprise or (negative) log-evidence. Optimization of the expected states and parameters corresponds to perceptual inference and learning respectively. The precise form of the recognition depends on the energy
We next introduce a very general model based on the hierarchal dynamic model discussed in Friston (
The non-linear functions
Under local linearity assumptions, the generalized motion of the sensory response and hidden states can be expressed compactly as
Equation 10 means that Gaussian assumptions about the random fluctuations specify a generative model in terms of a likelihood and empirical priors on the motion of hidden states
These probability densities are encoded by their covariances
Given this generative model we can now write down the energy as a function of the conditional means, which has a simple quadratic form (ignoring constants)
Here, the auxiliary variables
Again,
Note that the data enter the prediction errors at the lowest level;
In summary, these models are as general as one could imagine; they comprise hidden causes and states, whose dynamics can be coupled with arbitrary (analytic) non-linear functions. Furthermore, these states can be subject to random fluctuations with state-dependent changes in amplitude and arbitrary (analytic) autocorrelation functions. A key aspect is their hierarchical form, which induces empirical priors on the causes. In the next section, we look at the recognition dynamics entailed by this form of generative model, with a particular focus on how recognition might be implemented in the brain. We consider perception first and then attention. For completeness, we also mention learning; but will only pursue this in subsequent papers on learning and related phenomena (e.g., inhibition of return; Posner and Cohen,
If we now write down the recognition dynamics (Eq.
Here, we have assumed the amplitude of random fluctuations is parameterized in terms of log-precisions, where
The vector function π(
It is difficult to overstate the generality and importance of Eq.
In neural network terms, Eq.
In the present context, the key thing about this scheme is that the precisions
Perceptual learning corresponds to optimizing the first-order parameters θ ⊂ φ according to Eq.
Here μ(θ) is the connection strength mediating the influence of the
We conclude by considering the equivalent dynamics for the second-order (precision) parameters γ ⊂ φ. These precision parameters govern lateral and top-down state-dependent gain control and are learned according to Eq.
As with perceptual learning, the precision parameters change in proportion to a synaptic tag that decays in proportion to the precision
In this section, we have applied the general form of recognition dynamics prescribed by the free-energy treatment to a generic hierarchical dynamic model with state-dependent noise. When formulated as a neuronal message-passing scheme something quite important emerges; namely, a lateral and top-down modulation of synaptic gain in principal cells that convey sensory information (prediction error) from one cortical level to the next. It is this necessary and integral component of perpetual inference that we associate with attention.
In this section, we use the hierarchical dynamic model of the previous section as a generative model of stimuli used in the Posner paradigm. Inversion of this model, using generalized predictive coding (Eq.
We deliberately tried to keep the generative model as simple as possible so that its basic behavior can be seen clearly. To this end, we used a model with two levels, the first representing visual input and the second representing the causes of that input. The model has the following form, which we unpack below.
This minimal model has all the ingredients needed to demonstrate some complicated but intuitive phenomena. It helps to bear in mind that this is a generative model of how sensory data are caused that is used by the (synthetic) brain; we actually generated sensory data by simply presenting visual cues in various positions. Because this is a model the prior assumptions about the causes of visual input are that they are just random fluctuations about a mean of zero; i.e.,
We first describe the model in terms of the way that it explains sensory data; in other words, how it maps from causes to consequences. We then reprise the description in terms of its inversion; namely, mapping from consequences (sensory data) to causes (percepts). As a generative model, Eq.
Note how the log-precision π(1,
The ensuing increase in local precision can be regarded as analogous to exogenous cuing in the Posner paradigm, in the sense that it co-localizes in space and time with its sensory expression. Endogenous effects on precision that do not co-localize correspond to the probabilistic context established by
Some readers may wonder why we have used two hidden states that are placed in (redundant) opposition to each other. The reason for this is that we will use this model for more realistic simulations in the future, where hidden states encode a high precision in their circumscribed part of the visual field: this involves generating data in multiple sensory channels, with a hidden state for each channel or location. The vectors of ones and minus ones in Eq.
From the perspective of model inversion (mapping from sensory signals to causes) the predictive coding scheme of the previous section implies the following sort of behavior. When a cue
Figure
Figure
The result of this asymmetry between valid and invalid cueing means that responses to valid targets are of higher amplitude and have much tighter confidence tubes, in relation to invalid targets. This is shown on the lower right panel of Figure
The difference in the confidence tubes between valid and invalidly cued targets (Figure
Figure
Recall that the time course of the Posner effect depends on the slowly-decaying hidden states encoding precision (with a time constant of 32 in Eq.
The speed-accuracy trade-off is a useful psychophysical function, which can also be interpreted in terms of relative accuracies at a fixed reaction time. In this example, at 360 ms after the cue (about 50 ms after the onset of the target), the posterior confidence about the presence of valid targets is about 98%, whereas it is only about 70% for invalid targets (Figure
In what follows, we attempt to explain the well characterized electrophysiological correlates of the Posner paradigm using simulated event-related activity evoked by target stimuli. Spatial cueing effects are expressed in the modulation of event-related potentials (ERPs) to valid and invalid cues (Mangun and Hillyard,
Figure
In summary, this section has applied the Bayes-optimal scheme established in the previous section to a minimal model of the Posner paradigm. This model provides a mechanistic if somewhat simplified explanation for some of the key psychophysical and electrophysiological aspects of the Posner effect, namely, validity effects on reaction times and the time course of these effects as stimulus onset asynchrony increases. Furthermore, the model exhibits an asymmetry in costs and benefits for invalid and valid trials respectively. Electrophysiologically, it suggests early attentional P1 enhancement can be attributed to a boosting or biasing of sensory signals (prediction errors) evoked by a target, while later P3 invalidity (cf, novelty) effects are mediated by prediction errors about the context in which targets appear.
In this final section, we revisit the simulations above but from the point of view of biased competition. Although the Posner paradigm considers a much greater spatial and temporal scale than the paradigms normally employed in a monkey electrophysiology, we can emulate similar phenomena by presenting both cued and non-cued targets simultaneously using the Posner model. We hoped to see a competitive interaction between stimuli that favored the cued target. Furthermore, we hoped to see responses to the unattended (invalid) target changed in the presence of an attended target. This is one of the hallmarks of biased competition and is usually attributed to lateral interactions among competing representations for stimuli, within a cell's receptive field (see
Figure
This context is encoded by the expected hidden states and explains the biased competition for resources: in contrast with the hidden states inferred with the invalid target alone (see the equivalent panel in Figure
The results in Figure
Biased competition emerges naturally in Bayes-optimal schemes as a simple consequence of the fact that only one context can exist at a time. This unique aspect of context is encoded in the way that the representation of hidden states (context) modulates or distributes precision over sensory channels. Optimizing this representation leads to competition among stimuli to make the inferred context more consistent with their existence. This highlights the simplicity and usefulness of appealing to formal (Bayes-optimal) schemes, when trying to understand perception.
Our treatment of attention is one of many accounts that emphasize the role of probabilistic inference in sensory processing; including sensorimotor integration (Wolpert et al.,
The free-energy formulation is a generalization of information theoretic treatments that subsumes Bayesian schemes by assuming the brain is trying to optimize the evidence for its model of the world. This optimization involves changing the model to better account for sensory samples or by selectively sampling sensations that can be accounted for by the model (cf, perception and action). Attention can be viewed as a selective sampling of sensory data that have high-precision (signal to noise) in relation to the model's predictions. Crucially, the model is also trying to predict precision. It is this (state-dependent) prediction we associate with attention. In short, perception, attention and action are trying to suppress free-energy, which is an upper bound on (Shannon) surprise (or the negative log-evidence for the brain's model of the world). Under some simplifying assumptions, free-energy is just the amount of prediction error, which means free-energy minimization can be cast as predictive coding. So how does this relate to other formal treatments?
Rao (
It is becoming increasingly clear that estimates of the precision play an important role in sensory inference. Whiteley and Sahani (
The formulation in this paper reaffirms that there is no tension between biased competition and predictive coding: it demonstrates that the characteristic behaviors of biased competition emerge naturally under predictive coding. They key thing that reconciles these two theories is to realize that predictive coding can be generalized to cover both states and precisions and that (state-dependent) precision is itself optimized. This leads to non-linear interactions among states implicit in the precision-weighting of prediction errors and provides a simple explanation for attentional gain effects. It will be interesting to relate the ensuing bias or weighting of sensory signals (prediction errors) by precision to the divisive schemes above (e.g., Heeger,
In this paper, we have focussed on reaction time and event-related responses to targets. However, many electrophysiological and neuroimaging studies of attentional paradigms (e.g., Chelazzi et al.,
The neurobiological (resp. computational) mechanisms that might underlie these effects tie several strands of evidence together rather neatly: as noted in the introduction the most plausible candidate for modulating activity-dependent (resp. state-dependent) synaptic gain (resp. precision) are fast synchronous interactions associated with attention (Börgers et al.,
In closing, we pre-empt a potentially interesting argument about the specificity of gain mechanisms and attention. The idea pursued in this paper is that attention corresponds to inference about uncertainty or precision and that this inference is encoded by dynamic changes in post-synaptic gain. However, non-linear (gain) post-synaptic responses are ubiquitous in the brain; so what is special about the non-linearities associated with attention? We suggest that attention is mediated by gain modulation of prediction error-units (forward or bottom-up information) in contradistinction to gain modulation of prediction units (backward, lateral or top-down information). In other words, non-linearities in the brain's generative model encoding context-sensitive expectations are distinct from non-linearities (gain) entailed by optimal recognition. The distinction may seem subtle but there is a fundamental difference between inferring the context-dependent contingencies and causes of sensations (perception) and their precision (attention). In this sense, there is an implicit distinction between inferring what is relevant for a task (as in classical attention tasks like dichotic listening) and the uncertainty about what is relevant. We have side-stepped this issue with the Posner task, because all cues are task relevant.
There is a final distinction that may be mechanistically important: we have focussed on activity-dependent optimization of gain but have not considered the (slower) learning of how and when this optimization should be deployed. For example, the latency of saccades to a target can be reduced if the target is more likely to appear on one side – and this relationship can be learned in as few as 150 trials (Carpenter and Williams,
In this paper, we have tried to establish the face validity of optimizing the precision of sensory signals as an explanation for attention in perceptual inference. We started with an established scheme for perception based upon optimizing a free-energy bound on surprise or the log-evidence for a model of the world. Minimizing this bound, using gradient descent, furnishes recognition dynamics that are formally equivalent to evidence accumulation schemes. Under some simplifying assumptions, the free-energy reduces to prediction error and the scheme can be regarded as generalized predictive coding. The key thing that we have tried to demonstrate is that all the quantities required for making an inference have to be optimized. This includes the precisions that encode uncertainty or the amplitude of random fluctuations generating sensory information. By casting attention as inferring precision, we can explain several perspectives on attentional processing that fit comfortably with their putative neurobiological mechanisms. Furthermore, by considering how states of the world influence uncertainty, one arrives at a plausible architecture, in which conditional expectations about states modulate their own precision. This leads naturally to competition and other non-linear phenomena during perception. We have tried to illustrate these ideas in the context of a classical paradigm (the Posner paradigm) and relate the ensuing behavior to biased competition evident in electrophysiological responses recorded from awake, behaving monkeys. In future work, we will use the theoretical framework in this paper to model empirical psychophysical and electrophysiological data and pursue this hypothesis using formal model comparison.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Generalized filtering (Friston et al.,
This gives a simple form for the (Gibbs) energy that comprises a log-likelihood and prior
This system can be solved (integrated) using a local linearization (Ozaki,
The corresponding curvatures are (neglecting second-order terms involving states and parameters and second-order derivatives of the conditional entropy)
Finally, the conditional precision and its derivatives are given by the curvature of the (Gibbs) energy
Note that we have simplified the numerics here by neglecting conditional dependencies between the precisions and the states or parameters. These equations may look complicated but can be evaluated automatically using numerical derivatives. All the simulations in this paper used just one routine –
Sensory signals are invariably registered as non-negative quantities (e.g., firing rates of photoreceptors). If we assume the sensory signals
This means that as the expected amplitude of the sensory input increases,
The Wellcome Trust funded this work. We would like to thank Marcia Bennett for helping prepare this manuscript. We are very grateful to Jon Driver and Kia Nobre for invaluable help in formulating these ideas.
A measure of salience based on the (Kullback–Leibler) divergence between the recognition and prior densities. It measures the information in the data that can be recognized.
Conditional density or posterior density is the probability distribution of causes or model parameters, given some data; i.e., a probabilistic mapping from observed data (consequences) to causes.
Information divergence, information gain or relative entropy is a non-commutative measure of the difference between two probability distributions.
Priors that are induced by hierarchical models; they provide constraints on the recognition density is the usual way but depend on the data.
The average surprise of outcomes sampled from a probability distribution or density. A density with low entropy means, on average, the outcome is relatively predictable (certain).
An information theory measure that bounds (is greater than) the surprise on sampling some data, given a generative model.
Generalized coordinates of motion cover the value of a variable, its motion, acceleration, jerk and higher orders of motion. A point in generalized coordinates corresponds to a path or trajectory over time.
Generative model or forward model is a probabilistic mapping from causes to observed consequences (data). It is usually specified in terms of the likelihood of getting some data given their causes (parameters of a model) and priors on the parameters.
An optimization scheme that finds a minimum of a function by changing its arguments in proportion to the negative of the gradient of the function at the current value.
Device or scheme that uses a generative model to furnish a recognition density. They learn hidden structure in data by optimizing the parameters of generative models.
(In general statistical usage) means the inverse variance or dispersion of a random variable. The precision matrix of several variables is also called a concentration matrix. It quantifies the degree of certainty about the variables.
The probability distribution or density on the causes of data that encode beliefs about those causes prior to observing the data.
Recognition density or approximating conditional density is a probability distribution over the causes of data. It is the product of (approximate) inference or inverting a generative model. It is sometimes referred to as a proposal or ensemble density in machine learning.
Surprisal or self-information is the negative log-probability of an outcome. An improbable outcome is therefore surprising.
The successive states of stochastic processes are governed by random effects.
A measure of unpredictability or expected surprise (cf, entropy). The uncertainly about a variable is often quantified with its variance (inverse precision).