Expectancies in Decision Making, Reinforcement Learning, and Ventral Striatum

Decisions can arise in different ways, such as from a gut feeling, doing what worked last time, or planful deliberation. Different decision-making systems are dissociable behaviorally, map onto distinct brain systems, and have different computational demands. For instance, “model-free” decision strategies use prediction errors to estimate scalar action values from previous experience, while “model-based” strategies leverage internal forward models to generate and evaluate potentially rich outcome expectancies. Animal learning studies indicate that expectancies may arise from different sources, including not only forward models but also Pavlovian associations, and the flexibility with which such representations impact behavior may depend on how they are generated. In the light of these considerations, we review the results of van der Meer and Redish (2009a), who found that ventral striatal neurons that respond to reward delivery can also be activated at other points, notably at a decision point where hippocampal forward representations were also observed. These data suggest the possibility that ventral striatal reward representations contribute to model-based expectancies used in deliberative decision making.


INTRODUCTION: ANATOMY OF A DECISION
Defi nitions from different approaches to decision making commonly emphasize that a decision should involve "choice among alternatives" (Glimcher et al., 2008). This rules out the extreme case of a (hypothetical) pure refl ex where a given stimulus is always followed by a fi xed response, and is more in line with "…the delay, between stimulation and response, that seems so characteristic of thought" (Hebb, 1949). A genuine decision depends on more than external circumstances alone: the chosen response or action can refl ect the agent's experience, motivation, goals, and perception of the situation. Thus, theories of decision making, by defi nition, are concerned with covert processes in the brain; with the representations and computations internal to the decision-maker that give rise to behaviorally observable choice.
A useful simplifi cation in studies of economic decision making has been to focus on "static" decision making (Edwards, 1954), where internal variables are assumed fi xed and the decision-maker's response to a variety of different choice menus is observed (for instance, would you rather have one apple or fi ve grapes?). This tradition gave us the concept of value or utility, a common currency that allows comparison of the relative merits of different choices (Bernoulli, 1738;Rangel et al., 2008). In experimental studies of animal learning, the complementary "dynamic" approach has been popular, in which the stimulus or situation is held constant and changes in choice behavior resulting from internal variables, such as learning and motivation, can be studied (Domjan, 1998).
The reinforcement learning (RL) framework integrates both of these traditions to form chosen approximately equally. From experience, the agent learns action values for action A and B. Next, the agent is made thirsty and returned to the testing chamber where actions A and B are available but do not lead to reward. All the agent has to go on is its previously learned, cached values for A and B, thus expressing no preference between them 1 . However, what can be observed experimentally is that animals now prefer the left lever (which previously led to water) indicating that they can adjust their choice depending on motivational state (Dickinson and Dawson, 1987) In contrast, the model is limited by its previously learned values that do not take the motivational shift into account 3 . Furthermore, there are other experimental results which are diffi cult to explain if decisions are based on cached values that do not include sensory properties of the outcome, such as the differential outcomes effect (Urcuioli, 2005), "causal reasoning" (Blaisdell et al., 2006), shortcut behavior (Tolman, 1948) and specifi c Pavlovian-instrumental transfer (discussed in detail below).
Such considerations motivated the notion that animals have knowledge about the consequences of their actions, and that they can use such knowledge, or expectancies, to make informed decisions (Tolman, 1932;Bolles, 1972;Balleine and Dickinson, 1998). An expectancy can be loosely defi ned as a representation of an outcome before it occurs; as we discuss in the fi nal section, they may be generated in different ways including actionoutcome as well as stimulus-stimulus (Pavlovian) associations. In the context of a motivational shift, an expectancy-based decision mechanism is thought to require two components: generation an explicit computational account of not only how an agent might choose among alternatives based on a set of internal variables, but also how those variables are learned and modifi ed from experience. The RL framework covers a range of models and methods, but most share common elements exemplifi ed by the basic temporaldifference (TD) algorithm (Sutton and Barto, 1998). Briefl y, TD-RL algorithms, such as the actor-critic variant, operate on a set of distinct situations or states (such as being in a particular location, or the presentation of a tone stimulus; this set is known as the state space), in which one or more actions are available (such as "go left"). Actions can change the state the agent is in and may lead to rewards, conceptualized as scalars in a common reward currency; the agent has to learn from experience which actions lead to the most reward. It does this by updating the expected value of actions based on how much better or worse than expected those actions turn out to be: that is, it relies on a TD prediction error. A single static decision consists of the actor choosing an action based on the learned or "cached" values of the available actions (perhaps it picks the one with the highest value). From the observed outcome, the critic computes a prediction error by comparing the expected value with the value of the new state plus any rewards received. If the prediction error is non-zero, the critic updates its own state value, and the actor's action value is updated in parallel. Thus, by learning a value function over states, the critic allows the actor to learn action values that maximize reward.
In the dynamic (learning) sense, such TD-RL algorithms are very fl exible in that they can learn solutions to a variety of complex tasks. However, a key limitation is their dependence on cached action values to make a decision, which means there is no information available about the consequences of actions. This limitation renders decisions infl exible with respect to changing goals and motivations (Dayan, 2002;Daw et al., 2005;Niv et al., 2006). Furthermore, because such cached action values are based only on actual rewards received in the past, they cannot support latent learning, are not available in novel situations, and are only reliable if the world does not change too rapidly relative to the speed of learning. The fi rst limitation is illustrated, for instance, by experiments that involve a motivational shift (Krieckhaus and Wolf, 1968;Dickinson and Balleine, 1994). In an illustrative setup (Dickinson and Dawson, 1987), there is a training phase where action A (left lever) leads to water reward, and action B (right lever) to food reward, calibrated such that both actions are van der Meer and Redish

Reinforcement learning (RL)
A computational framework in which agents learn what actions to take based on reinforcement given by the environment. Provides tools to deal with problems, such as reinforcement being delayed with respect to the actions that lead to it (credit assignment problem) or how to balance taking known good actions with unknown ones that might be better (explorationexploitation tradeoff).

Actor-critic architecture
A class of RL algorithm with two distinct but interacting components. The "actor" decides what actions to take, and the "critic" evaluates how well each action turned out by computing a prediction error. Several studies report a mapping of these components onto distinct structures in the brain.

State space
In order to learn what action to take in a given situation, an agent must be able to detect what situation or state it is in. In RL, the set of all states is known as the state space, which may include location within an environment or the presence of a discriminative stimulus.

Expectancy
A representation of a particular future event or outcome, such as that of food following a predictive (Pavlovian) stimulus or an outcome generated by a forward model. An alternate scenario is that the motivational shift causes the agent to be in a new state. However, in this case, it will not have any cached values at all, so again no preference would be predicted. 2 For clarity, we have ignored the important but complex issue of under precisely what conditions animals respond immediately, as opposed to only after further experience, to motivational shifts and reinforcer revaluation procedures (see, e.g., Dickinson and Balleine 1994 for details). For now, we merely wish to point out that, under some conditions, they do. 3 One might imagine a variety of subtle modifi cations that would enable an actor-critic model to choose appropriately following motivational shifts. For instance, an agent who actually experiences both hungry and thirsty states during training could learn separate cached values for each, such that it would be sensitive to motivational shifts by calling up the relevant set of values. While the learning of multiple value functions would work for this specifi c experimental situation, it seems unlikely to generalize to different implementations of the procedure (such as pairing a specifi c outcome with illness; Garcia et al. 1970). devaluation relies on a limbic network that includes the basolateral amygdala, orbitofrontal cortex, and possibly ventral striatum (Corbit et al. 2001;Pickens et al. 2005;Johnson et al. 2009b, but see de Borchgrave et al. 2002. We focus here on recent results aimed at elucidating the neural basis of model-based decision making. Recall that dynamic evaluation lookahead requires both the generation and evaluation of potential choice outcomes, implying the existence of neural representations spatio-temporally dissociated from current stimuli (Johnson et al., 2009a). Johnson and Redish (2007) recently identifi ed a possible neural correlate of the internal generation of potential choice outcomes. Recording from ensembles of hippocampal neurons, it was found that while the ensemble usually represented locations close to the animal's actual location (as would be expected from "place cells"), during pauses at the fi nal choice point of the Multiple-T task ( Figure 1A), the decoded location could be observed to sweep down one arm of the maze, then the other, before the rat made a decision ( Figure 1B,C). Further analyses revealed that on average, the decoded representation was more forward of the animal than backward (implying that it is not a general degeneration of the representation into randomness), tended to represent one choice or the other rather than simultaneously, and tended to be more forward early during sessions (when rats were still uncertain about the correct choice) compared to late (when performance was stable). While the precise relationship of such hippocampal "sweeps" to individual actions or decisions is presently unknown, the manner in which they occur (during pauses at the choice point, during early but not late learning) suggests an involvement in decision making. Consistent with a role in dynamic evaluation lookahead, the hippocampus is required for behaviors requiring route planning in rats (Redish, 1999), and is implicated in imagination, self-projection, and constructive memory in humans (Buckner and Carroll, 2007;Hassabis et al., 2007). If hippocampal sweeps are the neural correlate of the generation of possibilities in dynamic evaluation lookahead, where is the evaluation?
Following the dynamic evaluation lookahead model, any behavioral impact of sweeps (generation of possibilities) would depend on an assignment of a value signal (evaluation). The hippocampal formation sends a functional projection to the ventral striatum (Groenewegen et al., 1987;Ito et al., 2008) and hippocampal network activity can modulate ventral striatal fi ring (Lansink et al., 2009). Thus, van der Meer and  of action-outcomes, and evaluation of such outcomes which takes current motivational state and goals into account. Put simply, the rat presses the lever because it predicts a food outcome, and it currently wants the food. This approach is sometimes referred to as "model-based" because it relies on a forward model of the environment to generate outcomes; in principle, this mechanism needs not be restricted to simply predicting the outcome of a lever press, but could include mental simulation or planning over extended and varied state spaces, such as spatial maps or Tower of London puzzles (Newell and Simon, 1972;Shallice, 1982;Gilbert and Wilson, 2007). While a model of the environment is a necessary component of this approach, it is only half of the solution 4 and a dynamic outcome evaluation step is also required. Thus, we will refer to it here as dynamic evaluation lookahead to emphasize the importance of the evaluation step; basic TD-RL, which relies on cached values in the absence of a forward model and dynamic evaluation, we term "model-free" (Daw et al., 2005).

POTENTIAL NEURAL CORRELATES OF DYNAMIC EVALUATION LOOKAHEAD
The fact that humans and animals respond appropriately to motivational shifts and other tasks thought to require outcome representations implies the presence of a controller such as dynamic evaluation lookahead. However, it appears a model-free controller is also used in some conditions. Which one is in control of behavior can depend on factors such as the amount of training and the reinforcement schedule. For instance, with extended training behavior can become "habitual", or resistant to reinforcer devaluation, which tends to be effective during early learning (Adams and Dickinson 1981;Daw et al. 2005, but see Colwill and Rescorla 1985). In devaluation in lever pressing tasks, as well as in other procedures, behavior that in principle requires only action values appears to depend on the dorsolateral striatum (Packard and McGaugh, 1996;Yin et al., 2004). In contrast, as might be expected from the variety of world knowledge required for model-based methods, model-based control appears to be more domain-specifi c. For instance, the ability to plan a route to a particular place requires the hippocampus (Morris et al., 1982;Redish, 1999), while sensitivity to

Forward model
In the RL domain, a model of the world that allows an agent to make predictions about the outcomes of its actions (forward in time or "lookahead"). For instance, knowing that pressing a certain lever leads to a "water" outcome or being able to plan a detour if the usual route is blocked, require forward models.

Dynamic evaluation lookahead
An evaluation of a future outcome that takes the agent's current motivational state into account. A two-step process that requires prediction, then evaluation, of the outcome, mapping the prediction onto a value usable for decision making.

Model-free versus model-based RL
Model-free RL maintains a set of values for available actions indicating how successful each action was in the past. It has no concept of the actual outcome (such as food or water) of that action. In contrast, model-based RL takes advantage of such world knowledge, such that a choice which leads to water might be preferred when thirsty.

Decoding
Mapping neural activity to what it represents, such as in reconstructing the identity of a stimulus from spike train data or estimating the location of an animal based on the activity of place cells. lets delivered following arrival at the correct maze arm) and then to ask if these neurons were also active at other points on the track. If so, this would indicate potential participation in covert outcome representations. Indeed, the fi rst observation of van der Meer and  is that ventral striatal neurons, which responded to reward delivery, often fi red a small number of spikes at other locations on the track (Figure 2A). Based on the Johnson and Redish (2007) fi nding of sweeps at the choice point, the a priori prediction was that if hypothesized that ventral striatum might play an evaluative role that connects sweeps (possible actions) to behavioral choice (actions). As a fi rst step toward testing this idea, van der Meer and Redish (2009a) recorded from ventral striatal neurons on the same Multiple-T task on which hippocampal sweeps had been observed (Johnson and Redish, 2007). The approach taken was to fi rst isolate cells apparently involved in encoding reward receipt or value (as defi ned by a significant response to actual reward receipt: food pel-van der Meer and Redish

Figure 1 | Representation of forward possibilities at the choice point of the Multiple-T maze. (A)
The Multiple-T maze. Rats are trained to run laps on an elevated track for food reward. Only one side (right in this example) is rewarded in any given session, but which side is varied between sessions, such that rats start out uncertain about the correct choice. Over the fi rst 10 laps, choice performance increases rapidly, coincident with a tendency to pause at the fi nal choice point (van der Meer and . Over the course of a session, rats continue to refi ne their path, indicating learning beyond choice (Schmitzer-Torbert and Redish, 2002). (B) Decoding methods schematic. Neurons in the rat hippocampus tend to be active in specifi c places on the track: fi ve such "place fi elds" (colored circles) around the choice point [black box in (A)] are shown. By observing which cells are active at any given time, we can infer what location is being represented. If the rat is simply representing its current location, the red cell will be active. In contrast, when the rat pauses at the choice point, activity from the purple, green, and yellow cells might be observed in sequence. This indicates the rat is representing a location distant from its current location (the right maze arm in this example) -a key component of planning. (C) Sequence of place representations decoded from actual neurons as the rat (location indicated by the white o) pauses at the fi nal choice point. Red indicates high probability, blue indicates low probability. Note how even though the rat (o) stays stationary, the decoded probability sweeps down the left arm of the maze, then the right (arrows). Data from Johnson and Redish (2007).
Thus, the covert representation of reward effect cannot be easily explained by reward predictive cue-responses, because the effect is specifi c to choice points, while other places (closer to the reward sites) are more predictive of reward, and because it is present early, but not late, in a constant environment. Instead, this effect suggests ventral striatum may be involved in the evaluation of internally generated possibilities during decision making. We explore this idea in the following section.

VENTRAL STRIATUM AS THE EVALUATOR IN DYNAMIC EVALUATION LOOKAHEAD
Actor-critic models have been especially relevant to neuroscience because of the experimentally observed mapping of its internal variables and processes onto dissociable brain areas. In particular, a common suggestion is that the dorsolateral striatum implements something like the actor, while the ventral tegmental area (VTA) and the ventral striatum work together to implement something like the critic (Houk et al., 1995;O'Doherty et al., 2004). While fMRI studies these non-local reward spikes are related to sweeps, they should occur preferentially at the choice point. Although the effect was subtle, this is what was found: compared to non-reward responsive cells, reward cells had a higher fi ring rate specifically at the choice point ( Figure 2B). This implies that at the choice point, animals have access to internally generated reward expectancies, which could allow them to modify their actions in the absence of immediate reward.
Next, van der Meer and Redish (2009a) examined the time course of the reward activity at the choice point. Both behavioral evidence and the time course of sweeps suggest a change in strategy on this Multiple-T task (Figure 1A), where, initially, behavior is under deliberative, dynamic evaluation lookahead control, but later it is less so. Consistent with this idea, late during sessions, when rats no longer paused at the fi nal choice point, there was no longer any difference between reward and non-reward cell fi ring at this choice point. It was also found that when the rat deviated from its normal path in an error, representation of reward was increased before turning around.

Figure 2 | Ventral striatal neurons show covert expectation of reward at a maze choice point. (A)
Example of a reward-responsive neuron in ventral striatum that also fi res spikes at other points on the maze, notably at the choice point (black arrow). The top panel shows the rat's path on the track (gray dots), with the black dots indicating the location of the rat when a spike was fi red. This neuron responded to reward delivery at the two feeder sites on the right side of the track, as indicated by the transient increase in fi ring rate at the time of reward delivery (peri-event time histograms, bottom panels). Data taken from a variation on the Multiple-T task published by van der Meer and .

(B)
Averaged over all cells, reward-responsive (blue), but not non-reward-responsive neurons (red) show a slight but signifi cant increase in fi ring at the fi nal choice point (T4) during early laps (1-10). Replot of the data in Figure 5, van der Meer and , obtained by normalizing each cell's fi ring rate against the distribution of fi ring rates over the sequence of turns (from S to past T4) in laps 1-10; the original fi gure published by van der Meer and  normalized against the fi ring rate distribution over the same segment of the track, but from all laps. The covert representation of reward effect at T4 was robust against this choice of normalization method.

Covert representation
Neural activity that is not directly attributable to external stimulation or resulting behavior, such as the consideration of possibilities during deliberation or the mental rotation of images.
critic. Also, neurons which ramp up activity at the time or location of reward receipt are commonly found (Schultz et al. 1992;Lavoie and Mizumori 1994;Miyazaki et al. 1998;Khamassi et al. 2008, see  In the strict actor-critic formulation, the critic only serves to train the actor; it is not required have reliably found value signals in the human ventral striatum (e.g., Preuschoff et al. 2006), the ventral striatum-critic connection has been less frequently made in recording studies (but see Cromwell and Schultz 2003;Takahashi et al. 2008). However, there are reports of ventral striatal fi ring patterns which are potentially consistent with a critic role. For instance, some ventral striatal neurons respond to actual reward receipt, as well as to cues that predict them (Williams et al., 1993;Setlow et al., 2003;Roitman et al., 2005); this dual encoding of actual and predicted rewards is an important computational requirement of the van der Meer and Redish

Figure 3 | Expectancies generated by specifi c Pavlovian-instrumental transfer (PIT) and an internal forward model. (A)
Schematic representation of a canonical specifi c PIT experiment (after Kruse et al. 1983). The animal is exposed to Pavlovian pairing of a light (conditioned stimulus or CS+) preceding food delivery (the unconditioned stimulus or US). Then, in a different environment, the animal learns to press lever 1 to obtain water and lever 2 to obtain food (the same food as in the pairing phase). Ideally these are calibrated such that the animal presses both equally. A specifi c PIT effect is obtained if, during the critical testing phase, the effect of presenting the light CS+ is to bias the animal's choice towards lever 2 (which previously resulted in food, as predicted by the light). Because this effect is specifi c to the food, it requires the animal to have an expectancy of the food when pressing lever 2. (B) Illustration of the distinction between outcome expectancies generated by an internal forward model (top) and presentation of the light CS+ (bottom). In the forward model case, the animal predicts the outcomes of the different available actions [which are then thought to be available for dynamic evaluation (V)]. In the Pavlovian case, the food outcome is activated by the learned association with the light cue, in the absence of a forward model. reward-predictive cues" (Cardinal et al., 2002;Schoenbaum and Setlow, 2003) and congruent with an action-biasing role "from motivation to action" (Mogenson et al., 1980), but maintains a similar computational role across model-free and dynamic evaluation lookahead control and across experimental paradigms, by including not just the evaluation of actual outcomes but also that of imagined or potential outcomes. Such an extended role can reconcile the suggestion that ventral striatum serves as the critic in an implementation of a model-free RL algorithm with evidence for its more direct involvement in decision making as demonstrated by effects, such as PIT.
A specifi c prediction of this extended role for ventral striatum is that there should be value-related neural activation during expectancy-based decisions, such as dynamic evaluation lookahead and specifi c PIT. The data of van der Meer and , as well as those of others (German and Fields, 2007) are consistent with this proposal. German and Fields (2007) found that in a morphine-conditioned place preference task in a three-chamber environment, ventral striatal neurons that were selectively active in one of the chambers tended to be transiently active just before the rats initiated a journey toward that particular chamber. However, it is not known (in either study) whether these representations encode only a scalar value representation (good, bad) or refl ect a specifi c outcome (such as food or water); value manipulations could address this issue. Although the time course of reward cell fi ring at the choice point reported by van der Meer and Redish (2009a) suggests a possible relationship with the behavioral strategy used (dynamic evaluation lookahead versus model-free cached values), it would be useful to verify this with a behavioral intervention, such as devaluation. Finally, the temporal relationship between this putative ventral striatal evaluation signal and outcome signals elsewhere is not known. For instance, the spatio-temporal distribution of the non-local reward cell activity in ventral striatum matched that of hippocampal "sweeps"; whether these effects coincide on the millisecond time scale of cognition is still an open question. Interestingly, there is evidence that hippocampal activity can selectively impact reward-related neurons in ventral striatum (Lansink et al., 2008). A possible mechanism for organizing relevant inputs to ventral striatum could be provided by gamma for a single static decision. This is consistent with a rat lesion study that found performance on a well-trained cued choice task was less affected by ventral striatal inactivation during choice, as compared to inactivation during training (Atallah et al., 2007). However, extensive evidence also suggests that ventral striatum is more directly involved in decision making. In particular, as reviewed in Cardinal et al. (2002), ventral striatum is thought to support the behavioral impact of motivationally relevant cues in effects such as autoshaping, conditioned reinforcement, and Pavlovian-instrumental transfer (PIT; Kruse et al. 1983;Colwill and Rescorla 1988;Corbit and Janak 2007;Talmi et al. 2008). For instance, in specifi c PIT (Figure 3A), a Pavlovian association is triggered by the presentation of the conditioned stimulus (CS, e.g., a tone) which has previously only been experienced in a different context than that where the choice is made. This association results in an expectancy containing certain properties of the unconditioned stimulus (US, e.g., food reward) which are suffi cient to bias the subject's choice toward actions that result in that US. For instance, given a choice between food and water, presentation of a Pavlovian cue that (in a different context) was paired with food will tend to bias the subject toward choosing food rather than water. Because this effect is reinforcer-specifi c 5 , there must be an expectancy involved that contains outcome-specifi c properties, as in dynamic evaluation lookahead. However, in specifi c PIT, this expectancy is not generated by an internal forward model as the outcome of a particular action, but rather by Pavlovian association ( Figure 3B).
As ventral striatum appears to be required for specifi c PIT (Corbit et al., 2001;Cardinal et al., 2002), this implies not only that ventral striatum can infl uence individual decisions, but also that it can do so through an outcome-specifi c expectancy biasing the subject toward a particular action. Note the similarity between this process and dynamic evaluation lookahead, where an internally generated representation of a particular outcome is involved in choice. Given that ventral striatal afferents, such as the hippocampus, can represent potential outcomes, we propose that ventral striatum evaluates such internally generated expectancies. In the actor-critic algorithm, the critic reports the value of cues or states that "actually occur"; the critic would also be well equipped to report values for "internally generated" cues or states, such as those resulting from model-based lookahead or Pavlovian associations. This is reminiscent of the idea that ventral striatum "mediates the motivational impact of that the representations of reward at the choice point reported by van der Meer and Redish (2009a) are unlikely to result from Pavlovian associations, but instead are likely to refl ect internally generated expectancies. However, little is known about the mechanism by which expectancies become linked to particular actions; two recent reports fi nding action-specifi c value representations in ventral striatum (Ito and Doya, 2009;Roesch et al., 2009) can provide a basis for investigating this issue.
In summary, the results obtained by van der Meer and  show that ventral striatal representations of reward can be activated not just by the delivery of actual reward, but also during decision making. The spatio-temporal specifi city of this effect suggests that covert representation of reward in ventral striatum may contribute to internally generated, dynamic evaluation lookahead. A role for ventral striatum as evaluating, or translating to action, the motivational relevance of internally generated expectancies is a natural extension of its commonly proposed role as critic. Future work may address the content of its neural representations during procedures that seem to generate expectancies with different properties, such as reinforcer devaluation and PIT, as well as its relationship to individual choices and other outcome-specifi c signals in the brain.

ACKNOWLEDGMENTS
We thank Bruce Overmier and Adam Steiner for their comments on an earlier version of the manuscript, and Kenji Doya, Yael Niv, Geoffrey Schoenbaum, and Eric Zilli for discussion. oscillations mediated by fast-spiking interneurons (Berke, 2009;van der Meer and Redish, 2009b); consistent with this idea, van der Meer and Redish (2009b) found that ∼80 Hz gamma oscillations, which are prominent in ventral striatal afferents including the hippocampus and frontal cortices, were increased specifi cally at the fi nal choice point during early learning.
There is, however, an intriguing challenge to the role of ventral striatum as the evaluator in dynamic evaluation lookahead: the way in which expectancies can infl uence choice behavior may depend on the way in which they are generated. In particular, behavior under the infl uence of specifi c PIT effects is not sensitive to devaluation of the US 6 , even though the procedure itself produces choice behavior requiring a representation of that US (Holland, 2004). This result suggests that while specifi c PIT and dynamic evaluation lookahead both depend on the generation of a specifi c outcome expectancy, the existence of such an expectancy alone is not suffi cient for dynamic evaluation in decision making. It raises the question of how the different impacts of internally generated versus cued outcome expectancies are implemented on the neural level. In experimental settings used to identify outcome representations with recording techniques, different ways of generating expectancies can be diffi cult to distinguish because of the presence of rewardpredictive cues (e.g., Colwill and Rescorla 1988;Schoenbaum et al. 1998). To the extent that the static spatial setting of the Multiple-T maze contains reward-predictive cues, they are not specifi c or maximally predictive at the choice point, such van der Meer and Redish