^{1}

^{*}

^{2}

^{1}

^{2}

Edited by: M. Victoria Puig, Massachusetts Institute of Technology, USA

Reviewed by: David J. Margolis, Rutgers University, USA; Eleftheria Kyriaki Pissadaki, University of Oxford, UK

*Correspondence: Kenji Morita, Physical and Health Education, Graduate School of Education, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, Japan e-mail:

This article was submitted to the journal Frontiers in Neural Circuits.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

It has been suggested that the midbrain dopamine (DA) neurons, receiving inputs from the cortico-basal ganglia (CBG) circuits and the brainstem, compute reward prediction error (RPE), the difference between reward obtained or expected to be obtained and reward that had been expected to be obtained. These reward expectations are suggested to be stored in the CBG synapses and updated according to RPE through synaptic plasticity, which is induced by released DA. These together constitute the “DA=RPE” hypothesis, which describes the mutual interaction between DA and the CBG circuits and serves as the primary working hypothesis in studying reward learning and value-based decision-making. However, recent work has revealed a new type of DA signal that appears not to represent RPE. Specifically, it has been found in a reward-associated maze task that striatal DA concentration primarily shows a gradual increase toward the goal. We explored whether such ramping DA could be explained by extending the “DA=RPE” hypothesis by taking into account biological properties of the CBG circuits. In particular, we examined effects of possible time-dependent decay of DA-dependent plastic changes of synaptic strengths by incorporating decay of learned values into the RPE-based reinforcement learning model and simulating reward learning tasks. We then found that incorporation of such a decay dramatically changes the model's behavior, causing gradual ramping of RPE. Moreover, we further incorporated magnitude-dependence of the rate of decay, which could potentially be in accord with some past observations, and found that near-sigmoidal ramping of RPE, resembling the observed DA ramping, could then occur. Given that synaptic decay can be useful for flexibly reversing and updating the learned reward associations, especially in case the baseline DA is low and encoding of negative RPE by DA is limited, the observed DA ramping would be indicative of the operation of such flexible reward learning.

The midbrain dopamine (DA) neurons receive inputs from many brain regions, among which the basal ganglia (BG) are particularly major sources (Watabe-Uchida et al.,

Recently, however, Howe et al. (

In most existing theories based on the “DA=RPE” hypothesis, it is assumed that neural circuits in the brain implement mathematical reinforcement learning algorithms in a perfect manner. Behind the request of such perfectness, it is usually assumed, often implicitly, that DA-dependent plastic changes of synaptic strength, which presumably implement the update of reward expectations according to RPE, are quite stable, kept constant without any decay. However, in reality, synapses might be much more dynamically changing, or more specifically, might entail time-dependent decay of plastic changes. Indeed, decay of synaptic potentiation has been observed at least in some experiments examining (presumably) synapses from the hippocampal formation (subiculum) to the ventral striatum (nucleus accumbens) in anesthetized rats (Boeijinga et al.,

We considered a virtual spatial navigation (unbranched “I-maze”) task as illustrated in Figure _{1}, and moves to the neighboring state in each time step until reaching _{n} (goal), where reward _{1} and _{n} in every trial, the value of the “preceding state” or the “upcoming state” was assumed to be 0, respectively; later, in the simulations shown in Figure _{i} in trial _{i} (

where _{i} (_{i − 1} (_{i} and state _{i − 1} in trial _{i} (_{i} in trial _{n}(_{i}(_{i − 1}(

where α (0 ≤ α ≤ 1) represents the learning rate. At the goal (_{n}) where reward

given that _{n}(_{n − 1} (_{n − 1} (^{∞}_{n − 1}, the above second equation becomes

and therefore

_{1} (start), and moves to the neighboring state at each time step until reaching _{n} (goal), where reward _{n} = _{i} = _{i} + γ_{i} − _{i − 1} is calculated, where _{i} is reward obtained at S_{i} = 0 unless _{i} and _{i − 1} are the values of state _{i} and _{i − 1}, respectively; γ (0 ≤ γ ≤ 1) is the time discount factor, and (II) the calculated RPE is used to update the value of _{i − 1} : _{i − 1} → _{i − 1} + αδ_{i}), where α (0 ≤ α ≤ 1) is the learning rate and _{n}): note that _{n} is assumed to be 0, indicating that reward is not expected after the goal in a given trial [reward expectation over multiple trials is not considered here for simplicity; it is considered later in the simulations shown in Figure _{n − 1} (value of _{n − 1}) in the simulated task shown in _{n − 1} (indicated by the brown bars) gradually increases from trial to trial, and eventually converges to the value of reward (_{n} = _{n − 1}) converges to 0. _{n − 1} does not converge to _{n}, indicated by the red dotted/solid rectangles) balances with the decrement due to the decay (indicated by the blue arrows). RPE at the goal (δ_{n}) thus remains to be positive even after many trials. _{1}) to the goal (_{7}) when there are 7 states (^{(1/6)},

Similarly, δ_{n − j} (_{1}), at which δ_{1}(^{n − 1}

Let us now introduce time-dependent decay of the value of the states into the model, in such a way that the update of the state value is described by the following equation (instead of the one described in the above):

where _{n}), this equation is calculated as follows (Figure

In the limit of _{n − 1} (_{n − 1} (^{∞}_{n − 1}, this equation becomes

and therefore

which is positive if _{n − 2}(^{∞}_{n − 2},

and therefore

Similarly, in the limit of

At the start of the maze (_{1}) (

The solid lines in Figure ^{∞}_{i} for all the states from the start (_{1}) to the goal (_{7}) when there are 7 states (^{(1/6)}, ^{∞}_{i} in the model without incorporating the decay for comparison. As shown in the figures, in the cases with decay, the eventual (asymptotic) values of RPE after the convergence of learning entail gradual ramping toward the goal under a wide range of parameters. Also notably, as appeared in the “

We also considered cases where the rate of decay of learned values depends on the current magnitude of values so that larger values are more resistant to decay. We constructed a time-step-based model, in which decay with such magnitude-dependent rate was incorporated. Specifically, we again considered a model of the same I-maze task (Figure

where

where α is the learning rate. We then considered the following function of value

where _{1} and _{2} are parameters, and assumed that the value of every state decays at each time step as follows:

^{1/n} ×

^{1/n} × (

Figure _{1} (_{1} = 0.6) and various values of _{2} [_{2} = ∞ (lightest gray lines), 1.5 (second-lightest gray lines), 0.9 (dark gray lines), or 0.6 (black lines)], and Figure ^{(1/6)} and α = 0.5 and without considering reward expectation over multiple trials, and the eventual values of RPE are presented in the solid lines in Figure

_{1}) to the goal (_{7}) in the simulated I-maze task shown in Figure

As a simplified model of the T-maze free-choice task with rewarded and unrewarded goals used in the experiments (Howe et al., _{5}), while learning the values of each state-action pair (_{1}, _{2}, ··· : there is assumed to be only a single action “moving forward” in the states other than the branch point), according to one of the major reinforcement (TD) learning algorithms called Q-learning (Watkins,

where ^{1/25}. According to this RPE, the value of the previous state-action pair is updated as follows:

where α is the learning rate and it was set to 0.5. We then assumed that the value of every state-action pair (denoted as

where _{1} and _{2} were set to _{1} = 0.6 and _{2} = 0.6. At the branch point (_{5}), one of the two possible actions (_{5} and _{6}) is chosen according to the following probability:

where Prob(_{5}) is the probability that action _{5} is chosen, and β is a parameter determining the degree of exploration vs. exploitation upon choice (as β becomes smaller, choice becomes more and more exploratory); β was set to 1.5. In the simulations of this model, we considered reward expectation over multiple trials, specifically, we assumed that at the first time step in every trial, subject moves from the last state in the previous trial to the first state in the current trial, and RPE computation and value update are done in the same manner as in the other time steps.

_{1}, _{5} (branch point), and _{8} or _{9} (goal) in the diagram, respectively.

In addition to the simulations of the Q-learning model, we also conducted simulations of the model with a different algorithm called SARSA (Rummery and Niranjan, _{5}):

where _{chosen} is the action that is actually chosen (either _{5} or _{6}), instead of the equation for Q-learning described above. In the simulations shown in Figure _{8}) and set to 0 otherwise, whereas in the simulations shown in Figure _{8} and _{9}, respectively) and set to 0 otherwise. In addition to the modeling and simulations of the free-choice task, we also conducted simulations of a forced-choice task, which could be regarded as a simplified model of the forced-choice task examined in the experiments (Howe et al., _{5} or _{6}) at the branch point (_{5}) in each trial rather than using the choice probability function described above (while RPE of the Q-learning type, taking the max of _{5}) and _{6}), was still assumed), and reward _{8}) was chosen (i.e., ratio of correct trials) was 65.6, 64.5, and 64.5% in the simulations of 1000 trials for Figures

We will first show how the standard reinforcement learning algorithm called the TD learning (Sutton and Barto, _{1}, and moves to the neighboring state in each time step until reaching the goal (_{n}), where reward _{1} (=start), _{2}, ···, _{n} (=goal, where reward _{i}, _{i} = _{i} + γ_{i} − _{i − 1}, where _{i} is reward obtained at S_{i} = 0 unless _{i} and _{i − 1} are the “values” (meaning reward expectations after leaving the states) of state _{i} and _{i − 1}, respectively; and γ (0 ≤ γ ≤ 1) is a parameter defining the degree of temporal discount of future rewards called the time discount factor, and (II) the RPE is used to update the value of the previous state (i.e., _{i − 1}) through DA-dependent plastic changes of striatal synapses: _{i − 1} → _{i − 1} + αδ_{i}), where α (0 ≤ α ≤ 1) represents the speed of learning called the learning rate, and

Assume that initially subject does not expect to obtain reward after completion of the maze run in individual trials and thus the “values” of all the states are 0. When reward is then introduced into the task and subject obtains reward _{n} = _{n}), positive RPE δ_{n} = _{n} − _{n − 1} = _{n − 1} : _{n − 1} → 0 + αδ_{n} = α_{n}) and positive RPE occurs; this time, the RPE amounts to δ_{n} = _{n} − _{n − 1} = _{n − 1} : _{n − 1} → α_{n} = (2 α − α^{2}) _{n − 1} (_{n − 1}) gradually increases from trial to trial, and accordingly RPE occurred at the goal (δ_{n} = _{n − 1}) gradually decreases. As long as _{n − 1} is smaller than _{n − 1} should increase in the next trial, and eventually, _{n − 1} converges to _{n}) converges to 0 (Figure _{n − 1}, _{n − 2}, ···; except for _{1}) also converge to _{n − 1}, δ_{n − 2}, ···; except for δ_{1}) converges to 0. Thus, from the prevailing theories of neural circuit mechanisms for reinforcement learning, it is predicted that DA neuronal response at the timing of reward and the preceding timings except for the initial timing, representing the RPE δ_{n}, δ_{n − 1}, δ_{n − 2}, ···, appears only transiently when reward is introduced into the task (or the amount of reward is changed), and after that transient period DA response appears only at the initial timing, as shown in the dashed lines in Figure

Let us now assume that DA-dependent plastic changes of synaptic strengths are subject to time-dependent decay so that learned values stored in them decay with time. Let us consider a situation where _{n − 1} (value of _{n − 1}) is smaller than _{n}. If there is no decay, _{n − 1} should be incremented exactly by the amount of this RPE multiplied by the learning rate (α) in the next trial, as seen above (Figure _{n − 1} should be incremented by the amount of α × RPE but simultaneously decremented by the amount of decay. By definition, RPE (δ_{n} = _{n − 1}) decreases as _{n − 1} increases. Therefore, if the rate (or amount) of decay is constant, _{n − 1} could initially increase from its initial value 0 given that the net change of _{n − 1} per trial (i.e., α × RPE − decay) is positive, but then the net change per trial becomes smaller and smaller as _{n − 1} increases, and eventually, as α × RPE becomes asymptotically equal to the amount of decay, increase of _{n − 1} should asymptotically terminate (Figure _{n}) remains to be positive, because it should be equal to the amount of decay divided by α. Similarly, RPE at the timings preceding reward (δ_{n − 1}, δ_{n − 2}, ···) also remains to be positive (see the Methods for mathematical details). The situation is thus quite different from the case without decay, in which RPE at the goal and the preceding timings except for the initial timing converges to 0 as seen above. The solid lines in Figure

As shown so far, the experimentally observed gradual ramping of DA concentration toward the goal could potentially be explained by incorporating the decay of plastic changes of synapses storing learned values into the prevailing hypothesis that the DA-CBG system implements the reinforcement learning algorithm and DA represents RPE. In the following, we will see whether and how detailed characteristics of the observed DA ramping can be explained by this account. First, the experimentally observed ramping of DA concentration in the VMS entails a nearly sigmoidal shape (Figure

Next, we examined whether the patterns of DA signal observed in the free-choice task (Howe et al.,

Given that the model's parameters are appropriately tuned, the model's choice performance can become comparable to the experimental results (about 65% correct), and the temporal evolution of the RPE averaged across rewarded trials and also the average across unrewarded trials can entail gradual ramping during the trial (Figure

In the study that we modeled (Howe et al.,

Although our model could explain the basic features of the experimentally observed DA ramping to a certain extent, there is also a major drawback as mentioned in the above. Specifically, in our simulations of the free-choice task, gradual ramping of the mean RPE was observed in both the average across rewarded trials and the average across unrewarded trials, but there was a prominent difference between these two (Figure

In the simulations shown in the above, it was assumed that the unrewarded goal is literally not rewarding at all. Specifically, in our model, we assumed a positive term representing obtained reward (

In the study that we modeled (Howe et al.,

Intriguingly, in the study that has shown the representation of RPE for Q-learning in VTA DA neurons (Roesch et al.,

While the hypothesis that DA represents RPE and DA-dependent synaptic plasticity implements update of reward expectations according to RPE has become widely appreciated, recent work has revealed the existence of gradually ramping DA signal that appears not to represent RPE. We explored whether such DA ramping can be explained by extending the “DA=RPE” hypothesis by taking into account possible time-dependent decay of DA-dependent plastic changes of synapses storing learned values. Through simulations of reward learning tasks by the RPE-based reinforcement learning model, we have shown that incorporation of the decay of learned values can indeed cause gradual ramping of RPE and could thus potentially explain the observed DA ramping. In the following, we discuss limitations of the present work, comparisons and relations with other studies, and functional implications.

In the study that has found the ramping DA signal (Howe et al., _{5} is about 0.158, which is smaller than RPE for unpredicted reward of the same size in our model (it is 1.0). This appears to deviate from the results of the experiments. However, there are at least three potential reasons that could explain the discrepancy between the experiments and our modeling results, as we describe below.

First, in the experiments, whereas there was only a small difference between the peak of DA response to free reward and the peak of DA ramping during the maze task when averaged across sessions, the slope of the regression line between these two values (DA ramping / DA to free reward) in individual sessions (Extended Data Figure 5a of Howe et al., _{n − 1} (bar height) and δ_{n} (space above the bar) as RPEs at the timings of the preceding sensory stimuli and the actual reward delivery, respectively (as for the former, except for time discount); they are both smaller than the reward amount (“

Other than the point described above, there are at least six fundamental limitations of our model. First, our model's behavior is sensitive to the magnitude of rewards. As shown in the Results, in our original model assuming decay with a constant rate, overall temporal evolution of RPE is proportionally scaled according to the amount of reward (Figure

Regarding potential relationships between the ramping DA signal in the spatial navigation task and the DA=RPE hypothesis, a recent theoretical study (Gershman,

It has also been shown (Niv et al.,

Given that the observed DA ramping is indicative of decay of learned values as we have proposed, what is the functional advantage of such decay? Decay would naturally lead to forgetting, which is rather disadvantageous in many cases. However, forgetting can instead be useful in certain situations, in particular, where environments are dynamically changing and subjects should continually overwrite old memories with new ones. Indeed, it has recently been proposed that decay of plastic changes of synapses might be used for active forgetting (Hardt et al.,

With such consideration, it is suggestive that DA ramping was observed in the study using the spatial navigation task (Howe et al.,

Apart from the decay, DA ramping can also have more direct functional meanings. Along with its roles in plasticity induction, DA also has significant modulatory effects on the responsiveness of recipient neurons. In particular, DA is known to modulate the activity of the two types of striatal projection neurons to the opposite directions (Gerfen and Surmeier,

Kenji Morita conceived and designed the research. Kenji Morita and Ayaka Kato performed the modeling, calculations, and simulations. Kenji Morita drafted the manuscript. Ayaka Kato commented on the manuscript, and contributed to its revision and elaboration.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

This work was supported by Grant-in-Aid for Scientific Research on Innovative Areas “Mesoscopic Neurocircuitry” (No.25115709) of The Ministry of Education, Science, Sports and Culture of Japan and Strategic Japanese - German Cooperative Programme on “Computational Neuroscience” (project title: neural circuit mechanisms of reinforcement learning) of Japan Science and Technology Agency to Kenji Morita.