Reward-modulated spike timing-dependent plasticity requires a reward-prediction system
Spike-timing dependent plasticity (STDP) has been shown to perform unsupervised learning tasks such as receptive fields development. However, STDP fails to take behavioral relevance into account, and as such cannot learn behavioral tasks. Recent publications have suggested to extend STDP by conditioning the induction of plasticity on the contingency of the pre-post pairing of classical STDP with a neuromodulatory "reward signal". We call this reward-modulated STDP model R-STDP (Izhikevich, 2007; Legenstein et al., 2008). In this study, we show that R-STDP (and, in general, any learning rule based on such a combination of an unsupervised learning rule with reward-modulation) suffers from a bias problem, which will in most cases impede learning. This problem can be solved only if the average of the reward signal is zero for each task. To learn a response to a single stimulus, it is enough to subtract a baseline reward value (the mean). However, in the case where the post-synaptic neurons have to learn different tasks for different input stimuli, this requires an external reward-prediction system (or "critic"), to subtract the stimulus-dependent or task-dependent baseline from the reward. Reward-modulated learning rules derived analytically from the policy-gradient framework of reinforcement learning (Pfister et al., 2006; Florian, 2007; Baras and Meir, 2007), called R-max in the following, don’t suffer from this bias problem. We illustrate our findings with two learning paradigms. First, we teach a simulated 1-layer feed-forward network to respond to a 1 second input pattern of spike trains with a precise pattern of output spike trains. At the end of each trial, a scalar reward signal is broadcast to the network, representing how well the output spike train matched the target, with a running average subtracted so that the average reward is zero. Both R-STDP and R-max can learn the task. However, as soon as the average reward is not zero or that two or more spike-train response tasks have to be learned, R-STDP fails spectacularly, whereas R-max suffers only a modest decrease in performance. Second, we learn a hand movement trajectory controlled by spiking neurons using a population vector coding similar to that found in motor areas (Schwartz et al., 1988). We use the same network structure as before, albeit with more units. Each output neuron codes for a particular direction of motion in 3 dimensions. The output spike trains produced in a single trial are transformed into a motion sequence through population vector coding. The reward signal is calculated by comparing the motion produced by the network an a target motion, and subtracting a baseline reward. Again, both learning rules can learn the task, but R-STDP fails to learn more than one different motion without subtracting a task-dependent baseline from the reward signal. In summary, we show that to learn behavioral tasks, reward-modulated STDP needs a reward-prediction system that can infer future expected rewards from current stimuli. This is a strong requirement, but dopaminergic neurons in the primate VTA could be plausible candidates (Schultz, 2007).
Conference:
Computational and Systems Neuroscience 2010, Salt Lake City, UT, United States, 25 Feb - 2 Mar, 2010.
Presentation Type:
Poster Presentation
Topic:
Poster session III
Citation:
Frémaux
N,
Sprekeler
H and
Gerstner
W
(2010). Reward-modulated spike timing-dependent plasticity requires a reward-prediction system.
Front. Neurosci.
Conference Abstract:
Computational and Systems Neuroscience 2010.
doi: 10.3389/conf.fnins.2010.03.00221
Copyright:
The abstracts in this collection have not been subject to any Frontiers peer review or checks, and are not endorsed by Frontiers.
They are made available through the Frontiers publishing platform as a service to conference organizers and presenters.
The copyright in the individual abstracts is owned by the author of each abstract or his/her employer unless otherwise stated.
Each abstract, as well as the collection of abstracts, are published under a Creative Commons CC-BY 4.0 (attribution) licence (https://creativecommons.org/licenses/by/4.0/) and may thus be reproduced, translated, adapted and be the subject of derivative works provided the authors and Frontiers are attributed.
For Frontiers’ terms and conditions please see https://www.frontiersin.org/legal/terms-and-conditions.
Received:
04 Mar 2010;
Published Online:
04 Mar 2010.
*
Correspondence:
Nicolas Frémaux, EPFL, LCN, Lausanne, Switzerland, nicofremaux@gmail.com