Embodied Synaptic Plasticity with Online Reinforcement learning

The endeavor to understand the brain involves multiple collaborating research fields. Classically, synaptic plasticity rules derived by theoretical neuroscientists are evaluated in isolation on pattern classification tasks. This contrasts with the biological brain which purpose is to control a body in closed-loop. This paper contributes to bringing the fields of computational neuroscience and robotics closer together by integrating open-source software components from these two fields. The resulting framework allows to evaluate the validity of biologically-plausibe plasticity models in closed-loop robotics environments. We demonstrate this framework to evaluate Synaptic Plasticity with Online REinforcement learning (SPORE), a reward-learning rule based on synaptic sampling, on two visuomotor tasks: reaching and lane following. We show that SPORE is capable of learning to perform policies within the course of simulated hours for both tasks. Provisional parameter explorations indicate that the learning rate and the temperature driving the stochastic processes that govern synaptic learning dynamics need to be regulated for performance improvements to be retained. We conclude by discussing the recent deep reinforcement learning techniques which would be beneficial to increase the functionality of SPORE on visuomotor tasks.

of these rules, an embodied evaluation is necessary. This evaluation is technically complicated since spiking neurons are dynamical systems that must be synchronized with the environment. Additionally, as in biological bodies, sensory information and motor commands need to be encoded and decoded respectively.
In this paper, we bring the fields of computational neuroscience and robotics closer together by integrating open-source software components from these two fields. The resulting framework is capable of learning online the control of simulated and real robots with a spiking network in a modular fashion. This framework is demonstrated in the evaluation of the promising neural reward-learning rule Synaptic Plasticity with Online REinforcement learning (SPORE) ( [21,19,22,45]) on two closed-loop robotic tasks. SPORE is an instantiation of the synaptic sampling scheme introduced in [21,19]. It incorporates a policy sampling method which models the growth of dendritic spines with respect to dopamine influx. Unlike current state-of-the-art reinforcement learning methods implemented with conventional neural networks ( [30,29,28]), SPORE learns online from precise spike-time and is entirely implemented with spiking neurons. We evaluate this learning rule in a closed-loop reaching and a lane following ( [4,18]) setup. In both tasks, an end-to-end visuomotor policy is learned, mapping visual input to motor commands. In the last years, important progress have been made on learning control from visual input with deep learning. However, deep learning approaches are computationally expensive and rely on biologically implausible mechanisms such as dense synchronous communication and batch learning. For networks of spiking neurons learning visuomotor tasks online with synaptic plasticity rules remains challenging. In this paper, visual input is encoded in Address Event Representation with a Dynamic Vision Sensor (DVS) simulation ( [27,18]). This representation drastically reduces the redundancy of the visual input as only motion is sensed, allowing more efficient learning. It agrees with the two pathways hypothesis which states that motion is processed separately than color and shape in the visual cortex ( [25]).
The main contribution of this paper is the embodiment of SPORE and its evaluation on two neurorobotic tasks using a combination of open-source software components. This embodiment allowed us to identify crucial techniques to regulate SPORE learning dynamics, not discussed in previous works where this learning rule was only evaluated on simple proof-of-concept learning problems ( [21,19,22,45]). Our results suggest that an external mechanism such as learning rate annealing is beneficial to retain a performing policy on advanced lane following task. This paper is structured as follows. We provide a review of the related work in Section 2. In Section 3, we give a brief overview of SPORE and discuss the contributed techniques required for its embodiment. The implementation and evaluation on the two chosen neurorobotic tasks is carried out in Section 4. Finally, we discuss in Section 5 how the method could be improved.

Related Work
The year 2015 marked a significant breakthrough in deep reinforcement learning. Artificial neural networks of analog neurons are now capable of solving a variety of tasks ranging from playing video games ( [30]), to controlling multi-joints robots ( [39,28]) and lane following ( [44]). Most recent methods ( [39,38,28,29]) are based on policy-gradients. Specifically, policy parameters are updated by performing ascending gradient steps with backpropagation to maximize the probability of taking rewarding actions. While functional, these methods are not based on biologically plausible processes. First, a large part of neural dynamics are ignored.
Importantly, unlike SPORE, these methods do not learn online -weight updates are performed with respect This is the author's version. This paper was published in Frontiers in Neurorobotics: www.frontiersin.org/articles/10.3389/fnbot.2019.00081 to entire trajectories stored in rollout memory. Second, learning is based on backpropagation which is not biologically plausible learning mechanism, as stated in [3].
Spiking network models inspired by deep reinforcement learning techniques were introduced in [40] and [2]. In both papers, the spiking networks are implemented with deep learning frameworks (PyTorch and TensorFlow, respectively) and rely on automatic differentiation. Their policy-gradient approach is based on Proximal Policy Optimization (PPO) ( [39]). As the learning mechanism consists of backpropagating the PPO loss (through-time in the case of [2]), most biological constraints stated in [3] are still violated. Indeed, the computations are based on spikes (4), but the backpropagation is purely linear (1), the feedback paths require precise knowledge of the derivatives (2) and weights (3) of the corresponding feedforward paths, and the feedforward and feedback phases alternate synchronously (5) (the enumeration refers to [3]).
Only a small body of work focused on reinforcement learning with spiking neural networks, while addressing the previous points. Groundwork of reinforcement learning with spiking networks was presented in [16,10,26].
In these works, a mathematical formalization is introduced characterizing how dopamine modulated spiketiming-dependent plasticity (DA-STDP) solves the distal reward problem with eligibility traces. Specifically, since the reward is received only after a rewarding action is performed, the brain needs a form of memory to reinforce previously chosen actions. This problem is solved with the introduction eligibility traces, which assign credit to recently active synapses. This concept has been observed in the brain ( [11,34]), and SPORE also relies on eligibility traces. Fewer works evaluated DA-STDP in an embodiment for reward maximization -a recent survey encompassing this topic is available in [5].
The closest previous work related to this paper are [18,4] and [6]. In [18], a neurorobotic lane following task is presented, where a simulated vehicle is controlled end-to-end from event-based vision to motor command.
The task is solved with an hard-coded spiking network of 16 neurons implementing a simple Braitenberg vehicle. The performance is evaluated with respect to distance and orientation differences to the middle of the lane. In this paper, these performance metrics are combined into a reward signal which the spiking network maximizes with the SPORE learning rule.
In [4], the authors evaluate DA-STDP (referred to as R-STDP for reward-modulated STDP) in a similar lane following environment. Their approach outperforms the hard-coded Braitenberg vehicle presented in [18]. The two motor neurons controlling the steering receive different (mirrored) reward signals whether the vehicle is on the left or on the right of the lane. This way, the reward provides the information of what motor command should be taken, similar to a supervised learning setup. Conversely, the approach presented in this paper is more generic since a global reward is distributed to all synapses and does not indicate which action the agent should take.
A similar plasticity rule implenting a policy-gradient approach is derived in [6]. Also relying on eligibility traces, this reward-learning rule uses a "slow" noise term to drive the exploration. This rule is demonstrated on a target reaching task comparable to the one discussed in Section 4.1.1 and achieves impressive learning times (in the order of 100s) with proper tuning of the noise term.
In [31], a spiking version of the free-energy-based reinforcement learning framework proposed in [33]  for learning to take place.
In [13], a supervised synaptic learning rule named Feedback-based Online Local Learning Of Weights (FOLLOW) is introduced. This rule is used to learn the inverse dynamics of a two-link arm -the model predicts control commands (torques) for a given arm trajectory. The loop is closed in [14] by feeding the predicted torques as control commands. In contrast, SPORE learns from a reward signal and can solve a variety of tasks.

Method
In this section, we give a brief overview of the reward-based learning rule SPORE. We then discuss how SPORE was embodied in closed-loop, along with our modifications to increase the robustness of the learned policy.

Synaptic Plasticity with Online Reinforcement Learning (SPORE)
Throughout our experiments we use an implementation of the reward-based online learning rule for spiking neural networks, named synaptic sampling, that was introduced in [21]. The learning rule employs synaptic updates that are modulated by a global reward signal to maximize the expected reward. More precisely, the learning rule does not converge to a local maximum θ * of the synaptic parameter vector θ, but it continuously samples different solutions θ ∼ p * (θ) from a target distribution that peaks at parameter vectors that likely yield high reward. A temperature parameter T allows to make the distribution p * (θ) flatter (high exploration) or more peaked (high exploitation).
SPORE ( [20]) is an implementation of the reward-based synaptic sampling rule [21], that uses the NEST neural simulator ( [12]). SPORE is optimized for closed-loop applications to form an online policy-gradient approach. We briefly review here the main features of the synaptic sampling algorithm.
We consider the goal of reinforcement learning to maximize the expected future discounted reward V(θ) given by where r(τ ) denotes the reward at time τ and τ e is a time constant that discounts remote rewards. We consider non-negative reward r(τ ) ≥ 0 at any time such that V(θ) ≥ 0 for all θ. The distribution p(r|θ) denotes the probability of observing the sequence of reward r under a given parameter vector θ. Note that computing this expectation involves averaging over a number of experimental trials and network responses.
As proposed in [21] we replace the standard goal of reinforcement learning to maximize the objective function in Equation (1) by a probabilistic framework that generates samples from the parameter vector θ according to some target distribution θ ∼ p * (θ). We will focus on sampling from the target distribution p * (θ) of the form where p (θ) is a prior distribution over the network parameters that allows us, for example, to introduce constraints on the sparsity of the network parameters. It has been shown in [21] that the learning goal in Equation (2) is achieved, if all synaptic parameters θ i obey the stochastic differential equation This is the author's version. This paper was published in Frontiers in Neurorobotics: www.frontiersin.org/articles/10.3389/fnbot.2019.00081 where β is a scaling parameter that functions as a learning rate, dW i are the stochastic increments and decrements of a Wiener process and T is the temperature parameter. ∂ ∂θi denotes the partial derivative with respect to the synaptic parameter θ i . The stochastic process in Equation (3) generates samples of θ that are with high probability close to the local optima of the target distribution p * (θ).
It has been further shown in [21] that Equation (3) can be implemented using a synapse model with local update rules. The state of each synapse i consists of the dynamic variables y i (t), e i (t), g i (t), θ i (t) and w i (t).
The variable y i (t) is the pre-synaptic spike train filtered with a postsynaptic-potential kernel. e i (t) is the eligibility trace that maintains a brief history of pre-/post neural activity. g i (t) is a variable to estimate the reward gradient, i.e. the gradient of the objective function in Equation (1) with respect to the synaptic parameter θ i (t). w i (t) denotes the weight of synapse i at time t. In addition each synapse has access to the global reward signal r(t). The variables e i (t), g i (t) and θ i (t) are updated by solving the differential equations: where z posti (t) is a sum of Dirac delta pulses placed at the firing times of the post-synaptic neuron, µ is the prior mean of synaptic parameters (p (θ) in Eq. (2)) and ρ posti (t) is the instantaneous firing rate of the post-synaptic neuron at time t. The constants c p and c g are tuning parameters of the algorithm that scale the influence of the prior distribution p (θ) against the influence of the reward-modulated term. Setting c p = 0 corresponds to a non-informative (flat) prior. In general, the prior distribution is modeled as a Gaussian centered around µ: p (θ) = N (µ, 1 cp ) . We used µ = 0 in our simulations. The variance of the reward gradient estimation (Equation (5)) could be reduced by subtracting a baseline to the reward as introduced in [43], although this was not investigated in this paper.
Finally the synaptic weights are given by the projection which scaling and offset parameters w 0 and θ 0 , respectively.
An embodied evaluation is technically more involved and requires a closed-loop environment simulation. A core contribution of this paper is the implementation of a framework allowing to evaluate the validity of bio-plausibe plasticity models in closed-loop robotics environments. We rely on this framework to evaluate the synaptic sampling rule SPORE ( [20]), as depicted in Figure 1. n This framework is tailored for evaluating spiking network learning rules in an embodiment. Visual sensory input is sensed, encoded as spikes, processed by the network, and output spikes are converted to motor commands. The motor commands are executed by the agent, which modifies the environment. This modification of the environment is sensed by the agent.
Additionally, a continuous reward signal is emitted from the environment. SPORE tries to maximize this reward signal online by steering the ongoing synaptic plasticity processes of the network towards configurations which are expected to yield more overall reward. Unlike classical reinforcement learning setup, the spiking network is treated as a dynamical system continuously receiving input and outputting motor commands. This allows us to report learning progress with respect to (biological) simulated time, unlike classical reinforcement The robotic simulator and the neural network run in different processes. We rely on MUSIC ( [7,8]) to communicate and transform the spikes and we employ the ROS-MUSIC tool-chain by [42] to bridge between the two communication frameworks. The latter also synchronizes ROS time with spiking network time. Most of these components are also integrated in the Neurorobotics Platform (NRP) [9], except for MUSIC and the ROS-MUSIC tool-chain. Therefore, the NRP does not support streaming a reward signal to all synapses, required in our experiments.
As part of this work, we contributed to the Gazebo DVS plugin by integrating it to ROS-MUSIC, and to the SPORE module by integrating it with MUSIC. These contributions enable researchers to design new ROS-MUSIC experiments using event-based vision to evaluate SPORE or their own biologically-plausible learning rules. A clear advantage of this framework is that the robotic simulation can be substituted for a real robot seamlessly. However, the necessary human supervision in real robotics coupled with the many hours needed by SPORE to learn a performing policy is currently prohibitive. The simulation of the whole framework was conducted on a Quad core Intel Core i7-4790K with 16GB RAM in real-time.

Learning Rate Annealing
In the original work presenting SPORE ( [21,19,22,45]), the learning rate β and the temperature T were kept constant throughout the learning process. Note that in deep learning, learning rates are often regulated by the optimization processes ( [23]). We found that the learning rate β of SPORE plays an important role in learning and benefit from an annealing mechanism. This regulation allows the synaptic weights to converge to a stable configuration and prevents the network to forget previous policy improvements. For the lane following experiment presented in this paper, the learning rate β is decreased over time, which also reduces the temperature (random exploration), see Equation (3). Specifically, we decay the learning rate β exponentially with respect to time:  The learning rate is updated following this equation every 10 minutes. Independently decaying the temperature term T was not investigated, however we expect a minor impact on the performance because of the high variance of the reward gradient estimation, intrinsically leading the agent to explore.

Evaluation
We evaluate our approach on two neurorobotic tasks: a reaching task and the lane following task presented in [18,4]. In the following sections, we describe these tasks and the ability of SPORE to solve them. Additionally, we analyze the performance and stability of the learned policies with respect to the prior distribution p (θ) and learning rate β, see Equation (3).

Experimental Setup
The tasks used for our evaluation are depicted in Figure 2. In both tasks, a feed-forward all-to-all two-layers network of spiking neurons is trained with SPORE to maximize a task-specific reward. Previous work has shown that this architecture was sufficient for the task complexity considered [18,4,6]. The network is end-to-end and maps the address events of a simulated DVS to motor commands. The parameters used for the evaluation are presented in Tables 1 to 3. In the next paragraphs, we describe the tasks together with their decoding schemes and reward functions.

Reaching Task
The reaching task is a natural extension of the open-loop blind reaching task on which SPORE was evaluated in [45]. A similar visual tracking task was presented in [6], with a different visual input encoding. In our setup, the agent controls a ball of 2m radius which has to move towards the 2m radius center of a 20mx20m plane enclosed with walls. Sensory input is provided by a simulated DVS with a resolution of 16x16 pixels located above the center which perceives the ball and the entire plane. There is one visual neuron corresponding to each DVS pixel -we make no distinctions between ON and OFF events. We additionally enhance the input with a k the activity of motor neuron k obtained by applying a low-pass filter on the spikes with time constant τ . This decoding scheme consists of equally distributing N motor neurons on a circle representing their contribution to the displacement vector. For our experiment, we set N = 8 motor neurons. We add an additional exploration neuron to the network which excites the motor neurons and is inhibited by the visual neurons. This neuron prevents long periods of immobility. Indeed, when the agent decides to stay motionless, it does not receive any sensory input as the DVS simulation only senses change. Since the network is feedforward, the absence of sensory input causes the neural activity to drop, leading to more immobility.
The ball is reset to a random position on the plane if it has reached the center. This reset is not signaled to the network -aside from the abrupt change in visual input -and does not mark the end of an episode. Let β err denote the absolute value of the angle between the straight line to the goal and the direction taken by the ball. The agent is rewarded if the ball moves in the direction towards the goal β err < β lim at a sufficient velocity v > v lim . Specifically, the reward r(t) is computed as:

Lane following Task
The lane following task was already used to demonstrate spiking neural controllers in [18] and [4]. The network controls the angle of the vehicle by steering it, while its linear velocity is constant. The output layer is separated into two neural populations. The steering commands sent to the agent consist of the difference of activity between these two populations. Specifically, steering commands are decoded from output spikes as a ratio between the following linear decoders: respectively. This discretization is similar than the one used in [44]. It yielded better performance than directly using r (multiplied with a scaling constant k) as a continuous-space steering command as in [18].
The reward signal delivered to the vehicle is equivalent to the performance metrics used in [18] to evaluate the policy. As in the reaching task, the reward depends on two terms -the angular error β err and the distance error d err . The angular error β err is the absolute value of the angle between the right lane and the vehicle.
The distance error d err is the distance between the vehicle and the center of the right lane. The reward r(t) is computed as: The constants are chosen so that the score is halved every 0.1m distance error or 5 • angular error. Note that this reward function is comprised between [0, 1] and is less informative than the error used in [4]. In our case, the same reward is delivered to all synapses, and a particular reward value does not indicate whether the vehicle is on the left or on the right of the lane. The decay of the learning rate is λ = 8.5 × 10 −5 , see Table 2.

Results
Our results show that SPORE is capable of learning policies online for moderately difficult embodied tasks within some simulated hours. We first discuss the results on the reaching task, where we evaluated the impact of the prior distribution. We then present the results on the lane following task, where the impact of the learning rate was evaluated.

Impact of Prior Distribution
For the reaching task, a flat prior c p = 0 yielded the policy with highest performance, see Figure 3. In At 7500 s (d), the performance has further increased. The policy, as shown in the second peak has grown even stronger for many pixels which also point in the right direction. The pixels pointing in the wrong direction mostly have a low vector strength.
After 9250 s (e), the performance drops to half its previous performance. As we can see from the policy, the weights grew even stronger. Some strong pixels vectors pointing towards each other have emerged, which can lead to the ball constantly moving up and down, without receiving any reward.
After this valley, the performance rises slowly again and at 20 000 s of simulation time (f) the policy has reached the maximum performance of this trial. Around the whole grid, strong motion vectors push the ball towards the center, and the ball reaches the center around 140 times every 250 s.
Just before the end of the trial, the performance drops again (g). Most vectors still point towards the right direction, however, the overall strength has largely decreased.

Impact of Learning Rate
For the lane following experiment, we show that the learning rate β plays an important role for retaining policy improvements. Specifically, when the learning rate β remains constant over the course of learning, the policy does not improve compared to random, see Figure 5. In the random case, the vehicle remains about 10 seconds on the right lane until triggering a reset. After about 3h of learning, the learning rate β decreased to 40% of its initial value and the policy starts to improve. After 5h of learning, the learning rate β approaches 20% of its initial value and the performance improvements are retained. Indeed, while the weights are not frozen, the amplitude of subsequent synaptic updates are drastically reduced. In this case, the policy is significantly better than random and the vehicle remains on the right lane about 60s on average.

Conclusion
The endeavor to understand the brain spans over multiple research fields. Collaborations allowing synaptic learning rules derived by theoretical neuroscientists to be evaluated in closed-loop embodiment are an important milestone of this endeavor. In this paper, we successfully implemented a framework allowing this evaluation by relying on open-source software components for spiking network simulation [12,20], synchronization and communication [7,8,42,36] and robotic simulation [24,18]. The resulting framework is capable of learning online the control of simulated and real robots with a spiking network in a modular fashion.
This framework is used to evaluate the reward-learning rule SPORE ( [21,19,22,45]) on two closed-loop visuomotor tasks. Overall, we have shown that SPORE was capable of learning shallow feedforward policies online for moderately difficult embodied tasks within some simulated hours. This evaluation allowed us to characterize the influence of the prior distribution on the learned policy. Specifically, constraining priors deteriorate the performance of the learned policy but prevent strong synaptic weights to emerge, see Figure 3. Additionally, for the lane following experiment, we have shown how learning rate regulation enabled policy improvements to be retained. Inspired by simulated annealing, we presented a simple method decreasing the learning rate over time. This method does not model a particular biological mechanism, but seems to work better in practice. On the other hand, novelty is known to modulate plasticity through a number of mechanisms ( [37,15]). Therefore, a decrease in learning rate after familiarization with the task is reasonable.
On a functional scale, deep learning methods still outperform biologically plausible learning rules such as SPORE. For future work, the performance gap between SPORE and deep learning methods should be tackled by taking inspiration from deep learning methods. Specifically, the online learning method inherent to SPORE is impacted by the high variance of the policy evaluation. This problem was alleviated in policy-gradient methods by introducing a critic trained to estimate the expected return of a given state. This expected return is used as a baseline which reduces the variance of the policy evaluation. Decreasing the variance could also be achieved by considering an action-space noise as in [6] instead of a parameter-space noise implemented by the Wiener process in Equation (3). Lastly, an automatic mechanism to regulate the learning rate β is beneficial for more complex task. Such a mechanism could be inspired by trust-region methods ( [38]), which constrains weight updates to alter the policy little by little. These improvements should increase SPORE