Edited by: Zeng-Guang Hou, Institute of Automation (CAS), China
Reviewed by: Daniel Saunders, University of Massachusetts Amherst, United States; Qi Xu, Dalian University of Technology, China
This article was submitted to Neuromorphic Engineering, a section of the journal Frontiers in Neuroscience
†These authors have contributed equally to this work and share first authorship
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Brain-inspired spiking neural networks (SNNs) are successfully applied to many pattern recognition domains. The SNNs-based deep structure has achieved considerable results in perceptual tasks, such as image classification and target detection. However, applying deep SNNs in reinforcement learning (RL) tasks is still a problem to be explored. Although there have been previous studies on the combination of SNNs and RL, most focus on robotic control problems with shallow networks or using the ANN-SNN conversion method to implement spiking deep Q networks (SDQN). In this study, we mathematically analyzed the problem of the disappearance of spiking signal features in SDQN and proposed a potential-based layer normalization (pbLN) method to train spiking deep Q networks directly. Experiment shows that compared with state-of-art ANN-SNN conversion method and other SDQN works, the proposed pbLN spiking deep Q networks (PL-SDQN) achieved better performance on Atari game tasks.
Inspired by biological brain neurons, the spiking neural network uses differential dynamics equations and spike information encoding methods to build computing node models in neural networks (Maass,
Spike neural networks can be applied to different domains of pattern information processing. SNNs have achieved competitive performance on many tasks compared with ANN. Spiking Resnet is trained for image classification (Fang et al.,
Although there have been previous studies on the combination of SNNs and RL, most focus on robotic control problems with shallow networks and few neurons. Reward-modulated spike-timing-dependent plasticity (R-STDP) is used for training SNN to control robot keeping within the lane. Lele et al. proposed SNN central pattern generators (CPG) and leaned with stochastic reinforcement-based STDP to control hexapod walking (Lele et al.,
Playing Atari games with spiking neural networks. The raw original video image is input into the spiking convolutional neural networks to extract image features, and finally the action selection implemented by the SNN output layer. The selected action is used to control the activity of the agent in the environment to obtain more rewards.
Direct training of SNNs can obtain better performance advantages than ANNs to SNNs method and improve energy efficiency (Wu et al.,
We analyze how the spiking mechanism influences information feature extraction in deep SNNs and found that the binary property of the spike hugely dissipates the variance and shifts the mean of network inputs. The pattern features of information are quickly vanishing in spiking deep Q networks.
We propose the potential-based layer normalization method to keep the unique sensitivity of spiking neuron in deep Q networks.
We construct a spiking deep Q network and implement it in gym Atari environments. The spiking deep Q network is directly trained with a surrogate function, and the experiments show that the pbLN improves the performance of SNN in RL tasks.
In this section, we introduce our study with three aspects. First, we construct a spiking deep Q network to estimate state-action value. Second, we analyze the feature vanishing in SNN and its influences on reinforcement learning. Third, we propose the potential-based layer normalization method and train the spiking deep Q network with a backpropagation algorithm.
In order to better reflect the characteristics of SNNs in the reinforcement learning environments, we construct our spiking Q networks as same as the DQN architecture shown in
Spiking deep Q network with potential-based layer normalization. It has the same network structure as DQN, with three layers of convolution and two layers of full connection. The network outputs an estimate of the state-action value, which is used for the selection of actions and TD based learning.
The neuron model in the spiking Q Network is adapted from the leaky integrate-and-fire (LIF) model.
where
Let
where
State-action value is generated from the time-window mean value of weighted sum spikes output from FC layer as Equation (7) and
Unlike the ANN-SNN conversion based method or SNN-DNN hybrid training, our proposed model is directly optimized using the TD error about the network output with target values as
The proposed deep spiking Q network is directly trained by the Spatio-temporal Backpropagation (STBP) algorithm (Wu et al.,
where δ
The derivative temporal chain of the networks weight
At the non-differentiable point of the neuron firing a spike, we use the surrogate function to approximate the derivative of
In the process of deep network training, the change of network parameters will cause a change in the distribution of the network outputs, which is called
For
Proceeding process of spiking deep Q networks.
Supposing the neuron in layer
ε is the signal loss ratio transmitted by spiking neural networks. Additionally, if ε < 1, the mean of neuron's spikes
According Equation (16), the problem of spike information vanishing in SNNs can be alleviated by initializing synapse weights with greater distributional variance or setting the spiking neuron model with a little potential threshold. But increasing
But these methods are suitable for supervised learning tasks such as image classification or object detection because, in those tasks, the SNNs are trained with batched data inputs. Compared with supervised learning, the environment information of reinforcement learning is more complex. First, in RL tasks spike vanishing problem of deep SNN models is quite serious. For example, we counted each layer's spiking deep Q network firing activity distribution when applied to Atari games. The statistical evidence in the Result section shows that the SDQN suffers serious spiking information reduction in deep layers. Second, unlike supervised learning, SNN agents in RL have no invariant and accurate learning labels and need to interact with the environment to collect data and reward information. The hysteresis of learning samples makes the SDQN model unable to effectively overcome the drawbacks caused by the disappearance of spike signals in output layers. Third, the input format in the RL task is not batched, so the normalization methods used in supervised learning can not be applied to SDQN.
In this study, we propose a potential-based layer normalization method to solve the spike activity vanishing problem in SDQN. We apply the normalization operation methods on PSP
where in convolution layer
PSP
Normalizing
where λ
Operations of neural potential-based layer normalization. The bars in the figure represent the distribution of PSP potential. The gray bars show the original PSP potential
The effect of pbLN on membrane potential is shown in
Neuron potential is maintained by the normalization method. The gray dotted line is the change of membrane potential of neurons without normalization when receiving spike inputs. Additionally, the solid black line is neuron potential changed by normalization. Neurons are difficult to fire when the time interval of external stimulation is relatively large. Instead, with normalization operations, the membrane potential is affected by the neighbor neurons, and the leakage trend will slow down, which increases the probability of neurons firing spikes.
PL-SDQN is a spiking neural network model based on LIF neurons and has the same network structure as traditional DQN. It contains three convolutional layers with a “c32k8-c64k4-c64k3” neural structure. The hidden layers are fully connected with 512 neurons, and the output is ten values as the weighted summation of the outputs of the hidden layer. We directly trained PL-SDQN on reinforcement learning tasks. The results show that spiking deep Q networks combined with the potential-based layer normalization method can achieve better performance on Atari games than traditional DQN and ANN-to-SNN conversion methods.
We counted each layer's firing spikes of SDQN to show the deep layer spike vanish phenomenon and the promoting effect of the pbLN method. The SDQN model is initialized by random synaptic weight and then used to play the Atari game. We calculated the ratio of neurons with firing activity to each layer's total number of neurons.
We tested each game ten times and counted the firing rate of each layer. These experiments' average and SD are shown in
Fire rates on different convolutional layers of SDQN. The blue bars depict the spike fire rate in SDQN without normalization. Additionally, the orange bars are for our model PL-SDQN. The black vertical line on top of each bar is the SD of 10 experimental data.
Compared with the vanishing problems in SDQN, the proposed pbLN method improves the deep layer spiking activities. The bottom rows in
We compared our model with the vanilla DQN model and ANN-SNN conversion based SDQN model, and the performance are obtained on 16 Atari games. All models are trained directly with the same settings and optimized by Adam's methods as
PL-SDQN performance on Atari games. The learning curves in the figure show that our PL-SDQN model achieves faster and better performance than the original DQN benchmarks. Although in the Breakout game, the DQN model learns faster at the beginning, our model catches up with it quickly and achieves better performance at the end. We smooth each curve with a moving average of 5 for clarity.
Details of Atari game experiments.
Atlantis | 3,049,750.0 | 161,861.4 | (5.31%) | 304,6920.0 | 13,48,868.6 | (44.27%) | 86,339.7 | (2.64%) | |
BeamRider | 10,423.2 | 2,245.1 | (21.54%) | 10,449.0 | 262,0.4 | (25.08%) | 33,42.4 | (29.11%) | |
Boxing | 99.3 | 0.9 | (0.91%) | 98.6 | 3.2 | (3.25%) | 0.9 | (0.90%) | |
Breakout | 343.1 | 41.3 | (12.04%) | 352.2 | 64.7 | (18.37%) | 140.1 | (32.76%) | |
CrazyClimber | 1,39,420.0 | 11,530.6 | (8.27%) | 12,8,380.0 | 23,239.8 | (18.10%) | 30,6,32.1 | (20.70%) | |
Gopher | 28,245.6 | (73.06%) | 22,438.0 | 10,076.7 | (44.91%) | 24,064.0 | 12,3,55.3 | (51.24%) | |
Jamesbond | 14,45.0 | 15,72.0 | (108.79%) | 14,20.0 | 190.0 | (13.38%) | 215,0.1 | (147.27%) | |
Kangaroo | 12,680.0 | 208.8 | (1.65%) | 13,850.0 | 113,2.5 | (8.17%) | 845.0 | (5.83%) | |
Krull | 10,271.0 | 1,365.5 | (13.29%) | 10,923.0 | 513.0 | (4.70%) | 568.2 | (4.81%) | |
MsPacman | 2,964.0 | 711.7 | (24.01%) | 36,91.0 | 434.8 | (11.78%) | 12,92.5 | (31.70%) | |
NameThisGame | 7,732.0 | 1,289.2 | (16.67%) | 8115.0 | 1,702.1 | (20.97%) | 22,10.7 | (18.12%) | |
RoadRunner | 1,310.0 | 764.8 | (58.38%) | 1,072.0 | 329.2 | (19.34%) | 47,14.4 | (9.08%) | |
SpaceInvaders | 1,728.5 | 461.6 | (26.71) | 176,0.0 | 483.5 | (27.47%) | 574.7 | (23.62%) | |
StarGunner | 53,050.0 | 1342.5 | (2.53%) | 55,910.0 | 12,796.9 | (22.89%) | 40,64.7 | (6.40%) | |
Tutankham | 262.0 | 28.9 | (11.03%) | 254.5 | 55.4 | (21.77%) | 70.4 | (25.93%) | |
VideoPinball | 5,07,442.5 | 327,1,89.1 | (64.48%) | 55,2917.6 | 20,0852.5 | (36.33%) | 100,66.2 | (1.49%) |
The vanilla DQN, ANN-SNN conversion based SDQN and our proposed model PL-SDQN are compared. We test these models for 10 rounds and record raw scores' mean and standard deviation (STD).
The best scores of each game are highlighted by bold values.
PL-SDQN model that we proposed achieves better performance than vanilla DQN and conversion based SDQN model. The data in
In order to show the improvement of our proposed pbLN method on the spiking deep Q model, we compared PL-SDQN with other directly trained SDQN models in articles by Liu et al. (
The result in
The comparison of our PL-SDQN model with state-of-art spiking deep Q networks.
Atlantis | 98.79 | 84.24 | |
BeamRider | 97.48 | 99.57 | |
Boxing | 99.17 | 100.20 | |
Breakout | 90.86 | 124.66 | |
CrazyClimber | 102.82 | 106.12 | |
Gopher | 95.78 | 62.24 | |
Jamesbond | 113.92 | 101.04 | |
Kangaroo | 94.56 | 114.35 | |
Krull | 106.77 | 28.69 | |
NameThisGame | 98.85 | 152.41 | |
RoadRunner | 89.72 | 917.26 | |
SpaceInvaders | 80.5 | 106.85 | |
StarGunner | 112.96 | 119.81 | |
Tutankham | 103.90 | 103.63 | |
VideoPinball | 87.01 | 132.73 | |
Total ≥ 100 % | 6/15 | 11/15 |
To show the SNN advantage, we record the percentage as SDQN/DQN * 100%. The best scores of each game are highlighted by bold values.
We analyze that our model has an advantage in the test game because the spiking activity vanishing in deep layers of SNN reduces the performance of the SDQN model. The proposed pbLN method can well counteract the impact of the input signal change on the model's performance to improve the ability to spike neural networks in the reinforcement learning task. Unlike PL-SDQN, the SDQN method based on ANN-SNN conversion faces the problem of spike accuracy and requires a long simulation process. The performance of ANN-SNN conversion SDQN is challenging to surpass the original ANN model. The other directly trained SDQN models compared in
The primary computational consumption of potential-based layer normalization is concentrated in the features mean and variance calculation. The computational complexity of mean and variance operation is
In this study, we directly trained the deep spiking neural networks on the Atari game reinforcement learning task. Because of the characteristics of discrete bias and the hard optimization problem, spiking neural network is difficult to apply to the reinforcement learning field in complex scenarios. We mathematically analyze why spiking neural networks are difficult to generate firing activity and propose a potential based layer normalization method to increase spiking activity in deep layers of SNN. This method can increase the firing rate of the deep spiking neural network so that the input information features can be transferred to the output layer. Additionally, the experiment results show that compared with vanilla DQN and ANN-SNN conversion based SDQN methods, our PL-SDQN model achieves better task performance. Besides, our model has better generalization and robustness compared to other directly trained SDQN methods on Atari game reinforcement learning tasks.
The original contributions presented in the study are included in the article/
YS wrote the code, performed the experiments, analyzed the data, and wrote the manuscript. YZ proposed and supervised the project and contributed to writing the manuscript. YL participated in helpful discussions and contributed to part of the experiments. All authors contributed to the article and approved the submitted version.
This study was supported by the National Key Research and Development Program (Grant No. 2020AAA0104305), the Strategic Priority Research Program of the Chinese Academy of Sciences (Grant No. XDB32070100), and the National Natural Science Foundation of China (Grant No. 62106261).
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The handling editor Z-GH declared a shared affiliation with the authors at the time of review.
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
The Supplementary Material for this article can be found online at: