^{1}

^{2}

^{*}

^{2}

^{2}

^{1}

^{2}

Edited by: Bernabe Linares-Barranco, Instituto de Microelectrónica de Sevilla, Spain

Reviewed by: Tara Julia Hamilton, Western Sydney University, Australia; Thomas Nowotny, University of Sussex, UK

*Correspondence: Jun Haeng Lee

This article was submitted to Neuromorphic Engineering, a section of the journal Frontiers in Neuroscience

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Deep spiking neural networks (SNNs) hold the potential for improving the latency and energy efficiency of deep neural networks through data-driven event-based computation. However, training such networks is difficult due to the non-differentiable nature of spike events. In this paper, we introduce a novel technique, which treats the membrane potentials of spiking neurons as differentiable signals, where discontinuities at spike times are considered as noise. This enables an error backpropagation mechanism for deep SNNs that follows the same principles as in conventional deep networks, but works directly on spike signals and membrane potentials. Compared with previous methods relying on indirect training and conversion, our technique has the potential to capture the statistics of spikes more precisely. We evaluate the proposed framework on artificially generated events from the original MNIST handwritten digit benchmark, and also on the N-MNIST benchmark recorded with an event-based dynamic vision sensor, in which the proposed method reduces the error rate by a factor of more than three compared to the best previous SNN, and also achieves a higher accuracy than a conventional convolutional neural network (CNN) trained and tested on the same data. We demonstrate in the context of the MNIST task that thanks to their event-driven operation, deep SNNs (both fully connected and convolutional) trained with our method achieve accuracy equivalent with conventional neural networks. In the N-MNIST example, equivalent accuracy is achieved with about five times fewer computational operations.

Deep learning is achieving outstanding results in various machine learning tasks (He et al.,

A recently proposed solution is to use different data representations between training and processing, i.e., training a conventional ANN and developing conversion algorithms that transfer the weights into equivalent deep SNNs (O'Connor et al.,

In this paper we introduce a novel supervised learning method for SNNs, which closely follows the successful backpropagation algorithm for deep ANNs, but here is used to train general forms of deep SNNs directly from spike signals. This framework includes both fully connected and convolutional SNNs, SNNs with leaky membrane potential, and layers implementing spiking winner-takes-all (WTA) circuits. The key idea of our approach is to generate a continuous and differentiable signal on which SGD can work, using low-pass filtered spiking signals added onto the membrane potential and treating abrupt changes of the membrane potential as noise during error backpropagation. Additional techniques are presented that address particular challenges of SNN training: Spiking neurons typically require large thresholds to achieve stability and reasonable firing rates, but large thresholds may result in many “dead” neurons, which do not participate in the optimization during training. Novel regularization and normalization techniques are proposed that contribute to stable and balanced learning. Our techniques lay the foundations for closing the performance gap between SNNs and ANNs, and promote their use for practical applications.

Gradient descent methods for SNNs have not been deeply investigated because both spike trains and the underlying membrane potentials are not differentiable at the time of spikes. The most successful approaches to date have used indirect methods, such as training a network in the continuous rate domain and converting it into a spiking version. O'Connor et al. (

In this article we study two types of networks: Fully connected SNNs with multiple hidden layers and convolutional SNNs. Let _{i}_{j}

The LIF neuron is one of the simplest models used for describing dynamics of spiking neurons (Gerstner and Kistler, _{mp} is the membrane potential, τ_{mp} is the membrane time constant, _{p} and _{p−1} are the present and previous input spike times, _{dyn}, which controls the refractory period following
_{ref} is the maximum duration of the refractory period, and Δ_{t} = _{out} − _{p}, where _{out} is the time of the latest output spike produced by the neuron or an external trigger signal through lateral inhibition as discussed in Section 2.1.2. Thus, the effect of input spikes on _{mp} is suppressed for a short period of time _{ref} after an output spike. _{dyn} recovers quadratically to 1 after the output spike at _{out}. Since _{dyn} is a neuron parameter and applied to all synapses identically, it is different from short-term plasticity, which is a synapse specific mechanism. The motivation to use dynamic weights instead of simpler refractory mechanisms, such as simply blocking the generation of output spikes, is that it allows controlling refractory states by external mechanisms. One example is the introduction of WTA circuits in Section 2.1.2, where lateral inhibition simultaneously puts all neurons competing in a WTA into the refractory state. This ensures that the winning neuron gets another chance to win the competition, since otherwise another neuron could fire while only the winner has to reset its membrane potential after generating a spike.

When _{mp} crosses the threshold value _{th}, the LIF neuron generates an output spike and _{mp} is decreased by the amount of the threshold:
_{th}, and _{mp} is clipped whenever it falls below this value. This strategy helps balancing the participation of neurons during training by preventing neurons from having highly negative membrane potentials. We will revisit this issue when we introduce threshold regularization in Section 2.3.2.

We found that the accuracy of SNNs could be improved by introducing a competitive recurrent architecture in the form of adding WTA circuits in certain layers. In a WTA circuit, multiple neurons form a group with lateral inhibitory connections. Thus, as soon as any neuron produces an output spike, it inhibits all other neurons in the circuit and prevents them from spiking (Rozell et al., _{th} weakly and those having large _{th} strongly. This improves the balance of activities among neurons during training since neurons with higher activities have larger _{th} due to the threshold regularization scheme described in Section 2.3.2. Furthermore, as described previously in Section 2.1.1, lateral inhibition is used to put the dynamic weights of all inhibited neurons in a WTA circuit into the refractory state. As shown in

In order to derive and apply the backpropagation equations for training SNNs, after summarizing the classical backpropagation method (Rumelhart and Zipser,

Neural networks are typically optimized by SGD, meaning that the vector of network parameters or weights θ is moved in the direction of the negative gradient of some loss function

Propagation inputs in the forward direction to compute the pre-activations (^{(l)}) and activations (^{(l)} = ^{(l)}(^{(l)})) for all the layers up to the output layer _{nl}, where

Calculate the error at the output layer:

where y is the label vector indicating the desired output activation and · is element-wise multiplication.

Backpropagate the error to lower layers _{l} − 1, _{l} − 2, …, 2:

where ^{(l)} is the weight matrix of the layer

Compute the partial derivatives for the update:

where ^{(l)} is the bias vector of the layer

Update the parameters:

Starting from the event-based update of the membrane potentials in Equation (1), we can define the accumulated effect (normalized by synaptic weight) of the _{k}(_{i}, which is due to the reset in Equation (3) (normalized by _{th}). Both _{k} and _{i} can be expressed as sums of exponentially decaying terms
_{p} < _{q} < _{i}. The accumulated effects of lateral inhibitory signals in WTA circuits can be expressed analogously to Equation (4). The activities in Equation (4) are real-valued and continuous except for the time points where spikes occur and the activities jump up. We use these numerically computed lowpass-filtered activities for backpropagation instead of directly using spike signals.

Ignoring the effect of refractory periods for now, the membrane potential of the _{k} and _{i} defined in Equation (4) as
_{ij} is the strength of lateral inhibition (−1 ≤ κ_{ij} ≤ 0) from the _{th}). We found a value of σ ≈ 0.5 to work well in practice. Equation (5) reveals the relationship between inputs and outputs of spiking neurons which is not clearly shown in Equations (1) and (3). Nonlinear activation of neurons is considered in Equation (5) by including only active synapses and neurons. Figure _{i}) of the current layer becomes the input (_{k}) of the next layer if all the neurons have same τ_{mp}, Equation (5) provides the basis for deriving the backpropagation algorithm via the chain rule.

Differentiation is not defined in Equation (4) at the moment of each spike because there is a discontinuous step jump. However, we propose here to ignore these fluctuations, and treat Equations (4) and (5) as if they were differentiable continuous signals to derive the necessary error gradients for backpropagation. In previous works (O'Connor et al., _{k} and _{i} in Equation (5) for backpropagation. In this work, however, we directly use the contribution of spike signals to the membrane potential as defined in Equation (4). Thus, the real statistics of spike signals, including temporal effects such as synchrony between inputs, can influence the training process. Ignoring the step jumps caused by spikes in the calculation of gradients might of course introduce errors, but as our results show, in practice this seems to have very little influence on SNN training. A potential explanation for this robustness of our training scheme is that by treating the signals in Equation (4) as continuous signals that fluctuate suddenly at times of spikes, we achieve a similar positive effect as the widely used approach of noise injection during training, which can improve the generalization capability of neural networks (Vincent et al.,

For the backpropagation equations it is necessary to obtain the transfer functions of LIF neurons in WTA circuits (which generalizes to non-WTA layers by setting κ_{ij} = 0 for all _{mp} term in the left side of Equation (5) to zero (since it is not relevant to the transfer function), resulting in the transfer function

Directly differentiating Equation (6) yields the backpropagation equations
_{ij} = μ, ∀_{i}/∂κ_{ih} is not necessary and Equation (8) can be simplified to

Good initialization of weight parameters in supervised learning is critical to handle the exploding or vanishing gradients problem in deep neural networks (Glorot and Bengio,

The weight and threshold parameters of neurons in the ^{(l)} is the number of synapses of each neuron, and α is a constant. α should be large enough to stabilize spiking neurons, but small enough to make the neurons respond to the inputs through multiple layers. In general, layers with smaller number of units need to have smaller α to generate more spikes per neuron and maintain a high enough input activity for the next layer. We used values between 3 and 10 for α and tuned them for each layer to increase the learning speed, although other choices of α will lead to similar results. The weights initialized by Equation (10) satisfy the following condition:

The main idea of backprop error normalization is to balance the magnitude of updates in weight (and in threshold) parameters among layers. In the ^{(l)} = ^{(l+1)}, ^{(l)} = ^{(l+1)}), we define the error propagating back through the _{w} and η_{th} are the learning rates for weight and threshold parameters, respectively. We found that the threshold values tend to decrease through the training epochs due to SGD decreasing the threshold values whenever the target neuron does not fully respond to the corresponding input. Small thresholds, however, could lead to exploding firing rates within the network. Thus, we used smaller learning rates for threshold updates to prevent the threshold parameters from decreasing too much.

As in conventional ANNs, regularization techniques such as weight decay during training are essential to improve the generalization capability of SNNs. Another problem in training SNNs is that because thresholds need to be initialized to large values as described in Equation (10), only a few neurons respond to input stimuli and many of them remain silent. This is a significant problem, especially in WTA circuits. In this section we introduce weight and threshold regularization methods to address these issues.

Weight decay regularization is used to improve the stability of SNNs as well as their generalization capability. Specifically, we want to maintain the condition in Equation (11). Conventional L2-regularization was found to be inadequate for this purpose, because it leads to an initial fast growth, followed by a continued decrease of weights. To address this issue, a new method named exponential regularization is introduced, which is inspired from max-norm regularization (Srivastava et al.,

Threshold regularization is used to balance the activities among _{w} neurons fire after receiving an input spike, their thresholds are increased by ρ_{w}. Thus, highly active neurons become less sensitive to input stimuli due to the increase of their thresholds. On the other hand, rarely active neurons can respond more easily for subsequent stimuli. Because the membrane potentials are restricted to the range [−_{th}, _{th}], neurons with smaller thresholds, because of their tight lower bound, tend to be less influenced by negative inputs. Threshold regularization actively prevents dead neurons and encourages all neurons to equally contribute to the optimization. This kind of regularization has been used for competitive learning previously (Rumelhart and Zipser,

Using the regularization term from Equation (14), the objective function for each training sample (using batch size = 1) is given by
_{i} is the number of output spikes generated by the

The training procedure can be summarized as follows: For every training sample, e.g., an image from the MNIST database, a set of events is generated. The events are propagated forward through the network using the event-driven update rule described in Equation (1) with threshold regularization. This simulation is purely event-driven, and does not use discrete time steps. Auxiliary activity values defined in Equation (4) are also calculated for training during forward propagation. Threshold regularization and auxiliary activity values are used for training only. Thus, they are not necessary if the trained network is used later for inference. After all the events from the set have finished propagating forward through the network, the events of the output layer are counted to obtain the output vector as described above Equation (16). This is used to calculate the error vector, which is normalized as _{nze} is the number of nonzero elements in (

MNIST is a hand written digit classification dataset consisting of 60,000 training samples and 10,000 test samples (LeCun et al.,

The permutation-invariant (PI) version of MNIST refers to the fact that the input images are randomly permuted, resulting in a loss of spatial structure and effectively random sparse input patterns. By randomly permuting the input stimuli we prohibit the use of techniques that exploit spatial correlations within inputs, such as data augmentation or convolutions to improve performance. Using the PI MNIST thus more directly measures the power of a fully-connected classifier.

Figure

τ_{mp} |
20 ms (MNIST), 200 ms (N-MNIST) | Equations (1) and (4) |

_{ref} |
1 ms | Equation (1) |

α | 3−10 | Equation (10) |

η_{w} |
0.002−0.004 | Equation (13) |

η_{th} |
0.1η_{w} (SGD), η_{w} (ADAM) |
Equation (13) |

β | 10 | Equation (14) |

λ | 0.002−0.04 | Equation (14) |

ρ | 0.00004−0.0002 | Section 2.3.2 |

We trained and evaluated SNNs with differently sized hidden layers (784-_{1} = 0.9, β_{2} = 0.999, ϵ = 10^{−8}), we could further improve the best accuracy up to 98.77%, which is close to ANNs trained with Dropout or DropConnect (Wan et al.,

ANN (Srivastava et al., |
800 | 98.4 |

ANN (Srivastava et al., |
4096–4096 | 98.99 |

ANN (Wan et al., |
800–800 | 98.8 |

ANN (Goodfellow et al., |
240 × 5–240 × 5 | 99.06 |

SNN (O'Connor et al., ^{a}^{,}^{b} |
500–500 | 94.09 |

SNN (Hunsberger and Eliasmith, ^{a} |
500–300 | 98.6 |

SNN (Diehl et al., |
1200–1200 | 98.64 |

SNN (O'Connor and Welling, |
200–200 | 97.8 |

SNN (SGD, This work) | 800 | [98.56, 98.64, 98.71]^{*} |

SNN (SGD, This work) | 500–500 | [98.63, 98.70, 98.76]^{*} |

SNN (ADAM, This work) | 300–300 | [98.71, 98.77, 98.88]^{*} |

Convolutional neural networks (CNNs) are currently the most popular architecture for visual recognition tasks. Since CNNs can effectively make use of the spatial structure of the visual world, we tested them on the standard MNIST benchmark (LeCun et al.,

CNN (Garbin et al., |
None | 1 | 98.3 |

CNN (Diehl et al., |
None | 1 | 99.1 |

Sparsely connected network (Esser et al., |
Affine transformation | 64 | 99.42 |

CNN (This work) | Elastic distortion | 1 | 99.31 |

To investigate the potential of the proposed method for training directly on event stream data, we trained a simple fully connected networks with 1 HL on the N-MNIST dataset, a neuromorphic version of MNIST (Orchard et al.,

The previous state-of-the-art result had achieved 95.72% accuracy with a spiking CNN (Neil and Liu,

ANN (Neil and Liu, |
CNN | Yes | 98.3 |

SNN (Neil and Liu, |
CNN | Yes | 95.72 |

SNN (Cohen et al., |
10,000 | No | 92.87 |

SNN (This work) | 800 | No | [98.56, 98.66, 98.74]^{*} |

An SNN continuously generates output spikes, thereby improving the accuracy as it integrates input events over time. Each output spike can be interpreted as an instantaneous inference based on a small set of input spikes over a short period preceding the spike. This is true for dynamic spatio-temporal event patterns like the N-MNIST task as shown in Figure

Integration of inference for dynamic pattern recognition can also be achieved in ANNs by iteratively performing inference over multiple consecutive images and using a majority vote as the predicted output. To investigate this, we trained an ANN with the same network architecture as the SNN, but using images of accumulated events over consecutive 30-ms intervals. Since we generated frames from the events over only short periods, preprocessing such as stabilizing the position of digits was not required. No significant blurring caused by saccade motion was observed in the generated frames. The test accuracy for each single snapshot image using the ANN was 95.2%. This can be interpreted as an instantaneous inference in ANNs. To obtain the final prediction, we accumulated the outputs of the softmax layer for 10 frames. When combining the results over 10 image frames (i.e., 300 ms in total), the error rate of the ANN drops to 2.2%. This accumulation of predictions reduced the gap between the ANN and SNN in term of accuracy practically to zero, however, it increased the computational cost for inference in the ANN far beyond that of the SNN. Figure

We proposed a variant of the classic backpropagation algorithm, known as the most widely used supervised learning algorithm for deep neural networks, which can be applied to train deep SNNs. Unlike previously proposed techniques based on ANN-to-SNN conversion methods (Diehl et al.,

Recent advances in deep learning have demonstrated the importance of working with large datasets and extensive computational resources. The MNIST benchmark, under these considerations needs to be considered too small for evaluating the scaling of architectures and learning methods to larger applications. Furthermore, the dataset is not meant as a benchmark for SNNs, because it does not provide spike events generated from real sensors. Nevertheless, it remains important since new methods and architectures are still frequently evaluated on MNIST. In particular, almost all recently published SNN papers are tested on MNIST, where it remains the only dataset allowing comparisons. The N-MNIST benchmark (Orchard et al.,

Just as hardware acceleration through GPUs has been critical to advance the state of the art in conventional deep learning, there is also an increasing need for powerful hardware platforms supporting SNN training and inference. Parallelizing event-based updates of SNNs on current GPU architectures remains challenging (Nageswaran et al.,

Here we have presented only examples where spiking backpropagation was applied to feed-forward networks, but an attractive next goal would be to extend the described methods to recurrent neural networks (RNNs) (Schmidhuber,

JL developed the theory and performed the experiments. JL, TD, and MP wrote the paper.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.