Edited by: André van Schaik, Western Sydney University, Australia
Reviewed by: Mark D. McDonnell, University of South Australia, Australia; Michael Pfeiffer, Robert Bosch (Germany), Germany
*Correspondence: Emre O. Neftci
This article was submitted to Neuromorphic Engineering, a section of the journal Frontiers in Neuroscience
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
An ongoing challenge in neuromorphic computing is to devise general and computationally efficient models of inference and learning which are compatible with the spatial and temporal constraints of the brain. One increasingly popular and successful approach is to take inspiration from inference and learning algorithms used in deep neural networks. However, the workhorse of deep learning, the gradient descent Gradient Back Propagation (BP) rule, often relies on the immediate availability of network-wide information stored with high-precision memory during learning, and precise operations that are difficult to realize in neuromorphic hardware. Remarkably, recent work showed that exact backpropagated gradients are not essential for learning deep representations. Building on these results, we demonstrate an event-driven random BP (eRBP) rule that uses an error-modulated synaptic plasticity for learning deep representations. Using a two-compartment Leaky Integrate & Fire (I&F) neuron, the rule requires only one addition and two comparisons for each synaptic weight, making it very suitable for implementation in digital or mixed-signal neuromorphic hardware. Our results show that using eRBP, deep representations are rapidly learned, achieving classification accuracies on permutation invariant datasets comparable to those obtained in artificial neural network simulations on GPUs, while being robust to neural and synaptic state quantizations during learning.
Biological neurons and synapses can provide the blueprint for inference and learning machines that are potentially 1,000-fold more energy efficient than mainstream computers. However, the breadth of application and scale of present-day neuromorphic hardware remains limited, mainly by a lack of general and efficient inference and learning algorithms compliant with the spatial and temporal constraints of the brain.
Thanks to their general-purpose, modular, and fault-tolerant nature, deep neural networks and machine learning has become a popular and effective means for executing a broad set of practical vision, audition and control tasks in neuromorphic hardware (Esser et al.,
The implementation of Gradient Back Propagation (hereafter BP for short) on a neural substrate is even more challenging (Grossberg,
Although, previous work (Lee et al.,
eRBP builds on the recent advances in approximate forms of the gradient BP rule (Lee et al.,
The focus of eRBP is to achieve real-time, online learning at higher power efficiency compared to deep learning on standard hardware, rather than achieving the highest accuracy on a given task. The success of eRBP on these measures lays out the foundations of neuromorphic deep learning machines, and paves the way for learning with streaming spike-event data in neuromorphic platforms at proficiencies close to those of artificial neural networks.
This article is organized as follows: key theoretical and simulation results are provided in the results sections, followed by a general discussion and conclusion. Technical details of eRBP and its implementation are provided as the final section.
The central contribution of this article is event-driven RBP (eRBP), a presynaptic spike-driven plasticity rule modulated by top-down errors and gated by the state of the postsynaptic neuron. The idea behind this additional modulation factor is motivated by supervised gradient descent learning in artificial neural networks and biologically plausible models of three-factor plasticity rules (Urbanczik and Senn,
In gradient descent using a squared error cost function, weight updates for a neuron in layer
where
where
where
This choice is motivated by the fact that the activation function of I&F neurons with absolute refractory period can be approximated by a linear threshold unit (also known as rectified linear unit) with saturation whose derivative is exactly the boxcar function. In this case, the eRBP synaptic weight update consists of additions and comparisons only, and can be captured using the following operations for neuron
where
Provided the second compartment dynamics, no multiplications are necessary for an eRBP update. This second compartment can be disabled after learning without affecting the inference dynamics. This rule is reminiscent of membrane voltage-based rules, where spike-driven plasticity is induced only when membrane voltage is inside an eligibility window (Brader et al.,
The realization of eRBP on neuromorphic hardware requires an auxiliary learning variable for integrating and storing top-down error signals during learning, which can be substantiated by a dendritic compartment. Provided this variable, each synaptic weight update incurs only two comparison operations and one addition. Additions and comparisons can be implemented very naturally in neuromorphic VLSI circuits (Liu et al.,
We demonstrate eRBP in networks consisting of one and two hidden layers trained on permutation invariant MNIST and EMNIST (Table
Classification error on the permutation invariant MNIST task (test set) obtained by averaging test errors of the last 5 epochs (for MNIST) and last epoch for EMNIST.
PI MNIST 784-100-10 | 3.77 (3.23) | 2.89 (2.81) | 2.74 (2.64) | 3.19 (2.98) | 2.25 (2.19) | 2.44 (2.39) |
PI MNIST 784-200-10 | 3.53 (2.98) | 2.78 (2.53) | 2.13 (2.04) | 2.37 (2.33) | 1.85 (1.78) | 1.94 (1.88) |
PI MNIST 784-500-10 | 2.86 (2.57) | 2.34 (2.23) | 2.00 (1.96) | 2.09 (2.06) | 1.63 (1.60) | 1.88 (1.80) |
PI MNIST 784-200-200-10 | 2.96 (2.85) | 2.29 (2.22) | 2.50 (2.45) | 2.26 (2.25) | 1.80 (1.78) | 1.82 (1.74) |
PI MNIST 784-500-500-10 | 2.36 (2.28) | 2.02 (1.96) | 2.24 (2.0) | 2.34 (2.31) | 1.90 (1.86) | 1.69 (1.56) |
PI EMNIST 784-200-200-10 | 26.76 (25.26) | 21.83 (21.4) | 22.3 (20.18) | 32.37 (26.48) | 18.42 (16.06) | 18.23 (17.72) |
Network Architecture for Event-driven Random Backpropagation (eRBP) and example spiking activity after training a 784-200-200-10 network for 60 epochs. The network consists of feed-forward layers (
MNIST Classification error on fully connected artificial neural networks (BP and RBP) and on spiking neural networks (eRBP). Curves for eRBP were obtained by averaging across 5 simulations with different seeds.
When equipped with stochastic connections (multiplicative noise) that randomly blank out presynaptic spikes, the network performed better overall (labeled
The reasons why the eRBP× performs better than the eRBP+ configuration cannot only be attributed to its regularizing effect: As learning progresses, a significant portion of the neurons tend to fire near their maximum rate and synchronize their spiking activity across layers as a result of large synaptic weights (and thus presynaptic inputs). Synchronized spike activity is not well captured by firing rate models, which is assumed by eRBP (see Section 5). Additive noise has a relatively small effect when the magnitude of the presynaptic input is large. However, multiplicative blank-out noise improves learning by introducing irregularity in the presynaptic spike-trains even when presynaptic neurons fire regularly. This type of “always-on” stochasticity also was argued to approximate Bayesian inference with Gaussian processes (Gal and Ghahramani,
Overall, the learned classification accuracy with eRBP× is close to that obtained with offline training of neural networks (e.g., GPUs,
Transitions between two data samples of different class (digit) are marked by bursts of activity in the error neurons (Figure
Firing rate of data layer and error layer upon stimulus onset, averaged across 1,000 trials and all neurons in the layer. The large firing rate at the onset is caused by synchronized neural activity. The vertical line in the bottom figure depicts the 50
In future work involving practical applications on autonomous systems, it will be beneficial to interleave learning and inference stages without explicitly controlling the learning rate. One way to achieve this is to introduce a negative bias in the error neurons by means of a constant negative input and an equal positive bias in the label neurons such that the error neuron can be only be active when an input label is provided
The presence of these bursts of error activity suggest that eRBP could learn spatiotemporal sequences as well. However, learning useful latent representations of the sequences requires solving a temporal credit assignment problem at the hidden layer—a problem that is commonly solved with gradient BP-through-time in artificial neural networks (Rumelhart et al.,
The response of the 784-200-10 network after stimulus onset is about one synaptic time constant. Using the first spike after 2τ
Classification error in the 784-200-10 eRBP+ network as a function of the number of spikes in the prediction layer, and total number of synaptic operations incurred up to each output spike. To obtain this data, the network was first stimulated with random patterns, and the spikes in the output layer were counted after τ
In this example, classification using the first spike incurred about 100
The low latency response with high accuracy may seem at odds with the inherent firing rate code underlying the network computations (see Section 5). However, a code based on the time of the first-spike is consistent with a firing rate code, since a neuron with a high firing rate is expected to fire first (Gerstner and Kistler,
In the spiking simulations, weight updates are updated during the presentation of
These results are not entirely surprising since seminal work in stochastic gradient descent established that with suitable conditions on the learning rate, the solution to a learning problem obtained with stochastic gradient descent is asymptotically as good as the solution obtained with batch gradient descent (Le Cun and Bottou,
It is fortunate that synaptic plasticity is inherently “online” in the machine learning sense, given that potential applications of neuromorphic hardware often involve real-time streaming data.
The online, event-based learning in eRBP combined with the reduced number of required dataset iterations suggests that learning on neuromorphic hardware can be particularly efficient. Furthermore, in neuromorphic hardware, only active connections in the network incur a SynOp. To demonstrate the efficiency of the learning, we report the number of multiply-accumulate (MAC) operations required for reaching a given accuracy compared to the number of synaptic operations (SynOps) in the spiking network for the MNIST learning task (784-200-200-10 network, Figure
Spiking Neural Networks equipped with eRBP with stochastic synapses (multiplicative noise) achieve SynOp-MAC parity at the MNIST task. The number of multiply-accumulate (MAC) operations required for reaching a given accuracy is compared to the number of synaptic operations (SynOps) in the spiking network for the MNIST learning task (784-200-200-10 network). Both networks requires roughly the same number of operations to reach the same accuracy during learning. Only MACs incurred in the matrix multiplications are taken into account (other necessary operations e.g., additions, logistic function calls, and weight updates were not taken into account here, and would further favor the spiking network).
The spiking neural networks learn quickly initially (epoch 1 at 94%), but subsequent improvements become slower compared to the artificial neural network. The reasons for this slowdown are likely due to (1) random backpropagation/direct feedback alignment, (2) spikes emanating from error-coding neurons becoming very sparse toward the end of the training, which prevent fine adjustments of the weight. We speculate that a scheduled or network accuracy-based adjustment of the error neuron sensitivity is likely to mitigate the latter cause. Such modifications, along with more sophisticated learning rules involving momentum and learning rate decay are left for future work.
The effectiveness of stochastic gradient descent degrades when the precision of the synaptic weights using a fixed point representation is smaller than 16 bits (Courbariaux et al.,
Extended simulations suggest that the random BP performance at 10 bits precision is indistinguishable from unquantized weights (Baldi et al.,
The gradient descent BP rule is a powerful algorithm that is ubiquitous in deep learning, but when implemented in a neuromorphic substrate, it relies on the immediate availability of network-wide information stored with high-precision memory. More specifically, (Baldi et al.,
Taken together, our results suggest that general-purpose deep learning using streaming spike-event data in neuromorphic platforms at artificial neural network proficiencies is realizable.
Our experiments target neuromorphic implementations of spiking neural networks with embedded plasticity. Membrane-voltage based learning rules implemented in mixed-signal neuromorphic hardware (Qiao et al.,
Spiking neural networks, especially those based on the I&F neuron types severely restrict computations during learning and inference. With the wide availability of graphical processing units and future dedicated machine learning accelerators, the neuromorphic spike-based approach to learning machines is often heavily criticized as being misguided. While this may be true for some hardware designs and on metrics based on absolute accuracy at most standardized benchmark tasks, neuromorphic hardware dedicated for embedded learning can have distinctive advantages thanks to: (1) asynchronous, event-based communication, which considerably reduces the communication between distributed processes, (2) natural exploitation of “rate” codes and “spike” codes where single spikes are meaningful, leading to fast and thus power-efficient and gradual responses (Figure
Many examples that led to the unprecedented success in machine learning have substantial overlap with equivalent neural mechanisms, such as normalization (Ioffe and Szegedy,
Our learning rule builds on the feedback alignment learning rule demonstrating that random feedback can deliver useful teaching signals by aligning the feed-forward weights with the feed-back weights (Lillicrap et al.,
Several approaches successfully realized the mapping of pre-trained artificial neural networks onto spiking neural networks using a firing rate code (O'Connor et al.,
An intermediate approach is to learn online with standard BP using spike-based quantization of network states (O'Connor and Welling,
STDP has been shown to be very powerful in a number of different models and tasks related to machine learning (Thorpe et al.,
Thus, there is considerable benefit in hardware implementations of synaptic plasticity rules that forego the causal updates. Such rules, which we referred to as spike-driven plasticity, can be consistent with STDP (Brader et al.,
A common feature among spike-driven learning rules is a modulation or gating with a variable that reflects the average firing rate of the neuron, for example through calcium concentration (Graupner and Brunel,
The two compartment neuron model used in this work is motivated by conductance-based dynamics in Urbanczik and Senn (
This article demonstrates a local, event-based synaptic plasticity rule for deep, feed-forward neural networks achieving classification accuracies on par with those obtained using equivalent machine learning algorithms. The learning rule combines two features: (1) Algorithmic simplicity: one addition and two comparisons per synaptic update provided one auxiliary state per neuron and (2) Locality: all the information for the weight update is available at each neuron and the synapse. The combination of these two features enables synaptic plasticity dynamics for neuromorphic deep learning machines.
Our results lay out a key component for the building blocks of spike-based deep learning using neural and synaptic operations largely demonstrated in existing neuromorphic technology (Chicca et al.,
One limitation eRBP is related to the “loop duration,” i.e., the duration necessary from the input onset to a stable response in the error neurons. This duration scales with the number of layers, raising the question whether eRBP can generalize for very deep networks without impractical delays. Future work currently in investigation is to augment eRBP using recently proposed synthetic gradients (Jaderberg et al.,
It can be reasonably expected that the deep learning community will uncover many variants of random BP, including in recurrent neural networks for sequence learning and memory augmented neural networks. In tandem with these developments, we envision that such RBP techniques will enable the embedded learning of pattern recognition, attention, working memory, and action selection mechanisms which promise transformative hardware architectures for embedded computing.
This work has focused on unstructured, feed-forward neural networks and a single benchmark task across multiple implementations for ease of comparison. Limitations in deep learning algorithms are often invisible on “toy” datasets like MNIST (Liao et al.,
In artificial neural networks, the mean-squared cost function for one data sample in a single layer neural network is:
where
and where η is a small learning rate. In deep networks, i.e., networks containing one or more hidden layers, the weights of the hidden layer neurons are modified by backpropagating the errors from the prediction layer using the chain rule:
where the δ for the topmost layer is
In the random BP rule considered here, the BP term δ is replaced with:
where
In the context of models of biological spiking neurons, RBP is appealing because it circumvents the problem of calculating the backpropagated errors and does not require bidirectional synapses or symmetric weights. RBP works remarkably well in a wide variety of classification and regression problems, using supervised and unsupervised learning in feed-forward networks, with a small penalty in accuracy.
The above BP rules are commonly used in artificial neural networks, where neuron outputs are represented as single scalar variables. To derive an equivalent spike-based rule, we start by matching this scalar value is the neuron's instantaneous firing rate. The cost function and its derivative for one data sample is then:
where
Random BP (Equation 6) is straightforward to implement in artificial neural network simulations. However, spiking neurons and synapses, especially with the dynamics that can be afforded in low-power neuromorphic implementations typically do not have arbitrary mathematical operations at their disposal. For example, evaluating the derivative ϕ can be difficult depending on the form of ϕ and multiplications between the multiple factors involved in RBP can become very costly given that they must be performed at every synapse for every presynaptic event.
In the following, we derive an event-driven version of RBP that uses only two comparisons and one addition for each presynaptic spike to perform the weight update. The derivation proceeds as follows: (1) Derive the firing rate ν, i.e, the equivalent of ϕ in the spiking neural network, (2) Compute its derivative
The dynamics of spiking neural circuits driven by Poisson spike trains is often studied in the diffusion approximation (Wang,
where
In this case, the neuron's membrane potential dynamics is an Ornstein-Uhlenbeck (OU) process (Gardiner,
where
The firing rate of neuron
where “erf” stands for the error function. The firing rate of neuron
For gradient descent, we require the derivative of the neuron's activation function with respect to the weight
As in previous work (Neftci et al.,
In the considered spiking neuron dynamics, the Gaussian function is not directly available. Although, a sampling scheme based on the membrane potential to approximate the derivative is possible, here we follow a simpler solution: Backed by extensive simulations, and inspired by previously proposed learning rules based on membrane potential gated learning rules (Brader et al.,
The resulting derivative function is similar in spirit to straight-through estimators used in machine learning (Courbariaux and Bengio,
For simplicity, the error
Each pair of error neurons synapse with a leaky dendritic compartment
The weight update for the hidden layers is similar, except that a random linear combination of the error is used instead of
All weight initializations are scaled with the number of rows and the number of columns as
In the following, we detail the spiking neuron dynamics that can efficiently implement eRBP.
The network used for eRBP consists of one or two feed-forward layers (Figure
(1)
where
(2)
where
where Θ is a boxcar function with boundaries
(3)
The spike trains at the data layer were generated using a stochastic neuron with instantaneous firing rate [exponential hazard function (Gerstner and Kistler,
where
Neural states and synaptic weight of the prediction neuron after 500 training examples.
In practice, we find that neurons tend to strongly synchronize in late stages of the training. The analysis provided above does not accurately describe synchronized dynamics, since one of the assumptions for the diffusion approximation is that spike times are uncorrelated. Multiplicative stochasticity was previously shown to be beneficial for regularization and decorrelation of spike trains, while being easy to implement in neuromorphic hardware (Neftci E. et al.,
We trained fully connected feed-forward networks on two datasets, the standard MNIST hand-written digits (LeCun et al.,
To keep the durations of the spiking simulations tractable, learning was run for 60 epochs (MNIST) or 30 epochs (EMNIST), compared to 1,000 epochs in the GPU. This is not a major limitation since errors appear to converge earlier in the spiking neural network. During a training epoch, each of the training digits were presented in during 250
All learning rates were kept fixed during the simulation. Other
We tested eRBP training on a spiking neural network based on the Auryn simulator (Zenke and Gerstner,
Parameters used for the continuous-time spiking neural network simulation implementing eRBP.
Number of data neurons | All networks | 784 | |
Number of hidden neurons | All networks | 100,200,400,1000 | |
Number of label neurons | All networks | 10 | |
Number of positive error neurons | All networks | 10 | |
Number of negative error neurons | All networks | 10 | |
Number of prediction neurons | All networks | 10 | |
σ | Poisson noise weight | eRBP+ | 50· 10−3 |
eRBP× | 0· 10−3 |
||
Blank-out probability | eRBP+ | 1.0 | |
eRBP× | 0.45 | ||
τ |
Refractory period | Prediction and hidden neurons | 3.9 |
Data neurons | 4.0 |
||
τ |
Synaptic Time Constant | All synapses | 4 |
Leak conductance state |
Prediction and hidden neurons | 1 |
|
Leak conductance state |
Prediction and hidden neurons | 5 |
|
Membrane capacitance | All neurons | 1 |
|
Firing threshold | Prediction and Hidden neurons | 100 |
|
Error neurons | 100 |
||
Number of training samples used | All figures | 50000 | |
Number of training samples used | Table |
10000 | |
Table |
1000 | ||
Table |
10000 | ||
Training sample duration | All models | 100 |
|
Testing sample duration | Table |
500 |
|
Table |
250 |
||
Initial weight matrix | RBP, BP | ||
eRBP+ | |||
eRBP× | |||
eRBP+, eRBP× | 90· 10−3nA | ||
eRBP+, eRBP× | 90· 10−3nA | ||
eRBP+, eRBP× | −90· 10−3nA | ||
eRBP+, eRBP× | −1.15, 1.15 |
||
2nd hidden layer | eRBP+, eRBP× | -25, 25 |
|
Figure |
eRBP+, eRBP× | −0.6, 0.6 |
|
β | Data neuron input scale | eRBP+, eRBP× | 0.5 |
γ | Data neuron input threshold | eRBP+, eRBP× | −0.215 |
η | Learning Rate | eRBP+ | 6· 10−4nS |
eRBP× | 10· 10−4nS | ||
RBP, BP | 0.4/ |
||
Minibatch size | RBP(100), BP(100) | 100 | |
RBP(1), BP(1) | 1 |
EN and GD: Designed experiment, conducted experiments, and wrote the paper. EN, GD, SP, and CA: Contributed software and tools.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
This work was partly supported by the Intel Corporation and by the National Science Foundation under grant 1640081, and the Nanoelectronics Research Corporation (NERC), a wholly-owned subsidiary of the Semiconductor Research Corporation (SRC), through Extremely Energy Efficient Collective Electronics (EXCEL), an SRC-NRI Nanoelectronics Research Initiative under Research Task ID 2698.003. We thank Friedemenn Zenke for support on the Auryn simulator, Jun-Haeng Lee and Peter O'Connor for review and comments; and Gert Cauwenberghs, João Sacramento, Walter Senn for discussion.
1Such logical “and” operation on top of a graded signal was previously used for conditional signal propagation in neuromorphic VLSI spiking neural networks (Neftci et al.,
2or equivalently, for the purpose of the derivative evaluation, the activation function is approximated as a rectified linear with hard saturation at