Synaptic Plasticity Dynamics for Deep Continuous Local Learning (DECOLLE)

A growing body of work underlines striking similarities between biological neural networks and recurrent, binary neural networks. A relatively smaller body of work, however, addresses the similarities between learning dynamics employed in deep artificial neural networks and synaptic plasticity in spiking neural networks. The challenge preventing this is largely caused by the discrepancy between the dynamical properties of synaptic plasticity and the requirements for gradient backpropagation. Learning algorithms that approximate gradient backpropagation using local error functions can overcome this challenge. Here, we introduce Deep Continuous Local Learning (DECOLLE), a spiking neural network equipped with local error functions for online learning with no memory overhead for computing gradients. DECOLLE is capable of learning deep spatio temporal representations from spikes relying solely on local information, making it compatible with neurobiology and neuromorphic hardware. Synaptic plasticity rules are derived systematically from user-defined cost functions and neural dynamics by leveraging existing autodifferentiation methods of machine learning frameworks. We benchmark our approach on the event-based neuromorphic dataset N-MNIST and DvsGesture, on which DECOLLE performs comparably to the state-of-the-art. DECOLLE networks provide continuously learning machines that are relevant to biology and supportive of event-based, low-power computer vision architectures matching the accuracies of conventional computers on tasks where temporal precision and speed are essential.


Implementation of DECOLLE using Autodifferentiation
The equations of Deep Continuous Local Learning (DECOLLE) are very similar to that of a simple recurrent neural network. However, rather than performing backpropagation through-time, the derivatives of U i are computed by propagating the traces P i forward in time as follows: where f loss is the loss function, e.g. MSE loss. STOPGRAD prevents the flow of gradients by setting them to zero. STOPGRAD(A-B)+B as above is a common construct used to compute gradients using a separate subgraph. In this case, it is used to implement the surrogate gradient. Note that the time variable does not appear in the variables. This underlines that DECOLLE computation does not need to store the state histories, and that the variables necessary for computing the gradient (GRAD) are available within the same step. In our implementation, the weights are updated online, at every time step. The cost of making the parameter update is no more than accumulating the gradients at each time step, since parameter memory and dynamical state memory are the same. Therefore, there is no overhead in updating online. Note that this may not be the case in dedicated neuromorphic hardware or AI accelerators that require different memory structures for storing parameter memory.

Complexity Overhead for Various Spiking Neuron Gradient-Based Training Approaches
We provide additional detail on the complexity of DECOLLE compared to other learning methods. In the current implementation of DECOLLE, all weight updates are applied immediately, thus no additional memory is necessary to accumulate the gradients. In all other methods presented in (Tab. 1), the weight updates are applied in an epoch-wise fashion, which requires an additional variable to store the accumulated weights. However, this is an implementation choice which could have been made for methods other than DECOLLE as well. For this reason, the overhead of accumulating gradients in epoch-wise learning is ignored in the following calculations.
DECOLLE The state of P and Q must be maintained. These states are readily available from the forward pass, and therefore do not need to be stored specifically for learning. Space complexity is therefore O(1). Each weight update requires M N r multiplications to obtain M local errors. Each of these are multiplied by the number of inputs pN , resulting in O(pN M + M N r ) time complexity, where p is the faction of connected neurons. Similarly to [2], the random weights in G l can be computed using a random number generator, which requires one seed value per layer.
Superspike When using the Van Rossum Distance (VRD), the Superspike learning rule requires one trace per connection, resulting in a space complexity of O(pN M ). The additional complexity here compared to DECOLLE is caused by the additional filter in the Van Rossum distance. Note that if the learning is applied directly to membrane potentials, the space and time complexity is similar to that of DECOLLE.
eProp In the case when no future errors are used, the complexity of e-Prop [1] is similar to that of SuperSpike.

RTRL and BPTT
The complexity of these techniques are discussed in detail in [3].

C3D Network
We used a standard 3D convolution network (C3D) for comparison with DECOLLE. In C3D, the temporal dimension is taken into account in the third dimension of the 3D convolution.