A scalable implementation of the recursive least-squares algorithm for training spiking neural networks

Training spiking recurrent neural networks on neuronal recordings or behavioral tasks has become a popular way to study computations performed by the nervous system. As the size and complexity of neural recordings increase, there is a need for efficient algorithms that can train models in a short period of time using minimal resources. We present optimized CPU and GPU implementations of the recursive least-squares algorithm in spiking neural networks. The GPU implementation can train networks of one million neurons, with 100 million plastic synapses and a billion static synapses, about 1,000 times faster than an unoptimized reference CPU implementation. We demonstrate the code's utility by training a network, in less than an hour, to reproduce the activity of > 66, 000 recorded neurons of a mouse performing a decision-making task. The fast implementation enables a more interactive in-silico study of the dynamics and connectivity underlying multi-area computations. It also admits the possibility to train models as in-vivo experiments are being conducted, thus closing the loop between modeling and experiments.


Generating target neural trajectories from PSTHs
For each recorded neuron, we converted its PSTH to target pre-synaptic activities to be used for training the synaptic inputs. Specifically, in Figure 2 we obtained for each spike rate r it (i = 1, ..., N, and t = T init + t, ..., T end where T init = 2 and T end = 1) the mean synaptic input f it that needs to be applied to the the leaky integrate-and-fire neuron to generate the desired spike rate. To this end, we numerically solved the nonlinear rate equation is the transfer function of the leaky integrate-and-fire neuron given mean input, m, and variance of the input, 2 [38,28]. This conversion yielded a set of vectors of target synaptic inputs f 1 , ..., f N 2 R T where T = (T end T init )/ t for the recorded neurons. Note that in Figure 2 we did not simulate the recurrent static connections (K = 0). Simulating both static connections and the external noise would require estimating in Eq.(1). This can be done by assuming Poisson firing statistics and calculating given the average connectivity and rates of the populations in the network [28]. Alternatively, it is possible to directly estimate from simulating an initial network that fits the average rates and level of noise (e.g. the Fano factor) observed in the data [9]. The results presented in this paper are done using the latter approach.

Network connectivity
In Figure 1A, the spiking neural network consisted of randomly connected N E excitatory and N I inhibitory neurons. They were recurrently connected as in [9]. In short, the recurrent synapses consisted of static weights J that remained constant throughout training and plastic weights W that were modified by the training algorithm. The static synapses connected neuron j in population to neuron i in population ↵ with probability p ↵ = K ↵ /N and synaptic weightJ ↵ / p K ↵ , where K ↵ is the average number of static connections from population to ↵: The strength of plastic synapses,W ↵ / p K ↵ , was of the same order as the static weights. However, the plastic synapses connected neurons with a smaller probability: which made the plastic synapses much sparser than the static synapses [39,9]. Here, c is an order 1 parameter that depends on training setup.
The static and plastic connections were non-overlapping in that any two neurons in the network can have only one type of synapse.
Keeping them disjoint allowed us to maintain the initial network dynamics generated by the static synapses and, subsequently, introduce trained activity to the initial dynamics by modifying the plastic synapses.
The static recurrent synapses were strong in that the coupling strength between two connected neurons scaled as 1 p K ↵ , while the average number of synaptic inputs scaled as K ↵ . As a result of this strong scaling, the excitatory and inhibitory synaptic inputs to a neuron from static synapses increased as p K ↵ , and thus were much larger than the spike-threshold for a large K ↵ . However, the excitatory and inhibitory currents were dynamically canceled, and, together with the external input, the sum was balanced to be around the spike-threshold [20,9].
In contrast to the static synapses, each trained neuron received only about L = p K ↵ plastic synapses. This made the plastic synapses much sparser than the sparse static EI connectivity (e.g., with K = 1000 static synapses, there are of the order of p K ⇡ 30 plastic synapses per neuron). Consequently, the EI plastic inputs of the initial network were independent of K ↵ and substantially weaker than the EI balanced inputs for a large K ↵ . After training the plastic synapses, the total synaptic input, i.e., the sum of plastic and balanced inputs, to each trained neuron was able to follow the target patterns. With this scaling of plastic synapses, training was robust to variations in the network size, N , and the number of pre-synaptic connections, K ↵ (Fig. 1B). Figure 2, on the other hand, the network had only plastic synapses but no static synapses, i.e., K = 0. Without the static excitatory-inhibitory synaptic connections, as in Figure 1, the network lacked internally generated noise. Therefore, we injected external noise to the neuron's voltage equation to induce variability in spiking activity (Fig. 2C). The variance of this white noise, 2 , was chosen such that the externally injected noise was similar to internally generated fluctuating synaptic current when static connections were present (K = 1000).

Network dynamics
In the following mathematical description of network activity, the static connections present in Figure 1B (K > 0) produced the balanced inputs (u ↵ bal,i ), which exhibited large fluctuations. For this reason, no additional external noise was injected to the network ( = 0). On the other hand, as mentioned above, in Figure 2 (and also in Figs. 1C, E), the static connections did not exist (K = 0) and external noise was injected to the neurons ( > 0).
We used an integrate-and-fire neuron to model the membrane potential dynamics of i'th neuron: where a spike is emitted and the membrane potential is reset to v reset when the membrane potential crosses the spike-threshold v thr . The membrane equations (Eq. 5), and the following synaptic equations (Eq. 8), were numerically solved using the forward Euler method.
Here, u ↵ i is the total synaptic input to neuron i in population ↵ that can be divided into static and plastic inputs incoming through the static and plastic synapses, respectively: X ↵ i is the total external input that can be divided into constant external input, plastic external input, and the stimulus: X ↵ bal,i is a constant input associated with the initial balanced network. It scales with the number of connections (i.e. is proportional to p K ↵ ), determines the firing rate of the initial network and stays unchanged [20]. X ↵ plas,i is plastic input provided to trained neurons in the recurrent network from external neurons that emit stochastic spikes with pre-determined rate patterns. The synaptic weights from the external neurons to the trained neurons were modified by the training algorithm. X ↵ stim,i is the predetermined stimulus, generated independently from the Ornstein-Uhlenbeck process for each neuron, and injected to all neurons in the network to trigger the learned responses in the trained neurons.
The synaptic activity was modeled by an instantaneous jump of the synaptic input due to a presynaptic neuron's spike, followed by an exponential decay. Our network did not include time delays in the propagation of the presynaptic neuron's spike to the post-synaptic neuron; therefore a spike caused jumps in the postsynaptic neurons instantaneously. Since the static and plastic synapses did not overlap, we separated the total synaptic input into static and plastic components as mentioned above: Alternatively, the synaptic activity can be expressed as a weighted sum of filtered spike trains because the synaptic variable equations (Eq. 8) are linear in J and W: where ⌧ balṙ bal,i = r bal,i + describe the dynamics of synaptically filtered spike trains.
Each external neuron emitted spikes stochastically at a pre-defined rate that changed over time. The rate patterns, followed by the external neurons, were randomly generated from an Ornstein-Uhlenbeck process with mean rate of 5 Hz. The synaptically filtered external spikes were weighted by plastic synapses W X and injected to trained neurons: Similarly, the external stimulus X stim,i applied to each neuron i in the network to trigger the learned response is generated independently from the Ornstein-Uhlenbeck process.
In the following section, we will use the linearity of W, W X in Eqs. 9 and 11 to derive the training algorithm that modifies plastic synaptic weights.

Recursive least squares
Here we derive a synaptic update rule that modifies the plastic synapses to learn the target activities. The derivation closely follows previous papers [8,16,9]. For notational simplicity, we drop the neuron index i in w i and other variables (e.g., f i , u i ) but the same synaptic update rule is applied to all trained neurons. In fact, the plastic connections to every trained neuron can be updated simultaneously since each trained neuron has its own private target function and plastic connections. Therefore, the CPU multithreading (Fig. 1D) or GPU implementation (Figs. 1B, C) presented in our study leverages the fact that the synaptic update rule can be applied in parallel to all plastic synapses.
The cost function of each trained neuron i is defined by Specifically, w = (W i1 , ..., W i L ) is a vector of recurrent plastic presynaptic synapses indexed by i 1 , ..., i L ; f t is the target function over time t; u t is the total synaptic input over time t. The hyperparameters and µ are the standard L 2 and ROWSUM [16] regularization terms, respectively. The L 2 regularization controls the overall learning rate, and ROWSUM regularization constrains the strength aggregate excitatory and inhibitory synaptic weights to the trained neuron.
The gradient of the cost function with respect to the weights w is where we substitute the expression u t = w · r t in the first line to evaluate the gradient with respect to w. To derive the synaptic update rule, we compute the gradient at two consecutive time points and Subtracting Eqs (15) and (16) yields w n = w n 1 + e n P n r n e n = f n w n 1 · r n where with the initial value To update P n iteratively, we use the Woodbury matrix identity where A is invertible and N ⇥ N , U is N ⇥ T , C is invertible and T ⇥ T and V is T ⇥ N matrices. Then P n can be calculated iteratively P n = P n 1 P n 1 r n (r n ) 0 P n 1 1 + (r n ) 0 P n 1 r n .
The following pseudocode provides overview of the simulation and RLS training of spiking neural networks.