Training multi-layer spiking neural networks with plastic synaptic weights and delays

Spiking neural networks are usually considered as the third generation of neural networks, which hold the potential of ultra-low power consumption on corresponding hardware platforms and are very suitable for temporal information processing. However, how to efficiently train the spiking neural networks remains an open question, and most existing learning methods only consider the plasticity of synaptic weights. In this paper, we proposed a new supervised learning algorithm for multiple-layer spiking neural networks based on the typical SpikeProp method. In the proposed method, both the synaptic weights and delays are considered as adjustable parameters to improve both the biological plausibility and the learning performance. In addition, the proposed method inherits the advantages of SpikeProp, which can make full use of the temporal information of spikes. Various experiments are conducted to verify the performance of the proposed method, and the results demonstrate that the proposed method achieves a competitive learning performance compared with the existing related works. Finally, the differences between the proposed method and the existing mainstream multi-layer training algorithms are discussed.


Introduction
Deep neural network (DNNs), as a mainstream algorithm of machine learning, has been applied to various fields, such as compute vision (He et al., 2016), speech separation (Subakan et al., 2021), path finding (Arulkumaran et al., 2017), etc.However, the current DNNs suffer from the problem of excessive power consumption, which limit their applications in energycritical environments (Zhang M. et al., 2021).In contrast, the nervous systems in biological brains require very little energy to handle complex tasks.As a combination of both, the spiking neural networks (SNNs) (Maass, 1997) inherit the existing mature structures and algorithms in the DNNs, and further learn from the way of using spike trains to transmit information between biological neurons.As a result, SNNs have more complex spatialtemporal dynamics and are suitable for ultra-low power devices (Lan et al., 2021;Pan et al., 2021;Zhang et al., 2022).However, at present, there is no training algorithm for SNNs that can give full play to their characteristics, so how to efficiently train SNNs is still an open question.
The current training algorithms for spiking neural networks include training the connection weights between neurons in the network and the delay in the transmission of spike trains between neurons.The first type of training algorithm is consistent with the goal of deep neural networks, which is to enable spiking neural networks to effectively complete tasks by training neural network weights.The training method can be divided into heuristic algorithms, conversion-based algorithms and BP-based algorithms.The representative algorithms of heuristic algorithms are STDP learning algorithms (Caporale and Dan, 2008) and its variants (Yha et al., 2020;Wu et al., 2021c).This class of algorithms is largely based on biological neuroscience findings that when neurons fire together, wire together.The second methods are conversion-based methods (Wu et al., 2021a,b) and their basic idea is to first train a DNN, and then convert the parameters in the trained network to the corresponding SNN through a series of methods.Compared with other state-of-the-art SNN implementations, the inference time and total synaptic operations of the network trained by this method are reduced by at least one order of magnitude.When the length of the simulation time is only eight-time steps, the conversion-based network still achieves good performance in large datasets.The third method is the method based on the surrogate gradient learning (Neftci et al., 2019;Zhu et al., 2021).Although this method can achieve competitive results with DNN in short time steps, this method needs to save the state information of the SNN at each moment.Therefore the computing power and storage requirements are large.The fourth is the event-driven training algorithm (Gütig, 2016;Neftci et al., 2016;Zhang M. et al., 2021;Luo et al., 2022), which only adjusts the network parameters according to the spikes, greatly reducing the training costs.
The above-mentioned learning algorithms for network weights are mostly used for processing static or periodic data and are not effective for fast time-varying signals.The reason is that simply adjusting the synaptic weights cannot effectively extract the rich time-dependent relationship between spike trains in SNNs, while in the biological nervous system, different synapses have various delays in transmitting spike trains (Zhang et al., 2020;Han et al., 2021).In order to further enhence the ability of the SNN model to process fast time-varying data, based on the original synaptic weight training, a training algorithm for the transmission delay between synapses is added.
However, there is biological evidence that the brain's biological synaptic latency is not a constant and there is no uniformity in the rules of latency variation (Sun et al., 2023), so this is also an open area of research.The DL-ReSuMe (Taherkhani et al., 2015a) algorithm is proposed to merge the delay shift approach and ReSuMe-based weight adjustment to enhance the learning performance.After that, Multi-DL-ReSuMe is proposed to train multiple neurons to classify spatiotemporal spiking patterns (Taherkhani et al., 2015b).Shrestha and Orchard (2018) proposes a general backpropagation mechanism for learning synaptic weights and axonal delays which overcomes the problem of nondifferentiability of the spike function and uses a temporal credit assignment policy for backpropagating error to preceding layers.Sun et al. (2022) proposes the rectified axonal delay (RAD) as an additional degree of freedom for training that can easily be incorporated into existing SNN frameworks.The new model can perform well on problems where timing matters using very few parameters.DW-ReSuMe (Han et al., 2021) is proposed to achieve a spike train learning task, which is combined with delay learning based on weight training.The RL-Squares-Based Learning Rule (Zhang Y. et al., 2021) is proposed to generate the desired spatiotemporal spike train.The gradient descent-based synaptic delay learning algorithm (Luo et al., 2022) is proposed to improve the sequential learning performance of single-spike neurons.
Although these methods increase the model's ability to process time-series-related data, they need to face the problem of exploding or vanishing gradients in the training process.It is necessary to select the hyper-parameters of the model and algorithm very carefully to make the model converge effectively.
The contribution of this paper includes the following points: 1.This paper introduces synaptic delays between neurons based on the Spike Response Model (SRM) (Gerstner, 1995) model, and combines the SpikeProp (Bohte et al., 2002)

Materials and methods
In this section, the basic theoretical knowledge of the neuron model that we used in this paper will be introduced first.Then, we will further introduce the learning algorithm of the SpikeProp.Finally, the proposed learning algorithm is presented and the processing of algorithm derivation is given in detail.

. Neuron model
Inspired by the biological brains, spiking neural networks (SNNs) (Maass and Bishop, 2001), which are often referred to as the third generation of artificial neural networks, employ a spike function to replicate the information transfer observed in biological neurons.SNNs possess the unique capacity for biological plasticity, allowing them to encode external inputs into spike trains (Izhikevich, 2003).When these spike trains are processed, they are reduced to two fundamental factors (Pfeiffer and Pfeil, 2018): (1) spike time, which involves the relative timing of pre-synaptic and post-synaptic spikes, and (2) synaptic type, encompassing attributes like excitatory or inhibitory properties and the strength of synaptic connections.
In SNNs, every neuron remains silent until it receives a spike.Once receiving incoming information, each neuron experiences changes in membrane voltage.Output spikes are generated only when the total membrane voltage surpasses the neuron's threshold θ , after which they propagate backward (Ghosh-Dastidar and Adeli, 2009).One widely utilized spiking neuron model is the spike response model [SRM (Gerstner, 1995)].The membrane voltage in the SRM is calculated as shown in Equation 1: where V l+1 j represents the membrane potential of the jth neuron in layer l + 1, and the t l i is the firing time of the ith neuron in layer l. w l+1 ij and d l+1 ij are the synaptic efficacy and delay between these two neurons, respectively.The kernel K(•), which determines the shape of postsynaptic potentials (PSPs), is defined as Equation 2: where τ m and τ s represent the membrane time constant of neurons, respectively.V norm is the result of PSPs being normalized, which makes the value of PSPs between 0 and 1 and is calculated by Equation 3: where β = τ m /τ s .
Figure 1 shows the SRM neuron model with synaptic delay.There are three pre-synaptic neurons with the weight w i and the delay time d i (i = 1, 2, 3).The delay time ensures the firing time at the synapse is delayed, which makes the membrane potential V(t) change between t 2 and t 3 .

. Learning algorithm of the SpikeProp
The SpikeProp algorithm, an improvement upon the traditional backpropagation algorithm for artificial neural networks (ANN), is designed to facilitate the learning process of multi-layer feedforward spiking neural networks (SNN) (Bohte et al., 2002).Notably, the algorithm imposes a constraint, allowing each neuron to fire at most once within each layer.The SpikeProp algorithm employs the time minimum mean square error function as its error function, which is illustrated in Equation 4: where t o k represents the time at which the output neuron k emits an actual spike, and t d k denotes its target spike time.SpikeProp utilizes the SRM model, which allows for the derivation of the relationship between firing time and membrane voltage through mathematical analysis.The weight update value is obtained by minimizing the mean square error, as defined in Equation 5: where w l ij stands for the gradient of the synaptic weight between ith presynaptic neuron in layer l − 1 and jth postsynaptic neuron in layer l. δ l j is the intermediate quantity for gradient calculation, which can be expressed as , and ) represents the unweighted postsynaptic potential.η is the learning rate.To simplify notation, we abbreviate V l j as V.However, when the membrane voltage reaches the firing threshold, it promptly resets to the resting potential, resulting in a spike emission at that specific moment, rendering it nondifferentiable.Consequently, the direct calculation of within δ l j becomes unfeasible.To address this challenge, SpikeProp introduces the concept of a linear hypothesis.This entails assuming that within a sufficiently small neighborhood around t = t l j , the membrane potential can be reasonably approximated as a linear function of time, which is defined as: . The proposed learning algorithm SpikeProp primarily focuses on adjusting synaptic weights, which is an essential aspect of biological synaptic plasticity.However, it overlooks another crucial element, synaptic delay plasticity, which imposes limitations on its overall performance and diminishes its biological interpretability.Conversely, in addressing the non-differentiability issue resulting from spike discontinuity, SpikeProp resorts to the approach outlined in Equation 6.Unfortunately, this solution introduces another challenge: the possibility of a gradient explosion due to the rapid membrane voltage change near the firing threshold.In the subsequent sections, we present solutions to these two problems individually.

. . Learning algorithms with synaptic weights and delay plasticity
To maintain generality, let's assume that the network in question is a multi-layer fully connected network, with the final layer designated as the oth layer.Similar to the approach employed in SpikeProp, the network adopts the loss function defined in Equation 4. Subsequently, based on this loss function and employing the error-backpropagation algorithm, the synaptic adjustment rules for layer l are formulated as Equation 7: where d l ij (w l ij ) represents the synaptic delay (weight) of the connection between the ith neuron in layer l − 1 and the jth neuron Since the adjustment of synaptic weights is the same as that of SpikeProp (i.e., Equation 5), next, we only elaborate on the learning of synaptic delay.
• Output layer: For the delay of the output layer d o jk , based on the chain rule, we have Equation 8: Similarly, for the convenience of description, we define the intermediate quantity δ o k as Equation 9: where ∂t o k /∂V(t o k ) can be solved by Equation 6like SpikeProp, but due to its drawbacks, we will give an alternative solution later.And the remaining terms in Equation 8 can be computed using Equation 10: where • Hidden layer: For the delay of hidden layer d l ij , we have Equation 12: where t l j is the spike time of the jth neuron in layer l.
To sum up, there are Equation 14: with While the majority of SNN algorithms traditionally focus solely on updating synaptic weights, we introduce a novel approach by incorporating the adjustment of synaptic delays.This augmentation allows us to achieve joint training of both synaptic weights and delays.This dual-training approach offers two significant advantages: addressing the silent window problem and expanding the parameter space.Silent windows, a common occurrence in spiking neural networks, refer to time periods where no spiking activity takes place, potentially undermining the learning process.Weight updates alone struggle to resolve this issue.However, incorporating delay learning can effectively adjust the distribution of input spikes and mitigate this problem.Moreover, the joint training of both synaptic weights and delays provides a more extensive set of tunable parameters compared to weight-only updates.This expanded parameter space enhances the model's flexibility and can lead to improved overall performance.

. . Gradient replacement strategy
According to the above process, it can be seen that the term ∂t l j /∂V(t l j ) in Equation 15is very important for gradient calculation.If its value is calculated according to Equation 6, it means that the derivative of the membrane voltage at the firing time will be critical.More specifically, the analysis of Figure 2A reveals that during a gradual crossing of the membrane voltage threshold, the derivative approaches zero.Consequently, this leads to a significant increase in the magnitude of |∂t l j /∂V(t l j )|, resulting in a phenomenon known as gradient explosion.To address this issue, we draw upon the insights from the rectangle replacement function (Wu et al., 2018), which has demonstrated enhanced convergence in ablation experiments.Building upon this framework, we propose a novel replacement function that mitigates the problem of gradient explosion.As a result, Equation 6 can be replaced by Equation 16, as shown below.
where m (0 < m < 1) is a constant, and there is currently no accepted theoretical method for finding the optimal m on any dataset.But a good way to get an appropriate m for a specific task is to use parameter search.
Figure 2B shows the shape of replaced weight.The closer the membrane potential is to the firing time than to the threshold, the larger ∂t l j /∂V(t l j ), which is consistent with the characteristics of the spike activity.However, it's important to emphasize that this value cannot become infinitely large since it would cause the value of Equation 6 to approach 0, leading the gradient to disappear.In our approach, we impose an upper limit, capping it at exp −2m 2 /τ .This constraint ensures that the influence on ∂t l j /∂V(t l j ) remains bounded.Actually, the function of the surrogate gradient can be various (Wu et al., 2018).

Results
In this section, we test the performance of the proposed algorithm on several different datasets, including  The larger probability value corresponds to the earlier spiking time and vice versa, and the spikes will not be emitted if the firing threshold is not reached, such as neurons " " and " ." Iris, Breast Cancer, Liver Disorders, Pima Diabetes, and Ionosphere.By comparison with other algorithms: SpikeProp (Shrestha and Song, 2015), SpikeTemp (Wang et al., 2015), SWAT (Wade et al., 2010), ReSuMe (Ponulak and Kasiński, 2010), SRESN (Dora et al., 2016), and MDL (Taherkhani et al., 2018), the proposed method has a good performance.The datasets can be divided into two types: real-value datasets and image-related datasets.Samples from these databases cannot be fed directly into the network and need to be encoded into spike sequences.In real-value datasets, the population encoding (Bohte et al., 2002;Shrestha and Song, 2015;Wang et al., 2015;Taherkhani et al., 2018) is used to convert these values to input spike trains, as shown in Figure 3.In image datasets, the latency encoding (Hopfield, 1995;Hu et al., 2013) is used, as shown in Figure 4.
In the output layer each output neuron corresponds to a category, and when a training sample is fed, the corresponding output neuron is trained to fire at desired output spike time t d generated by the dynamic decoding method (Luo et al., 2019;Zhang Y. et al., 2021), while the other neurons are kept silent.

. Classification of Iris dataset
As one of the most well-known pattern recognition databases, it is divided into three categories.Each category has 50 samples with four attributes: sepal length, sepal width, petal length, and petal width.Among them, 25 samples of each category were used as the training set, and the others were used as the test set.The network structure is 25-20-3, and a sample is considered correctly classified if either its target neuron fire the most spikes or the membrane potential of its target neuron is the maximum when none of the output neurons fire.The network architecture, training epochs, train and test accuracy of our works and the contrasting methods are all depicted in Table 1.The contrast of train and test accuracy of all methods are further illustrated in Figure 5. From the results, the proposed method outperforms these methods: SpikeProp, SWAT, MDL, ReSuMe, and SpikeTemp.SRESN achieved the best test accuracy, and the proposed model is only slightly below it.

. Classification of Breast Cancer dataset
The data were collected from clinical studies conducted between January 2014 and December 2014 and were derived from   method is 55-15-2.Two output neurons correspond to the benign sample and the malignant sample.If the benign output neuron fires the more spikes or has a larger membrane potential than the malignant one, the benign sample was correctly classified; and vice versa.The network architecture, training epochs, train and test accuracy of our works are shown in Table 2, along with those of the other methods.Additionally, the contrast of train and test accuracy of all methods are further illustrated in Figure 6.Experimental results show that this method outperforms most comparison algorithms including SpikeProp, SWAT, ReSuMe, SRESN, and MDL.And our model is only 0.4% lower than SpikeTemp. .

Classification of Liver Disorders dataset
This dataset consisted of 345 samples of seven attributes, amongst which the first five attributes are blood data related to the development of liver disease, and the sixth attribute is the number of alcoholic drinks per day.It is a binary classification, with half of the data in each category comprising the training set and the other half comprising the test set.The input neurons, hidden neurons and output neurons are 37, 15, and 2, respectively.If the target neuron of a sample fires the overwhelming spikes or has the larger membrane potential, this sample is considered to be correctly classified.The performance of the proposed method in Table 3 and Figure 7 comes from the average of 20 trials with 3,000 training epochs, which outperforms other methods in terms of test accuracy.

. Classification of Pima Diabetes dataset
This dataset contains women of at least 21 years of Pima Indian ancestry.It is also a binary classification to predict whether a patient has diabetes or not on the basis of eight attributes, including Pregnancy, Glucose, Glucose, etc.The data set includes 768 samples that are divided into training/test sets in a 1:1 ratio.The network structure is 55-20-2, and the conditions for correct classification are as follows: (1) the target output neuron of the input sample fires the most spikes, (2) the membrane potential of the target neuron overwhelms the other one when firing the same spikes.After 20 training trials of 3,000 epochs, the accuracy of our model is shown in Table 4. From Table 4 and Figure 8, the training and test accuracy of the proposed method are much higher than the all the other contrasting methods.  .

Classification of Ionosphere dataset
This dataset contains 351 samples of radar data collected by Johns Hopkins University.Each sample consists of 35 attributes, the first 34 attributes are contiguous, and the last attribute is the category label.This data set has two categories, and equal numbers of the samples from each category were added to the training set and test set respectively.The network architecture of this experiment is 205-25-2.When a sample is fed into the network, the category is determined if its target neuron emits the most spikes, or if the target neuron has the maximum membrane potential when emitting the same number of spikes as the other neuron.The network is trained 3,000 epochs in each trial, and the average of 20 trials is shown in Table 5.From Table 5 and Figure 9, the test accuracy of the proposed model ranks only below spikeTemp with a slight gap, but obviously higher than the other five comparison algorithms.

FIGURE
The training accuracy and test accuracy for various methods on Pima Diabetes database.The training accuracy and test accuracy of the proposed method is the best. .

Classification of MNIST dataset
To exploit and testify the image information learning capacity of the proposed method, we conducted experiments on the widely used MNIST dataset (LeCun et al., 1998), which is a popular choice in deep learning research (Mostafa, 2017;Tavanaei and Maida, 2019;Comsa et al., 2020).This dataset comprises 60,000 training samples and 10,000 testing samples, each of which has a visual scale of 28 × 28 pixels.The pixels are encoded as spike trains through the latency encoding method (Hopfield, 1995) and then fed in the proposed method with the architecture 784-800-10.We compare the experimental results with some effective ANN and SNN works, as detailed in Table 6.The results show that the proposed model has a comparable performance on image data, which outperforms the contrasting SNN models, and only trivially falls behind artificial neural networks in terms of accuracy.

Discussion
In this paper, we proposed a new supervised learning algorithm for multi-layer spiking neural networks, which considers the plasticity of both synaptic weights and delays.Various experiments are conducted to verify the performance of the proposed learning method, and the experimental results support its superiority.Actually, how to train multilayer spiking neural networks remains an open question.The existing learning rules can be classified as ANN-to-SNN, surrogate gradient method, and spike-driven learning algorithms.In the following, we will compare the proposed method with these methods.

. Compared to ANN-to-SNN methods
The ANN-to-SNN methods are proposed to avoid the difficult training of deep spiking neural networks.However, most of the ANN-to-SNN methods are based on the spike rate information, the time information has not been fully leveraged.In addition, the existing ANN-to-SNN conversion methods can only deal with image datasets.Those datasets with rich temporal information like speech and video can not be addressed by these methods.Our algorithm adopts temporal coding to make full use of the time information of spikes and has more potential to process these temporal datasets.

. Compared to surrogate gradient methods
Due to the non-differentiable spike function, directly training spiking neural networks is very difficult.To resolve this problem, surrogate gradient-based learning algorithms are proposed.By using a surrogate gradient, these methods do not need to calculate the exact gradients.However, these methods need to do backpropagation at every time step and cost a lot of computing sources.In contrast, our algorithm holds the potential of enabling training and inference in low-power devices.

. Compared to spike-driven methods
There are various spike-driven learning methods, such as SpikeProp and STDBP.However, most existing spike-driven learning algorithms only consider the plasticity of synaptic weights and ignore synaptic delay adjustment.In our algorithm, both the synaptic weights and delays are considered adjustable variables to improve both the biological plausibility and the learning performance.Experimental results demonstrate that our algorithm achieves a competitive learning performance compared with the existing related works.In the future, we would like to further extend the application of spike-driven learning algorithms on large-scale datasets and other practical applications (Liu and Li, 2022;Liu et al., 2022a,b).

FIGURE
FIGURESRM neuron model with synaptic delay.
layer l. η d(w) represents the learning rate of delays (weights).
FIGURE (A) Membrane potential reaches the threshold slowly, resulting in the exploding gradient.(B) The shape of the surrogate gradient function.

FIGURE
FIGURESchematic diagram of population encoding.Each neuron has its own Gaussian receiving function (di erent colors).I min and I max are the minimum and maximum values of the input current.When an input current I in is fed, the corresponding value on the Gaussian function is its probability value.Finally, these values are encoded into the spiking times distributed in [ , t].The larger probability value corresponds to the earlier spiking time and vice versa, and the spikes will not be emitted if the firing threshold is not reached, such as neurons " " and " ."

FIGURE
FIGURESchematic diagram of latency encoding.The subfigure on the left is an original image of the digit " " in the MNIST.Its pixel values determine the brightness of each pixel, with higher values corresponding to whiter areas.Depending on the value, we draw five rectangles (in the middle subfigure) of di erent widths to represent di erent pixel values, with wider rectangles representing larger values.The right subfigure is the spiking times of the selected five neurons by latency encoding.The larger the pixel value, the earlier the firing time.
TABLE Classification results of Iris database.andtest accuracy for various methods on Iris database.The best training accuracy of all methods is SpikeTemp, but the best test accuracy is SRESN which is .%higher than the proposed algorithm.TABLE Classification results of Breast Cancer database.
FIGUREThe training accuracy microscopic biopsy images of breast lumps in patients with breast cancer.Each sample has 10 attributes to describe the characteristics of the nucleus of the mass, including radius, texture, perimeter, and so on.It is divided into two types of data: benign and malignant cancers.This dataset has 569 instances, of which 357 are benign and the remaining 212 are malignant.Among them, 179 benign samples and 106 malignant samples make up the training set, and the rest were used as test sets.The network architecture of the proposedFIGUREThe training accuracy and test accuracy for various methods on Breast Cancer database.The training and test accuracy of the proposed method is the second, slightly lower than SpikeTemp.
The training accuracy and test accuracy for various methods on Liver Disorders database.The training accuracy of the SpikeTemp is the best, while the test accuracy of the proposed method is the best. FIGURE TABLE Classification results of Pima Diabetes database.
TABLE Classification results of Ionosphere database.TABLE Comparison with other works on MNIST dataset.
FIGUREThe training accuracy and test accuracy for various methods on Ionosphere database.The proposed method gets the best test accuracy and the smallest gap between training accuracy and test accuracy.