High-accuracy deep ANN-to-SNN conversion using quantization-aware training framework and calcium-gated bipolar leaky integrate and fire neuron

Spiking neural networks (SNNs) have attracted intensive attention due to the efficient event-driven computing paradigm. Among SNN training methods, the ANN-to-SNN conversion is usually regarded to achieve state-of-the-art recognition accuracies. However, many existing ANN-to-SNN techniques impose lengthy post-conversion steps like threshold balancing and weight renormalization, to compensate for the inherent behavioral discrepancy between artificial and spiking neurons. In addition, they require a long temporal window to encode and process as many spikes as possible to better approximate the real-valued ANN neurons, leading to a high inference latency. To overcome these challenges, we propose a calcium-gated bipolar leaky integrate and fire (Ca-LIF) spiking neuron model to better approximate the functions of the ReLU neurons widely adopted in ANNs. We also propose a quantization-aware training (QAT)-based framework leveraging an off-the-shelf QAT toolkit for easy ANN-to-SNN conversion, which directly exports the learned ANN weights to SNNs requiring no post-conversion processing. We benchmarked our method on typical deep network structures with varying time-step lengths from 8 to 128. Compared to other research, our converted SNNs reported competitively high-accuracy performance, while enjoying relatively short inference time steps.

However, training deep SNNs is highly challenging because it is difficult to directly apply the backpropagation (BP) method to SNNs owing to the inherent discontinuity of discrete spikes. A common indirect approach to overcome this problem is to train a structurally equivalent ANN model offline and then convert it to an SNN with the learned synaptic weights for inference, where the real values of inputs and outputs of ANN neurons correspond to the rates of presynaptic (input) and postsynaptic (output) spikes of the SNN neurons (Diehl et al., 2015;Hunsberger and Eliasmith, 2016;Rueckauer et al., 2017;Zhang et al., 2019;Kim et al., 2020;Lee et al., 2020;Yang et al., 2020;Deng and Gu, 2021;Dubhir et al., 2021;Ho and Chang, 2021;Hu et al., 2021;Kundu et al., 2021;Li et al., 2021b;Bu et al., 2022;Liu et al., 2022). Although previous ANN-to-SNN techniques usually obtain state-of-the-art object recognition accuracies, they require complicated post-conversion fixations such as threshold balancing (Diehl et al., 2015;Rueckauer et al., 2017;Liu et al., 2022), weight normalization (Diehl et al., 2015;Rueckauer et al., 2017;Ho and Chang, 2021), spike-norm (Sengupta et al., 2019), and channel-wise normalization (Kim et al., 2020), to compensate the behavioral discrepancies between artificial and spiking neurons. In addition, a few of those methods require a relatively long time window (e.g)., 2,500 algorithmic discrete time steps (Sengupta et al., 2019), allowing for sufficient spike emissions to precisely represent the real values of the equivalent ANNs. This incurs high latencies and additional computational overheads, severely compromising the efficiency of SNNs.
To mitigate the aforementioned overheads in ANN-to-SNN conversion, this study proposes a simple and effective deep ANNto-SNN framework without any post-conversion tuning, and the converted SNN can achieve a high recognition accuracy in a relatively shorter temporal window (i.e., 128 down to 8 time steps). This framework adopts our proposed calcium-gated bipolar leaky integrate and fire (Ca-LIF) spiking neuron model to well approximate the function of the ReLU neuron widely used in deep ANNs. It fully leverages off-the-shelf quantization-aware training (QAT) toolkit to train the ANNs with low-bit precision ReLU activations, which can be captured as the spike rate of the Ca-LIF neuron in an intermediately short time window.
The rest of this article is organized as follows: Section 2 explains the background of neural networks, including the ReLU and the basic LIF neurons. Section 3 proposes our Ca-LIF spiking neuron model and the QAT-based ANN-to-SNN framework, which are validated with elaborate experiments mentioned in Section 4. Section 5 summarizes this study.

. Preliminaries . . Convolution neural network
The typical structure of a deep neural network (shown in Figure 1) is composed of alternating convolutional (CONV) layers for feature detection and pooling layers for dimensionality reduction, followed by stacked fully connected (FC) layers as a feature classifier. In a CONV layer, each neuron in a channel is connected via a shared weight kernel to a few neurons within a spatial neighborhood called receptive field (RF) in the channels of the precedent layer. In a pooling layer, each neuron aggregates the outputs of the neurons in a p × p spatial window from the corresponding channel of its precedent CONV layer, thereby realizing data dimensionality reduction and small translational invariance. In an FC layer, each neuron is fully connected to all neurons in its precedent layer. The neuron with the most active outputs in the final layer indicates the recognition result.

. . ReLU neuron in ANN
The output of the ReLU neuron widely used in ANNs is formulated as follows: where z is the net summation which is calculated as follows: with x i as the i-th input value to the neuron, w i the connecting weight, and b a bias term. Figure 2A depicts the ReLU function.

. . Basic LIF neuron in SNN
The LIF neuron is the most commonly adopted model in SNNs . It is biologically plausible with an internal state variable called membrane potential V m (initialized to 0 at the beginning of spike trains of every input image) and exhibits rich temporal dynamics. Once the neuron receives a spike event via any of its synapses, the corresponding synaptic weight w i is integrated into its V m . Meanwhile, the neuron linearly leaks all the time. The event-driven LIF model can be described as follows: where t k and i(k) are the algorithmic discrete time step and the index of the synapse when and where the k-th input presynaptic spike arrives, respectively, and λ is a constant leakage at every time step. Whenever V m crosses a pre-defined threshold V th > 0, the neuron fires a postsynaptic spike to its downstream neurons and resets V m by subtracting V th from it. Suppose an input image has a presentation window of T time steps (i.e., the length of spike trains encoded from the image pixels), one would estimate the total output spike count of the LIF neuron as follows Lee et al., 2020): where floor returns the largest integer no larger than its argument, and z s is the net integration across all the T time steps: . /fnins. .

FIGURE
Typical structure of deep neural networks. In convolutional (CONV) layers, the Receptive Field (RF) is a spatial neighborhood around a neuron in a channel connected by a shared weight kernel to the next layer's neurons. The pooling layer is utilized to reduce the size of its preceding CONV layer feature map by a pooling window. Each neuron in a fully connected (FC) layer are connected to all the neurons in its previous layer. The outputs of the final layer indicate the image object recognition result.

FIGURE
Input-output relationships of (A) the ReLU neuron, (B) the basic LIF neuron, and (C) the quantized ReLU approximation based on rounding (Deng and Gu, ). In (A), when the input z > , the output y = z, otherwise y = . In (B), z S is the integrated membrane across the total time steps and V th is the threshold. In (C), z Q , y Q are the input and output of the quantized ReLU function based on rounding. with x si being the total count of input spikes via synapse i. Equation (3a) is depicted in Figure 2B.
. Materials and methods

. . Motivation
From the similarities between Eqs. (1) and (3) and between Figures 2A, B, it appears that the LIF neuron can be used to approximate the ReLU function by treating its pre-and postsynaptic spike rates or counts x si , y s as ReLU's input and output values x i , y. The leakage term -λT in Equation (3b) acts as the bias b in Equation (1b). Thus, we can first train a deep ANN using standard BP, and then export the learned weights and biases to a structurally equivalent SNN of LIF neurons for inference.
However, there are three challenges hindering such a direct ANN-to-SNN conversion: 1) The input and output spike counts of the LIF neuron are discrete integers, while ReLU allows continuous-valued inputs and output. Particularly, the y s in Figure 2B is a scaled (by the factor of 1/V th ) and staircase-like approximation of the ReLU output y in Figure 2A. To reduce their discrepancy, a long time window is often needed to generate sufficient output spikes, resulting in a high inference latency.
2) Due to the extra temporal dimension of the LIF neuron, Equation (3a) may be significantly violated sometimes. As illustrated in Figure 3, the earlier input spikes via positive synaptic weights trigger output spikes, which could not be canceled out by later input spikes via more negative synaptic weights, as the information accumulated into the negative V m of the LIF neuron cannot be passed on to other neurons via any output spikes. Therefore, even when the LIF neuron has weights and inputs values x si = x i identical to those of the ReLU neuron, with a leakage constant set to be λ = -b/T, LIF output spike count y s still severely deviates from ReLU output y and largely violates Equation (3a).
3) Note that there is a floor(z s /V th ) operation in Equation (3a) due to the discrete fire thresholding mechanism, leading to a shift of V th /2 along the positive z s axis in Figure 2B, compared to the rounding-based quantized ReLU approximation as shown in Figure 2C (Deng and Gu, 2021). Indeed, a better approximation to the ReLU neuron expects a round operation instead of the floor function to obtain statistically zero-mean quantization errors (Deng and Gu, 2021).
To overcome the first challenge, we can leverage QAT ANN training toolkits to produce an ANN with low-precision ReLU outputs, while minimizing the accuracy loss compared to a full-precision ANN. The complete QAT-based ANN-to-SNN framework is proposed in Section 3.3. For the other two challenges, we propose a Ca-LIF neuron model. It reserves the spike-based event-driven nature of a biological neuron, while mathematically better aligning with the (quantized) ReLU curve regardless of the input spike arrival order, as introduced later.

FIGURE
Comparison of the input-output relationships of the ReLU neuron, the basic LIF neuron, and the proposed Ca-LIF neuron.

. . The proposed Ca-LIF spiking neuron model
We proposed the Ca-LIF spiking neuron model to correct the output mismatches between the basic LIF model and the quantized ReLU function, as exhibited in Figure 3. It performs the same leaking and integration operations as in Equation (2) but employs a slightly different firing mechanism. The Ca-LIF neuron holds symmetric thresholds V th > 0 and -V th < 0. Once its V m upcrosses V th , or down-crosses -V th with the gating condition y S (t-1) > 0 satisfied, the neuron fires a positive or negative spike, respectively, where y S in the Ca-LIF neuron represents signed output of the spike count. i.e., the positive output spike count minus the positive negative spike count. Actually, y s resembles the calcium ion concentration (Ca + ) in a biological neuron (Brader et al., 2007). Note that if a Ca-LIF neuron receives a negative spike sent by another neuron via its synapse i, -w i is instead integrated onto the V m in Equation (2). This neuron resets by adding V th to the V m after it fires a negative output spike.
Moreover, as mentioned earlier, the spiking neuron should perform a rounding function to replace the floor operation on (z s /V th ) in Equation (3a) to better align with the quantized ReLU behavior. Mathematically, the Ca-LIF neuron should execute: To achieve this, after all the spike events input to the SNN composed of Ca-LIF neurons have been processed, each Ca-LIF .

FIGURE
The testing accuracies of the SNNs (T = ) converted from quantized ANNs. The numeric in the bracket below the accuracy is the loss of the converted SNN compared with the corresponding quantized ANN.
neuron in the first SNN layer with their V m between V th /2 ∼ V th (or between -V th ∼ -V th /2, and y S > 0) is a force to fire a positive (or negative) spike. These rounding spikes propagate to other Ca-LIF neurons in subsequent layers, trying to trigger their own rounding spikes based on their halved thresholds ±V th /2. This progresses until the final layer is completed.

. . QAT-based ANN-to-SNN conversion framework
Using the aforementioned Ca-LIF neurons, we now propose the details of the simple QAT-based ANN-to-SNN conversion framework. First, utilize any off-the-shelf QAT toolkit available to train a deep quantized ANN. Next, export the learned ANN weights to an SNN composed of Ca-LIF neurons organized in the same network structure as the ANN, and analytically determine the neuron thresholds. Typically, a QAT toolkit would provide the lowbit precision mantissa w Q i associated with a scaling factor S w of each learned quantized weight in the ANN, as well as the bias b, the input and output scaling factors S x and S y of the neurons. One quantized ReLU neuron performs inference with its such learned parameters as follows: Wherein the superscript Q denotes quantized. By comparing the forms of Equations (3c) and (4), it can be found that, if we simply set for a Ca-LIF neuron, it can seamlessly replace the quantized ReLU neuron and reproduce its input-output relationship of Figure 2C in the form of spike counts, with exactly the same learned weights.
In addition, one neuron in an average pooling layer of the quantized ANN performs a quantized linear operation as follows: where PW denotes the set of ReLU neurons in the p × p pooling window connecting to the pooling neuron. Such pooling neuron can also be approximated by our Ca-LIF neuron but without the y s gating constraint on negative firing, and with its V th being p 2 , the leakage constant λ being 0, and all synaptic weights being 1.

. Experiments . . Benchmark datasets
We evaluated our method on five image datasets: MNIST, CIFAR-10, CIFAR-100, Caltech-101, and Tiny-ImageNet. Their image resolution, number of object categories, as well as the training/testing subsets partition are listed in Table 1. The MNIST dataset contains 28 × 28 handwritten digit images of 10 classes, i.e., 0-9. It is divided into 50,000 training samples and 10,000 testing samples. The CIFAR-10 dataset contains 10 object classes, including 50,000 training images and 10,000 testing images with an image size of 32 × 32. For CIFAR-100, it holds 100 object classes, each owning 500 training samples and 100 testing samples. The Caltech-101 dataset consists of 101 object categories, each of which holds 40-800 image samples with a size of 300 × 200 pixels. The Tiny-ImageNet benchmark is composed of as many as 200 object classes, each of which has 500 training samples and 50 testing samples with an image size of 64 × 64.

FIGURE
The inference accuracy performance of the converted VGG-SNN on the CIFAR-and CIFAR-datasets using varying numbers of time-steps.
We employed the inter-spike interval (ISI) coding method (Guo et al., 2021) to encode pixel values into spikes. The pixel brightness Pix (for color images, this refers to the color component in each of the red, green, and blue channels) was converted to a spike train with N spikes in a T time-step window, with N = floor(α · T· Pix / Pix_max), where function floor(x) returned the biggest integer no larger than x, Pix_max was the maximum value a pixel could reach (for 8-bit image pixels which used in our work, Pix_max = 255), and α ≤ 1 controlled the spike rate, which was set to 1 throughout our experiments unless otherwise stated. The n-th spike happened at time step t n = floor(n · t int ), where t int = T / (α· T· Pix / Pix_max) = Pix_max / (α·Pix) ≥ 1 was the temporal interval (non-rounded) between two successive spikes. In particular, the brightest pixel value of 255 would be converted to a spike train of totally N = floor(α· T · 255/255) = T spikes with t int = α = 1. In other words, its converted spike train reached the maximum rate of one spike per time step.

. . Network structure configuration
We adopted five typical deep network structures to evaluate our Ca-LIF spiking neuron and ANN-to-SNN framework: (1) Lenet-5 (Lecun et al., 1998), (2) VGG-9 (Lee et al., 2020), (3) ResNet-11H, which only kept half of the channels in each CONV layer of the Resnet-11 (Lee et al., 2020), and (4) MobileNet-20, a reduced version of MobileNetV1 (Howard et al., 2017) with the original 16th -23rd CONV layers removed, and (5) VGG-16. We modified all pooling layers in these networks to perform average pooling. Moreover, for each network, the kernel size in its first layer and the number of neurons in its last FC layer had to accommodate the image size (i.e., the image resolution and number of color channels) and the number of object categories, respectively, when coping with different image datasets.

. . Recognition accuracy
In our experiments, we leveraged the off-the-shelf PyTorch QAT toolkit (PyTorch Foundation, 2022) to train deep ANNs of the aforementioned five neural network structures, and then exported the learned parameters to construct structurally equivalent SNNs for inference. The learned weights were directly translated to the synaptic weights of SNN Ca-LIF neurons, while other parameters like the biases and quantization scaling factors were used to determine the thresholds and leakages of Ca-LIF neurons according to Equation (5).
The PyTorch QAT toolkit quantized the inputs, outputs, and weights of the ANN neurons all into a signed 8-bit format during training. Note that we can freely leverage any other available QAT toolkit supporting other ANN activation bitprecisions, including binary and ternary activations. We employed the standard stochastic gradient descent method to train ANNs with a momentum of 0.9. The batch normalization (BN) (Ioffe and Szegedy, 2015) technique was also employed in the QAT training to improve the accuracy performance of some deep networks on complex datasets. The BN layers' parameters were updated with other parameters in a unified QAT process and were already incorporated into the convolution layers' biases and quantized 8bit weights before being exported to SNNs. The training of the QAT starts from scratch rather than relying on transferring learning. For converted SNN inference, we set T = 128 time steps as the baseline spike encoding window length. The testing accuracies of the SNN (T = 128) under each network structure configuration in Section 4.2 are demonstrated in Figure 4. These results indicated that our ANN-to-SNN conversion framework along with the proposed Ca-LIF neuron model achieved competitively high recognition performance. Indeed, the accuracy gap of the converted SNNs (T = 128) and their pre-conversion quantized ANN counterparts was negligibly below 0.04%. The results of experiments on MNIST, CIFAR-10, CIFAR-100, Caltech-101, and Tiny-ImageNet demonstrate the superiority and universality of our method.
. /fnins. .   The bold values indicate the testing performance of our conversion SNN on CIFAR-10 with different time steps.
Moreover, to evaluate the accuracy vs. latency (i.e., the number of inference time steps) tradeoff of our converted SNNs, Figure 5 depicts the accuracies of our converted VGG-16 SNN on the CIFAR-10 and CIFAR-100 image datasets under different time window length configurations with varying time steps of T = 8 to 512. The accuracies saturate above T = 128, as we utilized a signed 8-bit activation for the pre-conversion quantized ANN. A more elaborate work comparison and discussion about this is provided in Section 4.4. Table 2 compares our work with other previous ANN-to-SNN conversion research. Since a quantized ANN itself may suffer a bit lower accuracy (sometimes a little higher) than its full-precision version, we also trained and tested the recognition accuracies of full-precision ANNs using the aforementioned network structures for a fair comparison, and further evaluated the accuracy loss between the converted SNNs and corresponding full-precision ANNs.

. . Work comparison and discussion
For the MNIST dataset, the accuracies of our SNNs are a little higher than full-precision ANNs due to the higher accuracies of the QAT-trained ANNs. When it comes to CIFAR-10, the accuracy of our VGG-9 (93.63% for T = 128) surpasses those provided by Diehl et al. (2015), Sengupta et al. (2019), and Kundu et al. (2021). Using fewer time steps, our ResNet-11H on CIFAR-10 (93.58% for T =128) exceeds those using the same structure provided by Diehl et al. (2015) and Sengupta et al. (2019) and the deeper ResNet structure provided by Sengupta et al. (2019), ), Hu et al. (2021, and Deng and Gu (2021). As compared to Bu et al. (2022) (92.35% for T = 64), our ResNet-11H (93.44% for T = 64) also has a better performance. The reason that the accuracy of our ResNet-11H is lower than that of Deng et al. (2022) will be discussed in section 4.4. The accuracy of our MobileNet-20 is slightly superior to that of Li et al. (2021a), while our VGG-16 on .
/fnins. . CIFAR-10 is preferable to Sengupta et al. (2019) and  in terms of both accuracy and latency (i.e., number of time steps). The accuracy of our VGG-16 is a little lower than that provided by Bu et al. (2022) and Li et al. (2021a) due to their high-accuracy baseline full-precision ANN, while our method relies on the QAT framework which produces a less-accurate ANN model for conversion. Fortunately, our method requires no complex operations like modifying the loss function mentioned by Bu et al. (2022) or post-processing calibrations mentioned by Li et al. (2021a). For the CIFAR-100 dataset, our ResNet-11H SNN also transcends more complex ResNet structures (Sengupta et al., 2019;Hu et al., 2021) while falling behind (Deng and Gu, 2021). The accuracy and the latency metrics of our VGG-16 on CIFAR-100 outperform those using the same network architecture (Sengupta et al., 2019;Deng and Gu, 2021).
Regarding the Tiny-Image-Net dataset, the overall performance (accuracy, latency, and ANN-to-SNN accuracy loss) of all our networks defeat those of Kundu et al. (2021).
In general, Table 2 indicates that our SNNs converted from QAT-trained ANNs can achieve competitively high recognition accuracies across all the used network structures on the benchmark image datasets, when compared to the similar network topologies used in other studies. Our SNN accuracy loss with respect to the corresponding full-precision ANNs also keeps as low as that of other studies. Moreover, in our study, the low-precision data quantization in ANNs allows an intermediate temporal window of T = 128 time steps for the converted SNNs to complete inference at an acceptable computational overhead on potential neuromorphic hardware platforms. Table 3 Further uses the VGG-16 structure and CIFAR-10 dataset to test the accuracies of our converted SNNs with varying time steps and compares them with some recent ANN-to-SNN conversion researches. Our study surpasses ,  and Ding et al. (2021) totally under all time-step configurations. Our SNN accuracy is still comparably competent when using a relatively short time length of T = 32 time steps. However, when the time window is as extremely short as T = 16 or 8, our SNN accuracies start to obviously lag the ones obtained by Deng and Gu (2021), Bu et al. (2022), and Li et al. (2022b). Similar conclusions can be drawn from Table 4, where our study is compared with other studies on the SNN accuracies on the more challenging CIFAR-100 dataset. Our SNN accuracies are comparable to the others when T is 32 time steps or longer, but obviously lower for T = 8 and 16 time steps. We deem this accuracy degradation as the cost of adopting an off-the-shelf QAT ANN training toolkit without dedicated optimizations toward low-latency inference as employed in Deng and Gu (2021), Bu et al. (2022), and Li et al. (2022b). The recently emerged direct SNN training methods can also reach a relatively high accuracy while consuming much fewer time steps <10 (Guo et al., 2021(Guo et al., , 2022aDeng et al., 2022;Kim et al., 2022;Li et al., 2022a). However, evaluating direct SNN training methods is out of the scope of this article.
The concept of a negative spike has also been proposed by Kim et al. (2020). However, this work differs from theirs mainly in two aspects. First, the neuron model by Kim et al. (2020) has no membrane potential leakage. Rather, it adopts an extra constant input current to represent the bias term in the ANN ReLU. By contrast, our Ca-LIF model naturally incorporates the bias term in the more bio-plausible leakage term. Second and more importantly, the purposes of firing negative spikes are different. The negative spike mentioned by Kim et al. (2020) is only for modeling the negative part of the leaky-ReLU unit widely required in object detection, while our Ca-LIF neuron uses negative spikes to counter-balance the early emitted positive spikes so that when the net input z s in Equation (3b) aggregated over the entire time window T is negative, the final signed spike count can be zero, which thus closely emulates the quantized ReLU function in classification tasks, as explained in Section 3.1 and 3.2. Some previous ANN-to-SNN works do not adopt such methods but employed more complex threshold/weight balancing operations required to compensate for the early emitted positive spikes (Diehl et al., 2015;Rueckauer et al., 2017;Ho and Chang, 2021;Liu et al., 2022). In this regard, although judging the sign of the spikes puts forward marginally additional computational overhead, it considerably eliminates the tedious post-conversion steps like threshold/weight balancing.
One limitation of the proposed QAT ANN-to-SNN conversion framework, as well as other ANN-to-SNN conversion methods, is that the input spike coding can only employ a rate-coding paradigm, where input spike frequency or count is proportional to the pixel intensity to be encoded. This requires multiple to dozens of spikes for each pixel. These ANN-to-SNN conversion methods cannot accommodate the more computationally efficient temporal .
coding scheme (Mostafa, 2018), where each pixel is encoded into only one spike whose precise emission time is conversely proportional to the pixel intensity, and each neuron in the SNN is only allowed to fire at most once in response to an input sample. However, as mentioned earlier, since our method can adapt to any available QAT training toolkit, we can resort to those supporting binary or ternary activations, so that the total spikes propagated through our converted SNNs would be largely reduced, with the required inference time window length considerably shortened. Therefore, the gap between the computational overheads of our converted SNNs and the one mentioned by Mostafa (2018) using temporal coding can be well bridged.

. Conclusion
This study proposes a ReLU-equivalent Ca-LIF spiking neuron model and a QAT-based ANN-to-SNN conversion framework requiring no post-conversion operations, to achieve comparably high SNN accuracy in object recognition tasks with an intermediately short temporal window ranging from 32 to 128 time steps. We employed an off-the-shelf PyTorch QAT toolkit to train quantized deep ANNs and directly exported the learned weights to SNNs for inference without post-conversion operations. Experimental results demonstrated our converted SNNs of typical deep network structures can obtain competitive accuracies on various image datasets compared to previous studies while requiring a reasonable number of time steps for the inference. The proposed approach might also be applied to deploy deeper SNN architectures such as MobileNetv2 and VGG-34. Our future research will also include hardware implementation for SNN inference based on our Ca-LIF neurons.