Revisiting Batch Normalization for Training Low-Latency Deep Spiking Neural Networks From Scratch

Spiking Neural Networks (SNNs) have recently emerged as an alternative to deep learning owing to sparse, asynchronous and binary event (or spike) driven processing, that can yield huge energy efficiency benefits on neuromorphic hardware. However, SNNs convey temporally-varying spike activation through time that is likely to induce a large variation of forward activation and backward gradients, resulting in unstable training. To address this training issue in SNNs, we revisit Batch Normalization (BN) and propose a temporal Batch Normalization Through Time (BNTT) technique. Different from previous BN techniques with SNNs, we find that varying the BN parameters at every time-step allows the model to learn the time-varying input distribution better. Specifically, our proposed BNTT decouples the parameters in a BNTT layer along the time axis to capture the temporal dynamics of spikes. We demonstrate BNTT on CIFAR-10, CIFAR-100, Tiny-ImageNet, event-driven DVS-CIFAR10 datasets, and Sequential MNIST and show near state-of-the-art performance. We conduct comprehensive analysis on the temporal characteristic of BNTT and showcase interesting benefits toward robustness against random and adversarial noise. Further, by monitoring the learnt parameters of BNTT, we find that we can do temporal early exit. That is, we can reduce the inference latency by ~5 − 20 time-steps from the original training latency. The code has been released at https://github.com/Intelligent-Computing-Lab-Yale/BNTT-Batch-Normalization-Through-Time.

(S1) where,x It is worth mentioning that we accumulate input signals at the last layer in order to remove information loss. Then we convert the accumulated voltage into probabilities by using a softmax function. Therefore, we calculate backward gradient with respect to the loss L, following the previous work (Neftci et al., 2019). The first term of R.H.S in Eq. (S1) can be calculated as: For the second term of R.H.S in Eq. (S1), For the third term of R.H.S in Eq. (S1), Based on Eq (S3), Eq (S4), and Eq (S5), we can reformulate Eq. (S1) as: To summarize, for every time-step t, gradients are calculated based on the time-specific statistics of input signals. This allows the networks to take into account temporal dynamics for training weight connections.

B RATE CODING
Spiking neural networks process multiple binary spikes. Therefore, for training and inference, a static image needs to be converted. There are various spike coding schemes such as rate, temporal, and phase (Mostafa, 2017;Kim et al., 2018). Among them, we use rate coding due to its reliable performance across various tasks. Rate coding provides spikes proportional to the pixel intensity of the given image. In order to implement this, following previous work (Roy et al., 2019), we compare each pixel value with a random number ranging between [I min , I max ] at every time-step. Here, I min , I max correspond to the minimum and maximum possible pixel intensity. If the random number is greater than the pixel intensity, the Poisson spike generator outputs a spike with amplitude 1. Otherwise, the Poisson spike generator does not yield any spikes.

C DVS-CIFAR10 DATASET
On DVS-CIFAR10, following (Wu et al., 2019), we downsample the size of the 128 ×128 images to 42×42. Also, we divide the total number of time-steps available from the original time-frame data into 20 intervals and accumulate the spikes within each interval. We use a similar architecture as previous work (Wu et al., 2019), which consists of a 5-layerered feature extractor and a classifier. The detailed architecture is shown in Fig. S1 in this appendix. Figure S1: Illustration of network structures for DVS dataset. Here, AP denotes average pooling, FC denotes fully connected configuration.

D SPIKE RATE CALCULATION
In this appendix section, we provide the details of energy calculation discussed in Section 4.3 in the main paper. The total computational cost is proportional to the total number of floating point operations (FLOPS). This is approximately the same as the number of Matrix-Vector Multiplication (MVM) operations. For layer l in ANNs, we can calculate FLOPS as: Here, k is kernel size. O is output feature map size. C in and C out are input and output channels, respectively. For SNNs, we first define spiking rate R s (l) at layer l which is the average firing rate per neuron.
R s (l) = #spikes of layer l over all timesteps #neurons of layer l .
Frontiers Table S1. Normalized energy comparison on neuromorphic architecture: TrueNorth (Akopyan et al., 2015). We set conversion as a reference for normalized energy comparison. We conduct experiments on CIFAR-10 with a VGG9 architecture.

E ENERGY COMPARISON IN NEUROMORPHIC ARCHITECTURE
We further show the energy-efficiency of BNTT in a neuromorphic architecture, TrueNorth (Akopyan et al., 2015). Following the previous work (Park et al., 2020;Moradi and Manohar, 2018), we compute the normalized energy, which can be classified into dynamic energy (E dyn ) and static energy (E sta ). The E dyn value corresponds to the computing cores and routers, and E sta is for maintaining the state of the CMOS circuit. The total energy consumption can be calculated as #Spikes × E dyn + #Time-step × E sta , where (E dyn , E sta ) are (0.4, 0.6). In Table S1, we show that our BNTT has a huge advantage in terms of energy efficiency in neuromorphic hardware.