Boosting Throughput and Efficiency of Hardware Spiking Neural Accelerators Using Time Compression Supporting Multiple Spike Codes

Spiking neural networks (SNNs) are the third generation of neural networks and can explore both rate and temporal coding for energy-efficient event-driven computation. However, the decision accuracy of existing SNN designs is contingent upon processing a large number of spikes over a long period. Nevertheless, the switching power of SNN hardware accelerators is proportional to the number of spikes processed while the length of spike trains limits throughput and static power efficiency. This paper presents the first study on developing temporal compression to significantly boost throughput and reduce energy dissipation of digital hardware SNN accelerators while being applicable to multiple spike codes. The proposed compression architectures consist of low-cost input spike compression units, novel input-and-output-weighted spiking neurons, and reconfigurable time constant scaling to support large and flexible time compression ratios. Our compression architectures can be transparently applied to any given pre-designed SNNs employing either rate or temporal codes while incurring minimal modification of the neural models, learning algorithms, and hardware design. Using spiking speech and image recognition datasets, we demonstrate the feasibility of supporting large time compression ratios of up to 16×, delivering up to 15.93×, 13.88×, and 86.21× improvements in throughput, energy dissipation, the tradeoffs between hardware area, runtime, energy, and classification accuracy, respectively based on different spike codes on a Xilinx Zynq-7000 FPGA. These results are achieved while incurring little extra hardware overhead.


Introduction
Spiking neural networks (SNNs) closely emulate the spiking behaviors of biological brains (Ponulak and others 2011).Moreover, the event-driven nature of SNNs offer potentials in achieving great computational/energy efficiency on hardware neuromorphic computing systems (Merolla and others 2014; Furber and others 2014).For instance, processing a single spike may only consume a few pJ of energy on recent neuromorphic chips such as IBMs TrueNorth (Merolla and others 2014) and Intels Loihi (Davies and others 2018).
SNNs support various rate/temporal spike codes among which rate coding using Poisson spike trains is popular.
However, in that case, the low-power advantage of SNNs may be offset by long latency during which many spikes are processed for ensuring decision accuracy.Various temporal codes have been attempted to improve the efficiency of information representation (Thorpe and others 2001; Kayser and others 2009;Kim and others 2018;Thorpe and others 1990;Izhikevich and others 2002).The time-to-firstspike coding encodes information using arrival time of the first spike (Thorpe and others 2001).Phase coding (Kayser and others 2009) encodes information in a spike by its phase relative to a periodic reference signal (Kim and others 2018).No coding is considered universally optimal thus far.The achievable latency/spike reduction of a particular code can vary widely with network structure and application.Rather than advocating a particular code, for the first time, we focus on an orthogonal problem: temporal compression applicable to any given SNN (accelerator) and spike code to boost throughput and energy efficiency.We propose a general compression technique that preserves both the spike count and temporal characteristics of the original SNN with low information loss, as shown in Fig. 1 It transparently compresses duration of the spike trains, hence classification latency, on top of an existing rate/temporal code.More broadly, this work extends the notion of weight/model pruning/compression of DNN accelerators from the spatial domain to the temporal domain.

Neural Computation on a Faster Time Scale
The contributions of this paper include: 1) the first general time-compression technique transparently compressing spike train duration of a given SNN and achieving large latency reduction on top of the spike codes that come with the SNN, 2) facilitating the proposed time compression by four key ideas: spike train compression using a weighted representation, a new family of input-output-weighted (IOW) spiking neural models for processing time-compressed spike trains for multiple spike codes, scaling of time constants defining neural, synaptic, and learning dynamics, and lowcost support of flexible compression ratios (powers of two or not) using time averaging, 3) low-overhead hardware modifications of a given SNN accelerator to operate it on a compressed time scale while preserving the spike counts and temporal behaviors in inference and training, 4) a timecompressed SNN (TC-SNN) accelerator architecture and its programmable variant (PTC-SNN) operating on a wide range of (programmable) compression ratios and achieving significantly improved latency, energy efficiency, and tradeoffs between latency/energy/classification accuracy.
We demonstrate the proposed TC-SNN and PTC-SNN compression architectures by realizing several liquid-state machine (LSM) spiking neural accelerators with a time compression ratio up to 16:1 on a Xilinx Zynq-7000 FPGA.Using the TI46 Speech Corpus (Liberman and others 1991), the CityScape image recognition dataset (Cordts and others 2016), and N-TIDIGITS18 dataset (Anumula and others 2018), we demonstrate the feasibility of supporting large time compression ratios of up to 16×, delivering up to 15.93×, 13.88×, and 86.21× improvements in throughput, energy dissipation, the tradeoffs between hardware area, runtime, energy, and classification accuracy, respectively based on various spike coding mechanisms including burst coding (Park and others 2019) on a Xilinx Zynq-7000 FPGA.These results are achieved while incurring little extra hardware overhead.

Proposed Time-Compressed Neural Computation
This work aims to enable time-compressed neural computation that preserves the spike counts and temporal behaviors in inference and training of a given SNN while significantly improving latency, energy efficiency, and tradeoffs between latency/energy/classification accuracy.We develop four techniques for this objective: 1) spike train compression using a weighted representation, 2) a new family of input-output-weighted (IOW) spiking neural models processing time-compressed spike trains for multiple spike codes, 3) scaling of time constants of neural, synaptic, and learning dynamics, and 4) low-cost support of flexible compression ratios (powers of two or not) using time averaging.

Spike Train Compression in Weighted Form
We time-compress a given spiking neural network first by shrinking the duration of the input spike trains.To support large compression ratios hence significant latency reductions, we represent the compressed input trains using an weighted form.Typical binary spike trains with temporal sparsity may be time-compressed into another binary spike train of a shorter duration.However, as shown in Fig. 2, the spike count and temporal characteristics of the uncompressed train can only be preserved under a small compres-sion ratio bound by the minimal interspike interval.More aggressive compression would lead to merging multiple adjacent spikes into a single spike, resulting in significant alterations of firing count and temporally coded information.This severely limits the amount of compression possible.Instead, we propose a new weighted form for representing compressed spike trains, where multiple adjacent binary spikes are compressed into a single weighted spike with a weight value equal to the number of binary spikes combined, allowing preservation of spike information even under very large compression ratios (Fig. 2).

Input-Output-Weighted (IOW) Spiking Neurons
As such, each spiking neuron would process the received input spike trains in the weighted form.Furthermore, as shown in Fig. 3, under large compression ratios the membrane potential of a spiking neuron may rise high above the firing threshold voltage within a single time step as a result of receiving input spikes with large weights.In this case, outputting spike trains in the standard binary form can lead to significant loss of input formation, translating into large performance loss as we demonstrate in our experimental results.Instead, we propose a new family of input-output-weighted (IOW) spiking neural models which take the input spike trains in the weighted form and produce the output spike train in the same weighted form, where the multi-bit weight value of each output spike reflects the amplitude of the membrane potential as a multiple of the firing threshold.Spiking neuronal models such as the leaky integrate-and-fire (LIF) model and other models supporting various spike codes can be converted to their IOW counterpart with streamlined lowoverhead modification as detailed later.
Input Spike Train (Weighted form)

Scaling of Time Constants of SNN Dynamics
The proposed compression is general in the sense that it intends to preserve the spike counts and temporal behaviors in the neural dynamics, synaptic responses, and dynamics employed in the given SNN such that no substantial alterations are introduced by compression other than that the time-compressed SNN just effectively operates on a faster time scale.The dynamics of the cell membrane is typically specified by a membrane time constant τ m , which controls the process of action potential (spike) generation and influences the information processing of each spiking neuron (Gerstner and others 2002).Synaptic models also play an important role in an SNN and may be specified by one or multiple time constants, translating received spike inputs into a continuous synaptic current waveform based on the dynamics of a particular order (Gerstner and others 2002).Finally, Spike traces or temporal variables filtered with a specific time constant may be used to implement spike-dependent learning rules (Thorpe and others 2001; Zhang and others 2015).
Maintaining the key spiking/temporal characteristics in the neural, synaptic, and learning processes is favorable because: 1) the SNNs with time compression essentially attains pretty much the same dynamic behavior like before such that the classification performance would be also similar to the one under no time compression, i.e. no large performance degradation is expected when employing time compression; 2) the deployed learning rules need no modification and the same rules can effectively train the SNNs with time compression.Attaining the above goal entails proper scaling of the time constants associated with these processes as a function of the time compression ratio as shown in Fig. 4.  Without loss of generality, consider a decaying first order dynamics ẋ(t) = −x(t)/τ with time constant τ .For digital hardware implementation, forward Euler discretization may be adopted to discretize the dynamics over time: where ∆t is the discretization time stepsize and τ nom = τ /∆t is the normalized time constant used in digital hardware implementation.Now denote the target time compression ratio by γ (γ ≥ 1).The discretization stepsize with time compression is: ∆t c = γ∆t, i.e. one time step of the time-compressed SNN equals to γ time steps of the uncompressed SNN.Based on (1), discretizing the first order dynamics with time compression for one step gives: (2) where τ nom,c is the normalized time constant with compression.Linearly scaling τ nom,c by τ nom,c = τnom γ is equivalent to: X(t + ∆t c )≈X(t) 1− 1 τnom/γ , which produces large errors when γ 1.Instead, we get an accurate τ nom,c value according to:

Flexible Compression Ratios using Time Averaging
Digital multipliers and dividers are costly in area and power dissipation.Normalized time constants in a digital SNN hardware accelerator are typically set to a power of 2, i.e. τ nom = 2 K such that the dynamics can be efficiently implemented by a shifter rather than expensive multipliers and dividers (Zhang and others 2015).However, it may be desirable to choose a compression ratio and/or scale each time constant continuously in a wide integer range, e.g.within {1, 2, 3, ..., 16}.In this case, each scaled normalized time constant τ nom,c may not be a power of 2. For example, when τ nom,c = 10, τ nom,c is far away from its two nearest powers of 2, namely 8 and 16.Setting τ nom,c to either of the two would lead to large errors.
We propose a novel time averaging approach to address the above problem (Fig. 5).For a given scaled normalized τ nom,c , we find its two adjacent powers of 2: 2 K2 ≤ τ nom,c ≤ 2 K1 .We decay the targeted first order dynamics by toggling its scaled normalized time constant between two values: 2 K2 and 2 K1 .Since each of them is a power of two, the corresponding decaying behavior can be efficiently realized using a shifter.The usage frequencies of 2 K2 and 2 K1 are properly chosen such the time-averaged time constant is equal to the desired τ nom,c .Fig. 5 shows how the timeaveraged (normalized) time constant value of 5 is achieved by averaging between two compression ratios 4 and 8. Proposed Input-and-Output Weighted (IOW) Spiking Neural Models Any given spiking neural model can be converted into its input-and-output (IOW) counterpart based on straightfor-ward low-overhead modifications.Without loss of generality, we consider conversion of two models: the standard leaky integrate-and-fire (LIF) neuron model, which has been widely used in many SNNs including ones based on rating coding, and one of its variants for supporting burst coding.

IOW Neurons based on Standard LIF Model
The LIF model dynamics is (Gerstner and others 2002): where u(t) is the membrane potential, τ m =RC is the membrane time constant, and I(t) is the total received postsynaptic current given by: where w i is the synaptic weight from the pre-synaptic neuron i, α(t) = q τs exp − t τs H(t) for a first order synaptic model with time constant τ s , H(t) is the Heaviside step function, and q is the total charge injected into the postsynaptic neuron through a synapse of a weight of 1.In this work, we adopt a somewhat more complex second order model for improved performance.
Once the membrane potential reaches the firing threshold u th , an output spike is generated and the membrane potential is reset according to: where t (f ) is the firing time.IOW LIF neurons shall process weighted input spikes because of time compression with the modified synaptic input: where a weight ω f spike,i is introduced for each input spike.IOW LIF neurons shall also generate weighted output spikes.According to Fig. 3, we introduce a set of firing thresholds {u th , 2u th , ... ,nu th } with each being a multiple of the original threshold u th .At each time step t, an output spike is generated whenever the membrane potential reaches above any firing threshold from the set and the weight of the output spike is determined by the actual threshold crossed.For example, when ku th ≤ u(t) < (k + 1)u th , the output spike weight is set to k. Upon firing, the membrane potential is reset according to:

IOW Neurons based on Bursting LIF Model
The LIF model for burst coding is also based on (3) (Park and others 2019).A bursting function g i (t) is introduced to implement the bursting behavior per each presynaptic neuron i (Park and others 2019): where β is a burst constant, E i (t − ∆t) = 1 if the presynaptic neuron i fired at the previous time step and otherwise E i (t − ∆t) = 0. We assume a zero-th order synaptic response model.Per input spikes from the presynaptic neuron i, the firing threshold voltage is modified from u th to g i (t)u th and the corresponding reset characteristic of the membrane potential after firing is: Furthermore, the total post-synaptic current is: To implement the IOW version of the LIF model with burst coding, we modify the burst function to: 11) Similar to the case of the IOW LIF model, we use a set of firing thresholds to determine the weight of each output spike and a behavior similar to (7) for reset.The only difference here is that the adopted set of firing thresholds are g i (t)u th , 2g i (t)u th , • • • ,ng i (t)u th .

Time-Compressed SNN Accelerator Architectures
The proposed time compression technique can be employed to support a fixed time compression ratio or user-programmable time compression ratio, leading to the time-compressed SNN (TC-SNN) and programmable timecompressed SNN (PTC-SNN) architectures, respectively.We describe the more general PTC-SNN architecture shown in Fig. 6.It can be adopted for any pre-designed SNN hardware accelerator for added programmable time compression.PTC-SNN introduces three streamlined additions and minor modifications to the embedded SNN accelerator to enable application and coding independent time compression.
Based on the discussions presented in Section 2, firstly, a set of input-spike compression units (ISCUs), one for each input spike channel, are incorporated into the input layer of the SNN.ISCUs convert the raw binary input spike trains into the more compact weighted form with shortened time duration.A user-specified command sets the time compression ratio of all ISCUs through the Global Compression Controller.ISCUs compress the given spike channels without assuming sparsity of the input spike trains and can support large compression ratios.Secondly, we introduce modest added hardware overhead to replace all original silicon spiking neurons by their input-output-weighted neuron elements (IOW-NEs).Finally, all time constants in the SNN are scaled based on the time compression ratio.While an SNN may employ a large number of time constants, they can be all scaled in the same way, allowing use of one common simple programmable logic unit, i.e. the Global Compression Controller for scaling all time constants according to a user-specified compression ratio command.
[Input Spike Compression Unit (ISCU)] Each input spike channel is compressed by one low-cost ISCU according to the user-specified compression ratio N cmp .When each uncompressed spike input channel is fed by a single binary serial input, a demultiplexer is utilized in the ISCU to perform the reconfigurable serial-in and parallel-out (SIPO) operation to convert the serial input into N cmp parallel outputs, as shown in Fig. 7(a).If the input spike channel is supplied by parallel spike data, the SIPO operation is skipped.During each clock cycle, the N cmp bits of the parallel outputs are added by an adder, which effectively combines these spikes into a single weighted spike with a weight value set by the output of the adder.No spike count loss is resulted as the sum of spike weights is same as the total number of binary spikes in the raw spike input train.The global temporal spike distribution of the input spike train is preserved up to the temporal resolution of the compressed spike train.
[Input-Output-Weighted (IOW) Neuron Elements] We discuss efficient hardware realization of the IOW spiking neural models (Section 3).The IOW neuron element (IOW-NE) is shown in Fig. 7(b), which consist of a synaptic unit (SU), a neural unit (NU), and a time constant configuration module, described later.SU realizes a discretized version of (6).As in many practical implementations of hardware SNNs, each ω i is constrained to be in the form of 2 K .The product of ω spike,i • ω i is efficiently realized by left shifting ω spike,ki by K bits.NU performs membrane potential u(t) update based on discretization of (3) and reset behavior (7).NU generates a weighted output spike when u(t) is above certain threshold in the firing threshold set u th , 2u th , • • •.
The design of IOW LIF neurons with burst coding is almost identical to that of the IOW LIF neurons except for the following differences.We add a LUT to store the set of firing thresholds {g i (t)u th , 2g i (t)u th , • • •}, which are calculated based on (11).Because g i (t)u th might not be in the form of 2 K , a multiplier is used to compute the product g(t)

Experimental Evaluations
The proposed time-compressed SNN (TC-SNN) architecture with a fixed compression ratio and the more general programmable PTC-SNN architecture with user-programmable compression ratio can be adopted to re-design any given digital SNN accelerator to a time-compressed SNN accelerator with low additional design overhead in a highly streamlined manner.For demonstration purpose, we show how an existing liquid state machine (LSM) SNN accelerator can be re-designed to a TC-SNN and PTC-SNN on a Xilinx Zynq-7000 FPGA.The LSM is a recurrent spiking neural network model.With its spatio-temporal computing power, it has demonstrated promising performances for various applications (Maass and others 2002).
Three speech/image recognition datasets are adopted for benchmarking.The first dataset is a subset of the TI46 speech corpus (Liberman and others 1991) and consists of 260 isolated spoken English letters recorded by a single speaker.The time domain speech examples are preprocessed by the Lyons passive ear model (Lyon and others 1982) and transformed to 78 channel spike trains using the BSA spike encoding algorithm (Schrauwen and others 2003).The second one is the CityScape dataset (Cordts and others 2016) which contains 18 classes of 1,080 images of semantic urban scenes taken in several European cities.Each image is segmented and remapped into a size of 15 × 15, are then converted to 225 Poisson spike trains with the mean firing rate proportional to the corresponding pixel intensity.The third one is a subset of N-TIDIGITS18 speech dataset (Anumula and others 2018) which is obtained by playing the audio files from the TIDIGITS dataset to a CochleaAMS1b sensor.This dataset contains 10 classes of single digits (the digits 0 to 9).There are 111 male and 114 female speakers in the dataset and 2,250 training and 2,250 testing examples.For the first two datasets, we adopt 80% examples for train- The baseline LSM FPGA accelerator (without compression) we built in this paper is based on the standard LIF model, and consists of an input layer, a recurrent reservoir, and a readout layer.The number of input neurons is set by the number of the input spike trains, which is 78, 225 and 64, respectively for the TI46 dataset, CityScape dataset, and N-TIDIGITS18 dataset, respectively.The reservoir has 135 neurons for the TI46 and CityScape datasets and 300 neurons for the N-TIDIGITS18 dataset, respectively.The reservoir neurons are fully connected to the readout neurons.All readout synapses are plastic and trained using the supervised spike-dependent training algorithm in (Zhang and others 2015).The power consumption of various FPGA accelerators is measured using the Xilinx Power Analyzer (XPA) tool and their recognition performances are measured from the FPGA board.

Reservoir responses of the LSMs
We plot the raster plots of the reservoir IOW-LIF neurons when the input speech example is the letter A from the TI46 Speech Corpus to examine the impact of time compression in Fig. 8.It is fascinating to observe that when the compression ratio is between 2:1 to 4:1, the reservoir response in terms of both total spike count and spatio-temporal spike distribution changes little from the one without compression.When the compression ratio increases to the very large values of 8:1 and 16:1, the original spatio-temporal spike distribution is still largely preserved.This is consistent to the decent recognition performance achieved at 8:1 and 16:1 compression ratios presented next.
For the TI46 speech dataset (Liberman and others 1991), the runtime and energy dissipation of each accelerator expended on 350 training epochs of a batch of 208 randomly selected examples are measured.We compare the inference accuracy, hardware overhead measured by FPGA lookup (LUT) and flip-flop (FF) utilization, power, runtime, and energy of all six accelerators in Table 1.To show the benefit of producing weighted output spikes, we create a new inputweighted (IW) LIF model which differs from the IOW LIF model in that the IW model generates binary output spikes.We redesign the five TC-SNN accelerators using IW LIF neurons and compare them with their IOW counterparts in Table 1.With large compression ratios the IOW accelerators significantly outperform their IW counterparts on classification accuracy.For example, the IOW accelerator improves accuracy from 69.23% to 80.77% with a compression ratio of 16:1.
The power/hardware overhead of the TC-SNN accelerators with IOW LIF neurons only increases modestly with the time compression ratio.Over a very wide range of compression ratio, the runtime is linearly scaled with the compression ratio while the energy is scaled almost linearly.For example, 2:1 compression speeds up the runtime by 2×, reduces the energy by 1.69×, retaining the same classification

Figure 1 :
Figure 1: Proposed general time compression for SNNs.

Figure 4 :
Figure 4: Scaling of time constants of SNN dynamics.

Figure 5 :
Figure 5: Time-averaged time constants: the realized averaged time constant is 5.

Figure 6 :
Figure 6: Proposed time-compressed SNN architecture with programmable compression ratio (PTC-SNN).ISCU: input spike compression unit; SIPO: serial-in and parallelout; IOW-NE: input-output-weighted spiking neuron element; SP: synapse response; NE: regular binary-inputoutput neuron element; Vm: membrane potential.The LUT enables programmable scaling of time constants of the neuron/synaptic models and the learning unit.

Table 1 :
Comparison of the baseline and TC-SNN accelerators with IW/IOW LIF neurons based on TI46 Speech Corpus.