Deep Spiking Neural Networks for Large Vocabulary Automatic Speech Recognition

Artificial neural networks (ANN) have become the mainstream acoustic modeling technique for large vocabulary automatic speech recognition (ASR). A conventional ANN features a multi-layer architecture that requires massive amounts of computation. The brain-inspired spiking neural networks (SNN) closely mimic the biological neural networks and can operate on low-power neuromorphic hardware with spike-based computation. Motivated by their unprecedented energy-efficiency and rapid information processing capability, we explore the use of SNNs for speech recognition. In this work, we use SNNs for acoustic modeling and evaluate their performance on several large vocabulary recognition scenarios. The experimental results demonstrate competitive ASR accuracies to their ANN counterparts, while require only 10 algorithmic time steps and as low as 0.68 times total synaptic operations to classify each audio frame. Integrating the algorithmic power of deep SNNs with energy-efficient neuromorphic hardware, therefore, offer an attractive solution for ASR applications running locally on mobile and embedded devices.


INTRODUCTION
Automatic speech recognition (ASR) has enabled the voice interface of mobile devices and smart home appliances in our everyday life.The rapid progress in the integration of voice interfaces has been viable on account of the remarkable performance of the ASR systems using artificial neural networks (ANN) for acoustic modeling (Lippmann, 1989;Lang et al., 1990;Hinton et al., 2012;Yu and Deng, 2015).Various ANN architectures, either feedforward or recurrent, have been investigated for modeling the acoustic information preserved in speech signals (Dahl et al., 2012;Graves et al., 2013;Abdel-Hamid et al., 2014).
The performance gains come with immense computational requirements often due to the timesynchronous processing of input audio signals.Several techniques have been proposed to reduce the computational load and memory storage of ANNs by reducing the number of parameters that have to be used for inference (Sainath et al., 2013;Xue et al., 2013;He et al., 2014;Povey et al., 2018).Another common solution, for reducing the amount of processed speech, uses a wake word or phrase to initialize the embedded ASR engine and starts listening to input speech (Zehetner et al., 2014;Sainath and Parada, 2015;Wu et al., 2018).Moreover, most devices with voice control rely on cloud-based ASR engines rather model has also revealed compelling prospects of rapid inference and unprecedented energy efficiency of a neuromorphic approach.
The rest of the paper is organized as follows: In Section 2, we first give an overview of spiking neural networks, large vocabulary ASR systems, and existing SNN-based ASR systems.In Section 3, we introduce the spiking neuron model and the neural coding scheme that converts acoustic features into spike-based representation.We further present a recently introduced tandem learning framework for SNN training and how it is used to train deep SNN-based acoustic models.In Section 4, we present experimental results on the learning capability and energy efficiency of SNN-based acoustic models across three different types of recognition tasks including phone recognition, low-resourced and standard large-vocabulary ASR, and compare those to the ANN-based implementations.Finally, a discussion on the experimental findings is given in Section 5.

Spiking Neural Networks
The third generation spiking neural networks are originally studied as models to describe the information processing in the biological neural networks, wherein the information is communicated and exchanged via stereotypical action potentials or spikes (Gerstner and Kistler, 2002).Neuroscience studies reveal that the temporal structure and frequency of these spike trains are both important information carriers in the biological neural networks.As will be introduced in Section 3.1, the spiking neuron operates asynchronously and integrates the synaptic current from its incoming spike trains.An output spike is generated from the spiking neuron whenever its membrane potential crosses the firing threshold, and this output spike will be propagated to the connected neurons via the axon.
Motivated by the same connectionism principle, SNNs share the same network architectures, either feedforward or recurrent, with the conventional ANNs that use analog neurons.As shown in Figure 1, the early classification decision can be made from the SNN since the generation of the first output spike.However, the quality of the classification decision is typically improved over time with more evidence accumulated.It differs significantly from the synchronous information processing of the conventional ANNs, where the output layer needs to wait until all preceding layers are fully updated.Therefore, despite information is transmitted and processed at a speed that is several orders of magnitude slower in neural substrates than signal processing in modern transistors, biological neural systems can perform complex tasks rapidly.For more overviews about SNNs and their applications, we refer readers to (Pfeiffer and Pfeil, 2018;Tavanaei et al., 2018).

Large Vocabulary Automatic Speech Recognition
As shown in Figure 2, conventional ASR systems uses acoustic and linguistic information preserved in three distinct components to convert speech signals to the corresponding text: (1) an acoustic model for preserving the statistical representations of different speech units, e.g.phones, from speech features, (2) a language model for assigning probabilities to the co-occurring word sequences and (3) a pronunciation lexicon for mapping the phonetic transcriptions to orthography.These resources are jointly used to determine the most likely hypothesis in the decoding stage.
The acoustic modeling has been achieved using various statistical models such as Gaussian Mixture Models (GMM) for assigning frame-level phone posteriors in conjunction with a Hidden Markov Model (HMM) for duration modeling (Yu and Deng, 2015).More recently, ANN-based approaches have become the standard acoustic models providing state-of-the-art performance across a wide spectrum of ASR tasks (Hinton et al., 2012).Together with the numerous ANN architectures explored for acoustic modeling, several end-to-end ANN architectures have been proposed for directly mapping speech features to text with optional use of the other linguistic components (Graves and Jaitly, 2014a;Chan et al., 2016;Watanabe et al., 2017).
The probabilistic definition of acoustic modeling becomes more evident via the Bayesian formulation of the speech recognition task.Given a target speech signal that segmented into T overlapped frames, the resulting frame-wise features can be represented as O = [o 1 , o 2 , ..., o T ].An ASR system assigns the probability P (W|O) to all possible word sequences W = [w 1 , w 2 , ...], and the word sequence Ŵ with the highest probability is the recognized output, The probability P (W|O) can be decomposed into two parts by applying the Bayes' rule as below, P (O) can be omitted as it does not depend on W. This results in which formally defines the theoretical foundation that are grounded in conventional ASR systems.P (W) is the prior probability of the word sequence W and this probability is provided by the language model which is trained on a large written corpus of the target language.P (O|W) is the likelihood of the observed feature sequence O given the word sequence W, and this probability is associated with the acoustic model.The acoustic model captures the information about the acoustic component of speech signals, aiming to classify different acoustic units accurately.Traditionally, each phone in the phonetic alphabet is modeled using multiple three-state HMM models for different preceding and following phonetic context (triphone) (Lee, 1990).The emission probability of these HMM states are shared (tied) among different models to reduce the number of model parameters (Hwang and Huang, 1993).The output layer of the ANN-based acoustic model is designed accordingly and trained to assign these frame-level tied triphone HMM state (senone) probabilities (Dahl et al., 2012).The output layer uses the softmax function to normalize the output into a probability distribution.These values are scaled with the prior probabilities of each class, obtained from the training data, to determine the likelihood values.These likelihood values are later combined with the probabilities assigned by the language model during the decoding stage so as to find the most likely hypothesis.
Speech features, used as inputs to the acoustic model, describe the spectrotemporal dynamics of the speech signal and discriminate among different phones in the target language.Mel-frequency cepstral coefficients(MFCC) (Davis and Mermelstein, 1980) features are commonly used in conjunction with the GMM-HMM acoustic model.The MFCC features are extracted by (1) performing short-time Fourier transform, (2) applying triangular Mel-scaled filter banks to calculate the power at each Mel frequency in log domain (FBANK) and (3) performing a discrete cosine transform to decorrelate the FBANK features.The third step is often skipped and FBANK features are often used when training ANN-based acoustic models since these models can handle correlation among features.In this work, we incorporate deep SNNs for acoustic modeling instead of the conventional ANNs and compare their ASR performance in different ASR scenarios including phone recognition, low-resourced and standard large vocabulary ASR.The ASR performance obtained using popular speech features have been reported to explore the impact of the feature representation space and its dimensionality for SNN-based acoustic models.

Speech Recognition with Spiking Neural Network
SNNs are well-suited for representing and processing spatial-temporal signals, they hence possess great potentials for speech recognition tasks.Tavanaei et al. (Tavanaei and Maida, 2017a,b) proposed SNN-based feature extractors to extract discriminative features from the raw speech signal using unsupervised spikingtiming-dependent plasticity (STDP) rule.While connecting these SNN-based feature extractors with Support Vector Machine (SVM) or Hidden Markov Model (HMM) classifiers, competitive classification accuracies were demonstrated on the isolated spoken digit recognition task.Wu et al. (Wu et al., 2018a,b) introduced a SOM-SNN framework for environmental sound and speech recognition.In this framework, the biological-inspired self-organizing map (SOM) is utilized for feature representation, which maps frame-based acoustic features into a spike-based representation that is both sparse and discriminative.The temporal dynamic of the speech signal is further handled by the SNN classifier.Zhang et.al (Zhang et al., 2019) presented a fully SNN-based speech recognition framework, wherein the spectral information of consecutive frames are encoded with threshold coding and subsequently classified by the SNN that is trained with a novel membrane potential-driven aggregate-labeling learning algorithm.
Recurrent network of spiking neurons (RSNNs) exhibit greater memory capacity than the aforementioned feedforward frameworks.They can capture long temporal information that are useful for speech recognition tasks.In (Zhang et al., 2015), Zhang et al. presented a spiking liquid-state machine (LSM) speech recognition framework which is attractive for low-power very-large-scale-integration (VLSI) implementation.Bellec et al. recently demonstrated state-of-the-art phone recognition accuracy on the TIMIT dataset by adding neuronal adaptation mechanism to the vanilla RSNNs (Bellec et al., 2018).It is the first time that RSNNs approaching the performance of LSTM networks (Greff et al., 2016) on the speech recognition task.These preliminary works on the SNN-based ASR systems are however limited to the phone classification or small vocabulary isolated spoken digit recognition tasks.In this work, we apply deep SNNs to LVCSR tasks and demonstrate competitive accuracies over the comparable ANN-based ASR systems.

Spiking Neuron Model
As shown in Figure 4, the frame-based features are extracted and input into the SNN-based acoustic models.Given the short temporal duration of segmented frames and the slow variation of speech signals, these features are typically assumed to be stationary over the short time-period of segmented frames.In this work, we use the integrate-and-fire (IF) neuron model with reset by subtraction scheme (Rueckauer et al., 2017), which can effectively process these stationary frame-based features with minimal computational costs.At each time step t of a discrete-time simulation, the incoming spikes to neuron j at layer l are transduced into synaptic current as follows where θ l−1 i (t) indicates the occurrence of an input spike from afferent neuron i at time step t.In addition, the w l−1 ji denotes the synaptic weight that connects presynaptic neuron i from layer l −1.Here, b l j /T e can be interpreted as a constant injecting current across the encoding time window of size T e , and b j is determined from the bias term of the coupled analog neurons which will be explained in the tandem learning section.As shown in Figure 3, neuron j integrates the input current z l j (t) into its membrane potential V l j (t) as per Eq. 5.The V l j (0) is reset and initialized to zero for every new frame-based feature input.Without loss of generality, a unitary membrane resistance is assumed here.An output spike is generated whenever V l j (t) crosses the firing threshold ϑ (Eq.7), which we set to a value of 1 for all the experiments by assuming that all synaptic weights are normalized with respect to the ϑ.
According to Eqs. 4 and 5, the free aggregated membrane potential of neuron j (no firing) in layer l can be expressed as where c l−1 i is the input spike count from pre-synaptic neuron i at layer l − 1 as per Eq. 9.
The V l,f j summarizes the aggregate membrane potential contributions of the incoming spikes from presynaptic neurons while ignoring their temporal distribution.As will be explained in the tandem learning framework section, this intermediate quantity links the SNN layers to the coupled ANN layers for parameter optimization.

Neural Coding Scheme
SNNs process information transmitted via spike trains, therefore, special mechanisms are required to encode the continuous-valued feature vectors into spike trains and decode the classification results from the activity of output neurons.To this end, we adopt the spiking neural encoding scheme that proposed in the tandem learning framework (Wu et al., 2019).This encoding scheme first transforms frame-based input feature vector X 0 (e.g., MFCC or FBANK features), where T , through a weighted layer of rectified linear unit (ReLU) neurons as follows where w 0 ji is the strength of the synaptic connection between the input x 0 i and ReLU neuron j.The b 0 j is the corresponding bias term of the neuron j, and f (•) denotes the ReLU activation function.The free aggregate membrane potential V 0,f j (0) is defined to be equal to the activation value a 0 j of the ReLU neuron j.We distribute this quantity over the encoding time window T e and represent it with spike trains as per Eqs.11 and 12.
Altogether, the spike train s 0 and spike count c 0 that output from the neural encoding layer can be represented as follows s 0 = {θ 0 (1), ..., θ 0 (T e )} (13) This encoding layer performs weighted transformation inside an end-to-end learning framework.It transforms the original input representation to match the size of the encoding time window T e and represents the transformed information via spike trains.This encoding scheme is beneficial for rapid inference since the input information can be effectively encoded within a short encoding window.Start from this neural encoding layer, as shown in Figure 4, we input the spike count c l and s l to subsequent ANN and SNN layers for tandem learning.
To ensure smooth learning with high precision error gradients derived at the output layer, we use the free aggregate membrane potential of output spiking neurons for neural decoding.Considering that the dimensionality of input feature vectors and output classes are much smaller than that of hidden layers, the computation required will be limited when deploying these two layers onto the edge devices.

Tandem Learning for Training Deep SNNs
Although IF neurons do not emulate rich temporal dynamics of biological neurons, they are however ideal for working with the neural representation that employed in this work, where spike timings play an insignificant role.It is worth noting that connections are commonly drawn between the activation value of ReLU neurons and the steady-state firing rate of IF neurons (Rueckauer et al., 2017).Here, we present a recently proposed SNN learning rule, under the tandem neural network configuration, that exploits such a connection between the activation value of ANN neurons and the spike count of IF neurons.
By neglecting the temporal dynamic of IF neuron that due to the temporal distribution of incoming spike trains, we may consider the V l,f j as the main information carrier for SNN layers.The following one-to-one correspondence between the free aggregate membrane potential V l,f j of spiking neurons and the pseudo 'spike count' a l j of artificial neurons can be established.
As shown in Figure 4, during the activation forward propagation, the SNN layers are used to determine the exact spike representation which then propagate the aggregate spike counts and spike trains to the subsequent ANN and SNN layers, respectively.This interlaced layer structure ensures the information that forward propagated to the coupled ANN and SNN layers are synchronized.Taking Eq. 15 as the activation function of ANN layers and using straight-through estimator (Bengio et al., 2013) to address the discontinuity of the rounding operation, we can use the error gradients derived from ANN layers to approximate those of the coupled SNN layers.It worth noting that the ANN is just an auxiliary structure to facilitates the training of SNN, while only SNN is used during inference.

Deep SNNs for Large Vocabulary ASR
Notably, competitive classification accuracies are demonstrated with this tandem learning rule for the image classification on the ImageNet dataset (Wu et al., 2019).By analyzing the relationships between the approximated 'spike count' a l j and the actual spike count c l j in a high-dimensional space, Wu et al. have argued that the modified learning dynamic of such a decoupled network can approximate that of an intact ANN.The details of this tandem learning rule are provided in the Algorithm 1.

SNN-based Acoustic Modeling
To train the deep SNN-based acoustic models, which is the main contribution of this work, several popular speech features have been extracted from the training recordings as described in Section 2.2.Before being fed into the SNNs, these input speech features are contextualized by splicing multiple frames so as to exploit more temporal context information.Before training the SNN-based acoustic model, alignments of the speech features with the target senone labels are obtained using a conventional GMM-HMM-based ASR system similar to that described in (Dahl et al., 2012).These frame-level alignments enable the training of the deep SNN acoustic model with the tandem learning approach.During the training, the deep SNN learns to map input speech features to posterior probabilities of senones (cf.Section 2.2) by passing the input speech frames through multiple layers of spiking neurons.
During the inference phase, the acoustic scores provided by the trained SNN model are combined with the information stored in the language model and pronunciation lexicon.It is a common practice to use the weighted finite state transducers (WFST) (Mohri et al., 2002) as a unified representation of different ASR resources for creating the search graph containing possible hypotheses.The main motivation for using the WFST-based decoding is: (1) the straightforward composition of different ASR resources for constructing a mapping from HMM states to word sequences and (2) the existence of efficient search algorithms operating on WFST that speed up the decoding process.As a result of the search process, the most likely hypotheses are found and stored in the form of a lattice.The ASR output is chosen based on the weighted sum of the acoustic and language model scores belonging to hypotheses in the lattice.For further details of the WFST-based decoding approach used in this work, we refer the reader to (Povey et al., 2012).In the following sections, we describe the ASR experiments conducted to evaluate the recognition performance of the proposed SNN-based acoustic modeling in several recognition scenarios.

Datasets
The performance of the proposed SNN-based acoustic models is investigated in three different ASR tasks: (1) phone recognition using the TIMIT corpus (Garofolo et al., 1993), (2) low-resourced ASR task using the FAME code-switching Frisian-Dutch corpus (Yılmaz et al., 2016) and (3) standard large-vocabulary continuous ASR task using the Librispeech corpus (Panayotov et al., 2015).All speech data used in the experiments has a sampling frequency of 16 kHz.
The train, development and test sets of the standard TIMIT corpus contain 3,696, 400 and 192 utterances from 462, 50 and 24 speakers, respectively.Each utterance is phonetically transcribed using a phonetic alphabet consisting of 48 phones in total.The training data of the FAME corpus comprises of 8.5 hours and 3 hours of broadcast speech from Frisian and Dutch speakers, respectively.The training utterances are spoken by 382 speakers in total.This bilingual dataset contains Frisian-only and Dutch-only utterances as well as mixed utterances with inter-sentential, intra-sentential and intra-word code-switching (Myers-Scotton, 1989).The development and test sets consist of 1 hour of speech from Frisian speakers and 20 minutes of speech from Dutch speakers each.The total number of speakers is 61 in the development set and 54 in the test set.
The Librispeech corpus contains 1,000 hours of reading speech in total collected from audiobooks.This publicly available corpus1 has been considered as a popular benchmark for ASR algorithms with multiple training and testing settings.In the ASR experiments, we train acoustic models using the 100 (train clean 100) and 360 (train clean 360) hours of speech and apply these models to the clean development (dev clean) and test (test clean) sets.Further details about this corpus can be found in (Panayotov et al., 2015).

Implementation Details
All ASR experiments are performed using the PyTorch-Kaldi ASR toolkit (Ravanelli et al., 2019).This recently introduced toolkit inherits the flexibility of PyTorch toolkit (Paszke et al., 2017) for ANN-based acoustic model development and the efficiency of Kaldi ASR toolkit (Povey et al., 2011).We implement the SNN tandem learning rule in PyTorch and integrate it into the PyTorch-Kaldi toolkit for training the proposed SNN-based acoustic models (cf. Figure 4).The PyTorch implementation of the described SNN acoustic models will be made available online soon.For the baseline ANN models, the standard multi-layer perceptron recipes are used.The Kaldi toolkit is used for obtaining the initial alignments, feature extraction, graph creation, and decoding.
For all recognition scenarios, ANNs and SNNs are constructed with 5 hidden layers and 2048 hidden units each using the ReLU activation function.Each fully-connected layer is followed by a batch normalization layer and a dropout layer with a drop probability of 10% to prevent overfitting.We train these models using various popular speech features including the 13-dimensional Mel-frequency cepstral coefficient (MFCC) feature, 23-dimensional Mel-filterbank (FBANK) feature, and higher resolution 40-dimensional MFCC and FBANK features.We further extract feature space maximum likelihood linear regression (FMLLR) (Gales, 1998) features to explore the impact of speaker-dependent features.All features include the deltas and delta-deltas; mean and variance normalization are applied before the splicing.The time context size is set to 11 frames by concatenating 5 frames preceding and following.All features are encoded within a short time window of 10-time steps for SNN simulations.
The neural network training is performed by mini-batch Stochastic Gradient Descent (SGD) with an initial learning rate of 0.08 and a minibatch size of 128.The learning rate is halved if the improvement is less than a preset threshold of 0.001.The final acoustic models of the TIMIT and FAME corpora are obtained after 24 training epochs, while the models of the Librispeech corpus are trained for 12 epochs.
For the TIMIT and Librispeech ASR tasks, we follow the same language model (LM) and pronunciation lexicon preparation pipeline as provided in the corresponding Kaldi recipes2 .The smallest 3-gram LM (tgsmall) of the Librispeech corpus is used to create the graph for the decoding stage.The details of the LM and lexicon used in the FAME recognition task are given in (Yılmaz et al., 2018).

ASR Performance
The phone recognition on the TIMIT corpus is reported in terms of the phone error rate (PER).The word recognition accuracies on the FAME and Librispeech corpora are reported in terms of word error rate (WER).Both metrics are calculated as the ratio of all recognition errors (insertion, deletion, and substitution) and the total number of phones or words in the reference transcriptions.

Energy Efficiency: Counting Synaptic Operations
To compare the energy efficiency of ANN and its equivalent SNN implementation, we follow the convention from NC community and compute the total synaptic operations SynOps that required to perform a certain task (Merolla et al., 2014;Rueckauer et al., 2017;Sengupta et al., 2019).For ANN, the total synaptic operations (Multiply-and-Accumulate (MAC)) per classification is defined as follows where f l in denotes the number of fan-in connections to each neuron in layer l, and N l refers to the number of neurons in layer l.In addition, L denotes the total number of network layers.Hence, given a particular network configuration, the total synaptic operations required per classification is a constant number that jointly determined by f l in and N l .
While for SNN, as per Eq.17, the total synaptic operations (Accumulate (AC)) required per classification are correlated with the spiking neurons' firing rate, the number of fan-out connections f out to neurons in the subsequent layer as well as the simulation time window T .

SynOps
where s l j (t) indicates whether a spike is generated by neuron j of layer l at time instant t.

Phone Recognition on TIMIT Corpus
We report the PER on the development and test sets of TIMIT corpus in Table 1, with numbers in bold being the best performance given by the speaker-independent features.ASR performances of other state-of-the-art systems using various ANN and SNN architectures are given in the upper panel for reference purposes.As the results shown in Table 1, the proposed SNN-based acoustic models are applicable to different speech features and provide comparable or slightly worse ASR performance than the ANNs with the same network structure.In particular, the ANN system trained with the standard 13-dimensional FBANK feature achieves the best PER of 16.9% (18.5%) on the development (test) set.The equivalent SNN system using the same feature achieves slightly worse PER of 17.3% (18.7%) on the development (test) set.Although the state-of-the-art ASR systems (Ravanelli et al., 2018) give approximately 1% lower PER than the proposed SNN-based phone recognition system, it is largely credit to the longer time context explored by the recurrent Li-GRU model.
It worth mentioning that phone recognition is still a challenging task for spiking neural networks.To the best of our knowledge, only one recent work with recurrent spiking neural networks (Bellec et al., 2019) demonstrates some promising test results on this corpus with a PER of 26.4%.In contrast, our system has achieved significantly lower PER compared to this preliminary study of SNN-based acoustic modeling.
However, these results are not directly comparable since the proposed system incorporates both an acoustic and a language model during decoding unlike the system described in (Bellec et al., 2019).
The experimental results on the TIMIT phone recognition task can be considered as an initial indicator of the compelling prospects of the SNN-based acoustic modeling.Given that the phone recognition task on TIMIT corpus is simplistic compared to the modern LVCSR tasks, we further compare the ANN and SNN performance on newer corpora designed for LVCSR experiments.

Low-resourced ASR on FAME Corpus
In this section, we apply the SNN-based ASR systems to the low-resourced ASR scenario.As summarized in Table 2, the word recognition results on the FAME corpus are reported separately for monolingual Frisian (fy), monolingual Dutch (nl) and code-switched (cs) utterances.The overall performance (all) is also included in the rightmost column.Given that 8.5 hours Frisian and 3 hours of Dutch speech is used during the training phase, we can compare the ASR performance on different subsets, i.e. fy, nl and cs, to identify the variations in the ASR performance for different levels of low-resourcedness.We omit the results on the development set as they follow a similar pattern to the results on the test set.
In this scenario, the SNN acoustic models consistently provide lower WERs than the ANN models for all speech features.Systems with the FBANK features provide lower WERs than those using MFCC features, which is in line with our observations on the TIMIT corpus.The best performance on the test set is obtained using SNN models trained on 40-dimensional FBANK features with an overall WER of 36.9%.In contrast, the ANN model provides a WER of 39.0% for the same setting, which is relatively 5.4% worse than the SNN model.Moreover, the SNN-based acoustic models achieve a relative improvement of 4.7%, 5.2% and 8.2% on the fy, nl and cs subsets of the test set, respectively.These steady improvements in the recognition accuracies highlight the effectiveness of the SNN-based acoustic modeling in scenarios with limited training data compared to the conventional ANN models.The improved ASR performance with SNNs, in the low-resourced setting, may credit to the noisy weight updates derived by the coupled ANN layers of the tandem learning framework (Wu et al., 2019).It has been recognized that introducing noises into the training stage improves the generalization capability of ANN-based ASR systems (Yin et al., 2015).As a result, the noisy training of the tandem learning is expected to improve the recognition performance in low-resourced scenarios.Further investigation on the impact of this noisy training procedure remains as future work.

LVCSR experiments on Librispeech Corpus
In the final set of ASR experiments, we train acoustic models using the official 100-hour and 360-hour training subsets of the Librispeech corpus to compare the recognition performance of ANN and SNN models in a standard LVCSR scenario.As the results given in the middle panel of Table 3, for 100 hours of training data, the ANN systems perform marginally better than the corresponding SNN systems across all different speech features.The absolute WER differences range from 0.1% to 0.6%.These marginal performance degradations of the SNN models is likely due to the reduced representation power of using discrete spike counts.Nevertheless, these results are promising even when comparing to the state-of-the-art ASR systems using more complex ANN architectures as provided in the upper panel of Table 3.
It worth noting that both ANN and SNN systems can take benefit of an increased amount of training data.When increasing the training data from 100 hours to 360 hours, the WERs of the best SNN models reduced from 10.0% (10.3%) to 9.2% (9.4%) for the development (test) sets, respectively.To the best of our knowledge, it is the first time that SNN-based acoustic models have achieved comparable results over the ANN models for LVCSR tasks.These results suggest that SNNs are potentially good candidates for acoustic modeling.

Energy Efficiency of SNN-based ASR Systems
In addition to the promising modeling capability, the SNN-based ASR systems can achieve unprecedented performance gain when implemented on the low-power neuromorphic chips.In this section, we shed light on this prospect by comparing the energy efficiency of ANN-and SNN-based acoustic models.Given that data movements are the most energy-consuming operations for data-driven AI applications, we calculate the average synaptic operations on 5 randomly chosen utterances from the TIMIT corpus and report the ratio of average synaptic operations required per feature classification (SynOps(SNN) / SynOps(ANN)).To investigate the effect of different feature representations, we repeat our analysis on the 40-dimensional MFCC, FBANK and FMLLR features as summarized in Table 4 and Figure 5.
Taking advantage of the short encoding time window (T e = 10), the sparse neuronal activities are observed for all network layers as shown in Figure 5.Among the three features explored in this experiment, it is interesting to note the FMLLR feature achieves the lowest average spike rate.It is likely due to the more discriminative nature of the speaker-dependent feature, while it worth to note that the FMLLR feature is not always available in all ASR scenarios.As provided in Table 4, the SNN implementations taking MFCC, FBANK and FMLLR input features require 1.72, 1.10 and 0.68 times synaptic operations to their ANN counterparts, respectively.Although the average number of synaptic operations required for SNNs that using MFCC and FBANK features are slightly higher than the ANNs, the AC operations performed on SNNs are much cheaper than the MAC operations required for ANNs.One recent study on the Global Foundry 28 nm process has revealed that MAC operations are 14 times more costly than AC operations and requires 21 times more chip area (Rueckauer et al., 2017).Therefore, when deploying SNNs onto the emerging neuromorphic chips for inference (Merolla et al., 2014;Davies et al., 2018), we expect to receive at least an order of magnitude energy and chip area savings.While the actual energy savings for SNN-based acoustic models are dependent on the chip architectures and materials used, which is outside the scope of this work.

DISCUSSION
The remarkable progress in the automatic speech recognition systems has revolutionized the humancomputer interface.The rapid growing demands of ASR services have raised concerns on computational efficiency, real-time performance, and data security, etc. It, therefore, motivates novel solutions to address all those concerns.As inspired by the event-driven computation that observed in the biological neural systems, we explore using brain-inspired spiking neural networks for large vocabulary ASR tasks.For this purpose, we proposed a novel SNN-based ASR framework, wherein the SNN is used for acoustic modeling and map the frame-level features into a set of acoustic units.These frame-level outputs will further integrate the word-level information from the corresponding language model to find the most likely word sequence corresponding to the input speech signal.

Superior Speech Recognition Performance with SNNs
The phone and word recognition experiments on the well-known TIMIT and Librispeech benchmarks have demonstrated the promising modeling capacity of SNN acoustic models and their applicability to different input features.These preliminary results have shown that the recognition performance of SNNs is either comparable or slightly worse than the ANNs with the same network architecture on the TIMIT and Librispeech benchmarks.A possible reason for this performance degradation is the reduced representation power of the discrete neural representation (i.e., spike counts) as compared to the continuous floatingpoint representation of the ANNs (Wu et al., 2019).This performance gap could potentially be closed by extending the encoding window T e of SNNs.Moreover, the recognition performance of ANN and SNN models in a low-resourced scenario is also investigated.In this scenario, the SNN acoustic models outperform the conventional ANNs that could be attributed to the noisy training of the tandem learning framework, wherein error gradients of the SNN layers are approximated from the coupled ANN layers.
The neural encoding scheme adopted in this work allows input features to be encoded inside a short encoding time window for rapid processing by SNNs.It is attractive for the time-synchronous ASR tasks that require real-time performance.The preliminary study of the energy efficiency on the TIMIT corpus reveals at least an order of magnitude energy and chip area savings, as compared to the equivalent ANNs, can be achieved when deploying the offline trained SNNs onto neuromorphic chips.The recent study of a keyword spotting task on the Loihi neuromorphic research chip (Blouw et al., 2019) has also demonstrated the compelling energy savings, real-time performance and good scalability of emerging NC architectures over conventional low-power AI chips designed for ANNs.

Development of SNN-based ASR Systems
The active development of open-source software toolkits plays a significant role in the rapid progress of ASR research, instances include the Kaldi (Povey et al., 2011) and ESPnet (Watanabe et al., 2018).In this work, we demonstrate that state-of-the-art SNN acoustic models can be easily developed in PyTorch and integrated into the PyTorch-Kaldi Speech Recognition Toolkit (Ravanelli et al., 2019).This software toolkit integrates the efficiency of Kaldi and the flexibility of PyTorch, therefore, it can support the rapid development of SNN-based ASR systems.

Future Directions
The recurrent neural networks have shown great modeling capability for temporal signals by exploring long temporal context information in the input signals (Graves and Jaitly, 2014b).As future work, we will explore the recurrent networks of spiking neurons for large-vocabulary ASR tasks to further improve the recognition performance.
The substantial research efforts are devoted to reducing the computational cost and memory footprint of ANNs during inference, instances include network compression (Han et al., 2015), network quantization (Courbariaux et al., 2016;Zhou et al., 2016) and knowledge distillation (Hinton et al., 2015).While the computational paradigm underlying the efficient biological neural networks is fundamentally different from ANNs and hence fosters enormous potentials for neuromorphic computing architectures.Furthermore, grounded on the same connectionism principle, the information of both ANN and SNN are encoded in the network connectivity and connection strength.Therefore, SNN can also take benefits from these early research works on the network compression and quantization of ANNs to further reduce its memory footprint and computation cost (Deng et al., 2019).
The event-driven silicon cochlea audio sensors (Liu et al., 2014) are designed to mimic the functional mechanism of human cochlea and transform input audio signals into spiking events.Given temporally sparse information is transmitted in the surrounding environment, these sensors have shown greater coding efficiency than conventional microphone sensors (Liu et al., 2019).There are some interesting preliminary ASR studies explore the input spiking events captured by these silicon cochlea sensors (Anumula et al., 2018;Acharya et al., 2018).However, the scale of the ASR tasks explored in these studies is relatively small comparing to modern ASR benchmarks due to the limited availability of event-based ASR corpora.Pan et al. (Pan et al., 2019) recently proposed an efficient and perceptually motivated auditory neural encoding scheme to encode the large-scale ASR corpora collected by microphone sensors into spiking events.With this encoding scheme, approximately 50% spiking events can be reduced with negligible interference to the perceptual quality of inputs audio signals.Taking benefits from these earlier research on the neuromorphic auditory front-end, we are expecting to further improve the energy efficiency of SNN-based ASR systems.
The promising initial results demonstrated by the SNN-based large vocabulary ASR systems in this work is the first step towards a myriad opportunities for the integration of state-of-the-art ASR engines into mobile and embedded devices with power restrictions.In the long run, the SNN-based ASR systems are expected to take benefits from ever-growing research on novel neuromoprhic auditory front-end, SNN architectures, neuromorphic computing architectures and ultra-low-power non-volatile memory devices to further improve the computing performance.

Algorithm 1 :
Pseudo Codes For The Tandem Learning Rule Input: Input frame-based feature vectors X 0 , target label Y , network parameters w, neural encoding window size T e Output: Updated network parameters w Forward Pass: c 0 , s 0 = Neural Encoding(X 0 ) for layer l = 1 to N-1 do // State Update of the ANN Layer a l = ANN.layer[l].forward(cl−1 , w l−1 ) * for t = 1 to T e do // State Update of the SNN Layer s l [t] = SNN.layer[l].forward(sl−1 [t], w l−1 ) // Update the Spike Count c l = Te t=1 s l [t] / * Neural Decoding with the Aggregate Membrane Potential * / output = ANN.layer[N].forward(cN −1 , w N −1 ) Loss: E = LossFunction(Y, output) Backward Pass: ∂E ∂a N = LossGradient(Y, output) for layer l = N-1 to 1 do // Gradient Update through the ANN Layer ∂E ∂a l−1 , ∂E ∂w l−1 = ANN.layer[l].backward(∂E ∂a l , c l−1 , w l−1 ) Update parameters of the ANN layer based on the calculated gradients.Copy the updated parameters to the corresponding SNN layer.Note: * For inference, state updates are performed on the SNN layers entirely.

Figure 3 .
Figure3.The neuronal dynamic of an integrate-and-fire neuron (red).In this example, three pre-synaptic neurons are sending asynchronous spike trains to this neuron.Output spikes are generated when the membrane potential V crosses the firing (top right corner).

Figure 4 .
Figure 4. System flowchart for SNN training within a tandem neural network, wherein SNN layers are used in the forward pass to determine the spike count and spike train.The ANN layers are used for error back-propagation to approximate the gradient of the coupled SNN layers.

Figure 5 .
Figure 5. Average spike count per neuron of different SNN layers on the TIMIT corpus.The results of different input features are color-coded.Sparse neuronal activities can be observed in this bar chart.

Table 1 .
TABLES PER (%) on the TIMIT development and test sets.The upper panel reports the results of various ANN and SNN architectures from the literatures, and the lower panel presents the results achieved by the ANN and SNN models in this work (AM: acoustic model, *: the best result to date).

Table 2 .
WERs (%) achieved on Yılmaz et al., 2018)mixed segments of the FAME test set.The upper panel summarizes the number of words from each language subset.The middle panel provides the results of state-of-the-art ANN achitectures(Yılmaz et al., 2016;Yılmaz et al., 2018)for reference purposes and the lower panel presents the results achieved by the ANN and SNN models in this work (AM: acoustic model).

Table 3 .
WER (%) achieved on the Librispeech development and test sets.The upper panel gives the results, with 100-hour of training data, reported at the Github repo of Kaldi and PyTorch-Kaldi.The middle and lower panel present the results achieved by ANN and SNN models in this work using 100-hour and 360-hour of training data, respectively.(AM: acoustic model, † : reported at Github repo)

Table 4 .
Comparison of the computational costs between SNN and ANN.The ratio of their required total synaptic operations (SynOps(SNN) / SynOps(ANN)) is reported.It worth mentioning that ANNs use more costly MAC operations than the AC operations used in the SNNs.