An Efficient and Perceptually Motivated Auditory Neural Encoding and Decoding Algorithm for Spiking Neural Networks

The auditory front-end is an integral part of a spiking neural network (SNN) when performing auditory cognitive tasks. It encodes the temporal dynamic stimulus, such as speech and audio, into an efficient, effective and reconstructable spike pattern to facilitate the subsequent processing. However, most of the auditory front-ends in current studies have not made use of recent findings in psychoacoustics and physiology concerning human listening. In this paper, we propose a neural encoding and decoding scheme that is optimized for audio processing. The neural encoding scheme, that we call Biologically plausible Auditory Encoding (BAE), emulates the functions of the perceptual components of the human auditory system, that include the cochlear filter bank, the inner hair cells, auditory masking effects from psychoacoustic models, and the spike neural encoding by the auditory nerve. We evaluate the perceptual quality of the BAE scheme using PESQ; the performance of the BAE based on sound classification and speech recognition experiments. Finally, we also built and published two spike-version of speech datasets: the Spike-TIDIGITS and the Spike-TIMIT, for researchers to use and benchmarking of future SNN research.


INTRODUCTION
The temporal or rate based Spiking Neural Networks (SNN), supported by stronger biological evidence than the conventional artificial neural networks (ANN), represents a promising research direction.Neurons in a SNN communicate using spiking trains that are temporal signals in nature, therefore, making SNN a natural choice for dealing with dynamic signals such as audio, speech, and music.
In the domain of rate-coding, we studied the computational efficiency of SNN (Pan et al., 2019).Recently, further evidence has supported the theory of temporal coding with spike times.To learn a temporal spike pattern, a number of learning rules have been proposed, which include the single-spike Tempotron (Gütig and Sompolinsky, 2006), conductance-based Tempotron (Gütig and Sompolinsky, 2009), the multi-spike learning rule ReSuMe (Ponulak and Kasiński, 2010) (Taherkhani et al., 2015), the multi-layer spike learning rule SpikeProp (Bohte et al., 2002), and the Multi-spike Tempotron (Gütig, 2016), etc.The more recent studies are aggregate-label learning (Gütig, 2016), and a novel probability-based multi-layer SNN learning rule (SLAYER) (Shrestha and Orchard, 2018).
In our research, we are constantly addressing the question: what are the advantages of SNN over ANN? From the viewpoint of neural encoding, we expect to encode a dynamic stimulus into spike patterns, which was shown to be possible (Maass, 1997) (Ghosh-Dastidar and Adeli, 2009).Deep ANNs have benefited A typical pattern classification task consists of three stages: encoding, feature representation, and classification.The boundaries between each stage are getting less clear in an end-to-end classification neural network.Even then, a good encoding scheme can significantly ease the workload of the subsequent stages in a classification task, for instance, the Mel-Frequency Cepstral Coefficients (MFCC) (Mermelstein, 1976) is still very much in use for automatic speech recognition (ASR).Hence the design of a spiking dataset should consider how the encoding scheme could help reduce the workload of the SNN in a classification task.This cannot be misconstrued as giving the SNN an unfair advantage so long as all SNNs are measured using the same benchmark.The human cochlea performs frequency filtering (Tobias, 2012) while human vision performs orientation discrimination (Appelle, 1972).These all involve encoding schemes to help us better understand our environment.In our earlier work (Pan et al., 2019), on a simple dataset TIDIGITS (Leonard and Doddington, 1993) that contains only single spoken digits, we used a population threshold coding scheme to encode the dataset into events, which we refer to as Spike-TIDIGITS.Using such an encoding scheme, we go on to show that the dataset becomes linearly separable, i.e., the input can be classified based on spike counts alone.This demonstrates that when information is encoded in both the temporal (spike timing) and spatial (which neuron to spike) domain, the encoding scheme is able to project the inputs to a higher dimension, that takes some of the workload off the subsequent feature extraction and classification stages.In the case of Spike-TIDIGITS, the spikes encoded can be directly counted and then classified using a Support Vector Machine (SVM).Using this neural encoding scheme, We further enhance it and then apply it to the TIMIT dataset in this work.
The motivation of this paper is two-fold.Firstly, we believe that we need well-designed spike-encoded datasets that represent the state-of-the-art encoding methodology.With these datasets, one can focus the research on SNN feature representation and classification tasks.Secondly, the datasets should present a challenge in pattern classification, that become the reference benchmark in future SNN studies.
As speech is the most common way of human communication, we are looking into the neural encoding of speech signals in this work.The first question is how best possible to convert speech signals into spikes.As it is, there have been many related works in speech and audio encoding, each of which is optimized for a specific objective, for example, minimum signal reconstruction error (Xiao et al., 2016) (Dennis et al., 2013) (Loiselle et al., 2005).However, none of them is optimized for neuromorphic implementation, that considers the psycho-acoustics, computational efficiency, and effectiveness for pattern classification.In the SNN applications for speech recognition (Xiao et al., 2016) (Darabkh et al., 2018), Mel-Frequency Cepstral Coefficients (MFCC) (Mermelstein, 1976) are commonly used as the spectral representation in speech recognition.Others have tried to use the biologically plausible cochlear filter bank, but they are either analog filters which are prone to changes in the external environment (Liu and Delbruck, 2010), or yet to be studied in a spike-driven SNN system (Loiselle et al., 2005).
Considering spectral representation, an important step in neural encoding is to then convert the spectral energy in a perceptual frequency band into a spike train.The most common way is to treat the twodimensional time-frequency spectrogram as a static image, then converting each 'pixel' value into a spike latency time within the framing window size (Wu et al., 2018a), or into the phase of the sub-threshold membrane potential oscillation (Nadasdy, 2009).Such methods do not represent the spatio-temporal dynamics of the auditory signals in a way that can be directly learned in a SNN (Wu et al., 2018b).Furthermore, these prior studies mostly encode all the frequency components in the frames, and all of these frames into spike trains, introducing a lot of redundancy and hence unnecessary computational load for the subsequent SNN processing, such as speech recognition.Finally, little research has been studied on how to reconstruct a neural encoded speech signal back into its auditory signals for perceptual evaluation.Speech signal reconstruction is a critical task in speech information processing, such as speech synthesis, singing synthesis, and dialogue technology.
To address the need of neuromorphic computing for speech information processing, we propose three criteria for a biologically plausible auditory encoding (BAE) front-end: 1. Biologically plausible spectral features.2. Sparse and energy-efficient spike neural coding scheme.3. Friendly for temporal learning algorithms on cognitive tasks.
The fundamental research problem in neural encoding is how to encode the dynamic and continuous speech signals into discrete spike patterns.Spike rate code is thought to be less likely in an auditory system since much evidence suggest otherwise, such as the example of how bats rely highly on the precise spike timing of their auditory system to locate sound sources by detecting a time difference as short as 5µs.Latency code and phase code are well supported by neuro-biological observations.However, on its own, they cannot provide an invariant representation of the patterns for a classification task.To facilitate the processing of a SNN in a cognitive task, neural temporal encoding should not only consider how to encode the stimulus into spikes, but also care about how to represent the invariant features.Just like the auditory and visual sensory representations in the human prefrontal cortex, such representations in the proposed BAE front-end are required in a SNN framework, that can then be implemented with a low cost neuromorphic solution, that can effectively reduce the processing workload in the subsequent SNN pipeline.A large number of observations in neuroscience support the observation that our auditory sensory neurons encode the input stimulus using threshold crossing events in a population of sensory neurons (Ehret, 1997) (Hopfield, 2004).Inspired by these observations, a simple version of threshold coding has been proposed (Gütig and Sompolinsky, 2009), in which a population of encoding neurons with a set of uniformly distributed thresholds encode the spectral energy of different frequency channels into spikes.Such a cross-and-fire mechanism is reminiscent of quantization from the point of view of information coding.In our proposed BAE encoding front-end, such a neural coding scheme is also being incorporated.Further investigation is presented in the experiment section.
Besides effective neural coding representation, an efficient auditory front-end aims to encode acoustic signals into sparse spike patterns, while maintaining sufficient perceptual information.To achieve such a goal, our biological auditory system has provided us a solution best understood as masking effects (Harris andDallos, 1979)(Shinn-Cunningham, 2008).Masking is a complex and yet to be fully understood psychoacoustic phenomenon as some components of the acoustic events are not perceptible in both frequency and time domain (Ambikairajah et al., 1997).From the viewpoint of perceptual coding, these components are regarded as redundancies since they are inaudible.Implementing the masking effects, those inaudible components will be coded with larger quantization noise or not coded at all.Although the mechanism and function of masking is not yet fully understood, its effects have already been successfully exploited in auditory signal compression and coding (Ambikairajah et al., 2001), for efficient information storage, communication, and retrieval.In this paper, we propose a novel idea to apply the auditory masking effects in both frequency and time domain, which we refer to as simultaneous masking and temporal masking, respectively, in our auditory neural encoding front-end so as to reduce the number of encoding spikes.This improves the the sparsity and efficiency of our encoding scheme.Given how we address the three optimization criteria of neural encoding, we refer to it as biologically plausible auditory encoding scheme or BAE.Such an auditory encoding front-end also provides an engineering platform to bridge the study of masking effects between psycho-acoustics and speech processing.
Our main contributions in this paper are: 1) we emphasize the importance of spike acoustic datasets for SNN research.2) we propose an integrated auditory neural encoding front-end to further research in SNN-based learning algorithms.With the proposed BAE encoding front-end, the speech datasets can be converted into an energy-efficient, information-compact, and well-representative spike patterns for subsequent SNN tasks.
The rest of this paper is organized as follows: in Section 2 we discuss the auditory masking effects, and how simultaneous masking in the frequency domain, and temporal masking in the time domain for neural encoding of acoustic stimulus is being implemented; the BAE encoding scheme is applied in conjunction with masking to TIDIGITS and TIMIT datasets.In Section 3, we describe the details of the resulting spike datasets and evaluate them in comparison with their original datasets in a recognition task.In Section 4, we discuss our findings and conclude in Section 5.

Auditory masking effects
Most of our speech processing front-ends employ a fixed feature extraction mechanism, such as MFCC, to encode the input signals, whereas the human auditory sensory system ignores some while strongly emphasizes others, commonly referred to as attention mechanism in psycho-acoustics.The auditory masking effects closely emulate this phenomenon (Shinn-Cunningham, 2008).
Auditory masking is a known perceptual property of the human auditory system that occurs whenever the presence of a strong audio signal makes its neighborhood of weaker signals inaudible, both in the frequency and time domain.One of the most notable application of auditory masking is the MPEG/audio international standard for audio signal compression (Fogg et al., 2007) (Ambikairajah et al., 2001).It compresses the audio data in large part by removing the acoustically irrelevant parts of the audio signal, or by encoding those parts with less number of bits, due to more tolerance to quantization noise (Ambikairajah et al., 1997).To achieve such a goal, this algorithm designs two different kinds of maskings from the psycho-acoustic model (Lagerstrom, 2001): 1.In the frequency domain, two kinds of masking effects are used.Firstly, by allocating the quantization noise in the least sensitive regions of the spectrum, the perceptual distortion caused by quantization is minimized.Secondly, an absolute hearing threshold is exploited, below which the spectral components are entirely removed.2. In the time domain, the masking effect is applied such that the local peaks of the temporal signals in each frequency band will make their ensuing audio signals inaudible.
Motivated by the above signal compression theory, we propose an auditory masking approach to spike neural encoding, which greatly increases the coding efficiency of the spike patterns, by eliminating those perceptually insignificant spike events.The approach is conceptually consistent with the MPEG-1 layer III signal compression standard (Fogg et al., 2007), with modifications according to the characteristics of spiking neurons.

Simultaneous masking
The masking effect present in the frequency domain is referred to as simultaneous masking.According to the MPEG-1 standards, there are two sorts of masking strategies in the frequency domain: the absolute hearing threshold and the frequency maskers.The simultaneous masking effects are common in our daily life.For instance, the sensible sound levels of our auditory systems vary in different frequencies, therefore, we can be more sensitive to the sounds in our living environment.This is an evolutionary advantage for survival, in both human beings and animals.Besides the absolute hearing threshold, every acoustic event in the spectrum will also influence the perception of the neighboring frequency components, that is, different levels of tones could contribute to masking effects of other frequency tones.For instance, in a symphony show, the sounds from different musical instruments can be fully or partially masked by each other.As a result, we can enjoy the compositions of various frequency components with rich diversities.Such a psycho-acoustic phenomenon is called frequency masking.
Figure 1 illustrates the absolute hearing threshold, T a , as a function of frequency in Hz.The function is derived from psycho-acoustic experiments, in which pure tones continuous in the frequency domain are presented to the test subjects and the minimal audible sound pressure levels (SPL) in dB are recorded.The commonly used function to approximate the threshold is (Ambikairajah et al., 1997): For the frequency maskers, in the MPEG-1 standard, some sample pulses under masking thresholds might be partially masked, thus they are encoded by less number of bits.However, in the event-based scenario, spike patterns carry no amplitude information, similar to on-off binary values, which means that partial masking can hardly be realized.As such, we have modified the approach such that all components under the frequency maskers are fully masked (discarded).Further reconstruction and pattern recognition experiments are necessary to evaluate such an approach.Figure 2 shows the overall masking thresholds with both masking strategies in the frequency domain.This figure illustrates the simultaneous masking thresholds added to the acoustic events in a spectrogram.The sound signals with different spectral power in different cochlear filter channels will suffer from various masking thresholds.
Figure 3 provides a real-world example of the simultaneous masking.The spectrogram of a speech utterance of "one" from TIDIGITS dataset is demonstrated in a 3-D plot.The grey surface illustrates the simultaneous masking threshold acting on the spectrogram (colorful surface).By the masking strategy, the acoustic events with spectral energy lower than the threshold surface will be removed.Section 2.3 will introduce how to convert the masked spectrogram into a sparse and well-represented spike pattern.

Temporal masking
Another auditory masking effect is temporal masking in the time domain.Conceptually similar to the frequency maskers, a louder sound will mask the perception of the other acoustic components in the time domain.As illustrated in Figure 4, the vertical bars represent the signal intensity of short-time frames, that is called acoustic events, along the time axis.A local peak (the first red bar) forms a masker that makes the following events inaudible until the next local peak (the second red bar) exceeds the masker curve.According to the psycho-acoustic studies, the temporal masker threshold is modeled as an exponentially decaying curve (Ambikairajah et al., 2001): where y(n) denotes the masking threshold level on the n th following an acoustic event; c is the exponential index and p 1 represents the sound level of the local peak as the beginning of the masker.The decaying parameter c is tuned according to the hearing quality.

Auditory masking effects in both domains
By applying both the simultaneous masking and temporal masking illustrated above, we can remove those imperceptible acoustic events (frames) from the overall spectrogram.Since our goal is to apply the masking effects in the precise timing neural code, we propose the strategy as follows: 1.The spike pattern P K×N (p ij ) is generated from the raw spectrogram S K×N (s ij ) without masking effects, by some temporal neural coding methods, which will be discussed in Section 2.2.2.Here the index i, j refers to the time-frequency bin in the spectrogram, with i referring to the frequency bin, and j referring to the time frame index.The spike pattern P K×N is defined as a matrix that: ) is generated, whose dimensions are the same as the spectrogram.The element of the matrix Φ K×N (φ ij ) is defined as: where the time-frequency bin i, j is masked with φ i,j = 0 when the frame energy s ij is less than the masking threshold m ij , otherwise, φ i,j = 1. 4. Apply the masker map matrix Φ K×N (φ ij ) to the encoded pattern P K×N (p ij ) to generate a masked spike pattern P mask (p mask ij ) : ) where • denotes the Hadamard product.By doing so, those perceptually insignificant spikes are eliminated, thus forming a more compact and sparse spike pattern.
Figure 5 demonstrates the auditory masking effects acting in both the frequency and time domains, on a speech utterance of "one" in TIDIGITS dataset.The colored surface represents the original spectrogram while the grey areas represent the spectral energy values that are being masked.For TIDIGITS datasets, nearly half of the acoustic events (frames) are removed according to our auditory masking strategy, which corresponds to the 55% removal of PCM pulses in speech coding (Ambikairajah et al., 2001).

Cochlear filters and spike coding
The human auditory system is primarily a frequency analyzer (Tobias, 2012).Many studies have confirmed the existence of the perceptual centre frequencies and equivalent bandwidths.To emulate the working of the human cochlea, several artificial cochlear filter banks have been well studied: GammaTone filter bank (Patterson et al., 1987) (Hohmann, 2002), Constant Q Transform-based filter bank (CQT) (Brown, 1991) (Brown and Puckette, 1992), Bark-scale filter bank (Smith and Abel, 1999), etc.They share the same idea of logarithm distributed centre frequencies and constant Q factors but slightly differ in the exact parameters.To build the auditory encoding system, we adopt an event-based CQT-based filter bank in the time domain, following our previous work (Pan et al., 2018).

Time-domain cochlear filter bank
Adopting an event-based approach to emulate the human auditory system, we propose a neuronal implementation of the event-driven cochlear filter bank, of which the computation can be parallelized as follows, • As illustrated in Figure 6, a speech waveform (a) is filtered by K neurons (b) where each neuron represents one cochlear filter from a particular frequency bin.• The weights of each neuron in (b) are set as the time-domain impulse response of the corresponding cochlear filter.The computing of a neuron with its input is inherently a time-domain convolution process.
• The output of the filter bank neurons is a K-length vector (c), where K is the number of filters, for each time step.Since the signal (a) shifts sample by sample, the width of the output matrix is the same as the length of the input signal.As such, the auditory signal is decomposed into multiple channels in parallel, forming a spectrogram.
Suppose a speech signal x with M samples x = [x 1 , x 2 , ..., x M ] sampled at 16kHz.For the k th cochlear filter, the impulse response (wavelet) is a M k -length vector We note the impulse response F k has an infinite window size, however, numerically its amplitude decreases to small values outside an effective window, thus having little influence on the convolution results.As investigated in (Pan et al., 2018), we empirically set M k to an optimal value.So the m th output of the k th cochlear filter neuron is modeled as y k (m): φ m is a subset of the input samples within the m th window, whose length is the same as that of the M k -length wavelet, indicated as the samples between the two arrows in Figure 6 (a) and (b).The window φ m will move sample by sample, naturally along with the flow of the input signal samples.At each time step, a vector of length K, which is the number of filters, is generated as shown in (c).After M such samples, the final output time-frequency map of the filter bank is a K × M matrix Y K×M .
After time-domain cochlear filtering, the K × M time-frequency map Y K×M should be framed, which emulates the signal processing of hair cells in the auditory pathway.For the output waveform from each channel, we apply a framing window of length l (samples) with a step size of l/2 and calculate the logarithmic frame energy e of one framing window: where x q denotes the samples within the l-length window; e is the spectral energy of one frame, hence obtaining the time-frequency spectrum S K×N (s ij ) as indicated in Section 2.1.3which will be further encoded into spikes.

Neural spike encoding
In the inner ear, the motion of the stereocilia in the inner hair cells is converted into a chemical signal that excites adjacent nerve fibers, generating neural impulses that are then transmitted along the auditory pathway.Similarly, we would like to convert the sub-band framing energy into electrical impulses, or so-called spikes, for the purpose of information encoding and transmission.In prior work, the temporal dynamic sequences are encoded using several different methods: latency coding (Wu et al., 2018a), phase coding (Arnal and Giraud, 2012) (Giraud and Poeppel, 2012), latency population coding (Dean et al., 2005), that are adopted for specific applications.These encoding schemes are not optimized for SNN computation.
We would like to propose a biologically plausible neural encoding scheme taking into account the three criteria as defined in Section 1.In this section, the particular neural temporal coding scheme, which converts perceptual spectral power to precise spike times, is designed to meet the need of synaptic learning rules in SNNs (Gütig and Sompolinsky, 2006) (Ponulak and Kasiński, 2010).As such, the resulting temporal spike patterns are supposed to be friendly towards temporal learning rules.
In our previous work (Pan et al., 2019), two mainstream neural coding schemes, the single neuron temporal codes and (latency coding, phase coding) and the population codes (population latency/phase coding, threshold coding) are compared.It is found that the threshold coding outperforms the other coding schemes in SNN-based pattern recognition tasks.Next are some observations made whilst comparing threshold coding, and the single neuron temporal coding.
Pan Zihan et al.

An efficient auditory neural encoding
First of all, the single temporal coding scheme, such as latency or phase coding, encodes the spectral power using spike delaying time, or phase-locking time.Suppose a frame of normalized spectral power is e, the n th latency spike timing t f n = is defined as: where T denotes the time duration of the encoding window.For the phase coding, t f n is phase-locked to the nearest peak of the sub-threshold membrane oscillation.The spectral power, that represents the amplitude information, e is represented as the relative spike timing (1 − e) * T within each window and the number of spikes embedded are in the order n.Unfortunately, the SNN can hardly decode such an encoding scheme without the knowledge of the encoding window boundaries, implicitly provided by the spike order n and window length T .The spatio-temporal spike patterns could not provide such knowledge explicitly to the SNN.On the other hand, in the population code, such as threshold coding, the multiple encoding neurons naturally represent the amplitudes of the spectral power frames, and we only need to represent the temporal information in the spike timing.For example, the spike timing of the n th onset encoding neuron of the threshold code t n f is: t crossing records the time when the spectral tuning curve from one sub-band crosses the onset threshold θ n of the n th encoding neuron.In this way, both the temporal and amplitude information is encoded and made known to the SNN, which meets the third criterion mentioned above.Secondly, coding efficiency, which refers to the average encoding spike rates (number of spikes per second), is also studied in (Pan et al., 2019).The threshold code has the least average spike rates among all investigated neural codes.As the threshold code encodes only threshold-crossing events, it is supposed to be the most efficient coding method.
Thirdly, the threshold code promises to be more robust against noise, such as spike jitter.As it encodes the trajectory of the dynamics of the sub-band spectral power, the perturbation of precise spike timing will have less impact on the sequence of encoding neurons.
As such, the threshold code is a promising encoding scheme for temporal sequence recognition tasks (Pan et al., 2019).Further evaluation will be provided later in the experiments.While we note that each neural coding scheme has its own advantages, we focus on how the encoding scheme may help subsequent SNN learning algorithms in a cognitive task in this paper.As such, we adopt the threshold code for all experiments in this paper.

Biologically plausible auditory encoding (BAE) with masking effects
We propose a BAE front-end with masking effects as illustrated in Figure 8.
Firstly the auditory stimuli are sensed and amplified by the microphone and some peripheral circuits, leading to a digital signal (a).This process corresponds to the pathway of the pinna, external auditory meatus, tympanic membrane and auditory tube.Then the physically sensed stimuli are filtered by the cochlear filter bank (b), that emulates the cochlear function of frequency analysis.The outputs of the cochlear filter bank are parallel streams of time-domain sub-band (or so-called critical band) signals with psycho-acoustic centre frequencies and bandwidths.For the purpose of further neural coding and cognitive tasks, the sub-band signals should be framed as the logarithm-scale energy as per equation 9.The output of (c), the raw spectrogram, is then converted into a precise spike pattern.The spectrogram is also being used to calculate the simultaneous and temporal masking levels, as in (d) and (e), under which the spikes will be omitted.Finally a sparse, perceptually related, and learnable temporal spike pattern for a learning SNN is generated as shown in (g). Figure 9 gives an example of the intermediate results at different stages in Figure 8 for a speech data waveform.Figure 9(a) and (b) show the raw waveform and the spectrogram of a speech utterance "three" spoken by a male speaker.The spectrogram is further encoded into a raw spike pattern by threshold neural coding.Figure 9(d) is the mask as formulated in Section 2.1, according to which the raw spike pattern 9(c) is masked and results in a masked spike pattern (e).According to Table 4, 50.48% of all spikes are discarded.

Spike-TIDIGITS and Spike-TIMIT databases
The TIDIGITS (Leonard and Doddington, 1993) (LDC Catalog No. LDC93S10) is a speech corpus of spoken digits for speaker independent speech recognition (Cooke et al., 2001) (Tamazin et al., 2019).The speakers are from different genders (male and female), age ranges (adults and children), dialect districts (Boston, Richmond, Lubbock, etc.).As such, the corpus provides sufficiently speaker diversity and becomes one of the common benchmarking datasets.TIDIGITS has a vocabulary of 11 spoken words of digits.The original database contains both isolated digits and digit sequences.In this work, we only use the isolated digits: each utterance contains one individual spoken digit.In this first attempt, we would like to build a spike-version speech dataset that contains sufficient diversity and can be immediately used to train a SNN classifier (Wu et al., 2018a) (Pan et al., 2018).As each digit is repeated 224 and 226 times, the Spike-TIDIGITS has 224 × 11 = 2464 and 226 × 11 = 2486 isolated digit utterances for the training and testing set, respectively.
The BAE encoder proposed in Section 2.3 and Figure 8 is applied as the standard encoding scheme to generate this spike dataset.Table 1 and Table 2 describe the parameters in the encoding process of Spike-TIDIGITS.
Next, we encode one of the most popular speech dataset TIMIT (Garofolo, 1993) into a spike-version, Spike-TIMIT.TIMIT dataset consists of richer acoustic-phonetic content than TIDIGITS (Messaoud and Hamida, 2011).It consists of continuous speech utterances, that are useful for the evaluation of speech coding schemes (Besacier et al., 2000), speech enhancement El-Solh et al. (2007) or automatic speech recognition systems (Mohamed et al., 2011) (Graves et al., 2013).Similar to TIDIGITS, the speakers of TIMIT corpus are from 8 different dialect regions in the United States, 438 males and 192 females.There are 4621 and 1679 speech sequences in the training and testing sets.This corpus has a vocabulary of 6224 words, which is larger than that of TIDIGITS.
Our proposed BAE scheme is next evaluated in the following sections, using both reconstruction and speech pattern recognition experiments.

Audio reconstruction from masked patterns
According to equation 5, we adopt the binary auditory mask Φ K×N (φ ij ) which either fully encodes or ignores an acoustic event.It is suggested in auditory theory (Ambikairajah et al., 1997) that partial masking may exist in the frequency domain, especially in the presence of rich frequency tones.We would like to evaluate the masking effect in the BAE front-end through both objectively and subjectively .
We begin by reconstructing the spike trains into speech signals, and then evaluate the speech quality using several objective speech quality measures: Perceptual Evaluation of Speech Quality (PESQ), Root Mean Square Error (RMSE) and Signal to Distortion Ratio (SDR).The PESQ, defined in (Beerends et al., 2002) (Rix et al., 2002), is standardized as ITU-T recommendation P.862 for speech quality test methodology (Recommendation, 2001).The core principle of PESQ is the use of human auditory perception model (Rix et al., 2001) for speech quality assessment.For speech coding, especially the perceptual masking proposed in this paper, the PESQ measure could correctly distinguish between audible and inaudible distortions and thus assess the impact of perceptually masked coding noise.Besides, the PESQ is also used in the assessment of MPEG audio coding where auditory masking is involved.In this paper, the PESQ scores are further converted to MOS-LQO (Mean Opinion Score-Listening Quality Objective) scores ranging 1 to 5, which are more intuitive for assessing speech quality.The mapping function is obtained from ITU- T Recommendation P.862.1(ITU-T, 2003).Table 3 defines the MOS scales and their corresponding speech quality subjective descriptions (Rec, 1996).
Besides PESQ, the RMSE (equation 12) and Expand SDR (equation 13) measures are also reported, where x i and xi denote the i th time-domain sample of the original and reconstructed speech signals x 1×M and x1×M , respectively.
Pan Zihan et al.

An efficient auditory neural encoding
For comparison, we compare three groups of reconstructed speech signals: (1) the reconstructed signal ŝmask from spike trains with auditory masking; (2) the reconstructed signal ŝraw from raw spike trains without auditory masking; (3) the reconstructed signal ŝrandom from randomly masked spike trains.The speech quality of the three groups of reconstructed signals is measured, as reported in Table 4 and 5.For a fair comparison, we also simulate a random masking effect by dropping the same amount of spikes as that of the auditory masking.The raw spike patterns without any masking are used as a reference.The perceptual quality scores of the ŝmask and ŝraw are rather close at a high level of around 4.5, which suggests satisfying subjective quality between "Excellent" and "Good" according to Table 3.It is noted that the speech signals with random masking are perceived as "Poor" in quality.Besides the PESQ, the other two measures also lead to the same conclusion.The RMSE of ŝraw and ŝmask are approximately two orders of magnitude larger than that of the ŝrandom ; the SDRs also show a great gap.

Speech recognition by SNN for TIDIGITS dataset
In this section, we evaluate the BAE scheme in an SNN-based pattern recognition task, which also aims to evaluate the coding fidelity of our proposed methodology.The spike patterns encoded from TIDIGITS speech dataset are fed into an SNN, and the outputs correspond to the labels of which spoken digits the patterns are encoded from.The synapse efficacy updating rule is the MPD-AL, which is an efficient membrane potential driven aggregate-label learning algorithm for leaky integrate-and-fire spiking neurons (Malu et al., 2019).The network structure is given in Table 6.
To evaluate the effectiveness of the BAE front-end, we compare the classification performances between spike patterns with and without auditory masking.Gaussian noise, measured by Signal-to-Noise Ratio (SNR) in dB, is added to the original speech waveforms before the encoding process.Table 7 shows the classification accuracies under noisy conditions and in the clean condition.
The results show that the pattern classification accuracies of masked patterns are slightly higher than those of the original patterns, under different test conditions.Above all, referring to Table 4, our proposed BAE scheme helps to reduce nearly half of the spikes, which is a dramatic improvement in coding efficiency.

Large vocabulary speech recognition for TIMIT dataset
In Section 2.3, we present how the TIMIT dataset has been encoded into spike trains, which we henceforth refer to as Spike-TIMIT.We next train a recurrent neural network, the LSTM (Hochreiter and Schmidhuber, 1997) on both the original TIMIT and Spike-TIMIT datasets, with the CTC loss function (Graves et al., 2006).For the validation datasets, the normalized Levenshtein distance (by the labels) or the label error rate (LER) is reported (Graves et al., 2006).We obtained an LER of 0.27 and 0.28 respectively for the TIMIT and Spike-TIMIT datasets.The network architecture of the LSTM used for both datasets is illustrated in Table 8.The LSTM networks are adapted from Tensorpack (Zhou et al., 2016).We notice some improvement in accuracy when dropout is introduced for Spike-TIMIT but not for TIMIT.We further note that the Spike-TIMIT system involves many more input neurons than the TIMIT system (620 vs 39).However, because the TIMIT system employs more LSTM neurons, the Spike-TIMIT systems have much fewer parameters than the TIMIT system (4.5Mvs 13M).This is also a desired outcome of the BAE front-end, that is, more neurons are used for neural encoding so that far less neurons and parameters are needed in the feature representation and classification pipeline, leading to overall saving in number of neurons and parameters.

DISCUSSION
In this paper, we propose a biologically plausible auditory encoding (BAE) scheme, especially for speech signals.The encoding scheme is inspired by the modeling of human auditory sensory system, which is composed of spectral analysis, neural spike coding, as well as the psycho-acoustic perception model.We adopt three criteria for formulating the auditory encoding scheme.
For the spectral analysis part, a time-domain event-based cochlear filter bank is applied, with the perceptual scale of centre frequencies and bandwidths.The key feature of the spectral analysis is the parallel implementation of time-domain convolution.One of the most important properties of SNN is its asynchronous processing.The parallel implementation makes the neural encoding scheme a friendly frontend for any SNN processing.The neural encoding scheme, the threshold code in our case, helps to generate a sparse and representative spike patterns for efficient computing in the SNN classifier.The threshold code helps in two aspects: firstly it tracks the trajectory of the spectral power tuning curves, which represents the features in the acoustic dynamics; secondly, the threshold code, as a form of population neural code, is able to project the dynamics in the time domain onto the spatial domain, which facilitates the parallel processing of spiking neurons on cognitive tasks (Pan et al., 2019).Another key component of the BAE front-end is the implementation of auditory masking that benefits from findings in human psycho-acoustic experiments.The integrated auditory encoding scheme fulfills the three proposed design criteria.We have evaluated our BAE scheme through signal reconstruction and speech recognition experiments giving very promising results.To share our study with the research community, the spike-version of TIDIGITS and TIMIT speech corpus, namely, Spike-TIDIGITS and Spike-TIMIT, will be be made available as benchmarking datasets.
Figure 11 illustrates some interesting findings in our proposed auditory masking strategy.The upper, middle and lower panels of Figure 11 represent three speech utterances from the TIDIGITS dataset.The first and second column illustrates the encoded spike patterns with and without auditory masking effects.It is apparent that a large number of spikes are removed.The graphs in the third column demonstrate the membrane potential of the output neuron in the trained SNN classifier after being fed with both patterns during the testing phase.For example, the LIF neuron in (c) responds to the speech utterance of "six".As such, the encoded pattern of spoken "six", as in (a) and (b) will trigger the corresponding neuron to fire a spike in the testing phase.The sub-figure (c) demonstrates that though the sub-threshold membrane potentials of masking/unmasking patterns have different trajectories, the two membrane potential curves will exceed the firing threshold (which is 1 in this example) at close timing.Similar results are observed in (f) and (i).The spike patterns with or without auditory masking lead to similar neuronal responses, either in spiking activities (firing or not) or in membrane potential dynamics, as observed in (c), (f), (i).It is interesting to observe that auditory masking has little impact on the neuronal dynamics.As a psycho-acoustic experiment, the auditory mask is always studied using listening tests.It remains unclear how the human auditory system responds to auditory masking.Figure 11 provides an answer to the same question from a SNN perspective.
The parameters of auditory masking effects in this work, such as the exponential decaying parameter c in equation 2, or the cross-channel simultaneous masking thresholds in Figure 2, are all derived in the acoustic model of MPEG-1 Layer III standard (Fogg et al., 2007) and tuned according to the particular tasks.However, from a neuroscience point of view, our brain is adaptive to different environments.This suggests that the parameters could be optimized by machine learning methodology, for different tasks and datasets.Also, the threshold neural code, which encodes the dynamics of the spectrum using thresholdcrossing events, relies heavily on the choice of thresholds.We use 15 uniformly distributed thresholds for simplicity.We note that the recording of threshold-crossing events is analogous to quantization in digital coding, that the maximal coding efficiency (maximal information being conveyed constrained by the numbers of neurons or spikes) maybe derived using an information-theoretic approach.The Efficient Coding Hypothesis (ECH) (Barlow et al., 1961) (Srinivasan et al., 1982) that describes the link between neural encoding and information theory could provide us the theoretical framework to determine the optimal threshold distribution in the neural threshold code.It may also otherwise be learned using machine learning techniques.

CONCLUSION
Our proposed BAE scheme, motivated by the human auditory sensory system, could encode temporal speech data into spike patterns that are sparse, efficient, and friendly to SNN learning rules.It is both efficient and effective.We use the BAE scheme to encode popular speech datasets, namely, TIDIGITS and TIMIT into their spike versions: Spike-TIDIGITS and Spike-TIMIT.The two spike datasets are to be published as benchmarking datasets, in the hope of improving SNN-based classifiers.The acoustic events are referred to as the spectral power of the frames in a spectrogram.The spectral energy axis is the sound level of a maskee; the critical band axis is the frequency bins of the cochlear filter bank, as introduced in Section 3.1; the masking thresholds axis indicates the overall masking levels on the maskees of different sound levels from various critical bands.For example, an acoustic event of 20dB level on the 10 th critical band is masked off by the masking threshold of nearly 23dB, which is introduced by the other auditory components of its neighboring frequency channels.
Figure 3.The overall simultaneous masking effects on a speech utterance of "one", in a 3D spectrogram.
Combining the two kinds of masking effects in the frequency domain (refer to Figure 1 and Figure 2), the grey surface shows the overall masking thresholds on a speech utterance (the colorful surface).All the spectral energy under the thresholds will be imperceptible.

Time Signal intensity
Figure 4.The illustration of temporal masking: each bar represents the acoustic event received by the auditory system.In this paper, acoustic events generally referred to framing spectral power, which are the elements to be parsed to an auditory neural encoding scheme.A local peak event (red bar) forms a masking shadow represented by an exponentially decaying curve.The subsequent events that are weaker than the leading local peak will not be audible until another local peak event exceeds the masker curve.

Figure 10
Figure 10 depicts the flowchart of the reconstruction process.The left and right panels represent the spike encoding and decoding processes.The raw speech signals are first decomposed by a series of cochlear analysis filters, generating parallel streams of sub-band signals as in Figure 8(b).The 20 sub-band waveforms are encoded into spike times with masking strategies and then decoded back to sub-band speech signals.The reproduced sub-band waveforms 1 to K (20 in this work) are gain-weighted and summed to form the reconstructed speech signal for perceptual quality evaluation.Since the cochlear filters decompose the input signal by various weighting gains in different frequency bands, the weighting gains in the decoding part represent the inverse processing of the cochlear filters.

Figure 2 .
Figure2.The frequency masking thresholds acting on a maskee (the acoustic events being masked), generated by the acoustic events from the neighboring critical bands, are shown as a surface in a 3-D plot.The acoustic events are referred to as the spectral power of the frames in a spectrogram.The spectral energy axis is the sound level of a maskee; the critical band axis is the frequency bins of the cochlear filter bank, as introduced in Section 3.1; the masking thresholds axis indicates the overall masking levels on the maskees of different sound levels from various critical bands.For example, an acoustic event of 20dB level on the 10 th critical band is masked off by the masking threshold of nearly 23dB, which is introduced by the other auditory components of its neighboring frequency channels.

Figure 5 .Figure 6 .Figure 7 .Figure 8 .
Figure 5.Both the simultaneous and temporal masking effects acting on the 3-D plot spectrogram of a speech utterance of "one".The grey-color shaded parts of the spectrogram are masked.

Figure 9 .
Figure 9.An illustration of the intermediate results in a BAE process.Raw speech signal (a) of a speech utterance "three" is filtered and framed into a spectrogram (b), corresponding to the process in Figure 8(b) and (c).By applying the neural threshold code, a precise spike pattern (c) is generated from the spectrogram.The masker map as described in Equation (4) is illustrated in (d), where yellow and dark blue color blocks represent value 1 and 0, respectively.The masker (d) is applied on the spike pattern (c) and the auditory masked spike pattern is obtained in (e).

Figure 10 .
Figure10.The reconstruction from a spike pattern into a speech signal.Parallel streams of thresholdencoded spike trains that represent the dynamics of multiple frequency channels are first decoded into sub-band digital signals.The sub-band signals are further fed into a series of synthesis filters, which are built inversely from the corresponding analysis cochlear filters as in Figure6.The synthesis filters compensate the gains from the analysis filters for each frequency bin.Finally, the outputs from the synthesis filter banks sum up to generate the reconstructed speech signal.

Figure 11 .
Figure11.Free membrane potential of trained Leaky-Integrate and Fire neurons, by feeding patterns with and without masking.The upper, middle, and lower panels are for three different speech utterances "six", "seven", and "eight".The spike patterns with or without masking are apparently different, but the output neuron follows similar membrane potential trajectories.
a spike is emitted within the duration of the time-frequency bin i, j 0, otherwise(3)where t f is the encoded precise spike timing.As such the spike pattern P K×N (p ij ) is a sparse matrix that records the spike timing.2. According to the spectrogram S K×N (s ij ) and the auditory perceptual model, the simultaneous masking level matrix M simultaneous (m simultaneous ij ) and the temporal masking level matrix M temporal (m The overall masking level matrix M K×N (m ij ) is defined as follows.It provides a 2-D masking threshold surface that has the same dimensions as the spectrogram.

Table 1 .
Parameters of neural threshold encoding for TIDIGITS.

Table 2 .
Cochlear filter parameters: we use a total of 20 cochlear filters in the BAE front-end.The centre frequency and bandwidth of each filter are listed.

Table 3 .
MOS scales and their corresponding speech quality subjective assessments

Table 4 .
The objective speech quality meansures of the reconstructed speech signals for spoken digits TIDIGITS dataset.The reduced rates refer to the ratio of masked spikes.

Table 5 .
The objective speech quality measures of the reconstructed speech signals for continuous and large vocabulary speech dataset TIMIT.The reduced rates refer to the ratio of masked spikes.