Event-driven Spectrotemporal Feature Extraction and Classification using a Silicon Cochlea Model

This paper presents a reconfigurable digital implementation of an event-based binaural cochlear system on a Field Programmable Gate Array (FPGA). It consists of a pair of the Cascade of Asymmetric Resonators with Fast Acting Compression (CAR FAC) cochlea models and leaky integrate and fire (LIF) neurons. Additionally, we propose an event-driven SpectroTemporal Receptive Field (STRF) Feature Extraction using Adaptive Selection Thresholds (FEAST). It is tested on the TIDIGTIS benchmark and compared with current event-based auditory signal processing approaches and neural networks.


INTRODUCTION
In the human auditory pathway, information is extracted and conveyed through sequences of action potentials, or spikes. The spike streams form robust representations that are important for perception. The human sensory system achieves real-time, low-power, and noise-robust performance while operating in such an asynchronous "event"-based way. To mimic the efficiency of signal processing in the human auditory system, biologically inspired auditory sensors and algorithms have been implemented and investigated. For example, (Liu, van Schaik, Minch, & Delbruck, 2010, 2014) developed a 2×64×4 channel dynamic audio sensor that used an analogue cascade filter bank and pulse-frequency modulated circuits to emulate the peripheral auditory system and auditory nerve to generate spike streams; (Yang, Chien, Delbruck, & Liu, 2016) used a synchronised delta modulator to generate audio events; ) developed a digital multi-rate cochlea model on FPGA where a digital leaky integrate-and-fire (LIF) neuron model with different thresholds was used to model auditory neurons of the human auditory system with different thresholds.
Such neuromorphic auditory sensors encode acoustic information into spikes in real-time at a low data rate, which make them an ideal solution for real-world applications. In recent decades, efforts have been made to investigate neuromorphic sensing approaches to extract acoustic features from auditory spikes. For example, it has been argued that statistical features embedded in spike streams could be the mechanism for the precise encoding of auditory cues that are important for recognition (Gerstner & Kistler, 2002). Therefore rate-code based features (Neil & Liu, 2016), inter-spike interval distributions (Li, Delbruck, & Liu, 2012), (Uysal, Sathyendra, & Harris, 2006) inter-spike velocity (Chakrabartty & Liu, 2010), and exponential features (Anumula, Neil, Delbruck, & Liu, 2018) have all been investigated in speaker identification and speech recognition tasks. (Rasetto, Dominguez-Morales, Jimenez-Fernandez, & Benosman, 2021) proposed a feature extraction approach to extract spectrotemporal features from a cochlea model built with "event"-based filters for a command recognition task.
In addition to neuromorphic auditory data processing, event-driven feature extraction algorithms have been more widely investigated in neuromorphic vision systems. With the increase in the adoption of neuromorphic vision sensors, various dense tensor representations for the sparse asynchronous event data have been proposed and investigated to learn the spatiotemporal features (Afshar, Nicholson, van Schaik, & Cohen, 2020;Baldwin, Liu, Almatrafi, Asari, & Hirakawa, 2022;Maqueda, Loquercio, Gallego, Garcia, & Scaramuzza, 2018).
In (Afshar, Nicholson, et al., 2020;Cohen et al., 2019), the event-based time surface representations for event-based vision data have been used in extracting features for a range of tasks, such as object recognition on unmanned aerial vehicles (UAVs) (Zappa et al., 2020) and single photon avalanche diode (SPAD) sensors data processing (Afshar, Hamilton, Davis, van Schaik, & Delic, 2020).
In (Afshar, Ralph, et al., 2020), Feature Extraction using Adaptive Selection Thresholds (FEAST) was proposed for event-based vision data using the time surfaces representation. The FEAST method has been investigated for a range of applications such as object tracking (Ralph et al., 2022), event-based supervised learning (Bethi, Xu, Cohen, van Schaik, & Afshar, 2022) and inspired activity-driven adaptation in spiking neural networks (SNNs) (Haessig et al., 2020).
To investigate spectrotemporal representations for event-based auditory data, in (Xu, 2019), the FEAST method was investigated in audio to extract spectrotemporal features for an isolated spoken digits recognition task and showed improved performance. In this work, we extend the work and propose to use FEAST to build a computational auditory cortical modelthe Spectrotemporal Receptive Field (STRF) model. The proposed event-driven STRF approach is applied to the binaural cochlear system for a multi-resolution spectrotemporal analysis.

THE EVENT-BASED BINAURAL CAR-FAC SYSTEM ON FPGA
In the previous work, we implemented a digital cochlea model, the Cascade of Asymmetric Resonators with Fast Acting Compression (CAR-FAC) cochlea model (Lyon, 2017) on a Field Programmable Gate Array (FPGA) for sound localisation (Xu et al., 2021). This model approximates the physiological elements that make up the human cochlea, including the basilar membrane (BM), the outer hair cells (OHCs) and the inner hair cells (IHCs), as shown in Figure 1, and mimics its qualitative behaviour. The digital cochlea is reconfigurable in filter parameters and channel numbers. This work extends the cochlea model to an event-based binaural cochlear system. It includes a CAR-FAC cochlea pair and LIF neurons to generate auditory spike streams.
The architecture of the event-based binaural cochlear system is shown in Figure 2. Each "ear" in the system implements the components of the CAR, the digital OHC (DOHC), the digital IHC (DIHC), the automatic gain control (AGC), the lateral inhibition (LI), and the LIF neuron. One "ear" can be switched off so that the system operates as a single CAR-FAC model. The FAC part that introduces nonlinearities can also be switched off so that the system operates as a linear CAR model. The details of the CAR-FAC module were described in (Xu, Thakur, Singh, Wang, & van Schaik, 2016), (Xu, Thakur, et al., 2018), and . The LIF neuron here is implemented using: .
The generated spike streams encode the amplitude of each channel response that is used in the following feature extraction.

Unsupervised Feature Extraction Using Adaptive Selection Thresholds (FEAST)
The FEAST method in (Afshar, Ralph, et al., 2020) extracts spatio-temporal features for event-based vision data using real-valued exponentially decaying kernels and 2-D "neurons". The use of exponentially decaying kernels for event-based processing was described in (Tapson, Cohen, & van Schaik, 2015) and called a "time surface" in (Lagorce, Ieng, Clady, Pfeiffer, & Benosman, 2015). The time surface is generated by applying an exponential decay with a time constant F ) on a local (typically square) neighbourhood centred on the current event.  where H, I represent the spatial location of the pixel with reference to the event-based sensor and J ∈ {−1,1} is the polarity of the event.
The time surface N * (O, J) at the location (u = [xi, yi ] T ) of the event ei at time t can be calculated as: where P * (O, J) is the timestamp of the latest event that occurred at the location u. The time surface of a pixel in the spatial neighbourhood of size R around an event location is considered as an event context (E_C) (of size (2×R + 1)×( 2×R +1)). In event-based vision, the local E_C describes the recent time history of events in the spatial neighbourhood of an event in 2-D.
FEAST learns spatiotemporal features from the E_Cs through 2-D "neurons". Each neuron has randomly generated initial threshold and weights. The neurons act as feature extractors with individual adaptive thresholds via a competitive strategy. A similarity measure, cos(S) , between the E_C and the neuron's weights is used as a metric to match the event contexts with the weights of each neuron: Architecture of the binaural CAR-FAC FPGA system. The system consists of an audio codec and two "ears". Each of the ears includes a CAR-FAC module, a controller module, and an interface module. The FPGA board is hosted by a PC through a USB interface. The inset shows the system timing diagram demonstrating the pipelined CAR-FAC. With time multiplexing and pipeline techniques, a binaural real-time n-channel CAR-FAC system is built using only one CAR-FAC module and one LIF module for each ear.
where Z * denotes the weights of neuron i, and 4_6 ||4_6|| and 8 ! ||8 ! || are the normalisations of E_C and the weights. After normalisation, the similarity is calculated as a dot product of normalised E_C and weights.
In the learning phase, each neuron's unique threshold acts as a selection boundary. The neuron with the highest similarity that also crosses its selection threshold is picked as the winner neuron, which then emits a spike. The thresholds of the neurons are dynamic during learning, and are adapted based on two rules: 1. If there is a winner, then increase the threshold \'ℎ * of neuron T by a fixed amount Δ". 2. If there is no winner, then decrease all the neurons' thresholds by a fixed amount of ΔW.
The E_C is then used to update the winning neuron's weights with a fixed mixing rate as follows: Where the weights Z * of neuron i to which the E_C is successfully matched, and ^ is the mixing rate used to update the weights of the neuron. The weights of the neurons form features that cover the feature space of the input signals. The use of a dynamic threshold ensures that the rate of firing of all neurons is approximately equal across the dataset, as increasing the threshold on the matching feature serves to specialise each neuron from other neurons. If the weights are coding poorly for the incoming feature, then the global threshold decrease serves to expand the range of input features to which the neurons will respond. This learning process is dynamic and responsive to the statistics of the incoming data.
When the FEAST is applied to the event-based audio data, the E_C needs to be formed differently. Figure 3 shows the construction of the E_Cs and the details will be illustrated in the next two sections. In this paper, we use the FEAST to build the event-based multi-resolution spectrotemporal analysis. The computational SpectroTemporal Receptive Field (STRF) model is inspired by psychoacoustical and neurophysiological findings in the early and central stages of the auditory system (Chi, Ru, & Shamma, 2005). The model provided a unified multi-resolution representation of the spectral and temporal features likely critical in the perception of sound. It mimics aspects of the responses of higher central auditory stages, especially the primary auditory cortex. Functionally, it estimates the spectral and temporal modulation content of the auditory spectrogram via a bank of filters that are selective to different spectrotemporal modulation parameters ranging from slow to fast rates temporally, and from narrow to broad scales spectrally (Chi et al., 2005). Here we break the proposed event-based spectrotemporal feature extraction into two steps:

Temporal Feature Extraction -1-D FEAST
The CAR-FAC model shows highly frequency-dependent gains and the connecting LIF neurons encode the amplitude of the channel responses. A similar amplitude coding is also used by (Liu et al., 2014). Figure 4 show the CAR-FAC response to a TIDIGITS utterance "o". In the middle frequency channel, 650 Hz, the response shows the highest gain in amplitude, and thus higher spike numbers than the higher (1000 Hz) and lower frequency channels (180 Hz). Additionally, the inter-spike interval encodes the changes in amplitude. For example, for an increment in amplitude, the spike train shows a gradually decreasing inter-spike interval, whereas, for a decrease in  amplitude, the spike train shows a gradually increasing inter-spike interval. In this way, the spike trains of each channel encode syllabic rates of speech. In speech and music, there are three kinds of temporal modulations (Chi et al., 2005) in the cochlear outputs. Slow modulations that reflect the syllabic rates of speech. They are superimposed upon the intermediate rate modulations due to inter-harmonic interactions occurring at a rate that reflects the fundamental frequency of the input, which in turn are riding upon the fast frequency component driving this channel best, the characteristic frequencies (CF) of each cochlear channel.
The first step of Event-based Spectrotemporal Feature Extraction, 1-D FEAST, is to extract such syllabic rates, or slow amplitude changes, from each cochlear channel temporally: As shown in Figure 4 (C) and (D), we apply an exponential kernel decaying with a time constant F on each event across the channels: F = 5 % × 10 ,= (8) where 5 % is the sampling frequency, and F determines a duration over which the previous event has an impact on the scene, and the current event represents the highest energy, 1.0, as shown in Figure 4 (D). We then define a 1-D E_C for each event that includes a fixed number of spikes. Each E_C should include a sufficient number of spikes such that a change in amplitude can be represented. Since the E_C generated for each spike has a different duration in time, or number of samples, we then resample the E_C into a fixed number of samples. After resampling, all the E_Cs have a same number of samples, while preserving the encoded temporal features. For example, in Figure 4 (E), an onset is shown in five consecutive spikes with gradually increased inter-spike intervals in all the channels.
FEAST is then applied to the 1-D E_Cs to extract 1-D temporal features, as shown in Figure 5 (A), in two phases:  In the learning phase, the number of the neurons, m, is pre-set, and the initial threshold and weights for each neuron are randomly generated. For each event, we choose k spikes in the past that are the closest to it and resample it to form its E_C. All the extracted E_Cs are presented in random order during training.
For an event at time ti and channel n, the dot product between its E_C and each neuron is calculated. The only neuron with the largest value which is also above its threshold is the winner. The threshold of the winner neuron is then increased by Δ", and the weights are updated according to (7). If there is no winner, all the neurons' thresholds are then decreased by a fixed amount of ΔW. Multiple epochs of learning are performed empirically until it is converged.
b) Feature Extraction: Once the system is no longer learning, the m neurons are then used to extract features from spike streams. Each neuron generates a feature map in its feature space: For an event at time ti in channel n, the dot product between its E_C and each neuron is calculated. The only neuron with the largest value is the winner. The winner neuron will emit a spike at time ti in channel n in its feature space to form a feature map.
The 1-D FEAST extracts channel-wised temporal features, in particular the slow changes in amplitude encoded in the spike streams. It is comparable to the computational spectrotemporal cortical model (Chi et al., 2005) that uses slow rate filters for the temporal analysis to extract syllabic rates in speech.

2.2.3
Spectro-temporal Feature Extraction -2-D FEAST Speech contains spectral modulations created by harmonics and formants, which are also evident in the cochleogram. Harmonics come from the vocal folds and are considered the source of the sound. Formants come from the vocal tract. Formants filter the harmonic sound source, and thus after harmonics go through the vocal tract, some become louder, and some become softer.
The features of the harmonics/formants are assosiated with the frequency channels and the next step of the Event-based Spectrotemporal Feature Extraction, 2-D FEAST, is to extract Figure 6 The 1-D neuron features with different configurations and the corresponding feature maps. spectral and temporal combined features. As shown in Figure 5 (B), to extend the 1-D E_C, for each event at time ' * in channel n, we choose c channels in frequency. Within each selected channel, k spikes that are closest to the current event are selected and resampled to form a 2-D E_C. Similar to the 1-D FEAST, in the learning phase, for an event at time ti and channel n, the dot product between its E_C and each 2-D neuron is calculated. The neuron with the largest value which is also above its threshold is the winner. The threshold of the winner neuron is then increased by Δ", and the weights are updated according to (7). If there is no winner, all the neurons' thresholds are then decreased by a fixed amount of ΔW. Multiple epochs of learning are performed until the weights have converged. In the feature extraction phase, for each event at time ti and channel n, the dot product between its E_C and each 2-D neuron is calculated. The only neuron with the largest value is the winner. The winner neuron will emit a spike at time ti in channel n in its feature space to form a feature map.
Furthermore, for each event, we generate multiple sets of 2-D E_Cs, as shown in Figure 5 (C). Each set includes a different number of channels so that it covers multiple scales in frequency. For example, as shown in Figure 5 (C), a 5-channel dimension only includes one harmonic, whereas a 13-channel can cover two harmonics, and so on. The choice of channel numbers is based on the Greenwood function used in the CAR-FAC model (Greenwood, 1990). We then apply the 2-D FEAST described previously on each set of the E_C in parallel.
The same learning and feature extraction phases described above are applied to each event. For each event at time ti and channel n, there is one winner neuron in each dimension. The winner neuron of each dimension will emit a spike at time ti in channel n in its feature space.
2-D FEAST is comparable to a spectrotemporal cortical model that uses different "seed functions" as scale filters for spectrotemporal analysis (Chi et al., 2005) to extract harmonics and formants.

CAR-FAC ON FPGA
The CAR-FAC FPGA implementation has been investigated and measured by (Xu et al., 2016) and (Xu, Thakur, et al., 2018).
In this work, we use the proposed CAR-FAC cochlear system on FPGA to generate spike streams from the TIDIGITS database. Here F !"# in (2) is set as 10 ms, \ ><%<' is set as 0, and 'ℎ:;ℎ<=> is only set as a medium value, 0.0004. The device utilisation of the binaural 2×64×9 CAR-FAC system is shown in Table1.  The 1-D and 2-D FEAST are tested, respectively, on an isolated spoken digit recognition task using the TIDIGIT dataset. Here we use the isolated spoken digits (zero to ten) from 225 speakers (female and male) as the training and testing data, of which 4950 samples are included in the total (2464 for training and 2486 for testing). The Support Vector Machine (SVM) with linear kernels and optimal regularisation is used as the back-end classifier to investigate the performance of the FEAST.

1-D feature for temporal feature extraction
In the 1-D FEAST, we chose _ = 4 spikes for an E_C. The E_C is then resampled into 32 samples. In this experiment, the algorithm had converged after ten epochs of training. The parameters were configured as Δ" = 0.001, ΔW = 0.003, and ^ = 0.001 in (7), which were derived empirically. The optimal number of neurons depend greatly on the nature of the data. In this experiment, 8, 16, 32, and 64 neurons are tested. The generated feature map for each neuron is down-sampled via fixed time binning (Anumula et al., 2018), as shown in Figure 6. By observing the features of the neurons, we can see the neuron with gradually decreasing intervals often represents an onset, whereas the neuron with gradually increasing intervals represents an offset of an utterance. The evenly distributed intervals represent an unchanged amplitude of the utterance. The generated 1-D features are then used as input for the SVM. Additionally, according to (Acharya et al. 2018) and (Anumula et al., 2018), the time-binned spikes show the highest accuracy compared to other statistical features in the isolated spoken digit recognition, so in this experiment, the time-binned spikes generated from the proposed cochlear system are investigated as a baseline. The classification results are shown in TABLE 2. For all the configurations, the 1-D FEAST shows better accuracy than the time-binned spikes, and the 32-neuron configuration shows the best accuracy, 93.92%.

3.2.2
2-D feature for temporal feature extraction In the 2-D FEAST, we chose _ = 4 spikes and resample them to 32 samples temporally, and 5, 13, 25, and 37 channels for the 2-D E_Cs. In the training phase, we train each set of the 2-D E_Cs in parallel, using 16, 32 and 64 neurons, respectively. Figure 7 shows all the features of the 32 neurons and Figure 8 shows the corresponding feature maps. The small-sized neurons tend to show fine spectral features, whereas the large-sized neurons only show coarse intensity information in Figure 8.
We then use the 2-D features as input for the SVM. Firstly, we test each set of neurons separately. As shown in Table 2, the 32 neuron system shows the best accuracy, and we found for the same number of neurons, the 25-channel size tends to provide better accuracy. Next, we combine all the sizes together for each neuron configuration, and get an improved accuracy, 97.71%. As comparisons, the results of the same experiment by (Anumula et al., 2018) are also shown in TABLE 2, in which a Gated Recurrent Unit (GRU) Recurrent Neural Network (RNN) is used for a constant time binning of the exponential features. Currently, the highest accuracy of 99.09% on the same task is achieved by (Shrestha & Orchard, 2018) using a 484-500-500-11 neuron spiking neural network with backpropagation, whereas in our approach, we only use one-layer of 128 neurons (32 neurons × 4 sizes) and a simple linear classifier.

DISCUSSIONS
This paper presents a reconfigurable digital implementation of an event-based binaural cochlear system and an event-driven spectrotemporal receptive field feature extraction approach. The algorithm is tested on an isolated spoken digit recognition task. The features extracted from FEAST provide better multi-resolution representations of the event-based data than statistical approaches that have been classically used for decoding spike streams.
Like any other data modalities, noise in event-data poses challenges to effective processing and FEAST helps in learning noise-robust features. The CAR-FAC model has been shown to provide noise-robust features in audio to perform speaker identification (Islam, Xu, Monk, Afshar, & van Schaik, 2022). Audio features from the CAR-FAC cochlea model have also been used to perform noise-robust binaural sound localisation (Xu, Afshar, et al., 2018;Xu et al., 2019Xu et al., , 2021. Since the FEAST is an unsupervised method, it cannot perform classification and requires a backend classifier. In follow-up work, we will use a generalised model of the FEAST method that performs feature extraction and classification in a single architecture (Bethi et al., 2022).