Heterogeneous recurrent spiking neural network for spatio-temporal classification

Spiking Neural Networks are often touted as brain-inspired learning models for the third wave of Artificial Intelligence. Although recent SNNs trained with supervised backpropagation show classification accuracy comparable to deep networks, the performance of unsupervised learning-based SNNs remains much lower. This paper presents a heterogeneous recurrent spiking neural network (HRSNN) with unsupervised learning for spatio-temporal classification of video activity recognition tasks on RGB (KTH, UCF11, UCF101) and event-based datasets (DVS128 Gesture). We observed an accuracy of 94.32% for the KTH dataset, 79.58% and 77.53% for the UCF11 and UCF101 datasets, respectively, and an accuracy of 96.54% on the event-based DVS Gesture dataset using the novel unsupervised HRSNN model. The key novelty of the HRSNN is that the recurrent layer in HRSNN consists of heterogeneous neurons with varying firing/relaxation dynamics, and they are trained via heterogeneous spike-time-dependent-plasticity (STDP) with varying learning dynamics for each synapse. We show that this novel combination of heterogeneity in architecture and learning method outperforms current homogeneous spiking neural networks. We further show that HRSNN can achieve similar performance to state-of-the-art backpropagation trained supervised SNN, but with less computation (fewer neurons and sparse connection) and less training data.


. Introduction
Acclaimed as the third generation of neural networks, spiking neural networks (SNNs) have become very popular. In general, SNN promises lower operating power when mapped to hardware. In addition, recent developments of SNNs with leaky integrate-and-fire (LIF) neurons have shown classification performance similar to deep neural networks (DNN). However, most of these works use supervised statistical training algorithms such as backpropagationthrough-time (BPTT) (Jin et al., 2018;Shrestha and Orchard, 2018;Wu et al., 2018). These backpropagated models are extremely data-dependent and show poor trainability with less training data, and generalization characteristics (Tavanaei et al., 2019;Lobo et al., 2020). In addition, BPTT-trained models need highly complex architecture with a large number of neurons for good performance. Though unsupervised learning methods like the STDP have been introduced, they lack performance compared to their backpropagated counterparts. This is attributed to the high training complexity of these STDP dynamics (Lazar et al., 2006). Therefore, there is a need to explore SNN architectures and algorithms that can improve the performance of unsupervised learned SNN.
. Related works . . Recurrent spiking neural network . . . Supervised learning Recurrent networks of spiking neurons can be effectively trained to achieve competitive performance compared to standard recurrent neural networks. Demin and Nekhaev (2018) showed that using recurrence could reduce the number of layers in SNN models and potentially form the various functional network structures. Zhang and Li (2019) proposed a spike-train level recurrent SNN backpropagation method to train the deep RSNNs, which achieves excellent performance in image and speech classification tasks. On the other hand, Wang et al. (2021) used the recurrent LIF neuron model with the dynamic presynaptic currents and trained by the BP based on surrogate gradient. Some recent works introduces heterogeneity in the LIF parameters using trainable time constants (Fang et al., 2021). However, these methods are supervised learning models and also do not scale with a greater number of hyperparameters.

. . . Unsupervised learning
Unsupervised learning models like STDP have shown great generalization, and trainability properties (Chakraborty and Mukhopadhyay, 2021). Previous works have used STDP for training the recurrent spiking networks (Gilson et al., 2010). Nobukawa et al. (2019) used a hybrid STDP and Dopamine-modulated STDP to train the recurrent spiking network and showed its performance in classifying patterns. Several other works have used a reservoir-based computing strategy, as described above. Liquid State Machines, equipped with unsupervised learning models like STDP and BCM (Ivanov and Michmizos, 2021) have shown promising results.

. . . Heterogeneity
Despite the previous works on recurrent spiking neural networks, all these models use a uniform parameter distribution for spiking neuron parameters and their learning dynamics. There has been little research leveraging heterogeneity in the model parameters and their effect on performance and generalization. Recently, Perez-Nieves et al. (2021) introduced heterogeneity in the neuron time constants and showed this improves the model's performance in the classification task and makes the model robust to hyperparameter tuning. She et al. (2021b) also used a similar heterogeneity in the model parameters of a feedforward spiking network and showed it could classify temporal sequences. Again, modeling heterogeneity in the brain cortical networks, Zeldenrust et al. (2021) derived a class of RSNNs that tracks a continuously varying input online.

. . Action detection using SNNs
SNNs can operate directly on the event data instead of aggregating them, recent works use the concept of time-surfaces (Lagorce et al., 2016;Maro et al., 2020). Escobar et al. (2009) proposed a feed-forward SNN for action recognition using the mean firing rate of every neuron and synchrony between neuronal firing. Yang et al. (2018) used a two-layer spiking neural network to learn human body movement using a gradient descent-based learning the mechanism by encoding the trajectories of the joints as spike trains. Wang W. et al. (2019) proposed a novel Temporal Spiking Recurrent Neural Network (TSRNN) to perform robust action recognition from a video. Using a temporal pooling mechanism, the SNN model provides reliable and sparse frames to the recurrent units. Also, a continuous message passes from spiking signals to RNN helps the recurrent unit retain its long-term memory. The other idea explored in the literature is to capture the temporal features of the input that are extracted by a reservoir network of spiking neurons, the output of which is trained to produce certain desired activities based on some learning rule. Recent research learned video activities with limited examples using this idea of reservoir computing (Panda and Srinivasa, 2018;George et al., 2020;Zhou et al., 2020). We observed that driven/autonomous models are good for temporal dependency modeling of a single-dimensional pre-known time series, but it cannot learn spatio-temporal features together needed for action recognition. Soures and Kudithipudi (2019) used a the deep architecture of a reservoir connected to an unsupervised Winner Take All (WTA) layer, which captures input in a higher dimensional space and encodes that to a low dimensional representation by the WTA layer. All the information from the layers in the deep network is selectively processed using an attention-based neural mechanism. They have used ANN-based spatial feature extraction using ResNet but it is compute-intensive. Some of the recent works also study where R m is membrane resistance, τ m = R m C m is time constant and C m is membrane capacitance. a is the resting potential. I is the sum of current from all input synapses connected to the neuron. A spike is generated when membrane potential v crosses the threshold, and the neuron enters refractory period r, during which the neuron maintains its membrane potential at v reset . We construct the HRSNN from the baseline recurrent spiking network (RSNN) consisting of three layers: (1) an input encoding layer (I), (2) a recurrent spiking layer (R), and (3) an output decoding layer (O). The recurrent layer consists of excitatory and inhibitory neurons, distributed in a ratio of N E : N I = 4 : 1. The PSPs of post-synaptic neurons produced by the excitatory neurons are positive, while those produced by the inhibitory neurons are negative. We used a biologically plausible LIF neuron model and trained the model using STDP rules. From here on, we refer to connections between I and R neurons as S IR connections, inter-recurrent layer connections as S RR , and R to O as S RO . We created S RR connections using probabilities based on Euclidean distance, D(i, j), between any two neurons i, j: with closer neurons having higher connection probability. Parameters C and λ set the amplitude and horizontal shift, respectively, of the probability distribution. I contains excitatory encoding neurons, which convert input data into spike trains. S IR only randomly chooses 30% of the excitatory and inhibitory neurons in R as the post-synaptic neuron. The connection probability between the encoding neurons and neurons in the R is defined by a uniform probability P IR , which, together with λ, will be used to encode the architecture of the HRSNN and optimized using BO. In this work, each neuron received projections from some randomly selected neurons in R. We used unsupervised, local learning to the spiking recurrent model by letting STDP change each S RR and S IR connection, modeled as: where A + , A − are the potentiation/depression learning rates and T pre /T post are the pre/post-synaptic trace variables, modeled as, where a + , a − are the discrete contributions of each spike to the trace variable, τ * + , τ * − are the decay time constants, t i pre and t o post are the times of the pre-synaptic and post-synaptic spikes, respectively.

. . . Heterogeneous LIF neurons
The use of multiple timescales in spiking neural networks has several underlying benefits, like increasing the memory capacity of the network. In this paper, we propose the usage of heterogeneous LIF neurons with different membrane time constants and threshold voltages, thereby giving rise to multiple timescales. Due to differential effects of excitatory and inhibitory heterogeneity on the gain and asynchronous state of sparse cortical networks (Carvalho and Buonomano, 2009;Hofer et al., 2011), we use different gamma distributions for both the excitatory and inhibitory LIF neurons. This is also inspired by the brain's biological observations, where the time constants for excitatory neurons are larger than the time constants for the inhibitory neurons. Thus, we incorporate the heterogeneity in our Recurrent Spiking Neural Network by using different membrane time constants τ for each LIF neuron in R. This gives rise to a distribution for the time constants of the LIF neurons in R.

. . . Heterogeneous STDP
Experiments on different brain regions and diverse neuronal types have revealed a wide variety of STDP forms that vary in plasticity direction, temporal dependence, and the involvement of signaling pathways (Sjostrom et al., 2008;Feldman, 2012;Korte and Schmitz, 2016). As described by Pool and Mato (2011), one of the most striking aspects of this plasticity mechanism in synaptic efficacy is that the STDP windows display a great variety of forms in different parts of the nervous system. However, most STDP models used in Spiking Neural Networks are homogeneous with uniform timescale distribution. Thus, we explore the advantages of using heterogeneities in several hyperparameters discussed above. This paper considers heterogeneity in the scaling function constants (A + , A − ) and the decay time constants (τ + , τ − ).

. . Classification property of HRSNN
We theoretically compare the performance of the heterogeneous spiking recurrent model with its homogeneous counterpart using a binary classification problem. The ability of HRSNN to distinguish between many inputs is studied through the lens of the edge-of-chaos dynamics of the spiking recurrent neural network, similar to the case in spiking reservoirs shown by Legenstein and Maass (2007). Also, R possesses a fading memory due to its short-term synaptic plasticity and recurrent connectivity. For each stimulus, the final state of the R, .

FIGURE
An illustrative example showing the heterogeneous recurrent spiking neural network structure. First, we show the temporal encoding method based on the sensory receptors receiving the di erence between two time-adjacent data. Next, the input sequences are encoded by the encoding neurons that inject the spike train into % neurons in R.R contains a : ratio of excitatory (green nodes) and inhibitory (orange nodes), where the neuron parameters are heterogeneous. The synapses are trained using the heterogeneous STDP method.
i.e., the state at the end of each stimulus, carries the most information. Figure 1 shows the heterogeneous recurrent spiking neural network model with heterogeneous LIF neurons and heterogeneous STDP synapses used for the classification of spatiotemporal data sequences. The authors showed that the rank of the final state matrix F reflects the separation property of a kernel: F = S(1) S(2) · · · S(N) T where S(i) is the final state vector of R for the stimulus i. Each element of F represents one neuron's response to all the N stimuli. A higher rank in F indicates better kernel separation if all N inputs are from N distinct classes. The effective rank is calculated using Singular Value Decomposition (SVD) on F, and then taking the number of singular values that contain 99% of the sum in the diagonal matrix as the rank. i.e. F = U V T where U and V are unitary matrices, and is a diagonal matrix diag (λ 1 , λ 2 , λ 3 , . . . , λ N ) that contains non-negative singular values such that (λ 1 ≥ λ 2 · · · ≥ λ N ).
Definition: Linear separation property of a neuronal circuit C for m different inputs u 1 , . . . , u m (t) is defined as the rank of the n × m matrix M whose columns are the final circuit states x u i (t 0 ) obtained at time t 0 for the preceding input stream u i .
Following from the definition introduced by Legenstein and Maass (2007), if the rank of the matrix M = m, then for the inputs u i , any given assignment of target outputs y i ∈ R at time t 0 can be implemented by C.
We use the rank of the matrix as a measure for the linear separation of a circuit C for distinct inputs. This leverages the complexity and diversity of nonlinear operations carried out by C on its input to boost the classification performance of a subsequent linear decision-hyperplane.
Theorem 1: Assuming S u is finite and contains s inputs, let r Hom , r Het are the ranks of the n × s matrices consisting of the s vectors x u (t 0 ) for all inputs u in S u for each of Homogeneous and Heterogeneous RSNNs respectively. Then r Hom ≤ r Het .
Short Proof: Let us fix some inputs u 1 , . . . , u r in S u so that the resulting r circuit states x u i (t 0 ) are linearly independent. Using the Eckart-Young-Mirsky theorem for low-rank approximation, we show that the number of linearly independent vectors for HeNHeS is greater than or equal to the number of linearly independent vectors for HoNHoS. The detailed proof is given in the Supplementary material. Definition : Given K ρ is the modified Bessel function of the second kind, and σ 2 , κ, ρ are the variance, length scale, and smoothness parameters respectively, we define the modified Matern kernel on the Wasserstein metric space W between two distributions X , X ′ given as where Ŵ(.), H(.) is the Gamma and Bessel function, respectively. Theorem 2: The modified Matern function on the Wasserstein metric space W is a valid kernel function Short Proof: To show that the above function is a kernel function, we need to prove that Mercer's theorem holds. i.e., (i) the function is symmetric and (ii) in finite input space, the Gram matrix of the kernel function is positive semi-definite. The detailed proof is given in the Supplementary material.

. . Optimal hyperparameter selection using Bayesian Optimization
While BO is used in various settings, successful applications are often limited to low-dimensional problems, with fewer than twenty dimensions (Frazier, 2018). Thus, using BO for high-dimensional problems remains a significant challenge. In our case of optimizing .
HRSNN model parameters for 2,000, we need to optimize a huge number of parameters, which is extremely difficult for BO. As discussed by Eriksson and Jankowiak (2021), suitable function priors are especially important for good performance. Thus, we used a biologically inspired initialization of the hyperparameters derived from the human brain (see Supplementary material). This paper uses a modified BO to estimate parameter distributions for the LIF neurons and the STDP dynamics. To learn the probability distribution of the data, we modify the surrogate model and the acquisition function of the BO to treat the parameter distributions instead of individual variables. This makes our modified BO highly scalable over all the variables (dimensions) used. The loss for the surrogate model's update is calculated using the Wasserstein distance between the parameter distributions.
BO uses a Gaussian process to model the distribution of an objective function and an acquisition function to decide points to evaluate. For data points in a target dataset x ∈ X and the corresponding label y ∈ Y, an SNN with network structure V and neuron parameters W acts as a function f V,W (x) that maps input data x to predicted labelỹ. The optimization problem in this work is defined as where V is the set of hyperparameters of the neurons in R (Details of hyperparameters given in the Supplementary material) and W is the multi-variate distribution constituting the distributions of (i) the membrane time constants τ m−E , τ m−I of the LIF neurons, (ii) the scaling function constants (A + , A − ) and (iii) the decay time constants τ + , τ − for the STDP learning rule in S RR . Again, BO needs a prior distribution of the objective function f ( x) on the given data D 1 : k = x 1 : k , f ( x 1 : k ) . In GP-based BO, it is assumed that the prior distribution of f ( x 1 : k ) follows the multivariate Gaussian distribution, which follows a Gaussian Process with mean µ D 1 : k and covariance D 1 : k . We estimate D 1 : k using the modified Matern kernel function, which is given in Equation 6. In this paper, we use d(x, x ′ ) as the Wasserstein distance between the multivariate distributions of the different parameters. It is to be noted here that for higher-dimensional metric spaces, we use the Sinkhorn distance as a regularized version of the Wasserstein distance to approximate the Wasserstein distance (Feydy et al., 2019).
D 1 : k are the points that have been evaluated by the objective function, and the GP will estimate the mean µ D k : n and variance σ D k : n for the rest unevaluated data D k : n . The acquisition function used in this work is the expected improvement (EI) of the prediction fitness as: where (·) and φ(·) denote the probability distribution function and the cumulative distribution function of the prior distributions, respectively. f (x best ) = max f ( x 1 : k ) is the maximum value that has been evaluated by the original function f in all evaluated data D 1 : k and Z = µ D k : n −f (x best ) σ D k : n . BO will choose the data x j = argmax EI ( x k : n ) ; x j ⊆ x k : n as the next point to be evaluated using the original objective function.

. Experiments . . Training and inference
We use a network of leaky integrate and fire (LIF) neurons and train the synapses using a Hebbian plasticity rule called the spike timing dependent plasticity (STDP). The complete network is shown in Figure 5. First, to pre-process the spatio-temporal data and remove the background noise which arises due to camera movement and jitters, we use the Scan-based filtering technique as proposed by Panda and Srinivasa (2018) where we create a bounding box and center of gravity of spiking activity for each frame and scan across five directions as shown in Figure 2. Hence, the output of this scan-based filter is fed into the encoding layer, which encodes this information into an array of the spike train. In this paper, we use a temporal coding method. Following Zhou et al. (2020), we use a square cosine encoding method which employs several cosine encoding neurons to convert real-valued variables into spike times. The encoding neurons convert each real value to several spike times within a limited period of encoding time. Each real value is primarily normalized into [0, π ], and then converted into spike times as t s = T · cos(d + i · π/n), d ∈ [0, π ] i = 1, 2, . . . , n, where t s is the spiking time, T is the maximum encoding time of each spike, d denotes the normalized data, i is the sequence number of the encoding neuron, n is the number of encoding neurons.
The sensory receptors used for the spatial-temporal data are designed to receive the difference between time-adjacent data in a sequence. The data in each sequence is processed as follows: where M S represents a sequence, and D n represents an individual data in that sequence. If the difference exceeds the threshold, the encoding neuron will fire at that moment. We use a max-pooling operation before transferring the spike trains to post-synaptic neurons, where each pixel in the output max-pooled frame represents an encoding neuron. This helps in the reduction of the dimensions of the spike train. The recurrent spiking layer extracts the features of the spatio-temporal data and converts them into linearly separable states in a high-dimensional space. O abstracts the state from R for classification. The state of R is defined as the membrane potential of the output neurons at the end of each spike train converted from the injected spatio-temporal data. After the state is extracted, the membrane potential of the output neuron is set to its initial value. After injecting all sequences into the network, the states of each data are obtained. A linear classifier is employed in this work to evaluate pattern recognition performance. Further details regarding the training and inference procedures are elicited in the Supplementary material.

. . Baseline ablation models
We use the following baselines for the comparative study: • Recurrent Spiking Neural Network with STDP: • . Results

. . Ablation studies
We compare the performance of the HRSNN model with heterogeneity in the LIF and STDP dynamics (HeNHeS) to the ablation baseline recurrent spiking neural network models described above. We run five iterations for all the baseline cases and show the mean and standard deviation of the prediction accuracy of the network using 2,000 neurons. The results are shown in Table 1. We see that the heterogeneity in the LIF neurons and the LTP/LTD dynamics significantly improve the model's accuracy and error.

. . Number of neurons
In deep learning, it is an important task to design models with a lesser number of neurons without undergoing degradation in performance. We empirically show that heterogeneity plays a critical role in designing spiking neuron models of smaller sizes. We compare models' performance and convergence rates with fewer neurons in R.

. . . Performance analysis
We analyze the network performance and error when the number of neurons is decreased from 2,000 to just 100. We report the results obtained using the HoNHoS and HeNHeS models for the KTH and DVS-Gesture datasets. The experiments are repeated five times, and the observed mean and standard deviation of the accuracies are shown in Figure 3. The graphs show that as the number of neurons decreases, the difference in accuracy scores between the homogeneous and the heterogeneous networks increases rapidly.

. . . Convergence analysis with lesser neurons
Since the complexity of BO increases exponentially on increasing the search space, optimizing the HRSNN becomes increasingly difficult as the number of neurons increases. Thus, we compare the convergence behavior of the HoNHoS and HeNHeS models with 100 and 2,000 neurons each. The results are plotted in Figures 4A, B. Despite the huge number of additional parameters, the convergence behavior of HeNHeS is similar to that of HoNHoS. Also, it must be noted that once converged, the standard deviation of the accuracies for HeNHeS is much lesser than that of HoNHoS, indicating a much more stable model.

. . Sparse connections
S RR is generated using a probability dependent on the Euclidean distance between the two neurons, as described by Equation (2), where λ controls the density of the connection, and C is a constant depending on the type of the synapses (Zhou et al., 2020).
We performed various simulations using a range of values for the connection parameter λ and synaptic weight scale W scale . Increasing λ will increase the number of synapses. Second, the W scale parameter determines the mean synaptic strength. Now, a greater W scale produces larger weight variance. For a single input video, the number of active neurons was calculated and plotted against the parameter values for synaptic weight W scale and network connectivity λ. Active neurons are those that fire at least one spike over the entire test data set. The results for the HoNHoS and HeNHeS are shown in Figures 5A, B, respectively. Each plot in the figure is obtained by interpolating 81 points, and each point is calculated by averaging the results from five randomly initialized with the parameters specified by the point. The horizontal axis showing the increase in .
/fnins. .   λ is plotted on a linear scale, while the vertical axis showing the variation in W scale is in a log scale. The figure shows the neurons that have responded to the inputs and reflect the network's activity level. W scale is a factor that enhances the signal transmission within R. As discussed by Markram et al. (1997), the synaptic response that is generated by any action potential (AP) in a train is given as EPSP n = W scale × ρ n × u n , where ρ n is the fraction of the synaptic efficacy for the n-th AP and u n is its utilization of synaptic efficacy.
Hence, it is expected that when the W scale is large, more neurons will fire. As λ increases, more synaptic connections are created, which opens up more communication channels between the different neurons. As the number of synapses increases, the rank of the final state matrix used to calculate separation property also increases. The rank reaches an optimum for intermediate synapse density, and the number of synapses created increases steadily as λ increases. As λ increases, a larger number of connections creates more dependencies .
/fnins. . between neurons and decreases the effective separation ranks when the number of connections becomes too large. The results for the variation of the effective ranks with λ and W scale are shown in the Supplementary material. We compare the model's change in performance with varying sparsity levels in the connections and plotted in Figures 6A, B for the HoNHoS and the HeNHeS models. From the figures, we see that for larger values of λ, the performance of both the RSNNs was suboptimal and could not be improved by tuning the parameter W scale . For a small number of synapses, a larger W scale was required to obtain satisfactory performance for HoNHoS compared to the HeNHeS model. Hence, the large variance of the weights leads to better performance. Hence, we see that the best testing accuracy for HeNHeS is achieved with fewer synapses than HoNHoS. It also explains why the highest testing accuracy for the heterogeneous network ( Figure 6B) is better than the homogeneous network ( Figure 6A), because the red region in Figure 6B corresponds to higher W scale values and thus larger weight variance than Figure 6A.

. . Limited training data
In this section, we compare the performance of the HeNHeS to HoNHoS and HeNB-based spiking recurrent neural networks that are trained with limited training data. The evaluations performed on the KTH dataset are shown in Figure 7 as a stacked bar graph for the differential increments of training data sizes. The figure shows that using 10% training data, HeNHeS models outperform both HoNHoS and HeNB for all the cases. The difference between the HeNHeS and HeNB increases as the number of neurons in the recurrent layer N R decreases. Also, we see that adding heterogeneity improves the model's performance in homogeneous cases. Even when using 2,000 neurons, HeNHeS trained with 10% training data exhibit similar performance to HeNB trained with 25% of training data. It is to be noted here that for the performance evaluations of the cases with 10% training data, the same training was repeated until each model converged.

. . Comparison with prior work
In this section, we compare our proposed HRSNN model with other baseline architectures. We divide this comparison in two parts as discussed below: • DNN-based Models: We compare the performance and the model complexities of current state-of-the-art DNN-based models (Carreira and Zisserman, 2017;Wang Q. et al., 2019;Bi et al., 2020;Lee et al., 2021;Wang et al., 2021) with our proposed HRSNN models. • Backpropagation-based SNN Models: We compare the performance of backpropagation-based SNN models with HoNB and HeNB-based RSNN models. We observe that backpropagated HRSNN models (HeNB) can achieve similar performance to DNN models but with much lesser model complexity (measured using the number of parameters).
1. State-of-the-art BP Homogeneous SNN: We compare the performance of current state-of-the-art backpropagationbased SNN models (Panda and Srinivasa, 2018;Zheng et al., 2020;Liu et al., 2021;Shen et al., 2021). 2. State-of-the-art BP Heterogeneous SNN: We compare the performances of the current state-of-the-art SNN models, which uses neuronal heterogeneity (Fang et al., 2021;Perez-Nieves et al., 2021;She et al., 2021a). We compare the performances and the model complexities of these models. 3. Proposed Heterogeneous Backpropagation Models: We introduce two new backpropagation-based RSNN models. These models are the Homogeneous Neurons with Backpropagation (HoNB) and the Heterogeneous Neurons with Backpropagation (HeNB). We use our novel Bayesian Optimization to search for the parameters for both of these models.
• Unsupervised SNN Models: We also compare the results for some state-of-the-art unsupervised SNN models with our proposed HRSNN models.

FIGURE
The variation in performance of the action recognition classification task with network sparsity and weight variance for (A) HoNHoS and (B) HeNHeS. The plot is obtained by interpolating points, and each point is calculated by averaging the results from randomly initialized HRSNNs.

FIGURE
Bar graph showing di erence in performance for the di erent models with increasing training data for the KTH dataset. A similar trend can be observed for the DVS dataset (shown in Supplementary material).

Homogeneous SNN Models:
We compare the performances of some of the state-of-the-art unsupervised SNN models which uses homogeneous neuronal parameters (Meng et al., 2011;Zhou et al., 2020;Ivanov and Michmizos, 2021). 2. HRSNN Models: We compare the above models with respect to our proposed HRSNN models using heterogeneity in both neuronal and synaptic parameters. We compare the model's performance and the model's complexity.
We also compare the average neuronal activation of the homogeneous and the heterogeneous recurrent SNNs for the same given image input for a recurrent spiking network with 2,000 neurons. If we consider the neuronal activation of neuron i at time t to be ν i (t), the average neuronal activationν for T timesteps is defined The results obtained are shown in Table 2. The table shows that the heterogeneous HRSNN model has a much lesser average neuronal activation than the homogeneous RSNN and the other unsupervised SNN models. Thus, we conclude that HeNHeS induces sparse activation and sparse coding of information.
Again, comparing state-of-the-art unsupervised learning models for action recognition with our proposed HRSNN models, we see that using heterogeneity in the unsupervised learning models can substantially improve the model's performance while having much lesser model complexity.

. Conclusions
We develop a novel method using recurrent SNN to classify spatio-temporal signals for action recognition tasks using biologically-plausible unsupervised STDP learning. We show how heterogeneity in the neuron parameters and the LTP/LTD dynamics of the STDP learning process can improve the performance and empirically demonstrate their impact on .
/fnins. . developing smaller models with sparse connections and trained with lesser training data. It is well established in neuroscience that, heterogeneity (De Kloet and Reul, 1987;Shamir and Sompolinsky, 2006;Petitpré et al., 2018) is an intrinsic property of the human brain. Our analysis shows that incorporating such concepts is beneficial for designing high-performance HRSNN for classifying complex spatio-temporal datasets for action recognition tasks.

Data availability statement
The original contributions presented in the study are included in the article/Supplementary material, further inquiries can be directed to the corresponding author.