STSC-SNN: Spatio-Temporal Synaptic Connection with temporal convolution and attention for spiking neural networks

Spiking neural networks (SNNs), as one of the algorithmic models in neuromorphic computing, have gained a great deal of research attention owing to temporal information processing capability, low power consumption, and high biological plausibility. The potential to efficiently extract spatio-temporal features makes it suitable for processing event streams. However, existing synaptic structures in SNNs are almost full-connections or spatial 2D convolution, neither of which can extract temporal dependencies adequately. In this work, we take inspiration from biological synapses and propose a Spatio-Temporal Synaptic Connection SNN (STSC-SNN) model to enhance the spatio-temporal receptive fields of synaptic connections, thereby establishing temporal dependencies across layers. Specifically, we incorporate temporal convolution and attention mechanisms to implement synaptic filtering and gating functions. We show that endowing synaptic models with temporal dependencies can improve the performance of SNNs on classification tasks. In addition, we investigate the impact of performance via varied spatial-temporal receptive fields and reevaluate the temporal modules in SNNs. Our approach is tested on neuromorphic datasets, including DVS128 Gesture (gesture recognition), N-MNIST, CIFAR10-DVS (image classification), and SHD (speech digit recognition). The results show that the proposed model outperforms the state-of-the-art accuracy on nearly all datasets.

Spiking neural networks (SNNs), as one of the algorithmic models in neuromorphic computing, have gained a great deal of research attention owing to temporal information processing capability, low power consumption, and high biological plausibility. The potential to e ciently extract spatiotemporal features makes it suitable for processing event streams. However, existing synaptic structures in SNNs are almost full-connections or spatial D convolution, neither of which can extract temporal dependencies adequately. In this work, we take inspiration from biological synapses and propose a Spatio-Temporal Synaptic Connection SNN (STSC-SNN) model to enhance the spatio-temporal receptive fields of synaptic connections, thereby establishing temporal dependencies across layers. Specifically, we incorporate temporal convolution and attention mechanisms to implement synaptic filtering and gating functions. We show that endowing synaptic models with temporal dependencies can improve the performance of SNNs on classification tasks. In addition, we investigate the impact of performance via varied spatial-temporal receptive fields and reevaluate the temporal modules in SNNs. Our approach is tested on neuromorphic datasets, including DVS Gesture (gesture recognition), N-MNIST, CIFAR -DVS (image classification), and SHD (speech digit recognition). The results show that the proposed model outperforms the state-of-the-art accuracy on nearly all datasets. KEYWORDS spiking neural network (SNN), Spatio-Temporal Synaptic Connection (STSC), spike response filter (SRF), Feedforward Lateral Inhibition (FLI), attention mechanism, neuromorphic recognition, backpropagation (BP) . Introduction Spiking neural networks (SNNs) are regarded as the third generation of neural networks (Maass, 1997), with the purpose of addressing the fundamental mysteries of intelligence and the brain by emulating biological neurons and incorporating more biological mechanisms . The two fundamental components of SNNs are spiking neurons and synapses, which create a hierarchical structure (layers) and subsequently construct a network. SNNs have attracted a significant deal of academic interest in recent years due to their prospective properties, such as the ability to process temporal information (Petro et al., 2019), low power consumption , and biological interpretability (Gerstner et al., 2014). Currently, SNNs are capable of processing event stream data with low latency and low power (Pei et al., 2019;Gallego et al., 2020). However, there is still a performance gap between SNNs and traditional Artificial Neural Networks (ANNs). Recent SNN training techniques based on surrogate gradients and backpropagation have significantly enhanced the performance of SNNs (Wu et al., 2018;Fang et al., 2021c), while also promoting the further integration of ANNs' modules into SNNs Yao et al., 2021;Zheng et al., 2021), greatly accelerating the development of SNNs. However, it remains challenging to connect these computational techniques with the biological properties of SNNs.
Due to the time-dependent correlation of neuron dynamics, it is believed that SNNs naturally process information in both temporal and spatial dimensions (Petro et al., 2019;Roy et al., 2019). Further researches are necessary to harness the spatio-temporal information processing capabilities of SNNs. Combining ANNs' modules has significantly increased the performance of SNNs in several research studies. In terms of spatial information processing, CSNN (Xu et al., 2018) was the first to validate the application of convolution structure on SNNs, followed by the proposal of NeuNorm to improve SNNs' usage of convolution through auxiliary neurons . In the time dimension, Zheng et al. (2021) implements the time-dependent batch normalization (tdBN) module to tackle the issue of gradient vanishing and threshold balancing, and Yao et al. (2021) uses the Squeeze-and-Excitation (SE) block (Hu et al., 2018) to realize the attention distribution of the temporal dimension in order to improve the temporal feature extraction. Notably, Zhu et al. (2022) proposes Temporal-Channel Joint Attention (TCJA) to concurrently process input in both temporal and spatial dimensions, which is a significant effort for SNNs' spatio-temporal feature extraction. These studies effectively improve the performance of SNNs by transplanting established ANNs' modules and methodologies. However, applying these computational modules to SNNs from the standpoint of deep learning dilutes the fundamental biological interpretability, bringing SNNs closer to a mix of existing concepts in machine learning, such as recurrent neural networks (RNNs), binary neural networks (BNNs), and quantization networks.
From a biological standpoint, some works focus on the synapse models, investigating the potential of SNNs in respect of connection modes and information transmission. Shrestha and Orchard (2018), Fang et al. (2020a), and Yu et al. (2022) integrate impulse response models with synaptic dynamics, hence enhancing the temporal information representation of SNNs; Cheng et al. (2020) implements intra-layer lateral inhibitory connections to improve the noise tolerance of SNNs; from the standpoint of synaptic plasticity, Zhang and Li (2019) and Bellec et al. (2020) introduce bio-plausible training algorithms as an alternative to back-propagation (BP), allowing for lower-power training. Experiments revealed that the synaptic models of SNNs have a great deal of space for modification and refinement in order to handle spatiotemporal data better (Fang et al., 2020a). We propose a Spatio-Temporal Synaptic Connection (STSC) module for this reason.
Based on the notion of spatio-temporal receptive fields, the structural features of dendritic branches (Letellier et al., 2019) and feedforward lateral inhibition (Luo, 2021) motivate this study. By merging the ANNs' computation modules (temporal convolutions and attention mechanisms) with SNNs, we propose the STSC module, consisting of Temporal Response Filter (TRF) module and Feedforward Lateral Inhibition (FLI) module. As shown in Figure 1, the STSC can be attached to spatial operations to expand the spatio-temporal receptive fields of synaptic connections, hence facilitating the extraction of spatio-temporal features.
The main contributions of this work are summarized as follows: • We propose STSC-SNN to implement synaptic connections with extra temporal dependencies and enhance the SNNs' capacity to handle temporal information. To the best of our knowledge, this study is the first to propose the idea of synaptic connections with spatio-temporal receptive fields in SNNs and to investigate the influence of synaptic temporal dependencies in SNNs. • Inspired by biological synapses, we propose two plugand-play blocks: Temporal Response Filter (TRF) and Feedforward Lateral Inhibition (FLI), which perform temporal convolution and attention operations and can be simply implemented into deep learning frameworks for performance improvements. • On neuromorphic datasets, DVS128 Gesture, SHD, N-MNIST, and CIFAR10-DVS, we have produced positive results. Specifically, we acquire 92.36% accuracy on SHD with a simple fully-connected structure, which is a great improvement above the 91.08% results obtained with . /fnins. . recurrent structure and reaches performance comparable to ANNs.
. Methods and materials . . The networks components and training methods in spiking neural networks . . . Learning algorithms for SNNs In recent years, many works have explored the learning algorithms of SNNs, which can be generally categorized as biologically inspired approaches (Diehl and Cook, 2015;Zhang and Li, 2019;Bellec et al., 2020), ANN-to-SNN conversion methods (Orchard et al., 2015;Sengupta et al., 2019;Han et al., 2020), and surrogate-based direct training methods (Wu et al., 2018;Neftci et al., 2019;Fang et al., 2021c). Bioinspired algorithms are represented by spike-timing-dependent plasticity (STDP; Diehl and Cook, 2015;Vaila et al., 2020) which performs local learning through the learning rules of synaptic plasticity, and is often used in unsupervised learning. Some biologically inspired algorithms that integrate local and global learning also provide more efficient training methods, such as ST-RSBP (Zhang and Li, 2019) and e-prop (Bellec et al., 2020). These methods further enhance the low power consumption and biological interpretability of SNNs, although they are not yet capable of producing competitive outcomes in refined tasks like pattern recognition. The ANN-SNN conversion strategies aim to convert the trained ANNs into SNNs and then exploit the low power consumption of SNNs to improve inference (Orchard et al., 2015;Sengupta et al., 2019;Han et al., 2020;Zhang et al., 2021b). The main advantage of conversion methods is that they can directly achieve largerscale and higher-performance tasks inherited from ANNs. However, they disregard the temporal dynamics and temporal interdependence of spiking neurons and always need hundreds of time steps to approximate the performance of trained ANNs; consequently, hybrid training methods that further optimize converted SNNs were developed (Rathi et al., 2020;Rathi and Roy, 2021;Zhang et al., 2021a). Direct training methods utilize surrogate gradients to tackle the issue of non-differentiable spike activity (Wu et al., 2018), allowing error back-propagation (BP) through time to interface the gradient descent directly on SNNs for training. Those BP-based methods show strong potential to achieve high accuracy in a few timesteps by making full use of spatio-temporal information Fang et al., 2021c). However, more research is required to determine how to better extract spatio-temporal features for enhanced processing of spatio-temporal data; this is what we want to contribute.

. . . Attention modules in SNNs
The attention mechanism distributes attention preferentially to the most informative input components, which could be interpreted as the sensitivity of various inputs. The SE block (Hu et al., 2018) offers an efficient attention approach to improve representations in ANNs. Xie et al. (2016) and Kundu et al. (2021) introduced spatial-wise attention in SNNs; then, TA-SNN (Yao et al., 2021) developed a temporal-wise attention mechanism in SNNs by assigning attention factors to each input frame; more subsequently, TCJA (Zhu et al., 2022) added a channel-wise attention module and proposed temporal-channel joint attention. These studies demonstrate the usefulness of attention mechanisms in SNNs by achieving state-of-the-art results on various datasets. Moreover, based on these investigations, it is desirable to study other correlations between the attention mechanism and the biological nature of SNNs, which is the objective of our research. We employ the attention module as a feedforward lateral inhibitory connection (Luo, 2021), which develops a gating mechanism for the synapse model, and enables non-linear computation by the synapse.

. . . Synaptic models in SNNs
As one of the fundamental components of SNN, the synaptic model has drawn the interest of several researchers. Shrestha and Orchard (2018), Fang et al. (2020a), and Yu et al. (2022) established temporal relationships between response post-synaptic currents and input pre-synaptic spikes, therefore improving temporal expressiveness. Those temporal relationships are the extension of fully-connected synapses which are based on the assumption that there is only one connection between two neurons. Nevertheless, synaptic connections are often complex, and there are typically many paths connecting the axons and dendrites of neurons (Letellier et al., 2019;Luo, 2021). We apply temporal convolution to describe the more sophisticated impulse response model and generate time-dependent post-synaptic currents, taking into consideration biological features and computational simplicity.

. . Frame-based representation
Event steam consists of both a spatial and a temporal dimension, with the spatial dimension expandable to higher dimensions depending on the data type. The spatial dimension of event streams based on sound data is typically onedimensional, corresponding to different frequency channels; whereas the spatial dimension of event streams based on image data is typically three-dimensional, consisting of coordinates representing spatial positions and polarities representing brightness changes. The binary spike pattern is represented by the tensor E ∈ B T ′ ×S , where T ′ represents the original resolution in the temporal dimension, and S represents the resolution in the spatial dimension. For a frame with a time span of t, the events in the time interval t ′ ∈ [(t − 1) × t, t × t) can be mapped to the network input X 0 at time t by where t ∈ {1, 2, ..., T} is timesteps, and the aggregation function q(·) could be chosen as non-polarity aggregation (Massa et al., 2020), accumulate aggregation , AND aggregation (He et al., 2020), etc. Here, we choose to accumulate all event streams inside a frame.

. . Spiking neurons in SNNs
The Leaky-Integrate-and-Fire (LIF) model was introduced as an extremely simplified model of biological neurons (Dayan and Abbott, 2005), which has the essential qualities of potential integrating, leaking, and spike firing. The LIF model is used extensively in SNNs and neuromorphic engineering because of its ability to recreate essential neural functions at a minimal cost of computation. The LIF model is defined in the differential form, as where v(t) is the membrane potential of the neuron at time t, I(t) is the integrated current input from the pre-synaptic neuron at time t, and τ is the time constant that governs the pace of potential change. Solving the differential equation directly will incur additional costs. STBP (Wu et al., 2018) employs a simplified iterative representation and implements the LIF model on the Pytorch framework (Paszke et al., 2019), which supports the integration of SNNs and standard ANNs' modules and significantly speeds the construction of BP-based SNNs and training techniques. The explicit iterative LIF is expressed as where l and t are indices of layer and time, τ is the time constant, V is the membrane potential, V th is the threshold constant, S is the binary tensor of spikes, I is the input from the preceding layer, and (·) is the Heaviside step function that satisfies (x) = 1 when x ≥ 0, otherwise (x) = 0. Noting that the firing process, (·), is not differentiable, surrogate methods are often utilized in SNNs' direct training to achieve error propagation by creating various pseudo-derivatives for (·) (Neftci et al., 2019). Experiments conducted by STBP (Wu et al., 2018) demonstrate that the performance of each surrogate gradient is comparable. This work leverages arc tangent (ATan) as the pseudo-derivative of (·), which is well-supported in the SpikingJelly framework (Fang et al., 2020b).

. . Spatio-temporal receptive fields in SNNs
The receptive field is often used to comprehend convolution procedures. In the process of convolution, the receptive fields describe the range of the nearby input for identifying an output element, i.e., how much spatial neighboring position it can perceive. For static pictures, the receptive field could explain the projection range of the convolution operations and aid in the comprehension of the spatial feature extraction procedure. Similarly, the concept of receptive fields could be applied to event streams (or dynamic images) with an additional temporal dimension. This work leverages the concept of spatio-temporal receptive fields to aid comprehension of SNNs' spatio-temporal feature extraction procedure. As shown in Figure 1A, typical synaptic connections employ 2D convolution, pooling, fullconnections, and other inter-layer computations to process .

FIGURE
The standard layer inserted with the STSC module and its unfolded formulation. Note that all parameters are shared at all timesteps. STSC modules are set before spatial operations to process the latest temporal information.
information in the spatial dimension, which we refer to as spatial operations, and their receptive fields are restricted to the spatial dimension. To strengthen the spatio-temporal information processing capabilities of SNNs, it is essential to expand the receptive fields of these spatial operations into the temporal dimension.

. . Spatio-Temporal Synaptic Connection for SNNs
In general, the processing of temporal information in SNNs is attributed to spiking neurons since their dynamic model has a natural dependence on the temporal dimension; however, the level of this dependence is primarily reliant on the degree of neural complexity, while the LIF neurons only support very weak temporal linkages. Not just in neurons, but also in biological synapses, a great deal of the processing of latent temporal characteristics occurs (Letellier et al., 2019;Luo, 2021). This work focus on using temporal dimension operations in SNNs to broaden the spatio-temporal receptive fields of synapses, to enhance the spatio-temporal feature extractions of SNNs. Temporal operations are calculations connected to the time dimension, and they are contained in a pluggable module, referred to as the Spatio-Temporal Synaptic Connection (STSC). The STSC module is designed to be placed before spatial operations in order to aggregate temporal information and enlarge the spatio-temporal receptive fields while maintaining the original spatial operations (see Figure 2). The STSC module consists of two modules: Temporal Response Filter and Feedforward Lateral Inhibition, which carry the filtering and gating mechanisms of the synaptic model, respectively (see Figure 3). The two modules receive X as input tensor and conduct the operations C = f (X) and D = g(X), followed by element-wise product to produce output Y = C ⊙ D (see Figure 3A). As shown in Figure 4, our starting point is to reconstruct the synaptic connection of the neural circuit in order to get a simplified computing model comprised of TRF and FLI modules while preserving the temporal operations of synaptic computing to the greatest extent. Consequently, our synaptic model provides a further appropriate imitation of biological neural networks with biological interpretability; it also enhances the model's capabilities for spatio-temporal feature extraction by incorporating additional temporal computations. Figure 5 provides an illustration of the model's dataflow.

. . . Temporal response filter
The synapses in biological neural networks are intricate. The complexity of synapses is not only reflected in the Frontiers in Neuroscience frontiersin.org . /fnins. . non-topological spatial relationship of synapses (how neurons connect with one another) but also in the complicated temporal dependency of spike transmission (how neurons communicate with one another; Letellier et al., 2019). The axon terminals and dendrites of a pair of connected neurons are typically connected by multiple pathways, as shown in Figure 4A. Since the spike responses (including time delays and kernel shapes) in the various pathways are diverse from one another, this implies that more complicated temporal dynamic dependencies can be shaped among them (Luo, 2021). This work assumes that the diverse spike responses of the various pathways can be combined into a single, adaptable response ( Figure 4A). Thereby, we propose the Temporal Response Filter (TRF) to establish the integrated linear response of spikes over time by employing convolution in the time dimension, in order to expand the temporal receptive field in the most direct way. TRF offers a filtering path for STSC with temporal convolutions (see Figure 3). Figure 3B depicts the specific implementation of TRF. In detail, as for the 2D spatiotemporal tensor in the fully-connected structure, it performs temporal depth-wise 1D convolution independently on each spatial channel and generates an output tensor of the same size. To ensure that all spatial elements inside a channel have the same temporal response, for the 4D spatiotemporal tensor in the convolutional structure, temporal depth-wise 3D convolution is performed on each channel with kernel size of K G × 1 × 1. Mathematically, the filter operation is denoted as f (·), and it performs C = f (X) with input X and output C having the same size as R T×N or R T×C×H×W . Depending on the spatial dimension (1D or 3D) of the input X, the following calculation formulae apply: As for the 1D case, As for the 3D case, where n,c,h, and w are spatial location indices and t is a time index. K F denotes the kernel size of the temporal convolution, Frontiers in Neuroscience frontiersin.org . /fnins. .

Modeling TRF and FLI modules from biological circuits into computational models. (A) The motivation of TRF that integrating multi-compartment axon-dendrite connections with di erent spike responses into a single linear response filter. (B)
The function of FLI that transmitting the spike activities of pre-synaptic neurons into inhibitory neurons to obtain gating factors. which is equal to the temporal receptive fields of TRF. The padding of the convolution is set to K F −1 2 for maintaining the same size.

. . . Feedforward lateral inhibition
The mechanisms of feedforward lateral inhibition mechanisms exist in biological neural networks (Luo, 2021), which construct a lateral route to suppress feedforward input. To further boost synaptic expression, we attempted to emulate the feedforward lateral inhibition mechanism (see Figure 4B). We devised a computational strategy that replicates the function of inhibitory neurons by exploiting the near-time spike patterns of pre-synaptic neurons to determine the inhibition coefficients, also known as gating factors, which are utilized to regulate the transmission intensity of signals. While the TRF module based on temporal convolution is a straightforward linear computation, the FLI module incorporates non-linear components, resulting in a non-linear relationship between temporal dependencies that improves spatio-temporal feature extraction. We notice that the function of this structure closely resembles that of the attention module; therefore, we refer to the attention blocks (Hu et al., 2018;Yao et al., 2021;Zhu et al., 2022), and propose the FLI module to replicate the gating mechanism in synaptic connections. The module details are shown in Figure 3C. Regarding the 2D spatiotemporal tensor in the fully-connected structure, temporal-wise 1D convolution is utilized first to extract temporal features, followed by linear combination through sigmoid to acquire the gating coefficients (see Figure 3C). As for the 4D spatiotemporal tensor in the convolutional structure, spatial-wise average pooling is first conducted to obtain the channel-wise spatial sparsity of spikes; then, the 1D case FLI is performed. Finally, channel-wise gating factors are computed and transmitted to each channel's spatial locations. Mathematically, gating is denoted as g(·), and X is the input tensor of size R T×N or R T×C×H×W , D = f (X) is the output gating factors with values in the range (0, 1) that have the same shape with X. Depending on the spatial dimension (1D or 3D) of the input X, the following calculation equations apply: As for the 1D case: As for the 3D case: where n,c,h and w are spatial location indices, t is a time index, m is the index of the intermediate feature tensor S with spatial dimension M. M is determined by the spatial sizes N with reduction ratio r, as N r . K G denotes the kernel size of the convolution, which is equivalent to the receptive fields of FLI. The padding of the convolution is set to K G −1 2 for maintaining the same size. Notably, in order to reduce computational cost in the 3D case, we employ the same gating factor for all H × W neurons in each channel, feed the spiking space sparsity into the FLI module (Equation 9), and broadcast the output to all neurons inside the channel (Equation 11).

. . Training framework
Denote the simulating timesteps as T, size of output layers as L out and classes number as C, we utilize the voting strategy  to decode the network output O ∈ B T×L out with the constant voting matrix M ∈ R C×L out . The loss function is defined by the mean squared error (MSE), as where y is the one-hot target, with y l = 1 for target class l, and y i = 0 for i = l. The predicted label l p is then given by l p = argmax i

. Datasets
We evaluate the classification performance of STSC-SNN on a variety of neuromorphic datasets, including DVS128 Gesture (Amir et al., 2017;gesture recognition), N-MNIST (Orchard et al., 2015), CIFAR10-DVS (Li et al., 2017; image classification), and SHD (Cramer et al., 2020; speech digit recognition), all of which are event datasets but are generated using different methods. DVS128 Gesture is a gesture recognition dataset that uses DVS cameras to record actual human gestures. The event-based image datasets, N-MNIST and CIFAR10-DVS, are converted from the static dataset by using DVS cameras to scan each sample. Spiking Heidelberg Digits (SHD) is a spike-based speech dataset consisting of English and German spoken digits transformed from the audio recordings using an artificial inner ear model.

Representation
Frames with accumulative aggregation Learning algorithm STBP (Wu et al., 2018) and BPTT Surrogate gradient ATan (Fang et al., 2021c) Loss function Voting  and MSE Frameworks SpikingJelly and Pytorch . . . Settings Table 1 summarizes the experimental details of the SNNs training process. We use the SpikingJelly (Fang et al., 2020b) and Pytorch (Paszke et al., 2019) frameworks to develop and evaluate SNNs. We utilize the Adam optimizer (Kingma and Ba, 2014) to accelerate the training process. Table 2 displays the respective hyper-parameters and Table 3 displays the network architectures for different datasets. The adequate numbers of epochs are utilized to assure the models' steady convergence ( Table 2). The settings of network architecture are various in related works, and the network structures we use (Table 3) have been proven to perform quite well on each dataset [specifically, SHD refers to TA-SNN (Yao et al., 2021), N-MNIST refers to PLIF (Fang et al., 2021c), CIFAR10-DVS and DVS128 Gesture refer to TCJA-SNN (Zhu et al., 2022)]. All Conv2d layers are set as kernel size = 3, stride = 1, and padding = 1, followed by batch normalization (BN) layers. The voting layers are implemented using average pooling for classification robustness (Fang et al., 2021c). Table 4 shows the performance comparison of the proposed methods (STSC-SNN with TRF and FLI) and other competing methods on neuromorphic datasets, N-MNIST, CIFAR10-DVS, DVS128 Gesture, and SHD. The trials indicate that STSC-SNN can achieve similar or even better results under the same conditions as SOTAs. These results demonstrate that our work is comparable to SOTAs' and reveal that our models inspired by biological synapses are practical and feasible. As shown in Table 4, we achieve the highest accuracy on all datasets except CIFAR10-DVS. The SOTA results implemented in CIFAR10-DVS are based on the work of TET , which proposes a new loss function to enable the model to converge on a flatter local minimum with generalizability; TCJA (Zhu et al., 2022) also demonstrates its efficacy on CIFAR10-DVS. To preserve the consistency of this work, we continue to utilize MSE (Equation 12) as the loss function, and outperform the

Dataset
Network structure comparable result. Notably, the experiments on SHD show that we have enhanced the vanilla SNN from 78.71 to 92.36% using STSC (with 65 epochs), which is the state-of-the-art result compared to the highest available result (91.08% by TA-SNN). Moreover, it is a significant improvement that even reaches the best result achieved by ANNs on this dataset [92.4% by CNN (Cramer et al., 2020)]. The SHD dataset contains rich temporal information, which challenges the model's capacity to extract temporal features (Cramer et al., 2020); hence, there is considerable effort required to develop SNN models using recurrent structures (Cramer et al., 2020;Yin et al., 2020Yin et al., , 2021Perez-Nieves et al., 2021). Based on the recurrent structure, TA-SNN employs temporalwise attention and a particular LIF neuron [LIAF by Wu et al. (2021) that directly transmits membrane potential] to get an excellent result on SHD (91.08%), outperforming LSTM [89% by Cramer et al. (2020)] but falling short of the result (92.4%) produced by CNN processing (directly as 2D image input). In contrast, instead of the recurrent layers, we use a simple fully-connected network with two hidden layers and successfully .
/fnins. .  The bold values indicate the best outcomes for each dataset.

Method
obtain the SOTA result by adding the proposed STSC module. For the first time, our model obtained CNN-like performance on the SHD dataset, which represents a substantial effort to illustrate the SNNs' potential.

. . Control experiments and ablation study
To analyze the impact of each component on performance, we conduct control experiments on SHD. The SHD experiment is based on the fully-connected (FC) structure (see Table 3), with STSC modules strategically placed in front of the FC layers. There are seven insertion strategies designated P1, P2, P3, P12, P13, P23, and P123 (see Figure 6A). Figures 6B, C show the effect of varying receiving fields when TRF and FLI are used individually. TRF reaches 85.38% at P13 and RF = 7, while FLI reaches 90.72% at P12 and RF = 11. Comparing the two modules reveals that the FLI module plays a major role in performance improvement; thus, it is crucial to offer a gating mechanism that introduces non-linear expressions to FC layers. Figures 6B, C demonstrate that when the FLI or TRF modules are positioned in the first layer (P1/P12/P13/P123), they have a greater impact on performance than when they are positioned in the deep layer (P2/P3/P23). This suggests that the extraction of temporal features is more advantageous in shallow layers. As shown in Figure 7, we evaluated the impact of varying STSC receptive fields on SHD performance. Notably, raising the receptive field suitably will increase performance, whereas an overly broad receptive field setting would reduce .
/fnins. . accuracy. We claim the performance drop is a result of the model's excessive expressive capacity, which overfits the train data. This phenomenon is analogous to the usage of spatial 2D convolutions, in which the kernel size must be carefully determined. Figure 7 indicates that the combination of TRF and FLI modules improves performance, demonstrating their complementarity. Under the P1 strategy, setting TRF's RF = 5 and FLI's RF = 3 yields the best result of 92.36%, with just one STSC added after the input layer.

. . Analysis of temporal modules in SNNs
In the vanilla SNNs, only neurons perform temporal operations; hence, its temporal feature extraction is predicated solely on the temporal dependence inside each neuron. In order to assess the influence of temporal modules, we conduct the ablation study with LIF neurons and STSC modules on SHD datasets (see Figure 8), based on the same FC structure (see Table 3). Experiments comparing "FC(ReLU)" and "SNN" demonstrate that utilizing LIF neurons to replace the activation function in the FC structure can definitely increase the performance of the SHD classification task, proving the LIF's capacity to handle temporal information and capture temporal features. Moreover, the "FCs(Relu) + STSC" and "FCs(non) + STSC" structures generated by adding the STSC module obtain greater performance than the vanilla "SNN" model, demonstrating that our STSC module has superior temporal feature extraction capacity than LIF; hence, the utilization of time relationships within synaptic connections is valid and meaningful. Furthermore, integrating the STSC module and LIF concurrently inside the "SNN + STSC" model achieves the highest performance, proving that time-dependent interactions in both synapses and neurons could coexist and be coordinated to perform better temporal information processing.

. Discussion
The incorporation of temporal operations inevitably increases the model's complexities and the analysis of trade-off Frontiers in Neuroscience frontiersin.org . /fnins. .

FIGURE
Influence of receptive fields on SHD. The di erent experimental results via RFs with P /P /P /P .
value. Here, we explore the time and space complexity induced by the TRF and FLI modules in convolutional layers for 3D cases. Assuming the STSC modules are inserted prior to a spatial 2D convolution, the input and output tensor dimensions are R T×C×H×W and R T×C out ×H×W , and the size of the convolution kernel is O(K c × K c ). Temporal convolution (Equation 6) needs just O(K F ) time complexity per element for a TRF module with a K F receptive field, and the total time For the FLI module with a K G receptive field, each time slot requires a computational complexity of O(C × C r × K G + C × C r ) = O(C × C r × (K G + 1)), and overall time complexity is O(T × C × C r × (K G + 1)). In contrast to the O(T × C × C out × H × W) time complexity required for spatial 2D convolution operations, the O(H × W × K c × K c ) and O(T × C × C r × (K G + 1)) time complexity of TRF and FLI are acceptable. In addition, spatial 2D convolution needs O(C × C out × K c × K c ) parameters, whereas TRF requires merely O(C × K F ) parameters and FLI requires O(C × C r × (K G + 1)) parameters. In general, the space complexity of TRF is substantially less than that of FLI, and its additional parameters are negligible when compared to 2D convolution; the time complexity of TRF and FLI is relatively efficient in comparison to 2D convolution. Notably, both the TRF and FLI modules are based on the sliding of time windows, and the computations for distinct time frame inputs are identical; thus, there is potential parallelism in the time dimension, and hardware implementation and optimization are possible. In the meanwhile, the computation of STSC-SNN depends on floating-point multiplication, which may reduce the energy efficiency of hardware based on the binary nature of spike transmission. Nevertheless, there is a good reason to believe that binary signals are not a strict constraint for the development of neuromorphic computing, as the carrier (electrical signal or neurotransmitter) used to transmit the spike signal in the biological synapse is not a binary information representing just presence or absence of spike activities (Rothman, 2013); in fact, a substantial amount of research .

FIGURE
Ablation study of temporal modules in SNNs. The accuracy comparison of di erent models via training epochs on SHD. "FCs(Non)" denotes the FC structure without LIFs and activation functions, "FCs(ReLU)" denotes the FC structure with ReLU functions behind the first two FC layers, and "SNN" denotes the FC structure with LIFs behind all three FC layers. Then, STSC modules are added just behind input (P ) in three models as a comparison.
has moderately loosened the binary constraint (Shrestha and Orchard, 2018;Fang et al., 2020a;Wu et al., 2021;Yao et al., 2021;Yu et al., 2022;Zhu et al., 2022). We believe that with the development of neuromorphic chips, spiking neural networks based on analog circuits and in-memory computing will be capable of surpassing the binary constraints and reconcile the biological plausibility and computational complexity of synaptic operations Fang et al., 2021a;Tao et al., 2021).

. Conclusion
In conclusion, this work proposes to endow synaptic structures with spatio-temporal receptive fields and additional temporal dependencies in an effort to enhance the temporal information processing capabilities of SNNs. We propose the STSC module from the standpoints of both computational models and biological realities, which consists of TRF and FLI, implemented with temporal convolution and attention mechanisms. We verified the method's reliability on neuromorphic datasets of SHD, N-MNIST, CIFAR10-DVS, and DVS-Gesture. Notably, the STSC supports SNNs in reaching the SOTA result (92.36%) on the SHD dataset, which is comparable to ANNs' methods (89 and 92.4%), validating the potential of SNNs in the spatio-temporal data processing.

Data availability statement
The original contributions presented in the study are included in the article/supplementary material. The code is available at the following link: https://github.com/Tab-ct/STSC-SNN. Further inquiries can be directed to the corresponding author.

Author contributions
CY proposed the idea. CY and ZG designed and did the experiments. CY, ZG, and AW wrote the manuscript. DL, GW, AW, and EL revised the manuscript. AW directed the projects and provided overall guidance. All authors contributed to the article and approved the submitted version.