Optical flow estimation from event-based cameras and spiking neural networks

Event-based cameras are raising interest within the computer vision community. These sensors operate with asynchronous pixels, emitting events, or “spikes”, when the luminance change at a given pixel since the last event surpasses a certain threshold. Thanks to their inherent qualities, such as their low power consumption, low latency, and high dynamic range, they seem particularly tailored to applications with challenging temporal constraints and safety requirements. Event-based sensors are an excellent fit for Spiking Neural Networks (SNNs), since the coupling of an asynchronous sensor with neuromorphic hardware can yield real-time systems with minimal power requirements. In this work, we seek to develop one such system, using both event sensor data from the DSEC dataset and spiking neural networks to estimate optical flow for driving scenarios. We propose a U-Net-like SNN which, after supervised training, is able to make dense optical flow estimations. To do so, we encourage both minimal norm for the error vector and minimal angle between ground-truth and predicted flow, training our model with back-propagation using a surrogate gradient. In addition, the use of 3d convolutions allows us to capture the dynamic nature of the data by increasing the temporal receptive fields. Upsampling after each decoding stage ensures that each decoder's output contributes to the final estimation. Thanks to separable convolutions, we have been able to develop a light model (when compared to competitors) that can nonetheless yield reasonably accurate optical flow estimates.


Introduction
Computer vision has become a domain of major interest, both in research and in industry.Indeed, thanks to the development of new technologies, such as autonomous vehicles or self-operating machines, algorithms able to perceive the environment have proven to be key to achieving the desired level of performance.Among the numerous visual features these algorithms can estimate, optical flow (the pattern of apparent motion on the image plane due to relative displacements between an observer and his environment) remains one of paramount importance.Indeed, this magnitude is directly linked with depth and egomotion, and its rich, highly temporal information is precious for advanced computer vision applications, e.g. for obstacle detection and avoidance in autonomous driving systems.Given the severe safety constraints associated with this kind of critical systems, accuracy and reliability are key to achieving successful models.However, achieving high levels of performance is not enough: the increasing concern about energy consumption motivates us to seek the most efficient model possible, all while retaining high-performance standards.
In the search of an energy-efficient way to estimate optical flow, we decided to focus our interest on event cameras.Unlike their regular, frame-based counterpart, this kind of sensor is composed of independent pixel processors, each firing asynchronous events when the variation of the detected luminance since the previous event reaches a given threshold, being this event of positive polarity if the brightness has increased, and of negative polarity otherwise.This behavior translates into enormous energy savings: whereas conventional frame-based cameras are forced by design to output a frame at a fixed frequency, event cameras do not trigger any events for static visual scenes.Furthermore, they show a higher dynamic range, which allows them to avoid problems such as image artifacts (e.g.saturation after leaving a tunnel while driving), and a lower latency than regular cameras, which makes them particularly suitable for challenging, highly dynamic tasks (event sensors do not suffer from motion blur, unlike their frame-based counterparts).Nevertheless, they can also be less expressive: event cameras only provide information regarding changes in luminance, and not about the luminance itself.Furthermore, most event cameras discard color information, although some devices exist with independent firing for RGB formation at each pixel, like the Color-DAVIS346 event camera used by [1] to generate their CED Dataset.Finally, event cameras usually have lower spatial resolution than regular cameras, although recent technological developments are bridging this gap (e.g.[2]).
In search of energy efficiency, the choice of the sensor is not enough: the optical flow prediction algorithm itself also has to be as efficient as possible to achieve our goal.That is why we have resorted to Spiking Neural Networks (SNNs) to develop our model.These bio-inspired algorithms, heavily inspired by the brain, consist of independent units (neurons), each of them with an inner membrane potential, which can be excited or inhibited by pre-synaptic connections.When their inner potential reaches a certain, predefined firing threshold, one spike is sent to the post-synaptic neurons, and the membrane potential is reset.Since energy consumption on dedicated hardware is linked to spike activity, which is usually much sparser than standard analog neural networks activations, SNNs represent a more energy-efficient alternative.Moreover, in the absence of movement, no input events would be produced and fed to the network, which in turn would not trigger any spikes, and a zero-optical flow prediction would be achieved (which is indeed the desired behaviour, since no input events can only be achieved by a lack of relative motion).
Finally, optical flow being a highly temporal task, incorporating temporal context into our vision model is key to achieving acceptable levels of performance.Two alternatives exist: using stateful units within the network (e.g.LSTMs, GRUs or taking advantage of the intrinsic memory capabilities in the case of SNNs), or explicitly handling the temporal dependencies with convolutions over consecutive frames along a temporal axis.Exploiting spiking neuron inherent temporal dynamics has proven to be an extremely challenging task to achieve, and we have therefore opted for the second alternative.
To sum up, the main contributions of this article are: • a novel angular loss, which can be used with standard MSE-like functions and which helps the network to learn an intrinsic spatial structure.To the best of our knowledge, we are the first to ever use such a function for optical flow estimation.
• 3d-encoding of input events over a temporal dimension, leading to increased optical flow estimation accuracy • a hardware-friendly downsampling technique in the form of maximum pooling, that further improves the model's accuracy • a spiking neural network which can be implemented on neuromorphic chips, therefore taking advantage of their energy efficiency.

Related Work
Ever since their introduction, event cameras have been gaining ground within the computer vision community, and increasing efforts have been made to develop computer algorithms based on event data.As such, different datasets have emerged in order to solve different kinds of computer vision problems, like the DVS128 Gesture Dataset by [3] for gesture classification, or the EVIMO Dataset by [4] for motion segmentation and egomotion estimation.Despite this interest in event vision, the significant investment that event cameras represent for most research centers and companies has led to the development of event data simulators such as CARLA by [5], as well as algorithms to perform video-to-events conversion, like the model proposed in [6].While lacking the intrinsic noise event data usually presents, these artificial data can nonetheless be used to efficiently pre-train computer vision neural networks, e.g.[7] pretraining their model for depth estimation on a synthetic set of event data.
Nonetheless, for real-world applications (e.g.gesture recognition, object detection, clustering, etc.), true event recordings are preferred because simulators are still lacking realistic event noise models.Concerning depth and/or optical flow regression, two datasets have currently established themselves as the go-to choices: the MVSEC Dataset by [8], and the DSEC Dataset by [9].While all of these datasets have proven invaluable to develop event-based computer vision algorithms, there is still an enormous gap between event-based and image-based publicly available datasets, and many authors are still forced to develop their own.For example, [10] generated their own classification data from [11] to account for the additional "pedestrian" class.
Most models so far have either been standard Analog Neural Networks (ANNs) like [12], exploiting gated-recurrent units to achieve state-of the art accuracy on DSEC, or hybrid analog-spiking neural networks like [13], combining a spiking encoder with an additional analog encoder for grayscale images, followed by a standard ANN.Other models have tried to leverage the temporal context by feeding the network with not only the events themselves, but also information on event timestamps, like the EVFlowNet model presented in [14].More recently, [15] showed temporal information to be a key in accurately estimating both optical flow and depth, achieving top results in the MVSEC and the DSEC datasets thanks to their implementation of non-spiking leaky integrators with learnable per-channel time constants.While all of these models do indeed achieve good levels of performance on their test sets, none of them manage to take advantage of the neuromorphic-friendly nature of event data, since analog blocks or additional non-spiking information prevent a deployment on neuromorphic chips.
More interesting to this work are spiking neural networks applied to event vision, be it for depth or for optical flow estimation.As far as optical flow is concerned, it is worth citing the works of [16], which achieves state-of-the-art levels of performance on the MVSEC Dataset with a fully spiking architecture.More recently, [17] showed that spiking neural networks can indeed compete with their analog counterparts in terms of accuracy, showing top results both in the MVSEC and in the DSEC Dataset.Finally, [18] achieves a remarkable accuracy on the MVSEC Dataset with a U-Net-like architecture and a self-supervised learning rule.However, all of these models are not implementable on neuromorphic hardware, since they either use upsampling techniques which are incompatible with the spiking nature of these devices (e.g.bilinear upsampling), or re-inject intermediate, lower-scale analog optical flow predictions, thereby violating the spiking constraint by introducing floating point values in an otherwise binary model.In addition, the choice of a self-supervised learning rule, usually linked to a photometric loss function presented in [19], means that optical flow estimations are only provided for pixels where events occurred, therefore creating non-dense flow maps.
Looking at depth prediction though, we do find some interesting strategies for fully deployable neuromorphic models.Finally, authors in [20] presented in their StereoSpike model a fully-spiking, hardware-friendly network achieving remarkable accuracy on the MVSEC Dataset, thanks to stateless spiking neurons that have greatly inspired our work.
While we have focused on optical flow and depth predictions with event cameras, there have also been preceding works achieving top results on other computer vision tasks using event datasets and spiking neural networks.Such is the case of the works of [21], who performed semantic segmentation via supervised training of a SNN, or the method described in [22] that addressed instance segmentation on event data using a biologically-plausible learning strategy.
3 Materials and Methods

Training Dataset
Our study focuses on driving scenes, and we chose the DSEC Dataset by [9] to train our model.Unlike previous stateof-the-art datasets, such as the Muti-Vehicle Stereo Event Camera (MVSEC) Dataset by [8], which provided different working scenarios (indoors/outdoors, day/night, and four possible vehicle configurations: pedestrian, motorbike, car and drone), the DSEC dataset only consists of driving scenario sequences.However, it provides higher-quality ground-truth labels, thanks to the finer processing of the LIDAR measurements.In addition, this dataset also includes masks for invalid pixels, i.e., pixels where the optical flow ground-truth is unknown.As such, our metrics have only been evaluated on the valid pixels.Furthermore, this dataset provides an open benchmark to submit the results, which we used to determine our test metrics and compare ourselves to other works.

Input Event Representation
Event cameras produce an asynchronous event e i when the luminance variation at a given pixel reaches a given threshold: ) where (x i , y i ) are the coordinates of the pixel emitting the event, t i the event's timestamp, and p i its polarity (+1 if luminance increases, and -1 otherwise).However, in order to perform our training, we are forced to work with a discrete time model, so a pre-processing of this event stream has to be made.We therefore transform the input event stream into a sequence of frames of a given length in miliseconds, that we call "input histograms".These frames consist of a two-channel (C = 2) tensor of size (C, H, W ), where H and W represent the camera's resolution, i.e., the number of input pixels and their position in the camera.At each pixel, the first channel represents the number of positive input events that have been triggered in that particular pixel during the frame's duration, and the second channel represents the number of negative events.A representation of a one-channel input frame can be found in Figure 1.While not a  Top view of the event frame.We can see that events are heavily linked to contours (e.g. a zebra crossing on the bottom part, or the windows on a building on the right side), while regions with constant luminance (e.g. the road or the sky) do not trigger events.
binary representation, like the representation paradigm presented in [23], our choice is more expressive, since event counts account for pixel relative importance and therefore provide richer spatio-temporal information.
We acknowledge that this frame-based approach increases the model's latency, since event sensors can virtually function in continuous time.However, it is imposed by the nature or our training, and is a widespread technique for event-based learning (see [24] on event representations).Moreover, we can leverage the latency reduction by our frame duration choice: the input stream being a continuous sequence of events, we are free to cumulate them in windows of the desired duration.

Spiking Neuron Model
For our network, we chose a simple neuron model that can be easily implemented with open-source Python libraries, in addition to being much less computationally expensive than closer-to-nature neuron mathematical models.This model is the [25].It was implemented using the Spikingjelly library, developed and maintained by [26], due to their full integration with the Pytorch library.
Our model is based on a stateless approach: the neuron's potential is reset after each forward pass.Indeed, the mathematical neuron model presented by McCulloch and Pitts consists of stateless neurons with Heaviside activation functions.This is equivalent to stateless integrate-and-fire neurons, i.e., stateless artificial neurons working as perfect integrators, but which are reset at every time step.We therefore do not exploit the intrinsic memory capabilities of spiking neurons, but rather perform a binary encoding of the information.While this approach may seem counterintuitive, it actually further reduces energy consumption, since the reset operation is usually less energy demanding than the neuronal leak, and no resources have to be allocated to long-term memory handling.Consequently, we do not need to model such phenomenon, and as a result our neuron model is more hardware friendly than its leaky counterpart.Temporal context is handled by 3d convolutions in the encoder stages of the model, as we explain in the following section.

Network Architecture
Our network is based on a U-Net-like architecture ( [27]).Indeed, U-Net has established itself as a reference model when full-scale image predictions are required, i.e., predictions at roughly the same resolution as the input data.Our architecture is shown in Figure 2.After a first convolution stage which increases the number of channels to 32 without modifying the input tensor size, each encoder stage halves the tensor width and height while doubling the number of channels.Conversely, each decoder stage doubles the tensor width and height, and halves the number of channels.
In order to increase the network expressivity, each decoder stage plays a role in the final prediction.Each decoder output is upsampled into a full-scale, two-channel tensor (x-and y-components of the optical flow estimation).All of the outputs equally contribute to the network's final estimation, which consists of the combination of successive coarse predictions.The loss function is evaluated after each update of the final neuron pool, thus forcing the network's prediction to be close to the ground-truth as early as the first coarse update.This approach has been introduced in [20] and proved to be beneficial to increasing the overall accuracy.
The main features of our network are the following: • Inspired by Temporal-Convolutional Networks, presented in [28] and [29], we use three-dimensional convolutions for our data encoding.Consecutive input frames are combined by the temporal kernel via unpadded convolutions, decreasing the temporal dimension in size so it collapses to 1 when reaching the bottleneck.Acting as delay lines, they allow to explicitly handle the temporal dimension.The small temporal kernel size is able to capture short-term temporal relationships, while the increasing temporal receptive field due to consecutive convolutions along the temporal dimension accounts for long-term dependencies.Afterwards, the network architecture is fully two-dimensional.By default, the temporal kernel size we use is 5, which leads to a temporal receptive field of 21 • 9ms = 189ms from the bottleneck and beyond.
• Skip connections between the encoder and the decoder consist of the last component of the temporal dimension at the corresponding encoding stage, since we believe the most recent event information to be the most relevant for optical flow estimation.We tested both sum and concatenate skip connections and found that concatenations led to the best estimations (these results are presented in subsection 4.2).
• Given the relative importance of the residual blocks in the total number of parameters, and in search of the lightest possible model, we also analyzed the effect of reducing the number of residuals on the network's performance.We found that the best model only necessitated one residual, unlike other conventional U-Net-like architectures (e.g.[16]).
• Downsampling in the encoding stages is performed via maximum pooling, instead of traditional strided convolutions, to account for spikes within the kernel's region, and not so much about individual spikes.This approach has proved to increase our model's performance.To the best of our knowledge, it is the first time this technique is used in a U-Net-like spiking neural network for dense regression.In addition, [30] showed that this kind of downsampling strategy is supported by neuromorphic hardware.
• Since our final aim is to develop a model that could be implemented on a neuromorphic chip, the whole upsampling operation is performed via Nearest Neighbor upsampling, which preserves hardware friendliness.Indeed, while other widespread techniques, such as Bilinear Upsampling, interpolate each "pixel", Nearest Neighbor Upsampling simply copies each value into a tensor of an increased size, without modifying it.For further illustration, a graphic representation of both upsampling techniques can be found in the Supplementary Materials (Figure S1).• To further decrease the model's weight, we used depth-and point-wise separable convolutions (see e.g.[31]) everywhere in the model.These convolutions do not only decrease the model's number of parameters, but also reduce the model's overfitting, therefore increasing its performance on unseen data.
It is important to specify that our approach is integer rather than binary-based, since some of our skip connections are additions instead of concatenations, and our bottleneck's architecture is based on tensor sums.Nevertheless, our approach remains hardware-friendly, because: • If the processing were asynchronous and event-driven, then the spikes arriving through the residual connection would typically arrive before the others.Thus, if there were two spikes, one from the residual and one from the normal connection, instead of doing an explicit ADD, both spikes could be fed through the same synapse, and each spike would cause an increment of w (instead of adding the two spikes to get 2 and then multiplying by w to get the increment).Moreover, even if the spikes arrived synchronously, they would be processed sequentially using FIFO.• Concatenation is equivalent to addition as a skip connection if the weights are duplicated and kept tight.Indeed, if there were two spikes, one from the residual and one from the "normal" connection, instead of doing 2 • w, the algorithm would perform w + w.Since the duplicated weights would be tight, the number of trainable parameters would be the same, and both operations would be equivalent.

Supervised Learning Method
Our model was trained with supervised learning using the surrogate gradient descent, using a sigmoid function as our surrogate gradient model.The ground-truth optical flow values were those provided in the DSEC database.While traditional self-supervised methods restrict their optical flow processing to pixels where events occurred (e.g.[16], [14] or [17], to cite a few examples), our approach permits dense estimations (thanks to surrogate gradient learning).We trained our model on the valid pixels given by the dataset masks at each timestep.
Our loss function included two terms: • A standard MSE-like loss between the value of the predicted flow and its corresponding ground truth, with the following formula: The term N pixels represents the number of valid pixels to be trained at each timestep.• In addition to a penalization in modulus discrepancy between the vectors, we explicitly encourage the optical flow direction to be the same between the ground-truth and the prediction.This term has proven to be key to reduce noise in optical flow predictions, since pixels with low optical flow values consistently yield small modulus loss values regardless of their direction.We used the following formula: where cθ is the cosine of the error angle between the predicted and the ground-truth flow, and epsilon is a small parameter ( = 10 −7 ) to ensure that no errors are found within the code during execution.Furthermore, the values of cθ are clamped between (−1 + , 1 − ) for the same reason.
The final loss function used to train the model is: ) From preliminary tests, we found that λ mod = λ ang = 1 yields good results, and we therefore decided to use these values.
As explained in subsection 3.4 (Network Architecture), each decoder's output plays a role in the final optical flow estimation.As such, and in order to encourage accuracy since the first decoder's upsampling, the loss function is evaluated for each consecutive contribution to the final pool.After each upsampling of the decoder's output, the inner potentials of an IF layer are updated, and the loss is evaluated on those potentials equivalent to summing the spikes out of each decoder stage weighted by the corresponding intermediary prediction layer.
Finally, in order to perform the back-propagation in our supervised training method, we resorted to surrogate gradient learning, introduced in [32], and already implemented in the SpikingJelly library [26].

Training Details
All of our calculations were performed on either NVIDIA A40 GPUs, or in Tesla V100-SXM2-16GB GPUs belonging to the French regional public supercomputer CALMIP, owned by the Occitanie region.
Trainings were realized with a batch size of 1, since it is the optimal value we have found for our task.Although unconventional, this result is in line with the one found in [20], where a batch size of one was found optimal for depth regression from event data using stateless spiking neurons.We used an exponential learning rate scheduler, and have implemented random horizontal flip as a data augmentation technique to improve performance.Furthermore, thanks to our stateless approach, we were able to train our network with shuffled samples, instead of being forced to use the input frames sequentially.

Results
We divided our dataset into a train and a validation split, and our performance levels are reported with regard to the validation set.The exact sequences used in each split can be found in the supplementary materials.Nevertheless, we resort to the official DSEC benchmark to compare ourselves to the state-of-the-art, since it represents an objective, third-party test set.We now proceed to present the results we obtained in our studies.Due to the number of tests that we have run, all of the corresponding plots are provided in the Supplementary Materials.

Finding the optimal kernel size
Convolutional neural networks have regained the interest of the deep learning community during the past few years, thanks to their ability to capture spatial relations within their kernel.Recently, increased kernel sizes have been replacing the traditional 3x3 formula, with examples as relevant as [33], which uses 7x7 kernels.[34] presents a method to scale up the kernel size to 31x31, and [35] goes even further and proposes to go up to 51x51 for the spatial kernel size, although both of these methods rely on sparsity and re-parametrization to achieve their goal.Starting from a naive U-Net like model, we started our research by trying to optimize our spatial kernel size.In the end, our results do match those presented in [34], showing that 7x7 kernels are optimal.Indeed, further increasing the kernel size makes computational time explode, while accuracy plateaus.We therefore decided to adopt a 7x7 kernel in the spatial dimension for our model.
Next, we optimized the temporal kernel size, directly linked with the number of frames that we input to our model.Since we want the temporal dimension to collapse to one in the bottleneck thanks to unstrided convolutions in the temporal dimension, a larger kernel size naturally requires a greater number of frames, and therefore a heavier model.Nonetheless, it also takes into account a longer temporal context, which may be beneficial for the network's accuracy.As such, we tested our simple model for temporal kernel sizes of 3 (11 input frames), 5 (21 input frames) and 7 (31 input frames).Our results show that increasing the kernel size up to 5 does indeed boost the model's accuracy, but going beyond this size does not translate into an accuracy improvement.Thus, a temporal kernel size of 5 was chosen for the 3d convolutions in our model.

Finding the best network architecture
In order to find the best network architecture, we evaluated two possible options: • We compared sum vs. concatenate skip connections, since concatenate skip connections are easier to implement in neuromorphic hardware, but slightly increase the number of parameters in the network.
• Seeking to develop a model as light as possible, we also characterized the effect of the number of residuals in the network's bottleneck on the model's performance.
After training each of the models for 35 epochs, we found the best model to be the 1-residual network with concatenate skip connections, which amounts to a total of 1.22 million of parameters and leads to an accuracy of 1.1 pixels/second of average end-point error on our validation dataset, using 9ms frames as an input in all cases.The results regarding the architecture optimization have been summarized in Table 1.

Optimizing the frame duration
Next, we focused our attention on the optimal frame duration to accurately estimate optical flow, i.e., the total temporal context the network processes when making a prediction.This parameter is directly linked with the latency the model can achieve, since optical flow estimations are only produced at the end of each frame (provided that the input tensors are treated as a sliding window, where only the last N=21 frames are considered).
We trained the network with frames of 4.5 ms, 9 ms and 18 ms respectively.Our results show that the optimal frame duration was 9 ms, followed by 18 ms, and finally we get the worst performance for frames of 4.5 ms.While it may seem counter-intuitive as a results, since 4.5 ms frames contain a finer representation of the event sequence, we believe this phenomenon is caused by the lack of overall temporal context.Indeed, by using short frames, the network is unable to extract longer-term dependencies, and therefore to accurately predict optical flow.That is also why we believe that 18m s frames, while coarser, do manage to better capture these long term dependencies, and therefor provide a more accurate estimation.These results, as well as all of the successive optimization studies we have performed, can be found on Table 2.

Comparison with the state-of-the-art
We trained our best architecture on the whole DSEC dataset for a total of 100 epochs.We evaluated our model on the official test set provided by DSEC.Results are shown on Table 3.In order to provide a fair comparison, we only included results on the official benchmark, and not those reported on custom validation sets.While still far from the best models, we demonstrate the power of spiking neural networks when applied to dense regression in computer vision, achieving good levels of performance with a fraction of parameters when compared to other models.
We also provide some of our model's results on the validation set, which can be found in Figure 3.These pictures show that, even if the network was not explicitly trained to distinguish image contours (since it was only trained on a selection of valid pixels at each timestep), it is nonetheless capable of extracting structural information within the scene and generalizing it, as illustrated in the rightmost images (unmasked predictions) for the given examples.These results demonstrate the model's general comprehension of the visual scene, and we believe represent a solid understanding of the pattern of motion.

Ablation studies
Several ablation studies have been performed on our best model to further demonstrate our claims, and we have gathered our conclusions in the following paragraphs.Plots containing all of these results can be found in the Supplementary Materials.

Pooling vs. Convolutional Downsampling
Our results show that using maximum pooling instead of strided convolutions is an efficient technique to downsample spiking data.We believe that the reason behind this behavior is that pooling is a way of densifying the tensors without changing their spiking nature.

3d vs. 2d Encoding
We also compared our baseline 3d model with an equivalent 2d model, where the 21 input frames have been fed to the network concatenated along the channel dimension, so that both models have the same temporal context.We found out that fully 2-dimensional models lead to decreased performance.We believe this is due to the fact that, by using 2d convolutions, all the temporal information is directly mixed during the first convolution stage, therefore hindering the network from finding long-term dependencies.

Loss function
We also analyzed the influence of the loss function on the final results obtained.We compared our proposed loss model to two single-term losses: • One model with only the norm of the error vector, but without the angular loss term • One loss function with only a relative loss term: This model penalizes deviations in the prediction relative to the ground truth's norm, and should therefore be able to implicitly impose a restriction on angular accuracy.
Our results show that naively limiting the error's norm is not enough to achieve competitive results, and neither is limiting the relative error.Indeed, by introducing a more aggressive term in the loss function, we managed to force the network into implicitly learning the optical flow's structure, and therefore achieve better accuracy.
It is surprising that the network with the two losses reaches a lower L mod than the network with L mod only.This shows that the second network gets trapped in a local minima and that adding the L ang loss helps to get out of it.

Effect of combining polarities on performance
Our next study on input representation has consisted in combining polarities into a single channel before feeding them to the model.Polarities being closely linked to phenomena like color or texture, we wish to study their influence on the final performance levels.Indeed, if we imagine a grey background with a black shape and a white shape following the same track, we would obtain opposite polarity fronts, while the optical flow pattern would be the same.We have therefore analyzed if polarities can be simply combined into a total per-pixel event count.
However, our results show that keeping separate channels for each polarities is beneficial for the network's performance.We believe this result is linked to the different dynamics linked to each of the polarities, since different thresholds lead to different behaviour for luminance increments or decrements.

Skip connections in the bottleneck
Our final ablation study targeted the very first skip connection, i.e., connecting the last encoder with the first decoder.
Having always kept it as a sum (slightly redundant, given the residual block architecture) because of the high number of channels, we have also tested transforming it into a cat skip connection.However, we found out that it decreases the network's performance while also increasing the number of parameters.We therefore decided to keep it as a sum for all of the architectures.

Model evaluation on the MVSEC Dataset
In order to analyze the generalization capabilities of our method, we also tested our model on the Multi-Vehicle Stereo Event Camera Dataset (MVSEC), introduced in [8].We started by analyzing our model performance on the indoor flying sequences.To do so, we took a model pre-trained for DSEC, and optimized its weights on the MVSEC Dataset over 35 epochs.We followed a training approach akin to the one adopted for the DSEC dataset, i.e. we only considered pixels with either zero-valued ground-truth (x-and y-components of the optical flow vector below a small threshold thr = 1e − 5) or with unknown flow values as invalid, and only trained on valid pixels.The results we obtained, as well as a comparison with other state-of-the-art models, can be found in Table 4.We can see that we achieve state-of-the-art performance levels on these sequences when compared to other existing spiking neural networks, and top accuracy overall, even if our architecture has not been optimized for such a vehicle/scenario configuration.
Next, we also tested our model on the outdoor sequences on MVSEC: training on outdoor_day2, and evaluation on outdoor_day1.We present these results in Table 5 Although our model leads to competitive results on all of the MVSEC indoor sequences, it struggles to achieve competitive results on MVSEC outdoor sequences, both when starting from a pre-trained checkpoint or from scratch.We believe that this phenomenon is due to a combination of factors: • Our network architecture, and most precisely the spatial kernel size, has been optimized for an optical flow prediction of 480 × 640 pixels.Nevertheless, the MVSEC dataset was recorded with a different event camera, and therefore may demand a different kernel size to achieve top performance levels.• Our frame duration and overall temporal context have been designed for a specific camera configuration and resolution.Again, the use of a lower resolution camera leads to different optical flow dynamics, and therefore to potentially different temporal representation.• Our training procedure (learning rate, scheduler, etc.) has not been designed for such a low-resolution estimation, and therefore further optimizations are needed to increase accuracy.• Finally, the outdoor_day2 sequence of the MVSEC dataset, used for training on driving scenarios, consists of only nine minutes of recording where high frequency vibrations are constantly affecting the event camera (see [14]).In addition, the event histograms are greatly impacted by events caused by reflections on the car dashboard.These noisy events may prevent from achieving competitive results in these sequences, since they are nonetheless responsible of inputting information to the network.In fact, only by masking that section in both the input event histogram and the associated ground-truth have we achieved training on this scenario: otherwise, the network oscillates without consistently increasing accuracy.
Nevertheless, our model achieves a certain level of learning on this condition, and we are convinced that better results could be obtained by optimizing the training pipeline for this scenario (specially the frame duration and the kernel sizes).Taking into account this learning, in conjunction with our competitive results on indoor flying scenarios, we believe that these results demonstrate the generalization capabilities of our approach, as well as its applicability in a variety of conditions.

Discussion
Briefly, we have presented a hardware-friendly, lightweight spiking model able to accurately estimate optical flow from event-based data collected by neuromorphic vision sensors.We propose an efficient temporal coding in the form of 3d convolutions in the encoder that increases the temporal receptive field of the deepest stages of the network.We also introduce a novel angular loss function that, in conjunction with a standard MSE-like loss, manages to boost performance by forcing the algorithm to learn the implicit spatial structure.We use maximum pooling as our downsampling strategy, thus densifying the tensors in a neuromorphic-friendly fashion.Moreover, the successive contributions of decoder outputs to the final prediction increase the network's expressivity, and allow us to achieve competitive results without resorting to intermediate prediction re-injections.Consequently, our model can be implemented in neuromorphic hardware, thus resulting in an extremely energy efficient model that can still achieve accurate predictions.
We believe our results contribute to promote spiking neural networks as energy-efficient, real-world alternatives to traditional computer vision systems, based on frame-based video treatment and/or complex sensor data.However, we acknowledge that work has yet to be done, since a lot of the intrinsic potential of SNNs, namely their inherent memory handling capabilities, has not been fully exploited in this study.Moreover, the convergence of our experiments to an optimal batch size of 1, while having indeed improved our model's performance, greatly hinders the training speed, since strategies such as data parallelization cannot be employed.We therefore believe that these results can be further improved, e.g. using techniques such as weight averaging or network pre-training.
Future research lines should focus on further combining different techniques in order to boost performance even further.For instance, exploiting the intrinsic memory of spiking neurons is indeed a potentially useful approach, but the increased computational power linked to unrolling a stateful computational graph makes the task challenging.Moreover, sensor fusion can also be explored as an alternative to boost performance, especially since most event cameras often also provide black and white images.This approach could increase the network's latency, as well as making neuromorphic implementation challenging.Furthermore, while temporal dependencies have been imposed a priori in our model, they could also be natively learnt by the network.The works of [43] present a way of increasing kernel sizes without an increment in network parameters, capable of achieving state-of-the-art performances.While only applied so far for 1and 2-dimensional convolutions, their method could easily be adapted to our 3d approach.
Moreover, publicly available datasets usually lack challenging conditions, such as crossing pedestrians or vehicles, which can limit the network's generalization capabilities.While we believe that our proposed model is capable of understanding such situations (see Figure 3c, where traffic signals are easily recognizable), it would be desirable to train on more challenging scenarios.
Finally, we would like to address hardware efficiency and implementation.We acknowledge that our approach does not provide energy savings during training, since it is performed on GPUs using standard ANN learning techniques, and therefore suffers from the same energy consumption constraints as these networks (plus the added memory usage due to the stockage of the neuron's membrane potential.However, energy savings can be achieved when deployed on dedicated hardware, since they are more energy efficient than GPUs thanks to their spiking nature.Nonetheless, even if our model is hardware-friendly, and therefore theoretically implementable on dedicated hardware, more efforts can be dedicated towards making it easier to implement.Indeed, hardware mapping would benefit from weight quantization (which would require less bits to store each synaptic weight) or sparsity encouragement to fully exploit the neuromorphic hardware advantages over GPUs.These techniques, presented by [40] and [41], would not only reduce energy consumption, but also facilitate potential future implementations, and should be taken into account for actual on-chip deployment.

SUPPLEMENTARY MATERIAL
1 Supplementary Data

Train and validation split
Here is the sequence division we have done for our train and validation split: • Trainin split: • Validation split: -thun_00_a -zurich_city_02_d -zurich_city_03_a -zurich_city_08_a -zurich_city_11_b The data split has been performed in order to ensure around 75% of the available data beeing used during training, while the remaining 25% was used in evaluation.

Supplementary Figures -Result Plots
Here are the plots containing the results of our trainings, evaluated on our validation split.Each plot consists of the modification of a single parameter or architectural block, for better comparison of their effect on performance.

Architecture optimization
Figure 2: Architecture optimization: 1 vs. 2 residual blocks in the bottleneck" and Sum vs. Cat skip connections.We have found the best architecture to consist of CAT skip connections and a single residual block, which amount to a total of 1.2 million parameters for our base model.

Frame duration optimization
Figure 3: Frame duration optimization: 4.5ms, 9ms and 18 ms.We find intermediate frames to be the optimal choice when it comes to accuracy, since shorter histograms do not manage to capture sufficiently long-term time features, and longer histograms are not sufficiently crisp.

Ablation Studies
Below can be found the plots corresponding to the different ablation studies we carried out on our best model:  We show that our approach, where individual spikes carry less importance (one single spike within the kernel region is enough to forward the information) achieves remarkably better results.

Code availability
All of the codes we developed to carry out this study can be found on the GitHub repository https://github.com/J-Cuadrado/OF_EV_SNN.

Figure 1 :
Figure 1: Example of an input frame for 1 polarity.(A) Event cumulation at each pixel for a given time interval.(B)Top view of the event frame.We can see that events are heavily linked to contours (e.g. a zebra crossing on the bottom part, or the windows on a building on the right side), while regions with constant luminance (e.g. the road or the sky) do not trigger events.

Figure 2 :
Figure 2: Our proposed network architecture.3D Encoders ensure the incorporation of a temporal context within the model.Downsampling is performed via max.pooling to account for spatial spike activity.Each decoding stage is upsampled to contribute to the final network prediction.
(a) Optical flow discontinuities due to vertical artifacts within the visual scene.(b) The silhouette of the leftmost tree can be perceived on the unmasked optical flow map.(c) Traffic signs clearly distinguishable on the right side (d) Optic Flow Encoding

Figure 3 :
Figure 3: Example predictions of our best architecture on our validation set.For every picture, the leftmost image is the ground-truth, the middle image shows our masked estimation (only on valid pixels), and the rightmost image represents the unmasked estimation.(D) Represents the chosen colormap for optical flow representation: Optical flow is encoded as an Lab image, where the luminance channel represents the absolute magnitude of the flow, and the a and b channels the different directions.

Figure 1 :
Figure 1: Upsampling techniques comparison.Unlike Bilinear Upsampling, Nearest Neighbor Upsampling guarantees binary tensors after the upsampling operation, being therefore implementable on neuromorphic hardware.

Figure 4 :
Figure 4: Accuracy comparison between strided convolutions and strided maximum pooling as downsamplnig strategies.We show that our approach, where individual spikes carry less importance (one single spike within the kernel region is enough to forward the information) achieves remarkably better results.

Figure 5 :
Figure 5: Accuracy comparison between our proposed 3d-encoder architecture and an equivalent fully 2-dimensional architecture.We show that explicitly handling the temporal dimension with consecutive convolutions along a temporal axis yields better quality optical flow estimations.

Figure 6 :
Figure6: Influence of the angular term in the loss function on the achieved accuracy.We can observe that enforcing this term on the loss evaluation helps the neuron acquire a better general understanding of the scene, and therefore achieve better quality estimations.

Figure 7 :
Figure 7: Influence of the loss function on the achieved accuracy (validation set): best loss model (Mod + Ang), norm of the error vector (only error norm) and relative error (error norm divided by the optical flow ground-truth magnitude).

Figure 8 :
Figure8: Effect of combining the polarities in a single channel on the final validation performance.We can observe that keeping split polarities yields better performance, although not very significantly in the case of our best architecture.

Figure 9 :
Figure 9: Influence of the first skip connection (last encoder with fisrt decoder) on the model's performance: the blue curve represents our base model, whereas the yellow curve represents a CAT skip connection between linking the encoder's output and the decoder's input.

Table 1 :
Performance comparison for the different proposed architectures.All of the models have been trained for 35 epochs, using 21 input frames of 9ms each.

Table 2 :
Performance comparison.The two best models have been tested for different slight modifications of the architecture, keeping the number of parameters mostly unchanged.

Table 4 :
Performance comparison on the MVSEC dataset (indoor sequences), showing per-sequence and total average end-point error in pixels per second.Best result in bold, runner-up underlined.Starting from a model pre-trained on DSEC, we show state-of-the-art performance without modifying our pipeline.

Table 5 :
Performance comparison on the MVSEC dataset (outdoor sequences), showing average end-point error in pixels per second.Best result in bold, runner-up underlined.While far from the top performing contributions, our base pipeline is able to learn to estimate optical flow from scratch, without any optimization to make it tailored to the dataset and camera.