Event-Based Gesture Recognition With Dynamic Background Suppression Using Smartphone Computational Capabilities

In this paper, we introduce a framework for dynamic gesture recognition with background suppression operating on the output of a moving event-based camera. The system is developed to operate in real-time using only the computational capabilities of a mobile phone. It introduces a new development around the concept of time-surfaces. It also presents a novel event-based methodology to dynamically remove backgrounds that uses the high temporal resolution properties of event-based cameras. To our knowledge, this is the first Android event-based framework for vision-based recognition of dynamic gestures running on a smartphone without off-board processing. We assess the performances by considering several scenarios in both indoors and outdoors, for static and dynamic conditions, in uncontrolled lighting conditions. We also introduce a new event-based dataset for gesture recognition with static and dynamic backgrounds (made publicly available). The set of gestures has been selected following a clinical trial to allow human-machine interaction for the visually impaired and older adults. We finally report comparisons with prior work that addressed event-based gesture recognition reporting comparable results, without the use of advanced classification techniques nor power greedy hardware.


Introduction
This paper focuses on the problem of gesture recognition and dynamic background suppression using the output of a neuromorphic asynchronous camera [8,22].It allows for the first time to operate on the true dynamics of observed scenes event per event while only using the mobile phone computation capabilities without requiring connecting to off-board resources (Fig. 1).poral resolutions at equivalent frame rates at the order of several kHz at low computational power.They allow for a new level of performance in real-time vision with a drive towards more efficient algorithms.Event-based cameras rely on a new principle that naturally allows all of the information contained in a standard video stream of several megabytes to be compressed in an event stream of a few kilobytes [15,20].
In this paper we introduce a new method allowing for outdoor vision-based gesture recognition in real-time using the only computational power of a mobile phone.It features a scalable machine learning architecture relying on the concept of temporal surfaces introduced in [13] and extending it to operate more robustly.It also tackles the difficult problem of dynamic background suppression by introducing a novel approach in the temporal domain to this issue.Furthermore, we introduce a new data-set of gestures recorded using an event-based camera, that is made available publicly as the neuromorphic field still lacks data-sets that take full advantage of the precise timing of event-based cameras.Indeed, in most available data-sets such as N-MNIST and N-Caltech101 [18] the dynamical properties of the data are artificially introduced.Even datasets such as the Poker Pips [24] do not contain intrinsic dynamical properties that could be used for classification.Compared to previous approaches, we emphasize the importance of using the information carried out by the timing of past events to obtain a robust low-level feature representation.Driven by brainlike asynchronous event based computations, the methodology opens new perspectives for the Internet of Things (IOT) by operating asynchronously and allocating computational resources only on active parts of the network thus achieving lesser power consumption and faster response times.This differs from conventional image-based artificial neural network that require tremendous computational off-chip resources both for training and inference.

Related Work
Gesture recognition is an area of research that is quickly expanding [19,23] and that currently relies mainly on two main streams of research.The first one uses wearable devices that mainly target specific indoors applications such as special effects, and are unsuited for outdoor use.The second stream relies on machine learning techniques coupled with several types of sensors.However, resource-constrained devices such as smart-phones disallow the use of certain technologies, such as vision-based depth sensors, due to their high energy consumption.This leads to the use of a wide variety of sensors, such as the proximity sensor [5,12] which is readily available on most smart-phones or even an off-board chip with radiofrequency capabilities as in [11].Considering visionbased approaches, several techniques have been developed to handle gesture recognition [16] such as orientation histogram [10], hidden Markov models [27], particle filtering [4], support vector machine (SVM) [7] and more recently convolution neural networks that allow featureless methodology [29].A vision-based method using only the built-in RGB camera of a smart-phone was introduced in [26], but is limited to static gestures (hand poses), excluding dynamic gestures.The first gesture recognition system to take advantage of neuromorphic cameras to our knowledge was a stereo-vision setup, proposed by [14].In their work they use Leaky Integrate-and-Fire (LIF) neurons to correlate space-time events, in order to extract the trajectory of the gesture.Another work proposed a motion-based feature [6] that decays depending on the speed of the optical flow, which allows to take into account varying speeds.IBM research [1] proposed an endto-end neuromorphic system, running in real-time, using a Dynamic Vision Sensor (DVS) connected to a TrueNorth neuromorphic chip that performs the classification using a CNN-based architecture.Authors released a dataset, DvsGesture, which we use in our experiments, obtaining comparable results while being truly event-driven in the learning and inference.This paper also goes beyond existing background suppression methodologies by using the high temporal resolution of event based cameras and thus allowing to operate on the activity of scenes rather than considering a frame based approach.This approach drastically contrasts from any existing background removal algorithm and does not rely on code-books [9], probabilistic approaches [28], sample-based methods [3], subspacebased techniques [17] or even deep learning [2].

Event-based Cameras and the Event-based Paradigm
The Address Event Representation (AER) neuromorphic camera used in this work is the Asynchronous Time-based Image Sensor (ATIS) (see Fig. 1B) [21].Each pixel is fully autonomous, independent and asynchronous, and will only be triggered by a change in contrast in its own field of view.A pixel emits a visual event when the luminance change exceeds a certain threshold, typically 15% percent in contrast.The nature of this change is encoded in the polarity p of the visual event, which can be either ON (p = 1) or OFF (p = 0), depending on the sign of the luminance change (see Fig. 2).The ATIS has a high temporal precision, in the order of the millisecond, which allows for the capture of highly dynamical scenes.Furthermore, static scenes will produce no output.This results in a low-redundancy, sparse and activity-driven stream of events at the output of such neuromorphic cameras.The k-th visual event e k of the output stream of the camera can be mathematically written as the following triplet: where x k is the spatial location of the visual event on the focal plane, t k its time-stamp, and p k its polarity.

Dynamic Background Suppression
The Dynamic Background Suppression (DBS), aims to remove visual events that are not part of the useful signal, such as background objects.We make use of the native property of high temporal resolution of event-based cameras that allows gestures to generate a higher density of events than background objects as they closer to the camera.The presented gesture is a "swipe down".Top row is the raw stream of visual events, and the bottom row is the denoised stream, at the output of the 3rd stage of the cascade presented in this paper.Each snapshot from the top row is made of 10,000 events, and bottom row contains only the kept events of those 10,000."ON" events are orange, "OFF" events are black.The filtering lead to the removal of 83.8% of all events.Even after removing this many events each gesture is still easily recognizable by the human eye.
The pixel array is divided into a grid of cells.Each cell c hence contains several pixels and has its own activity counter noted A c .At each event e c k emitted by a pixel contained in cell c we apply the following formula to update A c : where t c k is the time-stamp of the current event e c k , t c the last time a pixel spiked in the cell c, and τ b a timeconstant set in regards to the pixel array spike-rate.The average activity A of all cells is computed, namely only the events in cells with A c ≥ αA (with α as a scalar to tune the aggressiveness of the filter) are propagated to the recognition module.This last stage prevents cells with a low spike-density, which are considered as background, from emitting events.An example of the cascade operating on data from the NavGesture-walk dataset is shown in Fig. 4.

Time-surfaces as spatio-temporal descriptors
A time-surface [13] is a descriptor of the spatio-temporal neighborhood of an event e k .We first define the timecontext T k (u, p) of the event e k as a map of time differences between the time-stamp of the current event and the time-stamps of the most recent events in its spatial neighborhood.This (2R + 1) × (2R + 1) map is centered on e k , of spatial coordinates x k .The time-context can be expressed as: Finally, we obtain the time-surface S k (u, p) associated with the event e k , by applying a linear decay kernel of time-constant τ to the time-context T k : This gives a low-level representation of a local spatiotemporal neighborhood.However, as a time-surface is computed for each new incoming event, overlapping time surfaces are computed several times leading to resources wastes.In order to limit this effect, time-surfaces are discarded if they do not contain sufficient information, as this information will be part of a later time-surface as soon as a new event is emitted in the spatio-temporal neighborhood.For a time-surface to be considered valid, it must satisfy the following constraint: where R is the radius (half-width) of the time-surface.As the event-based camera performs native contour extraction, this ensures that sufficient events to form a valid descriptor can be carried out.

Event-based hierarchical network
The event-based camera visual events are fed to a network composed of several layers integrating information over increasing temporal scales.As information flows into the network, only their polarities "or new feature planes" are updated.Polarities in the network correspond to learned patterns or elementary features at that temporal and spatial scale.However, as events can be discarded, the network output stream usually contains less events than the input stream, which is an important property that build on the native low output of the event-based camera to lower the computational cost.

Learning prototypes
An iterative online clustering method is used to learn the patterns (hereinafter called prototypes), as it allows to process events as they are received, in an event-based manner.First, a set of N time-surface prototypes C n , with n ∈ 0, N − 1 , is created.The C n are initialized by simply using the first N time-surfaces obtained from the stream of events.Then, for each incoming event e k we compute its associated time-surface S k .Using the L2 Euclidean distance we compute the closest matching prototype C i in the bank, which we update with S k using the following rule: with α i the current learning rate of C i defined as: where A i is the number of time-surfaces which have already been assigned to C i .If a prototype C i has not been triggered by any of the last time-surfaces, it is initialized and forced to learn a new pattern.This prevents badly initialized prototypes to stay unused, and helps them converge to meaningful representations while maintaining always on learning capabilities.It is important to emphasize that compared to the original [13] we show that a linear decay (less computational expensive that the original exponential), combined with the heuristic that allows the suppression of a systematic computation of the timesurfaces allow for massive reduction in computation costs.

Building the network
The set of prototypes can be organized in a hierarchical manner (a set is then called a layer), in order to form a network (see Fig. 6).These layers can have different number of prototypes N , radius R (which corresponds to a neuron's receptive field) and time-constant τ .
The stimulus is presented to the event-based camera (Fig. 6A), which outputs a stream of visual events.A given event e m of the stream must go through all the layers before the next one e m+1 is processed.At each layer, the time-surface S m associated to e m is computed (see Fig. 6B), using the previously introduced kernel in Eq. Figure 6: (A) A stimulus is presented in front of a neuromorphic camera, which encodes it as a stream of event.(B) A time-surface can be extracted from this stream.(C) This time-surface is matched against known pattern, which are also time-surfaces, and that can be used as features for classification.
(4), of time constant τ , and considering a spatial receptive field of side-length (2R + 1).If S m satisfies Eq. ( 5), we update the closest prototype C c using Eq. ( 6) (see Fig. 6B), and the polarity p m of e m is modified so that p m = c, c being the ID of the matching prototype.The polarity now encodes a pattern, and we talk of pattern events instead of visual events for which the polarity corresponds to a luminance change.The pattern event is then fed to the second layer, and processed in a similar manner.The second layer combines patterns from the first layer, thus its prototypes (and so the corresponding polarities) encode more sophisticated patterns.The second layer is therefore able to encode changes of direction in the motion.Once the full network has been trained, meaning that its time-surface prototypes have converged, the learning is disabled: prototypes are not updated using Eq. ( 6) anymore.The network can now serve as a feature extractor: the polarities of events output by the network will be used as features for classification.

Datasets
We used four datasets, all of them were recorded using a neuromorphic camera: Faces dataset [13], Dvs-Gesture [1] and two novel datasets, NavGestures-sit and NavGestures-walk, tailored to facilitate the use of a smartphone by the elderly and the visually-impaired.NavGestures datasets are publicly available at [url hidden during reviewing for anonymity purposes].

NavGestures-sit and NavGestures-walk Datasets
The NavGestures-sit dataset was designed to operate on a smartphone using mid-air gestures.The gesture dictionary has only 6 gestures in order to be easily memorable but have also been determined as being the most elementary and sufficient set to operate a mobile phone.Four of them are "sweeping" gestures: Right, Left, Up, Down.These are designed to navigate through the items in a menu.The Home gesture, a "hello"-waving hand, can be used to go back to the main menu, or to obtain help.Lastly, the select gesture, executed only using fingers, closing them as a claw in front of the device, and then reopening them, is used to select an item.The dataset features 35 subjects, 12 being visuallyimpaired subjects, with a condition ranging from 1 to 4/5 on the WHO blindness scale and 23 being people from the laboratory.The gestures were recorded in real use condition, with the subject sitting and holding the phone in one hand while performing the gesture with their other hand.Some of the subjects were shown video-clips of the gestures to perform, while some others had only an au- dio description of the gesture.This inferred some very noticeable differences in the way each subject performed the proposed gestures, in terms of hand shape, trajectory, motion and angle but also in terms of the camera pose.Each subject performed 10 repetitions of the 6 gestures.Then all the gesture clips were manually labelled and segmented.We removed clips with a wrong field of view, wrongly executed gestures or that had a device-related capturing issues.The manually curated dataset contains 1, 621 clips.The NavGestures-walk dataset contains the 6 same gestures.The main difference being that the users walked through an urban environment while holding the phone with one hand and performed the gestures with the other.The dataset features 10 people from the laboratory that performed several times each of the 6 gestures.The dataset was recorded in uncontrolled lighting condition, both indoor in the laboratory, and outdoor in the nearby streets.

DvsGesture and Faces Dataset
IBM Research released a 10-class (plus a rejection class with random gestures) dataset [1] of hand and arm gestures, performed by 29 subjects under 3 different lighting conditions.The camera is mounted on a stand, and the subjects stand still in front of it, therefore the database is lacking dynamic backgrounds at the core of our work but provides valuable grounds for comparisons.Authors split the dataset into a train database consisting of 23 subjects and a test database consisting of the 6 remaining subjects.The Faces dataset [13], contains clips of the faces of 7 subjects recorded using an event-based camera.Each subject made 24 recordings, resulting in 168 clips.The subjects moved their face while following a dot on a computer screen in a square movement.The dynamic is therefore the same for all subjects, and does not carry any meaningful information for the classification task.The faces dataset does not come with a proposed split between a train and a test subset.This allows us to perform crossvalidation (10 random shuffles of train and test subsets) to ensure that the results are solid.As in the original paper, we put 5 examples in the train subset, and 19 the test subset.

Experiments and Results
In the following experiments we did not take the polarity of visual events into account: we considered that only the illuminance change carries information for these classification tasks, and not the fact that the illuminance increased or decreased.This is because the same gesture can generate either ON or OFF events depending on the skin color, the clothing color or the background.For all classification tasks, the output of end-layers is integrated over time to generate a histogram of activity per feature.
This can then be used as a dynamic signature of the observed stimulus that can then be fed to a classifier (here a nearest neighbor).More sophisticated classifiers can be used, we chose however partly to save power resources a simple methodology but mostly to show that extracted features are strongly capturing the essence of the dynamic signature of the recorded gestures.

Removing the background on the NavGestures datasets
If subjects are sitting in the NavGestures-sit, they do hold the phone in their hand, which results in movements and unwanted jitters that both generate background activity.
In the case of the NavGestures-walk the visual background is even more present as subjects were walking while recording the dataset.Figure 4 illustrates the use of the Dynamic Background Suppression (DBS).Table 1 reports the mean percentage of events left for each gesture class after removing the background.It must be noted that we did not use the DBS on the DvsGesture dataset because it was recorded with a static camera and because the background was static, therefore there is no background to remove.The following parameters were used for the DBS: • τ b = 300µs • α = 2 • grid size : 3 × 3

Results on the gesture datasets
In our experiments on the gesture datasets (E4 to E11) we tried both 1-layer and 2-layers networks, and also the benefits of the Dynamic Background Suppression on the recognition rate.Two-layers network perform better, as they can handle changes in direction.Also the Dynamic Background Suppression greatly improves the recognition rate, as demonstrated for the NavGestures-walk, increasing the score from 81.3% to 92.6%.
Regarding the DvsGesture dataset, we use the same 2layer network architecture.The only difference is that we increased the number of prototypes in the second layer because the gestures are more complex, and more prototypes in the end-layer account for more discriminative power.This work HOTS [13] IBM [ We also took into account the spatial component of gestures.This is possible because clips of the DvsGesture dataset all have the same framing.We split the pixel array into sub-regions, using a 3 × 3 grid.Hence, the final feature is a histogram of size 3×3×64 = 576.Classification used a nearest neighbor classifier on the histograms.One can observe in table 3 that the system performs in the same range of precision as [1] while being lighter to implement and compute.

Gesture recognition on the smartphone
The whole system, made of the DBS, a 1-layer feature extractor and the recognition module, is implemented on a mobile phone, a Samsung GM-920F, as native C++ code.The event-based camera is directly plugged into the micro-usb port of the mobile phone (see Fig. 1).This prototype was briefly tested by visually-impaired end-users, in real use condition.They were asked to perform certain tasks using the phone, such as sending a pre-written message or play a song.Results of the pre-tests can be found in table 7. It is important to emphasize that some gestures require longer execution time, because it they generate much more visual event that thus require more computation.This is one of the properties of being scene-driven.

Results on the Faces dataset
Using a single-layer network with a receptive field R = 6, N = 32 prototypes and Tau = 5 ms, the scope is to push the system to it limit and inquire whether a single layer is enough to capture the static properties of this dataset.We are able to obtain 96.6% recognition score on this dataset, whereas the original model in [13] performed at 79% using a three-layer architecture, with its end-layer having Figure 7: Results obtained with visually-impaired endusers on early tests using the prototype.Users were asked to perform different scenarii such as sending a pre-written message.Average correct classification is 78%, however this score is heavily impacted by not-so-good score of the "select" gesture, which is performed in very different ways among users, something we already had reckoned during the dataset creation.
the same number N = 32 of prototypes.When increasing the number of prototypes to N = 48 and N = 64, we achieved respectively 97.9% and 98.5% in average recognition rate.Also we noticed that increasing Tau higher than 5 ms was not beneficial and decreased our classification accuracy.The data properties in this dataset are static: the dynamic does not carry any meaningful information for classifying the faces.This shows that the numerous modifications we introduced into the model lead to an important improvement in extracting the static properties.
Additional material provides videos of the Dynamic Background Suppression at work and of live gesture recognition on the smartphone.

Discussion
In this work, we presented a system that allows to recognize gestures using a smartphone computational capabilities.We also improved drastically the hierarchical net-work proposed in [13], both for static and dynamic data.The system and methodology allows to truly understand what the network is computing rather than the conventional black-box approach.We can report that the first layers operating on shorter time scales are extracting oriented contours and direction, while the second layers encode change of directions of the same feature.Deeper networks could theoretically encode multiple changes in direction, but given the nature of gestures and the task to be performed there no need to use such networks.We can assess that a 2-layered network is sufficient to handle efficiently any of the considered databases.This is truly the advantage of using time-surfaces that encode in a compact representation both spatial and temporal information.The system also relies on a very small number of meta parameters to tune.We did not require long parameter adjusting processes for all the considered databases.Once the parameters of the network match those of the observed object, the same set can apply regardless to the dataset, and we were able to use the same parameters for all the gestures datasets, while obtaining state-of-the-art accuracy scores on the DvsGestures classification task.We believe this is the first time that time-surfaces were used at their true potential.Indeed, in previous work like HOTS [13] or HATS [25] the decay times used were set to values thousands times higher than the duration of the stimulus.This resulted in time-surfaces that acted as binary frames, instead of truly encoding the dynamic of the scene.This comes as no surprise as considering inadequate time scales uncorrelated with the dynamics of the observed scene will provide low amounts of information and therefore poor recognition rates.Finally, the Dynamic Background Suppression plays a very important role in achieving high recognition rates in a walking situation.

Figure 1 :
Figure 1: A neuromorphic camera (an ATIS) (B) is plugged into a smart-phone (A) using an USB link (C), allowing mid-air gesture navigation on the smart-phone.

Figure 2 :
Figure 2: Principle of operation of the neuromorphic camera used in this work.(A) When the change in illuminance of a given pixel's field of view exceeds a certain threshold, (B) it emits a visual event, which is either "ON" or "OFF" depending on the sign of the change.(C) A given pixel responds asynchronously to the visual stimuli in its own field of view.

Figure 3 :
Figure 3: Operating principle of the Dynamic Background Suppression (DBS).(A) A gesture is performed in front of the camera, which pixel array is divided into cells.(B) Each cell has its own activity counter that decays over time.(C) Only cells with their activity greater than the mean activity (black dashes) of all cells can spike.

Figure 4 :
Figure 4: Denoising example of a gesture clip from the NavGestures-walk data-set.The presented gesture is a "swipe down".Top row is the raw stream of visual events, and the bottom row is the denoised stream, at the output of the 3rd stage of the cascade presented in this paper.Each snapshot from the top row is made of 10,000 events, and bottom row contains only the kept events of those 10,000."ON" events are orange, "OFF" events are black.The filtering lead to the removal of 83.8% of all events.Even after removing this many events each gesture is still easily recognizable by the human eye.

Figure 5 :
Figure 5: (A) A moving vertical bar is presented to the event-based camera, which output a stream of visual events.The edges of the bar are ON (white) and OFF (black) events.A ROI is defined around the current event (blue square).(B) The time-stamps of visual events contained the ROI are decayed using a linear kernel.(C) The resulting extracted time-surface, that encodes both the contour orientation and the dynamic of the motion.

Table 1 :
Mean percentage of events left after each the Dynamic Background Suppression for each gesture class.

Table 2 :
Detail of the experiments that were taken on the different datasets.

Table 3 :
Comparison of the classification accuracy on event-based datasets.