^{*}

Edited by: Emmanuel Michael Drakakis, Imperial College London, UK

Reviewed by: Anton Civit, University of Seville, Spain; Joaquin Sitte, Queensland University of Technology, Australia

*Correspondence: Heiko Neumann, Faculty of Engineering and Computer Science, Institute of Neural Information Processing, Ulm University, James-Franck-Ring, Ulm 89081, Germany

This article was submitted to Neuromorphic Engineering, a section of the journal Frontiers in Neuroscience

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Event-based sensing, i.e., the asynchronous detection of luminance changes, promises low-energy, high dynamic range, and sparse sensing. This stands in contrast to whole image frame-wise acquisition by standard cameras. Here, we systematically investigate the implications of event-based sensing in the context of visual motion, or flow, estimation. Starting from a common theoretical foundation, we discuss different principal approaches for optical flow detection ranging from gradient-based methods over plane-fitting to filter based methods and identify strengths and weaknesses of each class. Gradient-based methods for local motion integration are shown to suffer from the sparse encoding in address-event representations (AER). Approaches exploiting the local plane like structure of the event cloud, on the other hand, are shown to be well suited. Within this class, filter based approaches are shown to define a proper detection scheme which can also deal with the problem of representing multiple motions at a single location (motion transparency). A novel biologically inspired efficient motion detector is proposed, analyzed and experimentally validated. Furthermore, a stage of surround normalization is incorporated. Together with the filtering this defines a canonical circuit for motion feature detection. The theoretical analysis shows that such an integrated circuit reduces motion ambiguity in addition to decorrelating the representation of motion related activations.

The initial stages of visual processing extract a vocabulary of relevant feature items related to a visual scene. Rays of light reach the observer's eye and are transformed to internal representations. This can be formalized as sampling the ambient optic array (Gibson, _{x}, _{y}, _{z}) describes the intensity of a light ray of wavelength λ passing through the center of the pupil of an idealized eye at every possible angle (θ, φ) located at the position (_{x}, _{y}, _{z}) at time _{λ, Vx, Vy, Vz}(

We will focus on silicon retinas that generate an AER, namely the dynamic vision sensor (DVS; Delbrück and Liu, _{k} ∈ {−1, 1} are generated at times _{k} that emulate spike sequences of on- and off-contrast cells in the retina, respectively (Figure

We describe the stream of events by the function
_{k}, _{k}; _{k}) = (_{k}; _{k}) which define the location and time of an event _{k}; _{k}) = _{k}, generates 1 if the log-luminance changed more than a threshold ϑ, i.e., an ON event, and −1 if it changed more than −ϑ, i.e., an OFF event. This sampling of the lightfield essentially represents the temporal derivative of the luminance function

To estimate local translatory motion we assume throughout the paper that the gray level function remains constant within a small neighborhood in space and time, i.e., ^{T} ∇_{3g} + 1/2 Δ^{T} _{3} Δ^{T}, ∇_{3g} = (_{x}, _{y}; _{t})^{T} is the gradient with the 1st order partial derivatives of the continuous gray-level function, and _{3} denotes the Hessian with the 2nd order partial derivatives of the continuous gray-level function that is defined in the _{x} _{y} _{t} = 0 given that Δ^{T} = (_{x} _{y} _{t} = 0 holds with _{y} = 0). Then the motion can be estimated along the horizontal directions (left or right with respect to the tangent orientation of the contrast edge). When the edge contrast polarity is known (light-dark, LD, _{x} < 0 or dark-light, DL, _{x} > 0) the spatio-temporal movements can be estimated without ambiguity. For an DL edge if _{t} < 0 the edge moves to the right, while for _{t} > 0 the edge moves to the left (c.f. Figure

For an LD edge the sign of the temporal derivatives _{t} changes for both respective movement directions, i.e., only the ratio of gray-level derivatives yields a unique direction selector orthogonal to the oriented luminance contrast. This means that, sgn(_{x}/_{t}) = −1 implies rightward motion while sgn(_{x}/_{t}) = 1 implies leftward motion, irrespective of the contrast polarity. Note, however, that an estimate of _{x} is not easily accessible from the stream of events of an asynchronous event sensor. Thus, a key question is to what extend the required spatio-temporal derivative information is available and can be estimated.

We describe the luminance function _{0} the basic luminance level, and “*” denoting the convolution operator (since we only study the derivatives, we assume _{0} = 0). The parameter σ controls the spatial blur of the luminance edge with σ → 0 resulting in the step-function. Different contrast polarities are defined by ^{DL}_{σ}(_{σ}(^{LD}_{σ}(_{σ}(

When this gray-level transition moves through the origin at time ^{T}. The function _{⊥} =

Now, recall that the event-based DVS sensor provides an estimate of _{t} at a specific location [c.f. Equation (2)]. For a moving contrast profile this leads to a changing luminance function along the _{x} as

In sum, the temporal edge transition can be reconstructed in principle from a (uniform) event sequence at the edge location for a specific motion direction, given that

a reliable speed estimate is available to infer a robust value for θ, and

reliable estimates of temporal changes have been generated as an event cloud over an appropriately scaled temporal integration window Δ_{t}.

Note, that both parameters, θ and Δ_{t}, need to be precisely estimated to accomplish robust estimates of contrast information of the luminance edge. In Sections 2.1.4 and 2.1.5, we will briefly outline the necessary steps in such an estimation process. Alternatively, one can try to directly estimate the partial derivatives used in the motion constraint equation from the stream of events. The construction of this approach and its related problems are described in the following Section 2.1.3.

The local spatio-temporal movement of a gray-level function can be estimated by least-squares optimization from a set of local contrast measurements which define intersecting motion constraint lines in velocity space (Lucas and Kanade, _{x} _{y} _{t} = 0 from event sequences generated by the DVS. Events only encode information about the temporal derivative _{t} [c.f. Equation (2)]. Thus, without additional information it is impossible to reliably estimate _{x} or _{y}, as outlined in the previous Section 2.1.2. The derivative of a translatory moving gray level patch, however, generates a unique response in _{t}. Thus, we can apply the motion constraint equation to the function _{x} _{y} _{t} = 0, instead. Using two temporal windows _{−2} = (_{−1} = (_{t}, for example, by a backward ^{T} and ϑ denoting the event-generation threshold. The _{x} and _{y} can be approximated by central difference kernels [−1, 0, 1] and [−1, 0, 1]^{T}, respectively. These can be applied to the function _{−2} ∪ _{−1})

Consequently, the resulting flow computation results in a sparsification of responses since stationary edges will not be represented in

Note, however, that this approach has multiple issues regarding any real implementation. The most important observation is that when a luminance edge passes a pixel's receptive field of the DVS sensor, the amount of events is in the range of about 10 events (often even less, depending on the contrast, speed and luminance conditions; c.f. zoomed display of the event cloud in Figure _{x}, _{y} and especially in _{t} (since this now represents the second derivative of the original gray-level function _{t} accurately, if the temporal windows are small enough such that the gray-level edge has not already passed through the receptive field of a target cell at position _{W} _{W}

The short temporal window in which events of a briefly passing contrast edge are generated makes it difficult to reliably estimate the derivatives required in the motion constraint equation (c.f. previous section). An alternative approach is to consider the distribution of events (the “event cloud”) in a small volume of the _{e} : ℕ^{2} → ℝ is defined that maps the location _{e}(^{T} (defined in homogeneous coordinates), the orientation of the moving luminance edge _{x}, _{y}, 0)^{T}, and the normal vector ^{T} of the plane. These three vectors form an orthogonal system that spans the

The resulting velocity components ^{T})

As an alternative to considering the LS regression in estimating the velocity tangent plane from the cloud of events, the uncertainty of the event detection might be incorporated directly. At each location, detected events define likelihood distributions _{i}) = _{j}) (for arbitrary velocities _{est} of the movement that caused event

Thus, we can estimate the velocity from the responses _{i}),

In this section, we define spatio-temporal filters that are fitted to the physiological findings from De Valois et al. (

Our filter design essentially reverses the decomposition of neural responses conducted by De Valois et al. (

This main observation lead us to propose a family of spatio-temporally direction selective filters as illustrated in Figure _{odd} and _{even}) and mono-/bi-phasic temporal profiles (_{mono} and _{bi}). The details of the construction process are outlined in the following sections.

To construct the spatial component of the spatio-temporal filters illustrated in Figure ^{0}_{x}, ^{0}_{y}) (with a standard deviation σ in local space) defined by (c.f. Figure ^{0}_{x}, ^{0}_{y}) defines the shift of the Gaussian envelope with respect to the origin in the Fourier domain. This defines the two components _{odd} = ℑ(_{σ, f0x, f0y}) and _{even} = ℜ(_{σ, f0x, f0y}) to construct the filters as described in Section 2.2.1 (compare with Daugman,

_{0} = 0.08.

The second component required in the spatio-temporal filter generation process illustrated in Figure _{mono} and _{bi}. To fit the experimental data of De Valois et al. (

_{bi1} = 0.2 with the remaining parameters fitted to the experimental data (see text for details). Dashed line highlights that the peak of the mono-phasic kernel (green) is located at the zero-crossing of the bi-phasic kernel (blue).

When the experimental findings are incorporated, it is only necessary to choose a value for μ_{bi1}. All other parameters can be inferred according to the experimental data from De Valois et al. (

The bi-phasic scaling factors _{1} and _{2} are adapted to the minimum and maximum values of the experimental data relative to the maximum value of the monophasic kernel (which is one), i.e., _{1} = 1/2 and _{2} = 3/4.

A good fit with the experimental data reported in De Valois et al. (_{bi2} = 2μ_{bi1}.

The standard deviations σ_{mono} and σ_{bi1} are chosen such that the Gaussians are almost zero for _{mono} = μ_{mono}/3, σ_{bi1} = μ_{bi1}/3 (3σ–rule; 99.7% of the values lie within three standard deviations of the mean in a normal distribution).

The standard deviation of the second Gaussian of the bi-phasic kernel is about 3/2 of that of the first, i.e.,

The mean of the mono-phasic kernel μ_{mono} is given by the zero-crossing of the biphasic kernel, i.e.,

Figure

The full spatio-temporal filter _{odd} = ℑ(_{σ, f0x, f0y}), the monophasic temporal _{mono}, the even-spatial _{even} = ℜ(_{σ, f0x, f0y}), and the biphasic temporal filter _{bi} (c.f. Figure

^{0}_{x} = ^{0}_{y} ≈ 0.057 (_{bi1} = 0.2 (as in Figure ^{max}_{t} ≈ 0.965, ^{max}_{x} ≈ ^{max}_{y} ≈ 0.057). See text for details.

The preferred speed of the filter can be determined by an analysis of the Fourier transform _{x}, _{y}, _{t}) of the filter function ^{max}_{t}, ^{max}_{x}, ^{max}_{y}) where

The motion constraint equation in the frequency domain: _{x} _{y} _{t} = 0, i.e., _{t}.

(_{⊥},_{⊥}) is orthogonal to the luminance edge, i.e., parallel to (^{max}_{x}, ^{max}_{y}). Thus, the scalar product of ^{max} = (^{max}_{x}, ^{max}_{y}) and _{⊥} = (_{⊥},_{⊥}) is equal to ^{max} · _{⊥} = ‖

Combining both equations, we obtain −^{max}_{t} = ‖^{max}‖ · ‖_{⊥}‖, i.e., the speed _{⊥}‖ is given as _{⊥}‖ = −^{max}_{t}/‖^{max}‖. The velocity can now be obtained by scaling the normalized gradient direction ^{max}/‖^{max}‖ with ^{0}_{x} = ^{0}_{y} ≈ 0.057 (_{bi1} = 0.2, we numerically determined the values as ^{max}_{t} = 0.974, ^{max}_{x} = ^{max}_{y} = 0.057 which maximize |_{⊥},_{⊥}) = (8.61, 8.61)

The spatio-temporal filter mechanism is combined with a stage of down-modulating lateral divisive inhibition. Such response normalization was shown to have a multitude of favorable properties such as the decrease in response gain and latency observed at high contrasts, the effects of masking by stimuli that fail to elicit responses of the target cell when presented alone, the capability to process a high dynamic range of response activations (Heeger, _{i} denoting the input and _{j} denote the spatio-temporal weighting coefficients of the local neighborhood _{i} of neuron

Another favorable property of divisive normalization has been the observation that it can approximate a process dubbed _{j} denote the weighting coefficients for the activations in the surrounding neighborhood in the space-feature domain [as in Equation (29)]. When the coefficients are learned from a test set (Lyu and Simoncelli,

In addition to the main part describing the theoretical investigations outlined in the previous sections, we conducted a series of experiments to validate the modeling approach and its theoretical properties. The parameters of the spatio-temporal filters were chosen such that they fit the experimental data as reported in De Valois et al. (_{bi1} = 0.2 for the temporal filter components, and σ = 25, _{0} = 0.08 for the spatial filter components. The parameters of the normalization mechanism in Equation (31) were set to β = 1, α_{p} = 0.1, α_{q} = 0.002, _{j} resemble the coefficients of a Gaussian kernel with σ = 3.6, and Ψ_{I}(_{q}(

First, we probed the model using simple and more complex stimuli with translatory and rotational motion to demonstrate the detection performance and noise characteristics of the initial (linear and non-linear filtering of the input). Second, we studied the impact of the normalization stage on the initial filter responses. Third, the model was probed by stimuli with transparent overlaid motion patterns to test the segregation into multiple motion directions at a single spatial location (see e.g., Braddick et al.,

At each location the filter creates a population code of length _{k}. For visualization purposes (Figure _{p} and _{p} are inferred from the initial responses _{p; k},

A well known problem to motion detection is the estimation of ambiguous motion at e.g., straight contours (aperture problem). Locally only the normal flow direction can be measured which might not coincide with the true direction because the motion component parallel to a contrast edge is unknown (Figure

In Section 2.2.5, we point out that divisive normalization can effectively approximate radial Gaussianization, i.e., a reduction of the dependency between components within a population code. Here, we empirically validate that the divisive normalization described in Equation (31) indeed reduces the dependency within the population of motion selective cells. We quantify the statistical dependency of the multivariate representation by using multi-information (MI) (Studený and Vejnarová, _{1}, _{2}, …, _{d}) and the product of its marginals
_{k}) denotes the differential entropy of the _{norm}) = 0.028 (0.027 for the second example) after the normalization stage. Thus, divisive normalization employed here does not entirely decorrelate the movement representation (which would imply _{norm}) = 0) but significantly reduces it.

Unlike the motion of opaque surfaces transparent motion is perceived when multiple motions are presented in the same part of visual space. Few computational model mechanisms have been proposed in the literature that allow to segregate multiple motions (see e.g., Raudies and Neumann,

_{0} and μ_{bi} on the speed selectivity (in ^{0}_{t}/^{0}_{x}, with ^{0}_{x} and ^{0}_{t} maximizing |

_{0}\μ_{bi} |
|||||||||
---|---|---|---|---|---|---|---|---|---|

0.04 | 100.65 | 50.33 | 33.55 | 25.16 | 20.13 | 16.78 | 14.38 | 12.58 | 11.18 |

0.05 | 78.76 | 39.38 | 26.26 | 19.69 | 15.75 | 13.13 | 11.25 | 9.85 | 8.75 |

0.06 | 65.10 | 32.55 | 21.70 | 16.28 | 13.02 | 10.85 | 9.30 | 8.14 | 7.23 |

0.07 | 55.67 | 27.84 | 18.56 | 13.92 | 11.13 | 9.28 | 7.95 | 6.96 | 6.18 |

0.08 | 48.69 | 24.34 | 16.23 | 12.17 | 9.74 | 8.11 | 6.96 | 6.09 | 5.41 |

0.09 | 43.28 | 21.64 | 14.42 | 10.82 | 8.66 | 7.21 | 6.18 | 5.41 | 4.81 |

0.10 | 38.95 | 19.48 | 12.98 | 9.74 | 7.79 | 6.49 | 5.56 | 4.87 | 4.33 |

0.11 | 35.41 | 17.70 | 11.80 | 8.85 | 7.08 | 5.90 | 5.06 | 4.43 | 3.93 |

0.12 | 32.46 | 16.23 | 10.82 | 8.11 | 6.49 | 5.41 | 4.64 | 4.06 | 3.61 |

_{0}, μ_{bi}) = (0.10, 0.20) or as (0.05, 0.40), for example. Further adaptation of the standard deviations of the spatial and temporal kernels according to our theoretical results allows to realize optimal sampling of the Fourier-domain. For large enough σ (as for this table), the speed-selectivity does hardly depend on the parameter σ. For small σ, however, we noticed a strong impact which needs to be considered in the creation of a properly tuned filter bank.

To test the encoding of motion transparency, we probed the model by using simulated event-based sensor outputs of two superimposed random-dot patterns moving in orthogonal directions with the same speed. The spatio-temporal event-cloud generated by the moving dots is rather noisy and the component motions appear rather indistinguishable by eye. Figure

This paper investigates mechanisms for motion estimation given event-based input generation and representation. The proposed mechanism has been motivated from the perspective of sampling the plenoptic function such that specific temporal changes in the optic array are registered by the sensory device. The temporal sampling is based on significant changes in the (log) luminance distribution at individual sensory elements (pixels). These operate at a very low latency by generating events whenever the local luminance function has undergone a super-threshold increment or decrement. This is fundamentally different from common frame-based approaches of image acquisition where a full image is recorded at fixed intervals leading to a largely redundant signal representations. Our focus is on motion computation and the proposed approach is different from previous approaches in several respects. In a nutshell, our paper makes three main contributions:

We first investigate fundamental aspects of the local structure of lightfields for stationary observers and local contrast motion of the spatio-temporal luminance function. In particular, we emphasize the structure of local contrast information in the space-time domain and their encoding by events to build up an address-event representation (AER).

Based on these results we derive several constraints on the kind of information that can be extracted from event-based sensory acquisition using the AER principle. This allows us to challenge several previous approaches and to develop a unified formulation in a common framework of event-based motion detection.

We have shown that response normalization as part of a canonical microcircuit for motion detection is also applicable for event-based flow for which it reduces motion ambiguity and contributes to making the localized measures of filtering statistically more independent.

These different findings will be discussed in more detail in the following sections.

So far, only relatively few investigations have been published that report on how classical approaches developed in computer vision can be adapted to event-based sensory input and how the quality of the results changes depending on the new data representation framework. Examples are Benosman et al. (

We here focus on the detection of flow from spatio-temporal motion on the basis of event-based sensor input. We utilize the dynamic-vision sensor (DVS) that emulates the major processing cascade of the retina from sensors to ganglion cells (Lichtsteiner et al.,

In contrast, methods exploiting the local structure of the cloud of events are more robust in general. Here, we compared different approaches. First, we reviewed methods fitting an oriented plane to the event cloud. We derived equations which demonstrate that the orientation parameters of the plane directly encode the velocity [see Equation (18)]. The benefit of such an approach against the above-mentioned numerical derivative scheme is that it works even in the case of only a few generated events. Of course, the goodness of fit depends on the size of the spatio-temporal neighborhood. However, if we consider a neighborhood that is too small then the plane fit may eventually become arbitrary and thus instable. If the neighborhood is too large then the chances increase that the event cloud contains structure that is not well approximated by a local plane. This also applies to the case of multiple motions, such as in the case of, e.g., occlusions due to opposite motions, limp motion in articulations, or in case of transparent motion stimuli.

Based on these insights we suggest a novel filter that samples the event-cloud along different spatio-temporal orientations. Its construction “reverses” the singular-value decomposition conducted of V1 receptive fields to construct direction-selective cells with spatio-temporally inseparable receptive fields (De Valois and Cottaris,

Compared to plane-fitting models (as suggested by, e.g., Benosman et al.,

In order to account for non-linearities in the response properties of cortical cells (Carandini et al.,

Based on statistical investigations, a decorrelation of the responses of a group of cells into rather independent components has been suggested in Lyu and Simoncelli (

Motion estimation from the output of an asynchronous event-based vision sensor requires adapted methods. Here, we conducted for the first time a theoretical investigation that systematically categorizes event-based flow estimation models with respect to their underlying methods, namely gradient-based methods and algorithms exploiting the locally approximated plane-like structure of the cloud of events. In addition to analyzing existing gradient-based methods inconsistently mixing first and second order derivatives we proposed a novel consistent gradient-based algorithm. Even further, we showed that gradient-based methods in general suffer from strong noise originating from the limited number of events occurring at a single location. Methods exploiting the local plane-like shape of the event-cloud, on the other hand, were shown to be suitable for motion originating from a single object. In addition, we derived an explicit formula to derive the velocity from the parameters of the plane. For filter-based approaches, we proposed and analyzed a novel biologically inspired algorithm and demonstrated that it can also deal with motion transparency, i.e., it can represent different motion directions at a single location. Finally, we analyzed the impact of a stage of response normalization. We demonstrated that it is applicable to flow originating from event-based vision sensors, that it reduces motion ambiguity, and that it improves statistical independence of motion responses. All the theoretical findings were underpinned by simulation results which confirm that the model robustly estimates flow from event-based vision sensors.

Designing the models/experiments: TB, ST, and HN. Mathematical and theoretical analysis: TB and HN. Spatio-temporal filter-analysis: TB. Experimental investigations: ST. Manuscript preparation: TB, ST, and HN.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The work has been supported by the Transregional Collaborative Research Center “A Companion Technology for Cognitive Technical Systems” (SFB/TR-62) funded by the German Research Foundation (DFG). We thank the reviewers for their careful reading and constructive criticism that helped to improve the manuscript.