On event-based optical flow detection

Event-based sensing, i.e., the asynchronous detection of luminance changes, promises low-energy, high dynamic range, and sparse sensing. This stands in contrast to whole image frame-wise acquisition by standard cameras. Here, we systematically investigate the implications of event-based sensing in the context of visual motion, or flow, estimation. Starting from a common theoretical foundation, we discuss different principal approaches for optical flow detection ranging from gradient-based methods over plane-fitting to filter based methods and identify strengths and weaknesses of each class. Gradient-based methods for local motion integration are shown to suffer from the sparse encoding in address-event representations (AER). Approaches exploiting the local plane like structure of the event cloud, on the other hand, are shown to be well suited. Within this class, filter based approaches are shown to define a proper detection scheme which can also deal with the problem of representing multiple motions at a single location (motion transparency). A novel biologically inspired efficient motion detector is proposed, analyzed and experimentally validated. Furthermore, a stage of surround normalization is incorporated. Together with the filtering this defines a canonical circuit for motion feature detection. The theoretical analysis shows that such an integrated circuit reduces motion ambiguity in addition to decorrelating the representation of motion related activations.


Introduction
The initial stages of visual processing extract a vocabulary of relevant feature items related to a visual scene. Rays of light reach the observer's eye and are transformed to internal representations. This can be formalized as sampling the ambient optic array (Gibson, 1978(Gibson, , 1986. Formally, the plenoptic function P(θ, φ, λ, t, V x , V y , V z ) describes the intensity of a light ray of wavelength λ passing through the center of the pupil of an idealized eye at every possible angle (θ, φ) located at the position (V x , V y , V z ) at time t (Adelson and Bergen, 1991). As a simplification we assume a single stationary camera sensing a single narrow band of wavelengths in the electromagnetic spectrum on its image plane (x, y), reducing the plenoptic function to P λ,V x ,V y ,V z (x, y, t) = g(x, y, t) (the spatio-temporal gray level function). Elemental measurements are necessary to access the plenoptic structures. Conventional frame-based cameras sample the optic array by reading out measurements of all light-sensitive pixels at a fixed rate. Since the temporal sampling rate is limited through reading all pixel values in a fixed time interval, fast local luminance changes are integrated over time and cannot be differentiated in the further processing. When no changes occur in the intensity function, redundant information is generated that is carried to the subsequent processing steps. Address-event representations (AER), on the other hand, originate from image sensors in which pixel operate at individual rates generating events based on local decisions to generate an output response, like in the mammalian retina (Mead, 1990;Liu and Delbruck, 2010).
We will focus on silicon retinas that generate an AER, namely the dynamic vision sensor (DVS; Delbrück and Liu, 2004). Whenever the change in the log-luminance function exceeds a predefined threshold ϑ, events e k ∈ {−1, 1} are generated at times t k that emulate spike sequences of on-and off-contrast cells in the retina, respectively (Figure 1). We discuss what kind of information is accessible from the initial stages of event-based visual sensing and compare different approaches to estimate optical flow from the stream of on-and off-events visualized in Figure 1. We identify weaknesses, suggest improvements, propose a novel biologically inspired motion detector and conduct experiments to validate the theoretical predictions of flow estimation. The proposed detector is then further extended by incorporating an inhibitory pool of activation over a neighborhood in the space-time-feature domain that leads to contextual modulation and response normalization. Together with the initial filtering stage the scheme defines a canonical circuit model as suggested in Kouh and Poggio (2008); Carandini and Heeger (2012); Brosch and Neumann (2014a). This competitive mechanism is investigated from an information-theoretic point of view, shown to accomplish decorrelation, and linked to radial Gaussianization of the input response distribution (Lyu and Simoncelli, 2009b). Finally, we investigate whether motion transparency encoding (Snowden and Verstraten, 1999), i.e., the percept of two competing motions at a single location, like flocks of birds flying in front of passing clouds, can be supported.
FIGURE 1 | Top from left to right: Image, operating principle and stimulus generation of/for an asynchronous event sensor. Luminance changes exceeding a given threshold evoke ON and OFF events for positive and negative changes, respectively. The very low latency of the dynamic vision sensor (15 µs) requires analog stimulus generation as illustrated on the right. Bottom: Visualization of the spatio-temporal event cloud generated by the rotating stimulus in the upper right. A small volume has been zoomed in. As can be seen, only few events have been generated at a single location.

Nomenclature and Principal Problems
We describe the stream of events by the function which is always zero except for tuples (x k , y k ; t k ) = (p k ; t k ) which define the location and time of an event k generated when the luminance function increases or decreases by a significant amount. In other words, the function that defines the event generation e(p k ; t k ) = e k , generates 1 if the log-luminance changed more than a threshold ϑ, i.e., an ON event, and −1 if it changed more than −ϑ, i.e., an OFF event. This sampling of the lightfield essentially represents the temporal derivative of the luminance function g with ϑ the sensitivity threshold of the event-based sensor.
To estimate local translatory motion we assume throughout the paper that the gray level function remains constant within a small neighborhood in space and time, i.e., g(x, y; t) = g(x + x, y + y; t + t) (gray level constancy; c.f. Horn and Schunck, 1981). Note that due to the low latency of 15 µs of the event-based sensor , this assumption is more accurate than for conventional frame based sensors.
Local expansion up to the second order yields the constraint x T ∇ 3 g + 1/2 x T H 3 x = 0. Here, x = ( x, y, t) T , ∇ 3 g = (g x , g y ; g t ) T is the gradient with the 1st order partial derivatives of the continuous gray-level function, and H 3 denotes the Hessian with the 2nd order partial derivatives of the continuous gray-level function that is defined in the x-y-t-domain. If we further assume that the 2nd order derivative terms are negligible (linear terms dominate) we arrive at the spatio-temporal constraint equation that has been used for least-squares motion estimation. The least-squares formulation is based on a set of local constraint measures over a small neighborhood under the assumption of locally constant translations (Lucas and Kanade, 1981), i.e., g x u + g y v + g t = 0 given that t → 0 and u T = (u, v) = ( x/ t, y/ t). Note that this motion constraint equation can also be represented in the frequency domain in which f x u + f y v + f t = 0 holds with f denoting the frequency with subindices referring to the respective cardinal axes and assuming a non-vanishing energy spectrum for the gray-level luminance signal, i.e., Ĝ = 0. The local image motion u of an extended contrast can only be measured orthogonal to the contrast (normal flow, Wallach, 1935;Barron et al., 1994;Fermüller and Aloimonos, 1995;Wuerger et al., 1996). For simplicity, we assume a vertically oriented gray level edge (g y = 0). Then the motion can be estimated along the horizontal directions (left or right with respect to the tangent orientation of the contrast edge). When the edge contrast polarity is known (light-dark, LD, g x < 0 or dark-light, DL, g x > 0) the spatio-temporal movements can be estimated without ambiguity. For an DL edge if g t < 0 the edge moves to the right, while for g t > 0 the edge moves to the left (c.f. Figure 2).
For an LD edge the sign of the temporal derivatives g t changes for both respective movement directions, i.e., only the ratio of gray-level derivatives yields a unique direction selector orthogonal to the oriented luminance contrast. This means that, sgn(g x /g t ) = −1 implies rightward motion while sgn(g x /g t ) = 1 implies leftward motion, irrespective of the contrast polarity. Note, however, that an estimate of g x is not easily accessible from the stream of events of an asynchronous event sensor. Thus, a key question is to what extend the required spatio-temporal derivative information is available and can be estimated.

Moving Gray-Level Edges and the Spatio-Temporal Contrast Model
We describe the luminance function g for a stationary DL transition by convolving a step edge H(·) with a parameterized Gaussian, with c denoting the luminance step height, g 0 the basic luminance level, and " * " denoting the convolution operator (since we only study the derivatives, we assume g 0 = 0). The parameter σ controls the spatial blur of the luminance edge with σ → 0 resulting in the step-function. Different contrast polarities are defined by g DL σ (x) = c · erf σ (x) and g LD σ (x) = c · (1 − erf σ (x)), respectively (Neumann and Ottenberg, 1992).
When this gray-level transition moves through the origin at time t = 0 it generates a slanted line with normal n in the xt-space (c.f. Figure 3). The speed s of the moving contrast edge is given by s = sin(θ )/ cos(θ ), where θ is the angle between n and the x-axis (this is identical to the angle between the edge tangent and the t-axis). For a stationary gray-level edge (zero FIGURE 3 | Rightward moving 1D edge illustrated in the x-t-domain. The velocity is defined by the direction and the speed of the spatio-temporal change. In the case depicted here, the direction is to the right and the speed is encoded by the angle θ between the x-axis and the normal vector n along the spatio-temporal gradient direction (measured in counter-clockwise rotation). Alternatively, for a contrast edge of known finite transition width x, the speed can be inferred from the time t, it takes the contrast edge to pass a specific location on the x-axis.
FIGURE 2 | Moving DL (dark-light) and LD (light-dark) edge, either to the left or to the right (denoted by blue arrows), have an associated temporal on/off signature. Note that without knowledge about the edge type (DL vs. LD), an on/off event alone is insufficient to determine the motion direction.
Frontiers in Neuroscience | www.frontiersin.org speed) we get θ = 0 (i.e., the edge generated by the DL transition in the x-t-domain is located on the t-axis). Positive angles θ ∈ (0 • , 90 • ) (measured in counterclockwise direction) define leftward motion, while negative angles define rightward motion. For illustrative purposes, we consider an DL contrast that is moving to the right (c.f. Figure 3). The spatio-temporal gradient is maximal along the normal direction n = (cos θ, sin θ ) T . The function g(x; t) describing the resulting space-time picture of the movement in the x-t-space is thus given as with x ⊥ = x · cos θ − t · sin θ . The respective partial temporal and spatial derivatives are given as Now, recall that the event-based DVS sensor provides an estimate of g t at a specific location [c.f. Equation (2)]. For a moving contrast profile this leads to a changing luminance function along the t-axis (side graph g(0, t) in Figure 3). The temporal derivative of this profile is formally denoted in Equation (5). Given a known velocity specified by θ , we can combine equations (5) and (6) to determine g x as In sum, the temporal edge transition can be reconstructed in principle from a (uniform) event sequence at the edge location for a specific motion direction, given that • a reliable speed estimate is available to infer a robust value for θ , and • reliable estimates of temporal changes have been generated as an event cloud over an appropriately scaled temporal integration window w t .
Note, that both parameters, θ and w t , need to be precisely estimated to accomplish robust estimates of contrast information of the luminance edge. In Sections 2.1.4 and 2.1.5, we will briefly outline the necessary steps in such an estimation process. Alternatively, one can try to directly estimate the partial derivatives used in the motion constraint equation from the stream of events. The construction of this approach and its related problems are described in the following Section 2.1.3.

Estimating Spatio-Temporal Continuity using Event-Sequences
The local spatio-temporal movement of a gray-level function can be estimated by least-squares optimization from a set of local contrast measurements which define intersecting motion constraint lines in velocity space (Lucas and Kanade, 1981). Given a dense temporal sampling the spatio-temporal gray-level function can be reasonably well captured by a first-order approximation (as summarized in Section 2.1.1). The key question remains how one could estimate the spatial and temporal derivatives in the constraint equations, g x u + g y v + g t = 0 from event sequences generated by the DVS. Events only encode information about the temporal derivative g t [c.f. Equation (2)]. Thus, without additional information it is impossible to reliably estimate g x or g y , as outlined in the previous Section 2.1.2. The derivative of a translatory moving gray level patch, however, generates a unique response in h: = g t . Thus, we can apply the motion constraint equation to the function h and solve h x u + h y v + h t = 0, instead. Using two temporal windows T −2 = (t − 2 t, t − t] and T −1 = (t − t, t], we can approximate h t , for example, by a backward temporal difference with p = (x, y) T and ϑ denoting the event-generation threshold.
The spatial derivatives h x and h y can be approximated by central difference kernels [−1, 0, 1] and [−1, 0, 1] T , respectively. These can be applied to the function h estimated by integrating over the temporal window T (e.g., Consequently, the resulting flow computation results in a sparsification of responses since stationary edges will not be represented in h. This approach is similar to that of Benosman et al. (2012) but consistently employs the second derivative instead of mixing the first and second derivatives which leads to inconsistencies in general. Note, however, that this approach has multiple issues regarding any real implementation. The most important observation is that when a luminance edge passes a pixel's receptive field of the DVS sensor, the amount of events is in the range of about 10 events (often even less, depending on the contrast, speed and luminance conditions; c.f. zoomed display of the event cloud in Figure 1). Thus, huge approximation errors occur for h x , h y and especially in h t (since this now represents the second derivative of the original gray-level function g). Furthermore, we can only estimate h t accurately, if the temporal windows are small enough such that the gray-level edge has not already passed through the receptive field of a target cell at position p. This limits the number of events to even less and leads to magnifying the outlined problems even further. Alternatively, one could try to directly approximate the temporal derivative for each event by incorporating the time-span since the last event, i.e., with W t representing the time that has passed since the last event generated at p. This assumes a constant intensity change since the last event. This, however, is certainly not true for the first event because first nothing happens for a long period and then occasionally some change occurs that causes the event, i.e., the estimate will be too small, because W t is too big.

Least-Squares Velocity Estimation
The short temporal window in which events of a briefly passing contrast edge are generated makes it difficult to reliably estimate the derivatives required in the motion constraint equation (c.f. previous section). An alternative approach is to consider the distribution of events (the "event cloud") in a small volume of the x-y-t-space. The cloud that results from a moving contrast edge generates a locally plane-like cloud of on-and/or off-events (with on-and off-events in the case of a line, for example, and only onor off-events in the case of a transition from one homogeneous region to another) to which a velocity tangent plane can be fitted . The thickness of the event cloud orthogonal to the velocity tangent plane depends on the sharpness of the contrast edge, the speed with which the gray-level discontinuity moves through the spatial location of a pixel, and its local neighborhood (the receptive field, RF, of a cell at this position).
In Benosman et al. (2014) a function e : N 2 → R is defined that maps the location p of an event e to the time e (p) = t when the event was generated. This mapping may be used to describe the cloud of events. However, care should be taken since the mapping is non-continous in principle: it is either defined for each event in which case the mapping is not differentiable, or it is defined for all events in which case the mapping is not injective (because for a given t, there are multiple events at different locations). In any case, the inverse function theorem of calculus (as employed in Benosman et al., 2014) cannot be applied here to derive a speed estimate. This insight might explain, why in the velocity-vector-field of a rotating bar illustrated in Figure 7b of Benosman et al. (2014) the velocity vectors at the outer parts are shorter (instead of longer) compared to the velocity vectors at the inner ones. We suggest an alternative solution in which the speed is estimated from the regression plane by solving the orthogonal system of the velocity vector v = (u, v, 1) T (defined in homogeneous coordinates), the orientation of the moving luminance edge l = (l x , l y , 0) T , and the normal vector n = (a, b, c) T of the plane. These three vectors form an orthogonal system that spans the x-y-t space: The resulting velocity components u and v are then given as (with with the speed component s = √ u 2 + v 2 = c · (a 2 + b 2 ) −1/2 . Note, that for slow or moderate velocities, a reliable estimate of the velocity tangent plane requires a spatial as well as a temporal neighborhood such that the event cloud is fully covered within the spatio-temporal window (or RF) considered for the LS regression. In particular, the neighborhood support must cover the event cloud illustrated in the bottom right of Figure 1. If this condition is not fulfilled, i.e., if the window is smaller than the extent of the cloud, then the principal axes are arbitrary and cannot be estimated reliably.

Direction-Sensitive Filters
As an alternative to considering the LS regression in estimating the velocity tangent plane from the cloud of events, the uncertainty of the event detection might be incorporated directly. At each location, detected events define likelihood distributions p(e|u) given certain velocities of the visual scene (estimated by a filter bank, for example). Using Bayes' theorem, we derive that for each event p(u|e) ∝ p(e|u) · p(u). If each velocity is equally likely to be observed without a priori knowledge, i.e., p(u i ) = p(u j ) (for arbitrary velocities i, j), it holds p(u|e) ∝ p(e|u) and thus, the velocity u est of the movement that caused event e can be estimated as Thus, we can estimate the velocity from the responses p(e|u i ), i = 1, 2, . . . of a filter bank, for example. In addition, a priori knowledge could be incorporated to reduce noise and to increase coherency. Current knowledge suggests, that such distributions are represented by the filter characteristics of the spatio-temporal receptive fields of cells in area V1 which we use as inspiration for a novel filter mechanisms described in the following Section 2.2.

Event-Based Motion Estimation using Direction-Selective Filters
In this section, we define spatio-temporal filters that are fitted to the physiological findings from De Valois et al. (2000) summarized in the following Section 2.2.1.

Experimental Evidence
Our filter design essentially reverses the decomposition of neural responses conducted by De Valois et al. (2000) (also c.f. Tschechne et al., 2014). Based on physiological findings first described by DeAngelis et al. (1995), De Valois suggested that inseparable filters stem from a combination of various separable components (De Valois et al., 2000). In De Valois et al. (2000) cortical V1 cells were tested and strong evidence for the coexistence of two distinct types of populations of cells emerged: One population showed spatio-temporally separable weight functions of either even or odd spatial symmetry. These have either temporally mono-or bi-phasic response characteristics which were mainly determined by a single principal component in 2D (of a singular value decomposition). The other population of cells was spatio-temporally inseparable showing a receptive field distribution of selectivity that were slanted with respect to the time axis, i.e., motion sensitive (c.f. Figure 3; c.f. also De Valois and Cottaris, 1998). Response characteristics of these cells were determined by two strong principal components in 2D. These two components of the second group were itself spatiotemporally separable with spatially out-of-phase components and always composed of pairs of mono-and bi-phasic distributions. This main observation lead us to propose a family of spatiotemporally direction selective filters as illustrated in Figure 4, that are generated by superposed separable filters with quadrature pairs of spatial weighting profiles (G odd and G even ) and mono-/bi-phasic temporal profiles (T mono and T bi ). The details of the construction process are outlined in the following sections.

Spatial Gabor Filters
To construct the spatial component of the spatio-temporal filters illustrated in Figure 4 we define Gabor filters that are fitted to the experimental results of De Valois et al. (2000). To construct multiple spatio-temporally tuned filters of different spatial orientation selectivity, we employ a filter-bank of kernels as illustrated in Figure 5. More precisely, we employ Gabor filters maximally selective for the spatial frequency ( f 0 x , f 0 y ) (with a standard deviation σ in local space) defined by (c.f. Figure 5) in local space. The spatial frequencies selected by this filter can be seen by visualizing its Fourier transform (Figure 5, bottom left) which is given aŝ whereσ = 1/σ and the filter tuning ( f 0 x , f 0 y ) defines the shift of the Gaussian envelope with respect to the origin in the Fourier domain. This defines the two components G odd = ℑ(G σ,f 0 to construct the filters as described in Section 2.2.1 (compare with Daugman, 1985;Marčelja, 1980).

Mono-and Biphasic Temporal Filters
The second component required in the spatio-temporal filter generation process illustrated in Figure 4 is the definition of mono-and bi-phasic temporal filters, T mono and T bi . To fit the experimental data of De Valois et al. (2000), we define (c.f. Figure 6) with the unnormalized Gaussian function When the experimental findings are incorporated, it is only necessary to choose a value for µ bi1 . All other parameters can be inferred according to the experimental data from De Valois et al.
:   Valois et al. (2000). Parameters in all plots have been set to σ = 25 and f 0 = 0.08. FIGURE 6 | Temporal filters fit experimental data (gray) from De Valois et al. (2000). Kernels consist of one or two Gaussians which define a mono-and bi-phasic temporal filter, respectively. The mean of the mono-phasic kernel has been set to µ bi1 = 0.2 with the remaining parameters fitted to the experimental data (see text for details). Dashed line highlights that the peak of the mono-phasic kernel (green) is located at the zero-crossing of the bi-phasic kernel (blue).
• The bi-phasic scaling factors s 1 and s 2 are adapted to the minimum and maximum values of the experimental data relative to the maximum value of the monophasic kernel (which is one), i.e., s 1 = 1/2 and s 2 = 3/4. • A good fit with the experimental data reported in De Valois et al. (2000) is achieved by setting the relation between the mean values to µ bi2 = 2µ bi1 . • The standard deviations σ mono and σ bi1 are chosen such that the Gaussians are almost zero for t = 0, i.e., σ mono = µ mono /3, σ bi1 = µ bi1 /3 (3σ -rule; 99.7% of the values lie within three standard deviations of the mean in a normal distribution). • The standard deviation of the second Gaussian of the bi-phasic kernel is about 3/2 of that of the first, i.e., σ bi2 = 3 2 · σ bi1 = 1 2 · µ bi1 .

Figure 6
illustrates that these settings result in a good fit of the temporal filters with the experimental data reported in De Valois et al. (2000). We will now construct the full spatio-temporal selective filters as outlined in Figure 4.

Combined Spatio-Temporal Filter
The full spatio-temporal filter F is defined according to the scheme of Figure 4, i.e., by the sum of two products consisting of the odd-spatial G odd = ℑ(G σ,f 0 x ,f 0 y ), the monophasic temporal T mono , the even-spatial G even = ℜ(G σ,f 0 x ,f 0 y ), and the biphasic temporal filter T bi (c.f. Figure 7): The preferred speed of the filter can be determined by an analysis of the Fourier transformF( f x , f y , f t ) of the filter function F(x, y, t). From the location ( f max t , f max x , f max y ) whereF is maximal we can infer the filter's preferred normal velocity, i.e., the velocity parallel to the gradient of the luminance edge (n in Figure 3) with maximal filter response, using the following two relations: • The motion constraint equation in the frequency domain: DeAngelis et al. (1995), Their Figure 3F) and in Fourier domain (right; absolute value) for σ = 25, f 0 .08) and µ bi1 = 0.2 (as in Figure 4).
Combining both equations, we obtain −f max t = f max · u ⊥ , i.e., the speed s = u ⊥ is given as s The velocity can now be obtained by scaling the normalized

Response Normalization
The spatio-temporal filter mechanism is combined with a stage of down-modulating lateral divisive inhibition. Such response normalization was shown to have a multitude of favorable properties such as the decrease in response gain and latency observed at high contrasts, the effects of masking by stimuli that fail to elicit responses of the target cell when presented alone, the capability to process a high dynamic range of response activations (Heeger, 1992;Carandini et al., 1997;Koch, 1999;Sceniak et al., 1999;Frégnac et al., 2003;Tsui et al., 2010), and the ability to resolve ambiguous motion estimates at, for example, straight contours without knowledge about the edges of the contour (aperture problem Wallach, 1935;Nakayama and Silverman, 1988;Wuerger et al., 1996). To account for such nonlinearities we add a stage of divisive normalization to test whether it is also suited to enhance flow estimated from the output of DVSs. Based on our previous modeling (e.g., Raudies et al., 2011;Neumann, 2012, 2014a), we employ a dynamic neuron model of membrane potentials p and a mean firing rate generated by the monotonically increasing function (p). The full dynamic equation readṡ with I i denoting the input and c j denote the spatio-temporal weighting coefficients of the local neighborhood N i of neuron i in the space-time-feature domain (see Brosch and Neumann, 2014a for more details of an even more generalized circuit model). At equilibrium, the following state equations can be derived .
Another favorable property of divisive normalization has been the observation that it can approximate a process dubbed radial Gaussianization which minimizes the statistical dependency of coefficients in image coding Simoncelli, 2008b, 2009a): where b is a scalar scaling coefficient and c j denote the weighting coefficients for the activations in the surrounding neighborhood in the space-feature domain [as in Equation (29)]. When the coefficients are learned from a test set (Lyu and Simoncelli, 2009a), it was shown to approximate optimal minimization of statistical dependency, i.e., radial Gaussianization. Here, we test whether this is also true for Gaussian weights (in accordance with neurophyiological findings Bonin et al., 2005) and a slightly different but biologically inspired normalization scheme as outlined in Equation (31). Therefore, the normalization scheme adopted here can only lead to an approximate decorrelation of input encoding. We will, therefore, demonstrate experimentally the impact of the divisive normalization of the spatio-temporal input filtering.

Results
In addition to the main part describing the theoretical investigations outlined in the previous sections, we conducted a series of experiments to validate the modeling approach and its theoretical properties. The parameters of the spatio-temporal filters were chosen such that they fit the experimental data as reported in De Valois et al. (2000) (up to scaling), namely µ bi1 = 0.2 for the temporal filter components, and σ = 25, f 0 = 0.08 for the spatial filter components. The parameters of the normalization mechanism in Equation (31) were set to β = 1, α p = 0.1, α q = 0.002, c j resemble the coefficients of a Gaussian kernel with σ = 3.6, and I (x) = q (x) = max(0, x) denotes a rectifying transfer function. First, we probed the model using simple and more complex stimuli with translatory and rotational motion to demonstrate the detection performance and noise characteristics of the initial (linear and non-linear filtering of the input). Second, we studied the impact of the normalization stage on the initial filter responses. Third, the model was probed by stimuli with transparent overlaid motion patterns to test the segregation into multiple motion directions at a single spatial location (see e.g., Braddick et al., 2002;Edwards and Nishida, 1999;Treue et al., 2000).

Detection of Translatory and Rotational Movements
At each location the filter creates a population code of length N with each entry corresponding to the response of a spatiotemporal filter with motion direction selectivity θ k . For visualization purposes (Figure 8), the velocity components u p and v p are inferred from the initial responses I p;k , k ∈ {1, . . . , N} at each location p by summing them up according to FIGURE 8 | Responses to input stimuli with translatory and rotational motion. From left to right: Test stimulus and vector field of initial motion estimation using the filter mechanism in Equation (25) and after normalization (red arrows are not part of stimulus; only two representative stimuli are shown due to space constraints). First row: Translatory motion stimulus illustrates that a majority of the responses point into the normal flow-direction, i.e., orthogonal to the stimulus boundaries. Last row: A rotational stimulus has been employed to validate that the filter also works for different speeds (slow motion close to the center and fast motion at the more distant regions). See Section 3.2 for details about the normalization mechanism. A comparison of initial and normalized flow estimation demonstrates that responses within line segments are reduced while responses at corners or noise are enhanced (that could be compensated by feedback from higher stages Brosch and Neumann, 2014b).
effectively implementing a local vector addition of component estimates. The tests utilize stimuli of translatory and rotational motion. The visualized results (Figure 8) demonstrate that the filter based approach robustly computes estimates of contour motion, i.e., locations of apparently moving contrasts and object boundaries .

Response Normalization
A well known problem to motion detection is the estimation of ambiguous motion at e.g., straight contours (aperture problem). Locally only the normal flow direction can be measured which might not coincide with the true direction because the motion component parallel to a contrast edge is unknown (Figure 9, left). As suggested in Tsui et al. (2010), normalization can help to suppress responses at ambiguous parts of a contour (i.e., the inner parts of an extended contrast or line) and to enhance responses at line ends or sharp corners (c.f. Figure 9B). Figure 9 shows motion histograms of the tilted bar in Figure 8 (top) as a result of the initial filtering in the model (left) and with normalization (right). These results indicate that normalization significantly improves the histograms to better represent the true motion direction (Figure 9A; blue lines). In Section 2.2.5, we point out that divisive normalization can effectively approximate radial Gaussianization, i.e., a reduction of the dependency between components within a population code. Here, we empirically validate that the divisive normalization described in Equation (31) indeed reduces the dependency within the population of motion selective cells. We quantify the statistical dependency of the multivariate representation by using multi-information (MI) (Studený and Vejnarová, 1998), which is defined as the Kullback-Leibler divergence (Cover and Thomas, 2006;Lyu and Simoncelli, 2009a) between the joint distribution p(x 1 , x 2 , . . . , x d ) and the product of its marginals Normalization suppresses responses "within" the line and enhances responses at its endings (line ends, corners) that encode the true motion direction (c.f. Guo et al., 2006;Tsui et al., 2010).
where H(I) = p(I) log(p(I)) dI is the differential entropy of the representation I, and H(I k ) denotes the differential entropy of the kth component of I (Lyu and Simoncelli, 2009a). To calculate the required probability estimates, we employ binary variables indicating motion for d = 4 movement directions. As theoretically predicted by the connection to radial Gaussianization, the MI for the stimulus shown in Figure 9 is reduced from MI(I) = 0.042 (0.090 for the second example) before normalization to MI(I norm ) = 0.028 (0.027 for the second example) after the normalization stage. Thus, divisive normalization employed here does not entirely decorrelate the movement representation (which would imply MI(I norm ) = 0) but significantly reduces it.

Spatio-Temporal Filtering and Transparent Motion
Unlike the motion of opaque surfaces transparent motion is perceived when multiple motions are presented in the same part of visual space. Few computational model mechanisms have been proposed in the literature that allow to segregate multiple motions (see e.g., Raudies and Neumann, 2010;Raudies et al., 2011 which include recent overviews). All such model approaches are based on frame-based inputs. For that reason, we investigate how transparent motion induced by random dot patterns moving in different directions is represented in event-clouds originating from DVSs. In general, filter-based mechanisms are able to encode estimated motions for multiple directions at a single location. In contrast, it is not possible to fit a plane at positions where two (or multiple) event clouds generated by, for example, two crossing pedestrians intersect without applying additional knowledge. The filter mechanisms proposed in this work naturally encode motion directions within the uncertainty of the integration fields (c.f. Figures 10A,B). In order to build such a filter bank, the frequency space in Figure 10B needs to be sampled properly in accordance with the theoretical analysis outlined in Section 2 (c.f. Table 1).
To test the encoding of motion transparency, we probed the model by using simulated event-based sensor outputs of two superimposed random-dot patterns moving in orthogonal directions with the same speed. The spatio-temporal event-cloud generated by the moving dots is rather noisy and the component motions appear rather indistinguishable by eye. Figure 10C shows such events for individual dots and integrated over a small temporal window (directions are indicated by the blue and red arrows for illustrative purposes). As can be seen in Figure 10D the filter response clearly encodes both movement directions which could not be achieved by a plane-fitting approach without incorporating knowledge about the number of movement directions.

Discussion
This paper investigates mechanisms for motion estimation given event-based input generation and representation. The proposed mechanism has been motivated from the perspective of sampling the plenoptic function such that specific temporal changes in the optic array are registered by the sensory device. The temporal sampling is based on significant changes in the (log) luminance distribution at individual sensory elements (pixels). These operate at a very low latency by generating events whenever the local luminance function has undergone a super-threshold increment or decrement. This is fundamentally different from common frame-based approaches of image acquisition where a full image is recorded at fixed intervals leading to a largely redundant  signal representations. Our focus is on motion computation and the proposed approach is different from previous approaches in several respects. In a nutshell, our paper makes three main contributions: • We first investigate fundamental aspects of the local structure of lightfields for stationary observers and local contrast motion of the spatio-temporal luminance function. In particular, we emphasize the structure of local contrast information in the space-time domain and their encoding by events to build up an address-event representation (AER). • Based on these results we derive several constraints on the kind of information that can be extracted from event-based sensory acquisition using the AER principle. This allows us to challenge several previous approaches and to develop a unified formulation in a common framework of event-based motion detection. • We have shown that response normalization as part of a canonical microcircuit for motion detection is also applicable for event-based flow for which it reduces motion ambiguity and contributes to making the localized measures of filtering statistically more independent.
These different findings will be discussed in more detail in the following sections.

Previous Related Computational Models
So far, only relatively few investigations have been published that report on how classical approaches developed in computer vision can be adapted to event-based sensory input and how the quality of the results changes depending on the new data representation framework. Examples are Benosman et al. (2012Benosman et al. ( , 2014 for optical flow computation and (Rogister et al., 2012;Piatkowska et al., 2013;Camuñas Mesa et al., 2014) for stereo vision. Furthermore, other authors show future applications of this new sensor technology that have the potential to provide fast, robust and highly efficient sensory processing in various domains and challenging scenarios (e.g., Fu et al., 2008;Drazen et al., 2011). Even further, most recent work has elucidated how fast event-based sensing technology can be utilized to improve the performance of computer vision motion estimation approaches and how framebased imagery may help stabilizing the raw event-based motion processing . We here focus on the detection of flow from spatio-temporal motion on the basis of event-based sensor input. We utilize the dynamic-vision sensor (DVS) that emulates the major processing cascade of the retina from sensors to ganglion cells Liu and Delbruck, 2010). Based on the formulation of a local spatio-temporal surface patch at a significant luminance transition that moves along either direction, we have first categorized event-based flow estimation models. This allows us to provide a more systematic overview and to identify rather principled approaches. Based in these prerequisites, we have shown that gradient-based methods like  are generally not stable in terms of their input feature estimation. The main reason is rooted in the potentially very small number of events generated at a single location (c.f. Figure 1). Based on these investigations we have further shown that the numerical approximation of the gradients, like in Benosman et al. (2012), has methodological deficiencies that may lead to inconclusive motion estimates. On formal grounds, we have demonstrated that a gradient-based motion detection and integration scheme, using the scheme of Lucas and Kanade (1981), can be utilized to numerically estimate second-order spatio-temporal derivatives on a function that represents the temporal derivative of the luminance distribution. This requires to employ proper numerical difference schemes which also demonstrates the disadvantage of increased noise sensitivity (Section 2.1.3).
In contrast, methods exploiting the local structure of the cloud of events are more robust in general. Here, we compared different approaches. First, we reviewed methods fitting an oriented plane to the event cloud. We derived equations which demonstrate that the orientation parameters of the plane directly encode the velocity [see Equation (18)]. The benefit of such an approach against the above-mentioned numerical derivative scheme is that it works even in the case of only a few generated events. Of course, the goodness of fit depends on the size of the spatio-temporal neighborhood. However, if we consider a neighborhood that is too small then the plane fit may eventually become arbitrary and thus instable. If the neighborhood is too large then the chances increase that the event cloud contains structure that is not well approximated by a local plane. This also applies to the case of multiple motions, such as in the case of, e.g., occlusions due to opposite motions, limp motion in articulations, or in case of transparent motion stimuli.
Based on these insights we suggest a novel filter that samples the event-cloud along different spatio-temporal orientations. Its construction "reverses" the singular-value decomposition conducted of V1 receptive fields to construct direction-selective cells with spatio-temporally inseparable receptive fields (De Valois and Cottaris, 1998;De Valois et al., 2000). The conducted theoretical analysis allows to realize a spatio-temporally selective filter bank. Our investigation is similar to Escobar et al. (2009) who seek to specify the spatio-temporal selectivity. In contrast, our mechanism is directly derived from physiological findings. Perhaps the most similar scheme in comparison to our model is the one proposed by Adelson and Bergen (1985) which also suggests to derive spatio-temporally selective kernels by superposing different receptive fields. In their work, a spatial quadrature pair and two bi-phasic temporal kernels (in contrast to the mono-and bi-phasic kernels employed in our work) are combined (Adelson and Bergen, 1985) (compare also the review Emerson et al., 1992 andBorst andEgelhaaf, 1993). This scheme was motivated to resemble the spatio-temporal correlation scheme for motion detection (Hassenstein and Reichardt, 1956;Reichardt, 1957). In contrast to their approach, we rely upon the superposition of space-time separable filters with outof-phase temporal modulation filter-responses. In addition to the main analysis, our test applications of the model implementation successfully demonstrate the functionality of such initial filtering for motion detection from spatio-temporal event clouds.
Compared to plane-fitting models (as suggested by, e.g., Benosman et al., 2014) we have shown that our model has the advantage that it can encode multiple motion directions at a single location, such as, e.g., (semi-) transparent motion (Figure 10; compare, e.g., Snowden et al., 1991;Treue et al., 2000; see e.g., Raudies and Neumann, 2010;Raudies et al., 2011 for a detailed discussion of motion transparency computation).

Non-Linear Response Normalization by Divisive Inhibition
In order to account for non-linearities in the response properties of cortical cells (Carandini et al., 1997) several models have been proposed to arrive at a neural circuit to define canonical computational mechanism (e.g., Kouh and Poggio, 2008;Carandini and Heeger, 2012). These and other models employ a mechanism of divisive inhibition of the surround activity (also used here) that has been suggested to explain findings ranging from gain control (Ayaz and Chance, 2009;Louie et al., 2011) over attention effects (Reynolds and Heeger, 2009;Lee and Maunsell, 2009;Montijn et al., 2012) to normalization in multi-sensory integration (Ohshiro et al., 2011). Tsui et al. (2010) have demonstrated that cells in the motion-sensitive area MT can properly respond to motion directions even for tilted bars although the normal flow directions signaled by component sensitive V1 cells should bias the motion selectivity in a direction orthogonal to the tilt direction. These authors suggest a divisive normalization that operates upon the static filters of oriented contrast filtering before the separate temporal filter. Such a scheme is rather implausible mechanistically. We therefore developed a scheme that employs the pool normalization after the stage of spatiotemporal event-input filtering (c.f. Brosch and Neumann, 2014b). The simulation results using oriented bar stimuli further confirm findings of Guo et al. (2006) in which enhanced responses were shown at the bar ends while the responses along the extended boundary of the bar are significantly reduced [consistent with earlier investigations Bolz and Gilbert, 1986]. While Escobar et al. (2008) showed that such a reduction of uncertainty can be achieved by using subtractive surround inhibition the proposal by Bayerl and Neumann (2004) suggests that feedback can reduce such redundant aperture responses. Taken together, the proposed model not only demonstrates that response normalization of initial motion detection successfully operates for eventbased representations but suggests a reasonably simple account for the recent experimental observations (Tsui et al., 2010) using lateral interactions.
Based on statistical investigations, a decorrelation of the responses of a group of cells into rather independent components has been suggested in Simoncelli (2008a, 2009a), dubbed radial Gaussianization to account for the broadening of the tuning curves. Since we showed certain similarities but also deviations from the model proposed here, we employed an information theoretic measure which confirms that the normalization scheme decorrelates input representations by decreasing the multi-information even without special parameter learning from a test set (Studený and Vejnarová, 1998;Lyu and Simoncelli, 2009a,b). This might be beneficial in light of coding principles (to support a sparse coding mechanism, Olshausen and Field, 2004) and to better deal with the variability of the overall motion stimulus configuration. For example, most model mechanisms have been employed by assuming (implicitly or explicitly) that the motion can be approximated locally by translatory motion. However, for cases of rotations, the intersection-of-constraints mechanism (Adelson and Movshon, 1982) fails as there is no common point of intersection from local estimates (Caplovitz et al., 2006). We suggest that such a stage of normalization in real-world motions reduces the response to ambiguous parts of a stimulus, like the center of an extended contrast. At the same time due to the reduced mutual dependency of individual responses in a population the rotation components can be combined into a more global configuration more easily. This is exemplified by demonstrating the effective pushing of the motion response histogram toward the true motion direction (Figure 9) similar to Tsui et al. (2010) (see Pack and Born, 2001 for a discussion of an account to solve the aperture problem in area MT).

Summary
Motion estimation from the output of an asynchronous event-based vision sensor requires adapted methods. Here, we conducted for the first time a theoretical investigation that systematically categorizes event-based flow estimation models with respect to their underlying methods, namely gradient-based methods and algorithms exploiting the locally approximated plane-like structure of the cloud of events. In addition to analyzing existing gradient-based methods inconsistently mixing first and second order derivatives we proposed a novel consistent gradient-based algorithm. Even further, we showed that gradientbased methods in general suffer from strong noise originating from the limited number of events occurring at a single location. Methods exploiting the local plane-like shape of the event-cloud, on the other hand, were shown to be suitable for motion originating from a single object. In addition, we derived an explicit formula to derive the velocity from the parameters of the plane. For filter-based approaches, we proposed and analyzed a novel biologically inspired algorithm and demonstrated that it can also deal with motion transparency, i.e., it can represent different motion directions at a single location. Finally, we analyzed the impact of a stage of response normalization. We demonstrated that it is applicable to flow originating from event-based vision sensors, that it reduces motion ambiguity, and that it improves statistical independence of motion responses. All the theoretical findings were underpinned by simulation results which confirm that the model robustly estimates flow from event-based vision sensors.

Author Contributions
Designing the models/experiments: TB, ST, and HN. Mathematical and theoretical analysis: TB and HN. Spatio-temporal filter-analysis: TB. Experimental investigations: ST. Manuscript preparation: TB, ST, and HN.