Edited by: Mehdi Khamassi, UMR7222 Institut des Systèmes Intelligents et Robotiques (ISIR), France
Reviewed by: Xavier Clady, UMR7210 Institut de la Vision, France; Franck Ruffier, CNRS/Aix-Marseille Univ, France
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
In order to safely navigate and orient in their local surroundings autonomous systems need to rapidly extract and persistently track visual features from the environment. While there are many algorithms tackling those tasks for traditional frame-based cameras, these have to deal with the fact that conventional cameras sample their environment with a fixed frequency. Most prominently, the same features have to be found in consecutive frames and corresponding features then need to be matched using elaborate techniques as any information between the two frames is lost. We introduce a novel method to detect and track line structures in data streams of event-based silicon retinae [also known as dynamic vision sensors (DVS)]. In contrast to conventional cameras, these biologically inspired sensors generate a quasicontinuous stream of vision information analogous to the information stream created by the ganglion cells in mammal retinae. All pixels of DVS operate asynchronously without a periodic sampling rate and emit a so-called DVS address event as soon as they perceive a luminance change exceeding an adjustable threshold. We use the high temporal resolution achieved by the DVS to track features continuously through time instead of only at fixed points in time. The focus of this work lies on tracking lines in a mostly static environment which is observed by a moving camera, a typical setting in mobile robotics. Since DVS events are mostly generated at object boundaries and edges which in man-made environments often form lines they were chosen as feature to track. Our method is based on detecting planes of DVS address events in x-y-t-space and tracing these planes through time. It is robust against noise and runs in real time on a standard computer, hence it is suitable for low latency robotics. The efficacy and performance are evaluated on real-world data sets which show artificial structures in an office-building using event data for tracking and frame data for ground-truth estimation from a DAVIS240C sensor.
This article introduces an algorithm that is aimed at detecting and tracking visual line features with low latency and without requiring much prior knowledge about the environment. We envision this algorithm to be useful toward enabling high-speed autonomous machines to orient in and interact with their environments, e.g.,
We use dynamic vision sensors (DVS, see Lichtsteiner et al.,
To illustrate the difference between conventional cameras and DVS further, imagine a wall in front of which a box is moved through a camera’s field of view as an example (cf. Figure
Event traces for a box moved from bottom to top through the field of view of a DVS. Visible are dense manifolds of events corresponding to the two edges of the box. Events originating from the movement of the person holding the box are excluded for the sake of clearer visualization. The frames put in the event stream show snapshots of the situation at the time they were triggered. Box edges are indicated by blue bars for better visibility. Note that one axis corresponds to time!
The event stream is however still a raw data source from which no useful higher-level information can be trivially gained. In order to arrive at a meaningful interpretation and make the benefits of the DVS accessible it is necessary to cluster the events and assign them to physical origins. This article proposes a step toward this goal and introduces an algorithm for the fast extraction and persistent tracking of lines using dynamic vision sensors.
We operate on the assumption that a major part of DVS events belonging to a scene originate from object boundaries because that is where sharp transitions in brightness often occur. While not all boundaries are straight, many are (at least approximately), especially in man-made environments, e.g., for robots moving in indoor scenarios. This makes lines a good feature to track. The goal of our method is to parametrize these lines using the parameters line midpoint
The algorithm requires little prior knowledge about the scene. It is, however, designed for environments which contain many straight edges, but which may also contain arbitrary other objects and non-straight distractors. The lines are assumed to move in a translatory fashion, the method is not designed to detect and track fast spinning lines like the ones produced by fans or rotation of the sensor around the optical axis as may occur when airborne drones perform rolling maneuvers.
ON and OFF events are processed separately. y separating polarities, we gain the benefit of sparser populated x-y-t spaces (one for each polarity). This simplifies line identification as it allows for more generous thresholds, since on average half of the noise and objects have been removed. The lines themselves contain only one polarity and the number of events forming them will therefore not be reduced. Such separate processing of ON and OFF contrast was found to occur in nature and has been studied, e.g., in insect and mammal eyes (Franceschini et al.,
The rest of this work is structured as follows: the remainder of Section
There is a variety of algorithms to extract lines from frames, most notably the Hough transform (Duda and Hart,
In recent years, several trackers for different shapes have been developed for event-based systems. An early example of this can be found in Litzenberger et al. (
In a more recent work, Brändli et al. (
There are also increasing efforts to track other basic geometric shapes in event-based systems: corners have been a focus in multiple works as they generate distinct features that do not suffer from the aperture problem, can be tracked fast and find usage in robotic navigation. Clady et al. (
Lagorce et al. (
Dynamic vision sensors are a novel type of optical sensor whose working principle has been inspired by mammal retinas. The pixels of DVS operate asynchronously and independently from each other. A pixel generates a so called address event as soon as it senses a change in log luminance above a certain threshold rather than measuring actual intensity. An event contains the position on the retina
The fact that the sensor uses log luminance gives it a very high dynamic range of about 120 dB (Brändli et al.,
Davis recorded scene: gray scale frame (left), events (right; events have been accumulated for 50 ms; ON events are depicted white, OFF events black, gray areas did not emit any events in the previous 50 ms; camera was rotated clockwise.).
In this work, we use a Davis240C (Brändli et al.,
The main idea behind the algorithm is to identify planes of events in x-y-t space. On short time scales, straight physical edges move with near constant velocity through the field of view, i.e., acceleration due to physical acceleration or projective transformation can be neglected if the observed time interval is sufficiently small. Therefore, straight edges leave traces of events in x-y-t space that are approximately planar on short time scales. Figure
Overview over the algorithm. Top: stream part and bottom: batch part running in background.
First, events are separated by polarity. After separating events, we apply different noise filters: first, we introduce a refractory period per pixel. After a pixel emitted an event, we will suppress further events from this pixel within a certain time interval, because pixels sometimes generate additional spurious same-polarity events if they have been triggered before and pixels may emit multiple events of the same polarity if the change in brightness was very strong. Experimentally, we identified 1 ms for opposite polarity events and 50 ms for same polarity events to work well. All events that are received during this period with respect to their polarity are discarded. Afterward, an additional filtering step checks for every incoming event if at least 3 same polarity events in a 5 × 5 pixel window around it have been registered. If not, the event is labeled as noise and not processed further.
In the following, we will continue to explain the algorithm from the end of the processing chain, because it is easier for the reader to follow the whole process starting with the way clusters are initially formed, then promoted to lines, and how these lines are finally transferred through time.
When an event arrives (and could not be assigned to an existing line or a cluster), we use it as seed to search for a chain of adjoining pixels that recently generated events. First, we search for the youngest event in the ring of the 8 adjacent pixels. If we find no event, we search in the next ring of 16 pixels around the adjacent pixels. If we still find no events, we abort the search. Otherwise, we add the youngest event to our chain and repeat the procedure from the pixel position of the found event. This step is iterated until the chain length crosses a threshold or we do not find any new events.
If we are able to create a chain of events, we cluster the events of the chain, add all events that have been generated by adjacent pixels and store these events as cluster, thereby creating a candidate for a plane.
Moving one step to the left in the process flow (Figure
When a cluster has collected enough events (in the order of 20–40 events), we check if its events form a plane in x-y-t-space. As stated above, the underlying assumption is that the velocity of lines on the retina can be approximated as constant on short time scales. Then, non-spinning straight edges in the real world generate flat planes of events. To check if the candidate cluster’s events form a line, we compute the principal components of the event coordinates (
If the smallest eigenvalue is below the threshold we promote the candidate to a real line (or 3D-plane, respectively). The position and orientation of the line at time
Having found a vector pointing along the line, the next step is to find a point
We arrived at a line parametrization with midpoint
Whenever a cluster is promoted to a line we check if the new line’s position and slope match the ones of a line that was previously deleted. We require the polarities to be identical, the angular distance to be less than 5° and the midpoints distance to the other line to be less than 2px (although this threshold should be adapted depending on the sensor resolution). If there is a deleted line that matches, we assume that we lost track of it and recovered it now. In this case we will assign the new line the ID of the deleted line.
When a new event is received, we check for lines close to the pixel of event generation. We assign the event to the line if it is closer than
For larger time spans, the assumption of planarity is violated. This means the principal component analysis breaks down, if events that are too old are used. Therefore, we need to update the inferred planes either periodically or on request, as soon as an accurate estimate is required, by removing events that are older than a certain time or if a line contains many events per length simply by removing the oldest events. After removing these events, orientation of the event plane (and thereby also of the inferred line) will be re-estimated by re-applying PCA and going through all the additional steps described above. Note, furthermore, that this is not an expensive update, since we can store the sum of coordinates for the PCA, and just modify it when adding or removing events from the line. If after an update there remain less than 10 events or the smallest eigenvalue exceeds clusters are also periodically cleaned by removing old events. lines are checked for coherence: if lines display gaps in the event distribution, they are split into two lines. Gaps are detected by projecting every event position onto the parametrized line, partitioning the line in bins of stepwidth 2px. If two adjacent bins are empty the line is split at this gap. merging of lines: if lines have an angular difference of less than 5° and same polarity, as well as the midpoints’ distances to the respective other line are less than 2px (same values as for recovering deleted lines) and the midpoints’ distances to each other are less than the half sum of the lengths, i.e., the lines are adjacent to each other, they are merged to form just one line.
We performed experiments to evaluate the quality of the matching and tracking, as well as quantifying the latency and computational costs and investigated the robustness. For all experiments we used an Intel i7 Core 4,770 K running at 3.5 GHz. The algorithm was implemented without parallelization in C++.
To evaluate the quality, we recorded data sets with the DAVIS240C, capturing both events and frames (cf. Supplemental Material, frames captured with a rate between 15 and 20 Hz). To our knowledge, there exists no data set with ground truth values for event-based line tracking. So, we obtained ground truth values for lines by applying the well-known Canny edge detector (Canny,
Our method is able to successfully extract lines from event-based vision streams. Figure
Event stream with current position of detected lines (events accumulated for 50 ms). Camera rotates clockwise, so lines move to the left and older events trailing to the right of lines are still visible.
Comparison between different approaches. Top row left-to-right: frame taken from a DAVIS240C recording, frame 200 ms later, ground truth lines for the first frame, ground truth lines for the second frame. Second row left-to-right: (a) Hough transform, (b) LSD, (c) ELiSeD, and (d) our method. Third row: same algorithms as above applied to the second frame. In the images of our method lines were additionally assigned an ID to demonstrate the tracking capabilities (cf. text).
To measure the quality, we compared the estimated lines with the labeled ground-truth lines, where we assumed that an estimated line matches a ground-truth line if their difference of angles was less than 5° and the perpendicular distance from the midpoint of the estimated line to the ground-truth line was smaller than 1.5 px. We then obtained difference of angles and lengths for the matched lines: angles of lines are known to be a robust feature to estimate when detecting lines. This holds also for our method where the average absolute angular error over all matched lines was approximately 0.6°, the median absolute angular error approximately 0.4° (on the same data set Hough had mean/median angular error 0.7°/0.4°, LSD had mean/median angular error 0.9°/0.4° and ELiSeD had mean/median angular error of 1.5°/1.1°). In contrast, line length is a rather unstable feature to extract. Using our algorithm, we are able to extract lines with a high precision of length. Figure
Distributions of length ratios between estimated lines and matching ground truth lines for
The other aspect we aimed toward besides lines detection was tracking, i.e., we should be able to identify every line over the entire time that it is visible with the same ID. In Figure
Histogram over ratios of lifetimes of estimated lines to life time of ground truth lines in percentage.
There are certain limitations when working with the DVS. The amount of events that lines generate depends on their angle to the movement direction of the camera. Lines perfectly aligned with the sensor movement direction are invisible to the DVS because only the leading edge presents a luminance change. Therefore, these lines can not be tracked. If a line is being tracked but becomes aligned with the movement after a movement direction change of the sensor it will become invisible, too, and track will be lost. To overcome this, one could include an inertial measurement unit (IMU) and estimate the invisible line’s position using acceleration information. However, it turns out that the problem of invisible lines is in practice not severe. Vibrations of the sensor, which can stem for example from a robot’s motors or natural tremor in case of handheld DVS, cause the sensor to perform movements in the orthogonal direction of the line which makes them visible to the sensor. We performed an experiment to examine the dependence of the line detection on the angle for which we printed lines with known inclination from 0° to 10° in steps of 2° and recorded the scene using a self-built robotic platform with a mounted DVS that drove in parallel to the stimulus. Figure
Dependence of line tracking on line orientation. Top row:
In the second experiment, we drove with the robot over the seams of a tiled floor. These irregularities in the surface caused small abrupt movements of the sensor. Figure
Line tracking results for a robot driving over small irregularities caused by a tiled floor. Comparing line IDs shows that lines were tracked even when crossing seams (only ON events).
As third experiment, we attached a sensor to a radio controlled model car to evaluate behavior at high velocity (~12 km/s). We recorded two different settings and evaluated the results by visual inspection; the recordings are provided as Supplementary Material.
In the first experiment, the car started on a checkerboard pattern floor and drove through a door toward another door. Because the floor was smooth, we observed no major disturbances (especially no abrupt changes in motion) and detection and tracking yielded good results. Figure
Snapshots from a sensor mounted on an RC car driving over even floor through a door (time increases from left to right, also see
In the second scene, the car drove over uneven floor in a narrow hallway with a comparatively high noise level due to irregular illumination patterns and textures on wall and floor. In this recording detection of lines again yielded good results; tracking, however, was more challenging. While some lines (especially those perpendicular to the car vibrations) could be tracked well, others were often lost. This can be explained by the same reason for which we lost track of lines while driving over seams: the car experienced abrupt changes of motion, which lead to kinks in the event plane causing our tracking method to fail. Due to the dynamic nature of the recordings we recommend viewing the video of the experiments provided in Supplementary Material.
This section presents an experiment to evaluate the latency. The independent operation of DVS pixels generates a quasicontinuous stream of events. Due to this sensor property the events have already a low delay from illumination change in the scene to reception of the event in a processing device. Each event can be handled individually and is used to update our belief about the current state of the world immediately after arrival. We measured the time it takes to process an incoming event and call the update function that reestimates the line position using a scene with a swinging pendulum. This gives us a single line traversing the display with predictable translatory speed. To measure the error we obtained ground truth values for the line position in the following way: first, we discarded the OFF events and binned the ON events in slices of 50 ms. We then found the leading edge by picking for every pixel row of the sensor the event that was furthest in movement direction. We used robust linear fitting as built-in in MATLAB to fit a line and reject outliers and inspected each fitted line visually. Figure
Left: detail from method of ground truth estimation. Right: true line position (red line) and position estimates (blue crosses) at time of availability. Inlay zooms to region between 1.6 and 1.8 s. Position estimate overestimates true position by a small margin.
We compared the position at which our algorithm estimated the line with two different approaches: (1) retrieving the position of the line by interpolating the line movement linearly from the last calculated position of the line until the time of position request and (2) calling the update routine and refitting the line before returning the position estimate. For the case of event processing without line update the average required time was approximately 0.7
In addition to latency, the overall computation load is another important quantity for judging the usefulness of an algorithm. Figure
Left: dependence between computing time and number of events for a number of different recordings. Right: processed events per second and required computing time for line tracking in the staircase scene.
We have introduced an algorithm for the fast detection and persistent tracking of translating lines for a biologically inspired class of optical sensors, dynamic vision sensors (DVS). The nature of DVS data allows to solve both tasks, detection and tracking, in a combined approach in which we first cluster events and check for linearity and then continuously grow detected lines by adding events. Additional benefits we can derive from the use of DVS are on the one hand low-latency responses, because DVS pixels emit address events asynchronously as soon as they perceive an illumination change. We made use of this property by processing each event individually and showed in experiment 3.3 that it is possible to determine a line’s position within a few microseconds at arbitrarily chosen points in time with a subpixel accuracy.
On the other hand, our method can potentially be applied in environments with vastly varying lighting conditions, because DVS are insensitive to absolute illumination. This makes it suitable to be employed on robots that work in environments where lighting conditions are not well predictable or unknown beforehand. The efficacy of our method was demonstrated and the results compared to other methods for line detection in frames and in address event streams. Our method performed as well as classical algorithms applied to frames; note however, that classical algorithms are fundamentally constrained to frames and therefore cannot make use of the advantages of the neuronal sensor of low-latency and robustness to lighting variations.
The algorithm is resilient against small displacements and vibrations; vibrations are actually helpful by making more visual features of a scene accessible and allow for their detection and tracking as shown in Section
There are a couple of different directions in which to continue. By linking lines that move coherently, reconstructing outlines of objects with straight edges like doors and boxes (or objects whose outlines can be piecewise linearly approximated) can become possible and the algorithm can be developed toward object tracking. Furthermore, matching lines across streams from different DVS could allow for depth estimates. A different direction of advancement would be to extend the promotion mechanism of cluster by introducing PCA kernels for different shapes, e.g., circles. This would allow active systems to not only orient themselves on lines but provide them with more and more distinct features to allow a more robust position estimate and safer navigation. Objects that are not straight, however, behave more complicated under projective transformations, and require more complex parametrizations. There is a variety of options to use our algorithm as a basis for methods that can be used in robotics.
LE: designed and implemented algorithm, evaluated performance, and wrote manuscript JC: guided method development and contributed to manuscript and figures.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
The Supplementary Material for this article can be found online at
Staircase.mp4.
Corridor.mp4.
Office.mp4.
RCcar_even_surface.mp4.
RCcar_uneven_surface.mp4.
1
2
3