MaskUKF: An Instance Segmentation Aided Unscented Kalman Filter for 6D Object Pose and Velocity Tracking

Tracking the 6D pose and velocity of objects represents a fundamental requirement for modern robotics manipulation tasks. This paper proposes a 6D object pose tracking algorithm, called MaskUKF, that combines deep object segmentation networks and depth information with a serial Unscented Kalman Filter to track the pose and the velocity of an object in real-time. MaskUKF achieves and in most cases surpasses state-of-the-art performance on the YCB-Video pose estimation benchmark without the need for expensive ground truth pose annotations at training time. Closed loop control experiments on the iCub humanoid platform in simulation show that joint pose and velocity tracking helps achieving higher precision and reliability than with one-shot deep pose estimation networks. A video of the experiments is available as Supplementary Material.


INTRODUCTION
Object perception is one of the key challenges of modern robotics, representing a technology enabler to perform manipulation and navigation tasks reliably. Specifically, the estimation and tracking of the 6D pose of objects from camera images is useful to plan grasping actions and obstacle avoidance strategies directly in the 3D world. Object manipulation is another important application, where the evolution of the object state over time is fundamental for closing the control loop.
The problem of 6D object pose estimation and tracking has been extensively addressed both in the computer vision (Drost et al., 2010;Tjaden et al., 2017;Mitash et al., 2018) and robotics community, the latter focusing on solutions based on the use of Kalman and particle filtering techniques (Wüthrich et al., 2013;Issac et al., 2016;Vezzani et al., 2017;Deng et al., 2019). Recently, deep learning techniques have also been employed to solve the 6D object pose estimation problem (Li et al., 2018;Xiang et al., 2018;Wang C. et al., 2019). These methods, although being fairly complex and requiring a considerable amount of 3D labelled data, have shown impressive results in approaching the problem end-to-end. However, it is not clear whether their performance always generalizes to conditions not represented in the training set. Additionally, although successfully employed in one-shot grasping tasks (Wang C. et al., 2019), the possibility of closing the loop for tasks that require a continuous estimation of the object pose has not yet been thoroughly assessed.
In this paper we propose to take a step back from end-to-end deep pose estimation networks while still leveraging neural networks. We called our approach MaskUKF, as it uses 2D binary masks and an Unscented Kalman Filter (UKF) (Wan and Merwe, 2000) to track the pose and the velocity of the object given a 3D mesh model and depth observations in the form of point clouds.
Our contributions are the following: • We combine consolidated techniques for object segmentation using deep neural networks with non-linear state estimation from the Kalman filtering literature. • We show how to use depth information in the form of point clouds directly in an Unscented Kalman filtering setting and how to leverage recent advances in Serial Unscented filtering to reach real-time performance. • We propose a heuristic 1-parameter procedure to identify and reject outliers from point clouds and increase tracking performance. • We demonstrate the importance of joint pose and velocity tracking in a robotic setting with closed loop control experiments on a humanoid platform in simulation.
One of the main advantages of our approach consists of employing segmentation networks that are much simpler to train than end-to-end pose estimation architectures. Differently from other works, our approach does not require 6D labels of poses nor a specific training for each object of interest. Indeed, a suitable initial condition for the 6D pose of the object and a deep neural network (DNN) for segmentation, trained once for all the objects of interest, are enough to perform tracking. Finally, our approach allows estimating not only the pose of the object but also its velocity (linear and angular) that is paramount for closed loop control and in-hand manipulation (Viña et al., 2015).
We compare our technique against state-of-the-art pose estimation networks, object pose tracking algorithms, and a baseline consisting in a simpler architecture which uses an Iterative Closest Point (ICP) (Besl and McKay, 1992) registration algorithm from (Rusu and Cousins, 2011) and the masks from the segmentation network. Results demonstrate that we achieve and in most cases surpass the state-of-the-art and the baseline performance in terms of pose precision and frame rate. We further evaluate our algorithm in a closed loop scenario to demonstrate the advantages of our approach for robotic applications.
The rest of the paper is organized as follows. After a review of the state of the art on both 6D object pose estimation and tracking, we present our MaskUKF algorithm for 6D object pose tracking and the results of the experiments. Finally, we conclude the paper providing additional remarks and future perspectives.

RELATED WORK
MaskUKF draws from state-of-the-art pose estimation techniques based on deep learning, as well as on classical techniques for tracking.
Recent works addressing the problem of 6D object pose tracking using classical tools (e.g., Kalman and particle filtering) focused on handling occlusion and outliers. Wüthrich et al. (2013) use depth measurements within an articulated particle filtering framework in order to explicitly model the occlusion of the object as a part of the state to be estimated. This information is then used to reject range measurements that do not belong to the surface of the object. Issac et al. (2016) propose a Robust Gaussian Filter for depthbased object-tracking that uses a heavy-tailed Gaussian noise model to counteract occlusions and outliers.
End-to-end deep architectures for object pose estimation have also been proposed. PoseCNN  estimates the pose of the object in an end-to-end fashion from RGB images using a Convolutional Neural Network and refines it using depth information and a highly customized ICP. DenseFusion (Wang C. et al., 2019) incorporates segmentation, RGB and depth information in a dense pixel-wise fusion network whose output is refined using a dedicated iterative refinement network.
In contrast to the end-to-end approaches, PoseRBPF (Deng et al., 2019) combines Rao-Blackwellized particle filtering with a deep neural network in order to capture the full object orientation distribution, even in presence of arbitrary symmetries. However, state-of-the-art performance is achieved only with a large amount of particles, leading to low frame rates. Moreover, the method requires a separate training for each object. SegICP (Wong et al., 2017) follows a different path, incorporating off-the-shelf segmentation networks, a multi-hypothesis point cloud registration procedure and a Kalman filter to track the object pose and velocity.
In our work, we follow a similar path as SegICP (Wong et al., 2017) and PoseRBPF (Deng et al., 2019), since we combine DNNs with a filtering algorithm. In particular, we combine segmentation masks provided either by PoseCNN  or Mask R-CNN (He et al., 2017) with an Unscented Kalman Filter. Differently from SegICP, we do not require an intermediate registration step between the segmentation module and the tracking stage. Instead, we directly design our filtering algorithm around the observed point cloud. In addition we also provide qualitative results on velocity tracking performance and we show the advantage of using it into a robotic application.
Our work is also similar to (Issac et al., 2016) in that we use a single Gaussian filter with depth information only for the actual tracking of the object and we explicitly take outliers into account. However, our outlier rejection procedure requires far less modifications of the standard Gaussian filter than (Issac et al., 2016) in order to be employed.
The works we have mentioned in this section support their findings by means of commonly adopted pose estimation metrics (Hodan et al., 2018) evaluated on standard or custom-made datasets. Our work is not different in this sense. Nevertheless, our aim is also to show that having acceptable performance according to such metrics does not necessarily hold when the output of the pose estimation/tracking is used for closed loop control in a robotic setting. To this end, we provide the results of experiments showing that joint pose and velocity tracking helps achieving higher precision and reliability in the execution of the task than with one-shot deep pose estimation networks or standard techniques like ICP.

METHODOLOGY
Given a sequence of input RGB-D images {I t } 1 ≤ t ≤ NI , the task of 6D object pose tracking is to estimate the position r t and the orientation o t of the object O with respect to the camera frame, for every frame in the sequence. In our approach, we also estimate the relative velocity between the camera and object frames. The complete description of the object pose is then given by the state vector where r t ∈ R 3 is the Cartesian position, v t ∈ R 3 is the Cartesian velocity, o t ∈ R 3 is the Euler ZYX representation of the orientation and _ o t ∈ R 3 is the vector of the angular rates. Our approach relies on the assumption that an instance segmentation algorithm is available in order to identify the object of interest within the image I t and provide a segmentation mask M t . The mask is used to extract a partial point cloud belonging to the visible part of the surface of the object from the depth map of the image I t . The point cloud is then used as a measurement in order to track the state of the object, with y t,j ∈ R 3 the j-th point and L the total number of points. Our approach makes the assumption that a 3D mesh model of the object O is available. In order to track the state of the object over time, we decided to adopt an Unscented Kalman Filter (UKF) for the following reasons: • the ability to handle non-linear relationships between state and measurements; • the availability of a serial variant of the algorithm that efficiently processes high-dimensional measurements as in the case of point clouds; • the possibility of integrating a motion model for the object in a principled manner; • the recognized superiority (Wan and Merwe, 2000;Julier and Uhlmann, 2004) over alternatives based on linearization, e.g., the Extended Kalman Filter, in terms of unbiasedness and consistency.
In the remainder of this section, we describe MaskUKF in details.

Segmentation
The first step in the proposed pipeline requires to segment the object of interest within the input image I t in the form of a binary segmentation mask M t . As the focus of our work is to develop a 6D object pose tracking algorithm, we rely on two existing segmentation architectures.
One is the semantic segmentation network taken from PoseCNN , a recently proposed 6D pose estimation network. We used PoseCNN because it is also adopted in PoseRBPF (Deng et al., 2019) and DenseFusion (Wang C. et al., 2019), two algorithms we compare to in the experimental section of our work. In order to have a fair comparison, we adopted the same segmentation network.
The second architecture is Mask R-CNN (He et al., 2017), a consolidated region-based approach for object detection and segmentation. Differently from the segmentation network from PoseCNN, Mask R-CNN has not been proposed in the pose estimation and tracking communities and represents an off-theshelf and commonly adopted solution for object segmentation. For this reason, we decided to test our method also with this segmentation framework.

Serial Unscented Kalman Filter
We model the belief about the state of the object x as the posterior distribution p(x|y 1:t ), given all the measurements y 1:t up to the instant of time t. We adopt a Gaussian filtering framework and approximate p(x|y 1:t ) using a Gaussian distribution under the hypothesis that the state x evolves according to a Markovian dynamic model, i.e., and the measurements y t are conditionally independent given x t , i.e., p y t x 1: t , y 1: t−1 p y t x t .
We assume that the distributions in Eqs. 4 and 5 are Gaussian and have the following form where f and h are generic non-linear functions, Q t is referred as the process noise covariance matrix and R t as the measurement noise covariance matrix. The probabilistic formulation in Eqs. 6 and 7 can be expressed, in functional form, in terms of the following motion model for the state x at time t and measurement model for the measurement y at time t At each instant of time t, the previous belief is propagated through the model in Eq. 8 in order to obtain the predicted mean μ − t and covariance P − t of the state. Next, a new measurement y t is incorporated into the belief taking into account the measurement model in Eq. 9 according to the following correction step Frontiers in Robotics and AI | www.frontiersin.org March 2021 | Volume 8 | Article 594583 K t P xy,t P y,t − 1 , Here, K t is usually called the Kalman gain, P xy,t is the statemeasurement covariance matrix, P y,t is the measurement covariance matrix and y t is the predicted mean of the measurement. The actual estimate x t is extracted as the mean μ t of the approximate posterior p(x|y 1:t ).

The Unscented Transform Algorithm
The UKF algorithm, that we adopted, follows the general Gaussian filtering framework presented above. In the UKF, all the predicted and corrected mean and covariances are evaluated using the unscented transform (UT) algorithm (Julier and Uhlmann, 2004) that propagates the Guassian belief through the functions f and h even if they are not provided with an analytical expression (as in our case for the measurement function h, later described). More precisely, the UT algorithm, which we recap here in the case of an additive transform y g(x) + q, evaluates a Gaussian approximation of the joint distribution of x and y when x ∼ N (μ x , P x ) and q ∼ N (0, Q) by means of a fixed number of so-called sigma points that capture the mean and covariance of the original distribution of (x, q) exactly. The main steps of the algorithm are as follows. Let n be the size of the state x. 1) Form the sigma points for the random variable x: where [A] i is i-th column of the matrix A and λ is a suitable constant.
2) Propagate the sigma points through the function g: 3) Compute the propagated mean μ y , the propagated covariance P y and the cross-covarianace P xy : with W (m) i and W (c) i suitable weights.

Serial Correction Step
One challenge in the application of the correction step in Eq. 11 to our scenario is the necessity to invert the matrix P y,t having dimension 3L × 3L. Here, L, the number of points within the point cloud, might be quite large (in the order of thousands or more), thus making difficult to perform the inversion of the matrix in real-time. An efficient solution to this problem, under the hypothesis that the noise covariance matrix R t in Eq. 9 is block diagonal, is given by the serial UKF correction step proposed in (McManus and Barfoot, 2011). The serial processing approach reformulates the correction step in Eq. 11 in an algebrically equivalent way using the Sherman-Morrison-Woodbury identity (Jack and Morrison, 1950). The algebraic reformulation is designed around the following two matrices: The matrix X t contains the weighted vector differences between the predicted sigma points X (i),− t of the state, obtained after propagation through the model in Eq. 8, and the predicted mean of the state μ − t . Similarly, the matrix Y t contains the weighted vector differences between the predicted sigma points Y (i) t of the measurement, obtained after propagation through the model in Eq. 9, and the predicted mean of the measurement y t . This matrix can be also re-written in terms of the block matrices Y t,j ∈ R 3×(2n+1) where the j-th block contains three rows associated with the j-th point in the point cloud y t .
Given the definition of X t and Y t , it can be easily shown that the predicted covariance of the state P − t , the state-measurement covariance matrix P xy,t and the measurement covariance matrix P y,t in Eq. 11 can be expressed as where R t,j ∈ R 3×3 is the measurement covariance matrix associated to the j-th point in the point cloud y t . Substituting Eq. 18 in Eq. 11 results in the following re-formulation of the correction step: Using the Sherman-Morrison-Woodbury identity (Jack and Morrison, 1950) and the fact that R t is block diaognal, the covariance update equation in Eq. 19 becomes as follows: where C t ∈ R (2n+1)×(2n+1) . Following a similar reasoning, the state update equation in Eq. 19 becomes where y t,j ∈ R 3 is the j-th point extracted from the point cloud y t and y t,j ∈ R 3 is the associated predicted mean extracted from the predicted measurement vector y t . In summary, updating the state and the covariance of the state requires inverting the matrix C t having size (2n + 1) × (2n + 1) instead of the 3L × 3L matrix P y,t . Since the size n is typically much smaller than L (we recall in our case n 12), the serial approach requires to invert a relatively small matrix compared to the standard approach and allows achieving real-time computations.

Measurement Model
The specification of the measurement model accounts for the definition of the function h in Eq. 9. The role of h is to establish the relationship between the state x t and the measurements that we expect to observe when the object O x t is in configuration x t .
In this paper we adopt a likelihood-field like (Thrun et al., 2008) model and we decide to consider the point cloud y t as an ensemble of independent points each distributed according to a normal distribution Given an object O xt , the mean π t,j is defined here as the point on the surface of the object zO xt having minimum distance from the real measured point y t,j . More formally, It can be easily shown that the model described in Eqs. 22-24 can be cast into the following measurement equation where In summary, given the object in configuration x t , the point cloud that we expect to observe is obtained by projecting the actual point cloud y t on the surface of the object. The resulting measurement Eq. 26 is non-linear with additive noise and the noise covariance matrix R has a diagonal block structure as required by the hypothesis of the serial UKF correction step. The proposed measurement model also resembles the Nearest Neighbour procedure adopted in classical variants of the ICP registration algorithm. In addition to it, our measurement model provides a principled way to specify the uncertainty associated to the point cloud using the variances σ 2 j .

Implementation of the Measurement Model
In order to implement the measurement equation Eq. 26, we should ideally solve the L × (2n + 1) optimization problems corresponding to evaluation of L projections in Eq. 24 for each of the 2n + 1 sigma points as per Eq. 14. In order to reduce the computational cost of these evaluations, we approximate the projections π t,j using a Nearest Neighbor approach.
In the following we assume that the 3D mesh of the object is available. First, we sample the mesh using the Poisson Disk Sampling algorithm proposed in White et al. (2007). This produces a cloud PC of uniformly distributed points about the surface of the object that we represent using a 3D k-d tree T. Then, in order to evaluate an approximation of the projections π t,j (X (i) t ) for a given sigma point X (i) t , it is sufficient to express the measurement y t,j in the reference frame of the object in configuration X (i) t and to perform an efficient Nearest Neighbour search on the k-d tree T for each transformed The propagated sigma points, expressed in the reference frame of the object, are finally converted back in the reference frame of the camera obtaining the sigma points Y (i) t . We stress the fact that our choice to project the measurements onto the surface of the object in a given configuration represents a possible, although not unique, solution to the data association problem between the measurements y t and the points PC sampled on the mesh of the object. This solution guarantees that the length of each propagated sigma point Y (i) t is the same as the vector of the measurement y t , i.e., 3L, and allows evaluating the mean μ y and the associated quantities in Eq. 15. Different solutions, e.g., projecting the points on the sampled cloud PC that are visible from the camera viewpoint onto y t , would possibly produce sigma points of incompatible sizes thereby making impossible to execute the UT algorithm.
We also remark that the implementation of the projections π t,j using a Nearest Neighbour search is made possible by the adoption of the Unscented Kalman filter that, differently from other alternatives such as the Extended Kalman filter, does not require the evaluation of the Jacobian of the projections, i.e., of the Nearest Neighbour search, with respect to the state x which might be intractable.

Outlier Rejection
Point clouds from modern RGB-D sensors are characterized by non Gaussian noise and outliers that violate the hypotheses of the model in Eq. 25 leading to divergent estimates in Gaussian filters (Issac et al., 2016). To tackle this, we try to identify outliers by picking pairs of points on the point cloud y t , p i and p j , respectively, and their projections, π i and π j , on the surface of the object in configuration x t−1 . Under the assumption that the tracking process has already reached a steady state condition, we expect that the point cloud at time t fits very closely the surface of the object in the estimated configuration at time t − 1. As a result, the distance d ij between the two points and the distance d π ij Frontiers in Robotics and AI | www.frontiersin.org March 2021 | Volume 8 | Article 594583 between their projections should be preserved if both p i and p j are not outliers, i.e., belongs to the surface of the object. Then, by comparing the absolute difference between d ij and d π ij to a threshold we are able to identify possible outliers. In such a case, the point between p i and p j that has higher distance from its projection is marked as an outlier.
Since outliers that violate the additive Gaussian noise hypothesis are usually distributed relatively far from the actual surface of the object, we propose, as a heuristic, to choose candidate pairs of points as the combination of one point p i and its furthest point p f i on the point cloud for each point on the point cloud y t . Every time an outlier is found, it is removed from the point cloud and the procedure is repeated until all points have been visited.
An efficient evaluation of the points p f i is obtained using the algorithm proposed in Curtin and Gardner (2016).

Motion Model
The motion model in Eq. 8 accounts for our prior knowledge about the motion of the object. We use the White Noise Acceleration model (WNA) (Bar-Shalom et al., 2002) from the tracking literature. This model assumes that the Cartesian acceleration € r(t) is driven by a white noise process w r (t) with assigned power spectral density Q r ∈ R 3×3 . The same model has been adopted for the rotational part of the state, with power spectral density Q o , resulting in In order to obtain a discrete time model as in Eq. 8, we exactly discretized the WNA model assuming constant Cartesian and angular Euler rates within each sampling period Δ T . The final model is as follows where F is the state transition matrix, that depends on Δ T , and Q is the process noise covariance matrix depending on Q r , Q o and Δ T . The complete expressions of the matrices F and Q can be found in Bar-Shalom et al. (2002).

6D Object Pose Tracking Framework
The tracking process is initialized by specifying the initial mean μ 0 and covariance P 0 representing the initial belief on the state of the object. Then, the tracking process starts as soon as the first segmentation mask M 0 is available. At each frame I t we apply the binary mask M t to the depth map and extract the segmented point cloud y t , refining it as described in Section 3.4 to remove outliers. The resulting point cloud is used to correct our belief of the pose and velocity of the object. Our 6D object pose tracking framework is shown in Figure 1.
In case the mask is not available at time t, we use the most recent available mask instead. Differently from pose estimation networks, such as (Wang C. et al., 2019), this choice allows our method to run at full frame rate even when coupled with slower instance segmentation algorithms. If M t is empty, due to total occlusion, or when most of the extracted 3D points are invalid, due to lack of texture, we stop feeding the measurements y t to the correction step in Eqs. 20 and 21 because there are none or too few to be informative enough. Usually, such an absence of measurements is counteracted by the usage of the Kalman prediction step only. However, if the object is undergoing a motion characterized by moderate to high velocities, this approach is discouraged and leads to unbounded predictions. We instead keep performing the correction step using a static virtual point cloudỹ t ( x t−1 ) sampled on the surface of the object in a configuration corresponding to the last estimate available. As a result, the estimated velocities are driven to zero in case the absence of information persists for multiple frames allowing the tracking process to safely recover when the measurements are available again. We remark that the proposed procedure does not require to manually force the state of the filter which would impair the correctness of the state covariance associated with the state.

EXPERIMENTS
In this section we present the results of two separate sets of experiments that we conducted in order to evaluate the performance of our method.
In the first set of experiments, we evaluate our method against a standard dataset for pose estimation benchmarking. Following prior work (Li et al., 2018;Wang C. et al., 2019;Deng et al., 2019), we adopt the YCB Video dataset . The main goals of the first set of experiments are to: FIGURE 1 | Illustration of our MaskUKF framework for 6D object pose tracking. For each RGB-D frame, a generic instance segmentation algorithm provides the segmentation mask of the object of interest. The mask is combined with the depth frame to produce a partial point cloud of the object that is further refined to remove possible outliers. The resulting measurements are fed into an Unscented Kalman filter to update the belief of the object pose and velocity.
Frontiers in Robotics and AI | www.frontiersin.org March 2021 | Volume 8 | Article 594583 • compare the proposed architecture against state-of-the-art algorithms in tracking, i.e., PoseRBPF (Deng et al., 2019), and pose estimation, i.e., DenseFusion (Wang C. et al., 2019), from RGB-D images using standard metrics from the computer vision and tracking communities; • assess the advantage of the proposed architecture as opposed to the Iterative Closest Point (ICP) baseline.
Moreover, these experiments aim to: • determine the performance of the proposed approach in multi-rate configuration, when segmentation masks are available at low frequency, i.e., at 5 fps, as it might happen in a real scenario; • assess the ability of our method to estimate the 6D velocity of the object (linear and angular); • demonstrate the effectiveness of the outlier rejection procedure described in Section 3.4.
In the second set of experiments, we compare the performance of the proposed architecture in a robotic setting against a oneshot pose estimation network, i.e., DenseFusion (Wang C. et al., 2019), and against ICP within a dynamic pouring experiment carried out in the Gazebo (Koenig and Howard, 2004) environment using the iCub (Metta et al., 2010) humanoid robot platform. In these experiments, we employ a robotic control system in order to track a bowl container with the end-effector of the robot during a pouring action. We close the loop using either the output of our method or the output of the methods we compare to. The aim of these experiments is to understand how different algorithms for object pose estimation and tracking affect the closed loop tracking performance and the reliability of the simulated system in a dynamical robotic setting. Our results include end-effector tracking errors for several configurations of the control gains as well as the qualitative evolution of the end-effector trajectories over time.
Overall, the aim of our experiments is twofold. On one hand, we want to test our method on a standard computer vision dataset that has been adopted in the pose estimation and tracking communities. On the other hand, we want to test the effectiveness of our method in a robotic scenario involving closed loop vision-based control.

Comparison with the State of the Art
In order to conduct our experiments, we considered several algorithms from the state of the art.
PoseRBPF (Deng et al., 2019) is a 6D object pose tracking algorithm adopting a deep autoencoder for implicit orientation encoding (Martin et al., 2018) and a particle filter to track the position and the orientation of the object over time from RGB-D observations. This algorithm requires a textured 3D mesh model of the object both at training and test time and requires a 2D detector in order to be initialized. The authors adopt the segmentation stage of PoseCNN as source of detections. In order to have a fair comparison with PoseRBPF, we also adopted PoseCNN as source of segmentation masks in our experiments.
Following the authors of PoseRBPF, we also compare our method with a 6D object pose estimation network from the state of the art, DenseFusion (Wang C. et al., 2019). This network fuses segmentation, RGB and depth information at the pixel level in order to regress the 6D pose of the object. An iterative refinement network is used to refine the pose. We notice that this method has similar requirements to our since it needs the object segmentation in order to isolate 2D and 3D features on the part of the input image containing the object. Moreover, it requires a 3D mesh model of the object of interest in order to be trained. DenseFusion adopts PoseCNN as source of segementation masks as our method.
As baseline, we also compare against a basic point cloud registration ICP algorithm (Rusu and Cousins, 2011). In order to have a fair comparison with our pipeline, we adopted an ICP algorithm that is assisted by segmentation and that is equipped with its own outlier rejection procedure. Given that the ICP algorithm needs to be initialized at each iteration, we adopted the estimate from the previous frame as the initialization. Similarly to the other algorithms that we considered, the ICP algorithm requires a 3D mesh model of the object in order to sample a target point cloud, on the surface of the object, to be registered with the point cloud extracted from the depth image.
In summary, our method and all the methods that we considered use the RGB data in order to segment or detect the object in the image plane at initialization time or at each frame and use the depth information as observation to track or estimate the pose of the object. Moreover, they all require a 3D mesh model at training time or at execution time.
Even though our approach is somewhat similar to the SegICP (Wong et al., 2017) algorithm, we do not compare against it as the authors do not provide their results on the YCB Video Dataset and the implementation of the algorithm is not publicly available.

Implementation Details
In order to perform our experiments we used either the semantic segmentation network from the PoseCNN  framework as trained by the authors on all the 200k+ frames of the YCB Video dataset or the Mask R-CNN (He et al., 2017) model that we trained using 50k frames from the same dataset. For the experiments in the Gazebo environment, we relied on a simulated segmentation camera.
The initialization of the tracking is not a focus of this paper. For this reason, we reset both the UKF and the ICP to the ground truth perturbed with a displacement of 5 cm along each Cartesian axis and a variation of 10°for each Euler angle. Although we could have initialized the algorithms using a pose estimation network, to the best of our knowledge, the magnitude of our displacement is higher than the error produced by state-of-the-art pose estimation algorithms (this is actually confirmed by our Table 1, e.g., for DenseFusion for the <2 cm metric). Hence, our choice poses a even more challenging condition than initializing using a pose estimation network. We stress that the initialization is done as soon as the object is visible in the scene and the first segmentation mask is available. Differently from other state-of-the-art tracking algorithms, e.g., PoseRBPF (Deng et al., 2019), we never re-initialize our algorithm within a video sequence, e.g., after a heavy occlusion of the object.
We remark that all the results presented hereafter are obtained from a single run of our method since both the UKF and the ICP are deterministic.
To enforce repeatability in the evaluation of the projections in Eq. 24, we relied on uniform point clouds sampled on the 3D mesh of the objects as provided within the YCB Video Dataset.
The code used for the experiments is made publicly available for free with an Open Source license online. 1

Description of the YCB Video Dataset
The YCB Video dataset features RGB-D video sequences of 21 objects from the YCB Object and Model Set (Calli et al., 2015) under different occlusion conditions. The dataset contains 92 RGB-D video sequences recorded at 30 fps, split in a training set of 80 videos and a testing set of 12 videos from which 2,949 key-frames are extracted in order to evaluate performance metrics. Every frame is annotated with segmentation masks and 6D object poses.

Evaluation Metrics
We use two different metrics. The first metric is the average closest point distance (ADD-S)  which synthesizes both symmetric and non-symmetric objects into an overall evaluation. This metric considers the mean distance between each 3D model point transformed by the estimated pose and its closest neighbour on the target model transformed by the ground truth pose. Following prior work (Wang C. et al., 2019), we report, for each object, the percentage of ADD-S smaller than 2 cm and the area under the ADD-S curve (AUC) with maximum threshold set to 10 cm.
We report results using the ADD-S metric, because it allows comparing with other state-of-the-art algorithms especially when their implementation is not publicly available. Indeed, the authors of PoseRBPF (Deng et al., 2019) and DenseFusion (Wang C. et al., 2019) report the results of their experiments using this metric. Nevertheless, the ADD-S metric does not report the actual error in position and orientation in the algorithm output that we deem very important in order to evaluate an algorithm for robotic applications. Moreover, the ADD-S metric is evaluated on a subset of the video frames, called key-frames. For these reasons and for a better evaluation, we also report, as additional metric, the Root Mean Square Error (RMSE) over the entire trajectory of the Cartesian error e r,t and the angular error e o,t , which is the standard metric used in tracking literature (Bar-Shalom et al., 2002). We evaluate the angular error as the geodesic (Huynh, 2009) between the actual and the estimated orientation in [0, π). In order to have a fair comparison, we omitted part of the angular error due to symmetries for texture-less objects and for objects that have a pure cylindrical shape. In fact, algorithms that only use the depth information, as ours and ICP, cannot infer the rotation along the main axis of rotation of such objects. Table 1 shows the evaluation for all the 21 objects in the YCB Video dataset. We report the ADD-S AUC (<10 cm) and the 1 | Quantitative evaluation of 6D pose on YCB-Video Dataset key-frames (ADD-S). We do not report ADD-S < 2 cm for PoseRBPF as this information is not available in the original paper. In this experiment, segmentation masks are available at each frame I t . A bold entry indicates a strictly better result (i.e. if different algorithms have the same best result, the associated entries are not bolded). ADD-S < 2 cm metrics evaluated on the 2,949 key-frames provided. In order to evaluate on the exact key-frames, we run our experiments under the hypothesis that the segmentation results are available for each frame I t . When considering segmentation from PoseCNN, the percentage of frames with error less than 2 cm is lower for our method than DenseFusion. However, our method outperforms both the DenseFusion framework and ICP in terms of the area under the ADD-S curve with maximum threshold set to 10 cm. This demonstrates that, in the interval [0 cm, 2 cm), errors for MaskUKF have a distribution that is more concentrated towards zero.

ADD-S Metric
We remark that even if the increase in performance is moderate, our architecture is much simpler than some pose estimation networks, such as DenseFusion, especially in terms of training requirement (we recall that MaskUKF can track object poses given a suitable initial condition and a trained segmentation network). Our method also outperforms the tracking algorithm PoseRBPF under the same metric.
We detected a performance drop when comparing results obtained with different segmentation networks, namely PoseCNN and Mask R-CNN. We verified that this is due to missing detections or completely wrong segmentation from Mask R-CNN. See, e.g., objects 003, 008, 035. Another example is that of the two "clamps" (051 and 052) that have the very same shape but a different scale, hence difficult to segment using a conventional instance segmentation network. For these two objects PoseRBPF outperforms all the other algorithms. More in general, the results suggest that the performance might vary with the specific segmentation algorithm that is adopted. In this work, differently from PoseRBPF and DenseFusion that rely on PoseCNN for the detection and segmentation, respectively, we have also considered a general purpose instance segmentation algorithm, such as Mask R-CNN, that has not been trained in the context of 6D pose estimation.

RMSE Metric
In Table 2, we report the RMSE metrics evaluated on all the frames. In order to compare with DenseFusion, we run their pipeline on all the frames of each video sequence. We cannot compare with PoseRBPF since their implementation is not publicly available.
When equipped with the PoseCNN segmentation network, our method outclasses both the DenseFusion framework and ICP. While the increase in performance is minimal for the Cartesian error, the difference is considerable for the angular error that, in average, is almost halved with respect to the competitors.
Using segmentation from Mask R-CNN, our method still outperforms ICP in terms of the angular error while the Cartesian error is comparable. As already noticed, also the RMSE metric reveals, on average, a drop in performance when using this segmentation with respect to PoseCNN due to missing detections or completely wrong segmentation.
If we consider both segmentations, we can see that, for some specific objects, the reduction in angular error is indeed substantial. As an example, when using Mask R-CNN, the angular error for object 005 is 6.32°for MaskUKF while, for  Table 3, where we reported the maximum percentage of occlusion for each object among all YCB Video frames, these objects are among those involved in moderate to severe occlusions.
In order to provide an insight on the motivations behind the performance improvement, we provide in Figure 2 an excerpt of the trajectories of the estimated pose for the object 021 in the sequence 0055 according to several algorithms. In that sequence, the object is visible from its shortest and texture-less side (Figure 3) hence producing partial and ambiguous measurements. As can be seen both from the plot and in Figure 3, the estimate from ICP, that uses only depth information, pivots around the measurements resulting in a totally wrong estimate of the orientation. The output orientation from DenseFusion is often wrong despite the availability of RGB information because the visible part of the object is almost texture-less, hence ambiguous. Our algorithm, instead, provides a sound estimate of the pose over time. This can be explained in terms of the motion model in Eq. 28 that helps regularize the estimate when only partial 3D measurements are available using the information encoded in the estimate of the previous step and in the estimated velocities. In the specific case of the sequence 0055, the camera is rotating very slowly around the object of interest, hence the estimated velocities helps enforcing the correct estimate and reducing pivoting phenomena as seen in the outcome of the ICP algorithm. A similar reasoning justifies the performance of our method in presence of occlusions. Figure 3 illustrates some samples 6D estimates on the YCB Video Dataset. As can be seen, both ICP and DenseFusion fail to estimate the correct pose of the cans in the leftmost columns due to challenging occlusion. In the central column, the bleach cleanser is visible from its shortest and texture-less side resulting in wrong estimates of the orientation when using DenseFusion 3.6 009_gelatin_box 0.8 FIGURE 2 | Comparison of the trajectories of several algorithms for the object 021_bleach_cleanser within a subset of the sequence 0055 from the YCB Video Dataset. Our method is more precise than ICP and smoother than DenseFusion which exibits spikes and irregularities. In this figure, the orientation of the object is represented in a compact way using the Euler vector θ θe obtained as the product of the axis of rotation e and the angle of rotation θ.

Qualitative Evaluation
Frontiers in Robotics and AI | www.frontiersin.org March 2021 | Volume 8 | Article 594583 and ICP. Another challenging case is that of the cracker box, on the right, that is only partially visible in the RGB frame. While DenseFusion struggles to estimate the correct orientation, our algorithm and ICP are able to recover it properly. We do not report qualitative results for PoseRBPF as their implementation is not publicly available, hence it is not possible to reproduce the estimated poses. Figure 2 shows trajectory samples of the estimated pose for one of the object from the YCB Video Dataset obtained by different algorithms. Our algorithm is visibly more precise than ICP and smoother than DenseFusion, which makes it more suitable for closing the control loop in robotic applications.

Multi-Rate Experiment
In Table 4, we report the results of the experiment in which we feed the UKF and the ICP algorithm with masks from Mask R-CNN (He et al., 2017) at the maximum frame rate declared by the authors, i.e., 5 fps. As described in Section 3.6, if the mask at time t is not available, our algorithm uses the most recent one instead. In order to make the comparison fair, we adopted the same strategy for the ICP algorithm. We provide the ADD-S (AUC) and RMSE metrics.
In this scenario, our algorithm outperforms the ICP baseline especially in terms of the orientation of the objects. A direct comparison with the results in Table 1 reveals a drop in the ADD-S performance (slightly more accentuated for ICP). This is an expected behavior due to the re-use of old masks on several consecutive frames. The same considerations apply for the position and orientation of most objects except for those, such as 051 and 052, whose masks provided by Mask R-CNN are wrong or missing in several frames. Indeed, if the masks are provided less frequently, there are less chances that the tracking algorithm uses erroneous 3D information extracted from wrong masks.

Results on Velocity Estimation
We report a qualitative evaluation of the velocity estimates (linear and angular) in Figure 4, where it is shown that tracking is performed reliably. Given that the YCB Video Dataset does not provide a ground truth for the velocities, we extracted it from the ground truth poses of consecutive frames by means of finite differences. The angular velocity, indicated as ω, was evaluated starting from our estimate of the Euler angular rates _ o using the Euler kinematical equation (Stevens et al., 2015). Table 5 shows the effectiveness of the outlier rejection procedure presented in Section 3.4. Our algorithm performs better when the outliers are taken into account, especially in terms of reduced angular error.

Frame Rate Comparison
We compare our frame rate against DenseFusion and PoseRBPF in Table 6. For our algorithm, we evaluated the frame rate as the inverse of the mean time, averaged on the total number of frames of the testing sequences, required to perform outlier rejection and the Kalman prediction and correction steps. We did not consider the time required to segment the object in our evaluation since, as described in Section 3.6, our method can run asynchronously with respect to the frame rate of the segmentation algorithm. For ICP, the evaluation was done on the mean time required to perform the registration step between the source and target point cloud. For DenseFusion and PoseRBPF we considered the frame rates reported in the associated publications. We notice that the frame rate reported by the authors of DenseFusion, i.e., 16.7 fps, includes the time required to segment the object. In order to have a fair comparison, we omitted the segmentation time resulting in a frame rate of 30 fps for this algorithm.
We ran our experiments on an Intel i7-9750H multi-core CPU. Our method is approximately one and a half times faster than DenseFusion and ten times faster than PoseRBPF. Given its simplicity, the ICP implementation that we adopted reaches 91.7 fps.

Results on Closed Loop Control
In this section we compare our method with DenseFusion and the baseline ICP within a simulated robotic experiment where the output of each method is fed as a reference signal to a closed loop control system. For our experiments, we adopt the iCub humanoid robotic platform (Metta et al., 2010) that we simulated in the Gazebo (Koenig and Howard, 2004) environment ( Figure 5). We do not test against PoseRBPF since the implementation of the algorithm is not publicly available. Our results include the end-effector tracking errors for several configurations of the control gains in order to compare the algorithms on the tracking precision and reliability when the amount of feedback given to the control system is varied. We also FIGURE 4 | Comparison between estimated linear (v) and angular (ω) velocities for several algorithms within a video sequence from the YCB Video dataset. The ground truth velocities are obtained from finite differences. show the qualitative evolution of the end-effector trajectories over time in order to discuss the feasibility of the commanded trajectories.

Description of the Experiment
In our experiment we consider the task of following a moving container with the end-effector of a humanoid robot while the robot is holding a bottle and pouring its content inside the container. The task is depicted in Figure 6. We assume that the 6D pose and the velocity of the container are estimated/ tracked using RGB-D information, while we assume that the position of the bottle is known given the robot kinematics (i.e., we focus on the task of tracking the container only). We are not directly interested in the grasping task nor in the pouring action per se but more on the possibility of exploiting the estimated signal in a closed loop fashion in order to follow the container as close as possible while avoiding contacts that might compromise the pouring action. We adopted the YCB Object and Model Set (Calli et al., 2015) for the objects of the experiment by choosing the object 006_mustard_bottle as the bottle and the object 024_bowl as the container ( Figure 5).

Implementation of the Experiment
We implemented our experiment in the Gazebo environment using a simulated model of the iCub robotic platform (Metta et al., 2010). Even though iCub features 53 DoF, only a subset of these are used in our experiment, i.e., three DoF in the torso, seven DoF in the arm and six DoF in the head, where the stereo vision system is mounted. In our experiment we assume that the vision system of the robot provides segmentation and RGB-D information that we use to estimate and/or track the pose of the container. Additionally, we use the iCub gaze control system (Roncone et al., 2016) to track the container in the image plane by using the estimated Cartesian position of the object as a reference gazing point.  In order to carry out the task described in Section 4.4.1, we consider the torso (two DoF out of three) and one of the arms as a nine DoF serial chain whose dynamic behavior is described by the standard equation where q, _ q and € q ∈ R 9 are the joints angles, velocities and accelerations, respectively, M(q) ∈ R 9×9 is the Mass matrix of the chain, h(q, _ q) ∈ R 9 represents the effect of centrifugal, Coriolis and gravity terms and τ ∈ R 9 are the torques applied to the chain joints axes.
In order to command the end-effector of the robot and follow the moving container over time, we adopted a two-layer control architecture. The first layer consists in an Inverse Dynamics controller Here, J ee ∈ R 6×9 is the Jacobian of the end-effector expressed in the robot root frame which links the joints velocities _ q with the end-effector linear velocity v ee ∈ R 3 and angular velocity ω ee ∈ R 3 v ee ω ee J ee q _ q, both expressed in the robot root frame. The matrix Λ(q) ∈ R 9×9 in Eq. 31 is also called the Task Space Mass matrix. The term _ V des ∈ R 6 is a vector containing the desired linear acceleration of the end-effector _ v des ∈ R 3 and the desired angular both expressed in the robot root frame. If controlled using the torques in Eq. 31, the system in Eq. 30 reduces to the system of equations where _ v ee ∈ R 3 is the linear acceleration of the end-effector and _ ω ee ∈ R 3 is the angular acceleration. In essence, the first layer allows reducing the dynamics of the serial chain to a linear system having _ V des as input. The second control layer consists in a Proportional Derivative controller where e p ∈ R 6 is the end-effector configuration error e p p ee − p des log R des R T ee , and e v ∈ R 6 is the end-effector velocity error Here, p ee ∈ R 3 is the Cartesian position of the end-effector, R ee ∈ SO(3) is the orientation of the end-effector, v ee ∈ R 3 is the linear velocity of the end-effector and ω ee ∈ R 3 is the angular velocity that are known via the robot forward kinematics map and forward differential kinematics map. Moreover, p des ∈ R 3 is the desired position of the end-effector, R des ∈ SO(3) is the desired orientation of the end-effector, v des ∈ R 3 is the desired linear velocity of the end-effector and ω des ∈ R 3 is the desired angular velocity. We recall that the expression log(R des R T ee ) in Eq. 36 represents a proper representation of the angular error in the Lie algebra so(3) (Bullo and Murray, 1999).
In our experiment, we assume that position of the tip of the bottle with respect to the robot end-effector is known at any time. Given this assumption, we consider the end-effector frame ee in Eqs. 36 and 37 to be a frame attached to the tip of the bottle. Given this choice, we conclude the design of our control system by setting the desired quantities as follows: where r t , o t are the estimated container position and orientation and v t , _ o t are the estimated container Cartesian velocity and the Euler angular rates as defined in Eq. 1. The term R des (R 0 , o t ) takes into account the orientation of the container and a default pouring configuration R 0 in order to provide the desired orientation of the end-effector while the term ω t (o t , _ o t ) represents the conversion from Euler rates to angular velocities.
We remark that the Cartesian velocity and the angular velocity of the container are directly provided by our method as part of the state defined in Eq. 1. In order to execute the experiment with the pose estimation algorithm DenseFusion and with the baseline ICP, we approximated the Cartesian and angular velocity using finite differences.

Evaluation Metrics
In order to compare different algorithms we use the Cartesian error between a given coordinate x of the end-effector 3D position and the real coordinate x of the 3D position of the object r gt t,x and the geodesic angular error (Huynh, 2009) between the orientation R ee of the end-effector and the desired orientation R des (R 0 , o gt t ) evaluated on the real orientation of the object o gt t .

Results of the Experiment
We tested our method, DenseFusion and the baseline ICP on the closed loop experiment described above under the following assumptions: • a sinusoidal trajectory is assigned to the moving container along the y direction of the robot root frame; • a sinusoidal trajectory is assigned to the orientation of the container along one of its axis; • before starting the experiment, the end-effector is reset to a rest configuration near the container; • each experiment lasts 1 min in order to test the reliability of system; • our method and ICP are initialized using the ground truth from the simulator.
An excerpt of the trajectory of the moving container can be seen in Figure 6.
In Figure 7 we compare the three algorithms in terms of Cartesian error e y and the angular error e R . Specifically, we consider the Root Mean Square Error (RMSE) along 1 min of experiment for several choices of the proportional gain. For each choice of the gain we assigned the derivative gain k d as 2 k p as this choice assures the fastest possible closed loop dynamics for the double integrator system in Eq. 34.
As can be seen from Figure 7, both the errors decrease for all the algorithms when the gain increases up to kp ≈ 50. For higher gains, the error increases for both DenseFusion and ICP resulting in a failure in the experiment for kp 160. Conversely, our method allows increasing the gain up to kp 320 and reaching a lower tracking error especially for the orientation FIGURE 7 | Comparison between RMSE Cartesian and angular error for the object following experiment for varying proportional gains k p . Our method allows increasing the gain up to k p 320 reaching better tracking performance.  Frontiers in Robotics and AI | www.frontiersin.org part. This behavior can be explained in terms of the smoothness of the pose estimates produced by our method as we discussed in Section 4.3.5. Furthermore, our method naturally provides a smooth estimate of the velocity of the object, as shown in Section 4.3.7, while for the other algorithms we resorted to velocities obtained by finite differences. Finite difference approximations are typically noisy, even in the case of DenseFusion (whose pose estimates are quite precise but noisy, see Figure 2), thus reducing the maximum possible amount of feedback hence the tracking performance of the closed loop system. We conclude our analysis by showing the actual trajectories of the end-effector for two specific choices of k p , namely 80 and 160.
In Figure 8, we show the desired and achieved trajectory of the y coordinate for the first 20 s. As can be seen in the case of k p 160, with our method the end-effector achieves a steady and smooth behavior. When using ICP, the end-effector fails to track the container after ≈ 17 s and shows a moderately noisy behavior. With DenseFusion, tracking is lost after ≈ 10 s and the motion is characterized by non-neglibile spikes which make it unsafe for the robot.
In Figure 9, we show the evolution of the angular error over time for the same choice of the proportional gain. As can be seen, moving from k p 80 to k p 160 helps reducing the mean angular error when using our method and ICP. However, as already seen for the y coordinate, with ICP tracking is lost at some point and the error diverges. When using DenseFusion with k p 80, the error is much higher than with MaskUKF and ICP and it diverges when moving to k p 160.
In summary, our method provides better performance in terms of tracking precision and reliability when the amount of feedback given to the control system is increased. We stress that this result depends in large part on the adoption of a Kalman filtering framework for our method which, leveraging even a rather simple motion model as in Eq. 28, can produce smooth estimates of both object pose and velocity.

CONCLUSION
We proposed MaskUKF, an approach that combines deep instance segmentation with an efficient Unscented Kalman Filter to track the 6D pose and the velocity of an object from RGB-D images. We showed that MaskUKF achieves, and in most cases surpasses, state-of-the-art performance on the YCB video pose estimation benchmark, while providing smooth velocity estimation without the need for expensive pose annotation at training time.
Experiments with the state-of-the-art segmentation algorithm Mask R-CNN at relatively low frame rate of 5 fps suggest possible future research on the integration with recent architectures such as SiamMask (Wang Q. et al., 2019) or Siamese Mask R-CNN (Michaelis et al., 2018) that have proven to provide comparable accuracy at higher frame rates.
Our results show that for some objects, simple solutions like ICP, that operates without a motion model, perform very similarly to the state of the art. This seems to suggest that the YCB Video dataset is not challenging enough, despite having become a popular dataset for pose estimation benchmarking. As future work, we propose to develop more challenging pose estimation and tracking datasets that can effectively show the shortcomings of classical approaches as ICP and motivate the necessity for complex deep learningbased architectures.
Experiments in a simulated dynamical environment highlights superior performance of our method for closed loop control tasks on robotic platforms. At the same time, they show how state-ofthe-art RGB-D deep approches, while being precise enough for pose estimation purposes, might be inadequate for control tasks due to unregulated noise in the output which limits the overall performance. As a future work, we propose to develop standard benchmarks, specifically tailored to the robotic community, to ascertain the actual usability of pose estimation/tracking algorithms for such tasks.

DATA AVAILABILITY STATEMENT
Publicly available datasets were analyzed in this study. This data can be found here: https://rse-lab.cs.washington.edu/projects/posecnn/.

AUTHOR CONTRIBUTIONS
NP designed and implemented the core algorithm presented in the paper and carried out the experiments on the YCB Video dataset. He designed and implemented the closed loop experiments with the iCub humanoid robot in the Gazebo environment. FB prepared the training data from the YCB Video Dataset and trained the Mask R-CNN instance segmentation network used as input to the algorithm. NP, CF, and GV contributed to the design of the filtering algorithm and the measurement model. LN, UP, GV, CF, and FB contributed to the presented ideas and to the review of the final manuscript.