6IMPOSE: bridging the reality gap in 6D pose estimation for robotic grasping

6D pose recognition has been a crucial factor in the success of robotic grasping, and recent deep learning based approaches have achieved remarkable results on benchmarks. However, their generalization capabilities in real-world applications remain unclear. To overcome this gap, we introduce 6IMPOSE, a novel framework for sim-to-real data generation and 6D pose estimation. 6IMPOSE consists of four modules: First, a data generation pipeline that employs the 3D software suite Blender to create synthetic RGBD image datasets with 6D pose annotations. Second, an annotated RGBD dataset of five household objects was generated using the proposed pipeline. Third, a real-time two-stage 6D pose estimation approach that integrates the object detector YOLO-V4 and a streamlined, real-time version of the 6D pose estimation algorithm PVN3D optimized for time-sensitive robotics applications. Fourth, a codebase designed to facilitate the integration of the vision system into a robotic grasping experiment. Our approach demonstrates the efficient generation of large amounts of photo-realistic RGBD images and the successful transfer of the trained inference model to robotic grasping experiments, achieving an overall success rate of 87% in grasping five different household objects from cluttered backgrounds under varying lighting conditions. This is made possible by fine-tuning data generation and domain randomization techniques and optimizing the inference pipeline, overcoming the generalization and performance shortcomings of the original PVN3D algorithm. Finally, we make the code, synthetic dataset, and all the pre-trained models available on GitHub.


I. INTRODUCTION
Reliable robotic grasping remains a challenge in many precision-demanding robotic applications, such as autonomous assembly [1] and palletizing [2].To overcome this challenge, one approach is to accurately recognize the translation and orientation of objects, known as 6D pose, to minimize grasping uncertainty [3].Recent learning-based approaches leverage deep neural networks (DNNs) to predict the 6D object pose from RGB images, achieving promising performance.
Nonetheless, estimating 6D poses from RGB images is challenging.Perspective ambiguities, where the appearances of the objects are similar under different viewpoints, hamper effective learning.This problem is further exacerbated by occlusions in cluttered scenarios [4].Additionally, as in many computer vision tasks, the performance of the algorithms is vulnerable to environmental factors, such as lighting changes and cluttered backgrounds [5].Furthermore, the use of learningbased methods requires a substantial amount of annotated training data, making it a limiting factor in practical applications as data labeling is time-consuming and costly.
To address the challenges faced by RGB-based approaches, RGBD-based 6D pose estimation algorithms leverage the additional modality from depth images, where the lighting and color-independent geometric information is presented.One way to leverage depth images is to use the depth for fine pose refinement based on the coarse pose predicted from RGB images [6], [7].In this case, the initial poses are estimated from the RGB images using DNNs, and the depth information is used to optimize the pose with the Iterative Closest Point algorithm (ICP) to increase the accuracy.Another approach is to convert the depth image into point clouds, from which the 6D Pose is predicted [8], [9].Due to the unstructured nature of the data, working directly on the point cloud is computationally arXiv:2208.14288v2[cs.CV] 9 Mar 2023 expensive.[8], [9] first employ an instance detection network to segment the target from the RGB images and crop the point cloud correspondingly.After that, point cloud networks work on the cropped point cloud to predict the 6D pose.
As an alternative, the geometric features can be directly extracted from the point cloud using DNNs and merged with the RGB features [10]- [15].Typically, the extracted features of both modalities are matched geometrically and concatenated before further processing [10]- [13].This approach is simple to implement and simplifies training as the feature extraction networks can be pre-trained in isolation on the available image and point cloud data sets.However, the feature extraction on both modalities could not benefit from each other to enhance representation learning, as the feature extraction DNNs do not communicate.On the other hand, FFB6D [14] achieves better performance by exploring bidirectional feature fusion at different stages of feature extraction.In this way, the local and global complementary information from both modalities can be used to learn better representations.Moreover, by primarily localizing the target object and excluding the irrelevant background, the feature extraction could be more concentrated on the region of interest, thus the performance can be further improved [15].
After feature extraction, different approaches exist to derive the object pose.Direct regression uses dense neural networks to regress to the object's pose directly [16].While this approach allows end-to-end learning and does not require decoding the inferred pose, the optimization of the DNNs is usually difficult due to the limitation of the mathematical representation for the orientation [16].Another common approach is the prediction of orientation-less keypoints and retrieving the pose by their geometric correspondence.[13]- [15] use DNNs to predict the keypoints in 3D space, and then compute the 6D pose via geometry matching on paired predicted keypoints and groundtruth keypoints.
State-of-the-art 6D pose estimation algorithms have achieved excellent performance as evaluated on benchmarks [8], [10], [12]- [15].However, the training and validation data used in these benchmarks is often correlated, as they are commonly sourced from video frames.Additionally, they may contain environmental features that can bias the learning process and simplify the inference.These factors raise concerns about the generalization of these algorithms and their ability to perform well in real-world scenarios.
Applying the state-of-the-art algorithms to practical robotic applications is non-trivial as the training of 6D pose estimation algorithms has high demand for annotated data [17].6D pose labeling of images is time and labor intensive, which limits the availability of datasets.On the other hand, using modern simulations to generate synthetic data for training DNNs shows great potential with low cost and high efficiency.For RGBbased approaches, [6] and [7] render 3D meshes in OpenGL to generate synthetic RGB images with random backgrounds from commonly used computer vision datasets, for example Pascal VOC [18] or MS COCO [19].Some RGBD approaches [12]- [15] use image composition in RGB and only render depth for the labeled objects.Recently, modern simulations, such as Unity or Blender, enable realistic rendering for full RGBD images, making these engines popular to generate high-quality training datasets [9].
Unfortunately, the performance of models solely trained on synthetic data often deteriorates when tested on real images due to the so-called reality-gap [6], [20].To mitigate the realitygap, domain randomization techniques are often applied to the synthetic data [21].Domain randomization can be applied to different aspects of image generation.Before rendering, the scene can be randomized by varying the pose of objects, backgrounds, lighting, and the environment to cover as many scenarios as possible [6], [9], [22].After rendering, the RGB and depth images can be directly altered, for example changing image contrast, saturation or adding Gaussian blur, and color distortion [6], [7].The depth images can be randomized by injecting Gaussian and Perlin Noise [23] to approximate the noise presented on a real camera.
Many works [6], [7], [9] on 6D pose estimation from synthetic data only evaluate on benchmarks, however, the performance in the real world remains unclear.[24] deploy 6D pose estimation DNNs to real-world robotic grasping, showing promising performance when tested under normal lighting conditions in a structured environment.When tested in unstructured scenarios, where environmental conditions can be inconsistent, the learned algorithms often need real-world data for fine-tuning to bridge the domain gap and achieve comparable performance [25], [26].

II. METHODS
In this section, we first introduce a data preparation pipeline for synthetic data generation and augmentation.Second, we present a two-stage approach to solve the 6D pose estimation problem in real time for robotic applications.

A. Synthetic data generation
In this work, the synthetic data is generated in Blender [29] by leveraging its state-of-the-art raycasting rendering functionality.To render RGBD images, a textured 3D model of the object is required, which can be derived from CAD data or collected by 3D scanning.
Image Generation Given a set of objects, we generate a separate dataset for each object of interest, with the other objects and additional unrelated objects acting as distractors.
For each scene to be rendered, we randomly place the objects in the camera's view.In order to avoid overfitting on the color during training, we recolor 25% of the distracting objects with the dominant color of the main object.Moreover, the distractors' optical properties, such as surface roughness and reflectivity, are varied to further increase the variety of generated images.
During simulation, the randomly placed distracting objects can severly occlude the main object, which makes the main object not clearly visible, resulting in invalid training data.To avoid this, we check whether the centroid of the main object is occluded, in which case, we move the occluding objects to the back of the main object.
We sample images from SUN2012 [30]  and smooth 2D Perlin noise as in [31] to each color channel to cover different environments and sensors.
Depth data augmentation The synthetic depth images rendered from simulations are noiseless and almost perfect, which is not the case for images obtained from a real depth camera, where the depth values are often inconsistent and incomplete [32].To approximate inconsistent depth values, we introduce Gaussian noise and Perlin noise to augment synthetic depth images.Similar to [23], pixel-level Gaussian noise is added to the synthetic depth images resembling a blurring effect.
Smooth Perlin noise has been shown to significantly increase performance when learning from synthetic depth data [33].We create Perlin noise with random frequency and amplitude and add it directly to the depth channel.The introduced Perlin noise shifts each depth point along the perceived Z-axis, resulting in a warped point cloud, similar to the observed point clouds of real depth cameras.In real RGBD images, a misalignment can be observed between depth and RGB images.Similar to [34], we use Perlin noise again to additionally warp the depth image in the image plane.Instead of using a 3D vector field to warp the entire depth image, we restrict warping to the edges of the objects.We apply a Sobel filter to detect the edges and obtain edge masks.We then shift the pixels on the edges using a 2D vector field generated using Perlin noise.
The rendered depth images have no depth information where there is no 3D model, resulting in large empty areas between objects.However, it is also very important to simulate plausible depth values for the background [33].
The background depth is based on a randomly tilted plane, to which we add a random Gaussian noise.The noise is sampled on a grid over the image and then interpolated.An additional Gaussian noise is sampled from a second grid and again interpolated.Due to the random and independent choices of grid sizes and interpolation for the two grids, we can achieve a wide variety of depth backgrounds.By adding an appropriate offset, we guarantee that the artificial background is in close proximity to the main object; hence, making object segmentation from the background more difficult.The artificial depth background then replaces empty depth pixels in the original synthetic depth image.
In the real depth images, some regions might miss the depth values and are observed as holes due to strong reflections of the object or other limitations of the depth sensor [32].To simulate the missing depth problem, we first generate a random 2D Perlin noise map, which is converted to a binary masking map based on a threshold.This binary masking map is then used to create missing regions in the synthetic depth image.
While this method is not an accurate simulation, we found this approximation, in combination with the other augmentation strategies, useful to improve the accuracy of the neural network.

C. A two-stage 6D pose estimation approach
The goal of 6D pose estimation is to estimate the homogeneous matrix Rt ∈ SE(3), which transforms the object from its coordinate system to the camera's coordinate system.This transformation matrix consists of a rotation R ∈ SO(3) and the translation t ∈ R 3 of the target object.In this work, we use PVN3D [13] to infer the homogeneous matrix Rt on the cropped region of interest (ROI) identified by a YOLO-V4tiny [27] object detector.This two-stage approach is shown in Figure 1.
The RGB image is processed at the first stage using YOLO-V4-tiny, which provides several candidate bounding boxes and confidence scores.The bounding box with the highest confidence score for a specific object determines the ROI.
Given the ROI, the cropped area is the smallest square centered on the ROI and including it, that is a multiple of the PVN3D input size (e.g.80 x 80, 160 x 160, ...).The square cropped images are then resized to 80 x 80 using nearest neighbor interpolation.
Following PVN3D [13] and PointNet++ [35], the point cloud is enriched by appending point-wise R, G, B values and surface normals.We estimate the surface normal vectors by calculating the depth image's gradients and the pixel-wise normals geometrically as in [36].Differently from the original PVN3D [13] implementation, where the nearest neighbor approach is used to compute the normals from unstructured point clouds, calculating normals from structured depth image is more computationally efficient [37].This also allows us to use a GPU-based gradient filter in TensorFlow.The resulting point cloud is then randomly subsampled to increase computational efficiency.
In the second stage, PVN3D is used for the pose estimation, with PSPNet [38] and PointNet++ [35] as backbones to extract RGB and point cloud features separately.The extracted latent features are then fused by DenseFusionNet [12] at pixel level.
Because of the resizing of the cropped RGB image, we map the resized features back to the nearest point in the point cloud.Shared MLPs are then used to regress to the point-wise segmentation and keypoints offsets {of i } ∈ R 3 .
To obtain the final object pose, the point-wise segmentation filters out background points and the keypoint offset are added to the input point cloud to get keypoint candidates.
In [13], keypoint candidates are clustered by using Mean-Shift clustering for the final voted keypoints { kp i } ∈ R The prediction accuracy is improved by cropping the image to the ROI, as only the relevant part of the data is processed.
With the same number of sampling points, the sampled point cloud from the cropped image is denser, providing PointNet++ with richer geometric information for feature processing, which can also be observed in [15].Given the cropped input, we could build the PVN3D with only about 8 millions parameters, which is approximately 15% of the original implementation [39].In our test on the LineMOD dataset, the reduced PVN3D performs similarly to the original model.We refer to the reduced PVN3D model as PVN3D-tiny.

III. EXPERIMENTS
In this section, we study the effectiveness of the proposed synthetic data preparation pipeline and the two-stages 6D pose estimation algorithm.Specifically, we use 3D models from the cropped RGBD images using the ground truth poses and segmentation masks.

B. Synthetic data inspection
To quantify the reality-gap between synthetic and real data, This is an indication that depth augmentation reduces the gap between the synthetic and real data.
The examination of global statistics for RGBD images is efficient, as it does not require real annotations.This examination also enables us to identify the "reality gap" qualitatively and adjust the data generation parameters, such as brightness and depth frequencies, to align the synthetic data

C. Implementation
The synthetic data generation pipeline is implemented in Python using Blender's API.The data randomization and preprocessing are implemented using TensorFlow, accelerating the processing with GPUs.As for the two-stage 6D pose estimation approach, we use the original Darknet implementation [27] of YOLO-V4-tiny for the object detection at the first stage and PVN3D-tiny, implemented in TensorFlow, in the second stage.

D. Training and evaluation on LineMod dataset
To address single object 6D pose estimation problem on LineMod, we separately train a binary YOLO-V4-tiny model and a PVN3D-tiny model for each object of the LineMOD dataset.The YOLO-V4-tiny model is trained using the Darknet framework [40], and PVN3D-tiny is trained in TensorFlow [41].
All deep neural networks are trained from scratch using only synthetic data without any pretrained models.After training, we build the two-stage 6D pose estimation pipeline by combining YOLO-V4 and PVN3D.We follow [13] to evaluate the 6D pose estimation performance on the annotated real images provided in LineMOD.The 6D pose estimation performance is measured by using ADD(S) metrics [28].ADD measures the average distance between the ground truth point cloud and the point cloud transformed with predicted R, t, which can be defined as follows: where m is the number of the sampled points, R * , t * is the ground truth pose, and v ∈ R 3 denotes a vertex from the object O. Similarly, the ADDS metric measures the average minimum distance between two point clouds as:

E. Robotic Grasping
We train the proposed approach for pose estimation from purely synthetic images, to perform robotic grasping experiments.We choose five household objects: a rubber duck, a stapler, a chew toy for dogs, a glue bottle and pliers, as shown in Figure 5, for which the 3D models of the objects are obtained using a Shining3D Transcan C 3D scanner.We generate synthetic training data and train a multi-classes detector YOLO-V4 to localize the target object, and train multi PVN3D-tiny models to estimate poses of different target objects, as described As an endeffector, we use an OnRobot RG2 gripper.Attached to the endeffector is a Intel Realsense D415 which is used to obtain the RGBD images.This setup is then used to perform 50 grasp attempts per object in three different lighting conditions, which yields 750 grasps in total.The three different lighting conditions are diffused, low and spot lighting, to test the algorithm's robustness to different lighting levels, as shown in Figure 6.

Grasping strategy
The following approach is used to conduct grasping experiments.The robot starts by moving to a predefined home position where the entire bin is visible in the camera's field of view.The object of interest is then identified using YOLO and PVN3D-tiny.To ensure that possible collisions around the object can be observed, the robot moves its end-effector directly above the object.This is important when the object is close to the edge of the camera's view and surrounding obstacles may be out of sight.A safe grasp pose is selected using the pose estimation and grasp selection method.
A smooth and tangential trajectory, using a Bézier curve, is generated to approach the object.The gripper is closed when it reaches the grasp pose and the object is lifted out of the bin.
To conclude one grasping attempt, the object is dropped back into the bin after the robot returns to the initial home position.
If the object can be grasped and lifted without slipping, this grasp will be regarded as a success, a failure otherwise.We distinguish the failure cases between a missed grasp and a collision, to identify the cause of failure, which can be the pose estimation or the collision avoidance.
In this paper we leverage a simple algorithmic approach similar to [43].Local grasp poses in the object's coordinate frame are generated offline and beforehand.With the estimated pose of the object, the local grasp pose can be lifted to the global coordinate frame as a target pose for the robotic manipulator.
Generally it is not required to find all grasp poses or the best one, but to find a set of poses, that cover most directions from which the robot may approach the object.Therefore, a list of grasp candidates is generated, that will enable the robotic gripper to securely grasp the object.
We use a sampling based grasp pose estimation using the available mesh of the objects, where randomly sampled points on the surface area of the objects are considered as possible contact points.For each connecting line between a pair of points, we generate 24 grasp poses, rotated around the connecting line and additionally generate the corresponding antipodal grasps.From this grasp candidates, a grasp pose is considered valid if the following criteria are met: • the surface curvature on the mesh should not prevent a stable friction grasp.Therefore, no sharp edges or concave surfaces are considered; • the contact surfaces should be perpendicular to the connecting line.This ensures a stable friction grasp; • the gripper bounding box should not collide with the object.
The remaining grasps are then downsampled, using sparse anchor points in three-dimensional space.Our approach typically yields less than 100 grasps for each object, while still providing a high degree of coverage of all possible angles, as can be seen in Figure 4.

Grasp pose selection
The optimal grasp pose for an object of interest is then selected utilizing the predicted 6D pose and the pointcloud data from an RGBD camera.We first filter out grasp poses that would require the robot to approach the  Finally, we choose the grasp pose that maximizes the distance to the pointcloud for safety.

IV. RESULTS
In this section, we report the ADD(S) accuracy performance of the proposed two-stage 6D pose estimation algorithm on the LineMOD dataset after training on the synthetic data.We also report the success rate (SR) for grasping different household objects in robotic grasping experiments.

A. 6D pose estimation accuracy
We evaluate the performance of the proposed 6D pose estimation approach on all objects from the LineMOD dataset.We report the results with comparison to the state-of-the-art work in Table I, in which the performance of PointFusion is from [15] and performance of SSD-6D [7] is from [6].Compared to other synthetic-only trained methods, our approach achieves competitive performance with overall 83.6% pose recognition accuracy without pose refinement.Specially, it performs well on small objects like "ape", "duck", on which the SSD-6D [7]    and AAE [6] are less accurate.On the other hand, our model performs less than optimal on "holepuncher" and "camera".The reason could lie in the low-quality textures of the LineMOD models.Our approach, being trained end-to-end on RGBD data, could be more sensitive to less-detailed textures compared to refinement-based approaches.
Compared to the related work that only uses synthetic data for training, our approach outperforms AAE [6] and SSD-6D [7].Furthermore, [9] proposes a 6D pose estimation algorithm based on DGCNN [44] and reaches 98% average accuracy.
However, it relies heavily on pose refinement, and it takes approximately one second to detect a single object.
The algorithms trained using real data generally outperform the counterparts trained only on synthetic data, as shown in Table I.Nevertheless, we noticed that our approach can achieve approximately 94% accuracy without refinement when the ground truth bounding box is used to localize the target object, as presented in Table II.This result is comparable to the stateof-the-art methods trained using real data.One potential way to improve the object detection performance is to use RGBD images [45], where the object detector can learn the more robust features from both appearances features provided by RGB images and geometry features provided by depth images.
This performance gap can also be mitigated by fine-tuning the object detector with a handful of annotated real data.We did not observe a degredation of the object detector on the robotic grasping dataset.

B. Run time
The efficiency of the proposed approach was evaluated on a workstation equipped with two Xeon Silver-CPU (2.1GHz) and an NVIDIA Quadro RTX 8000 graphics card.The results, as reported in Table III, indicate that the inference of PVN3Dtiny consumes the majority of the running time, while the other procedures have similar computational requirements.For an input of 480x640 RGB and depth images, the proposed  I, and suitable for real-time robotic tasks.As demonstrated in the following section, the accuracy of the proposed approach is sufficient for grasping tasks.

C. 6DoF pose estimation in robotic applications
As shown in Table IV, the robotic arm has achieved an approximate 87% successful rate (SR).The three scenarios show similar success rates, showing the algorithm's robustness to different lighting levels.Notably, the proposed algorithm works well in low-lighting conditions.The reason could be attributed to two factors: first, the training on domainrandomized synthetic data makes the algorithm learn more robust features.Second, the depth information remains consistent under different lighting conditions, as shown in the Figure 6, and so the algorithm can extract sufficient features from depth to compensate for the underexposed color camera.
In general, collision avoidance is not the focus of this research and the results regarding the accuracy of the pose estimation pipeline are more relevant.Thus it is interesting to analyse the grasping success and pose estimation failures excluding all the collision events.The collisions cases are mainly due to the insufficient collision checking.If we neglect the collision cases, the failure cases decrease by 50%, and the overall grasping performance achieves 93%.This suggests that a more sophisticated collision checking and grasp pose selection strategy is required and it will be subject of future work.
Rubber Duck: The grasping of the rubber duck is the most robust and successful of all of the objects.The non-regular shape of the duck with no rotational symmetries are robust features, resulting in an accurate pose estimation.Additionally, the rubber material and soft structure facilitate robotic grasping, where slight inaccuracies still lead to successful grasps.

Glue bottle:
The glue bottle achieves a good SR as well, due to its shape and material, which are forgiving similarly to that of the rubber duck.Additionally, the bright color of the glue bottle might aid in low light environments, making the object easily visible.
Stapler: The grasping of the stapler is highly affected by the lighting conditions, with 86.0%SR under spot light and 76.0% under low light.Possibly, the stapler, due to its dark color, looses more details in low light conditions, making 6D pose estimation more difficult without properly distinguishable features.
Chewtoy: During grasping of the chew toy, collision have been the primary cause of failure due to its small size, as it easily gets stuck in small cavities between other objects.We observed this is especially relevant for the chewtoy, because the round shape makes the object roll in the bin, until it gets stopped by other objects.Therefore, the primary reason for failure is not the pose estimation, but the inferior collision avoidance.
Moreover under spot lighting conditions, the chewtoy is underexposed, particularly when stuck in a hole and this makes 6D pose estimation challenging.Combined with the proximity to other objects, this leads to an increased rate of collisions.
Pliers: In the case of grasping the pliers, we observe a higher number of missed grasps, due to the small size of the grasp handles.The grasp generation places all the grasps on the handles and, while according to the ADD most of the proposed grasp would be successful, in practice, some fail in the robotic experiment.

V. CONCLUSIONS AND FUTURE WORK
In this work, we introduce 6IMPOSE, a novel framework for sim-to-real data generation and 6D pose estimation.The framework consists of a data generation pipeline that leverages the 3D suite Blender to produce synthetic RGBD image datasets, a real-time two-stage 6D pose estimation approach integrating YOLO-V4-tiny [27] and a real-time version of PVN3D [13], and a code base for integration into a robotic grasping experiment.The results of evaluating the 6IMPOSE framework on the LineMod dataset [28] showed competitive performance with 83.6% pose recognition accuracy, outperforming or matching state-of-the-art methods.Furthermore, the real-world robotic grasping experiment demonstrated the robustness of the 6IMPOSE framework, achieving an 87% success rate for grasping five different household objects from cluttered backgrounds under varying lighting conditions.The contribution of 6IMPOSE lies in its efficient generation of large amounts of photo-realistic RGBD images and successful transfer of the trained inference model to real-world robotic grasping experiments.To the best of our knowledge, this is the first time a sim-to-real 6D pose estimation approach has been systematically and successfully tested in robotic grasping.In future work, there is potential to further improve 6IM-POSE by exploring improvements to the perception pipeline, such as using a more sophisticated pose detection network or multi-frame detection; improving the scalability and quality of the data generation process; and improving the robotic integration, for example, in areas such as collision detection, grasping pose selection, and support for more scenarios like bin picking where there are multiple instances of the same object class.
Fig.1:A two-stage pose estimation approach showing the object detection with YOLO-tiny to localize the object of interest at the first stage, followed by the 6D object pose estimation with PVN3D-tiny at the second stage.

Fig. 2 :
Fig. 2: The figure showing the visualization of RGB images (Left), depth images (Middle) and surface normals (Right) for real data (a), rendered synthetic data (b) and augmented synthetic data (c).
we sample 50 RGBD images from the synthetic and real dataset and compare global statistics of these two subsets.For RGB images, we compute the average and the standard deviation for brightness and saturation, as we qualitatively observed that these two factors have a strong influence on the appearances of the generated data.By comparing the statistic of brightness on the synthetic and real subsets, we can optimize the average power and randomization of the point lights and the lightemitting background in Blender.Similarly, with the statistics of saturation, we can optimize the color management in the Blender.To study the statistics of depth images, we use the average power spectral density (PSD) and compare the average distribution on frequencies, as shown in Figure3.Studying PSD on frequencies allows us to inspect the structures of the depth images.And we can accordingly adjust the frequency of the Perlin noise used for depth augmentation.It can be seen that the augmented depth images are closer in frequency distribution to the real images than the non-augmented ones.

Fig. 3 :Fig. 4 :
Fig.3: The plot showing the qualitative average power spectral density (PSD) of depth images with respect to frequencies for the object "cat" from LM dataset over 50 randomly sampled images.
Compared to ADD, ADDS measures the distance to the nearest point instead of correspondent mesh points.For symmetrical objects, ADDS is better suited because ADD yields low scores if the object's pose is different from the ground truth, even if the pose corresponds to an invariant rotation.The success rates on test images are used to quantify the pose estimation performance.A threshold of 10% of the object's diameter is typically used to classify a prediction as successful or not.

Fig. 5 :
Fig. 5: The figure showing the photographed (left), photo-realistically rendered RGB (middle) and rendered depth images (right) for the five selected household objects.
(a) Experiments under diffused lighting conditions.(b) Experiments under low lighting condition.(c) Experiments under spot lighting condition.

Fig. 6 :
Fig. 6: The figure showing the real world grasping experiments under diffused (a), low (b), spot (c) lighting conditions.The experimental setup, RGB view, depth view and the predicted pose of pliers on RGB are shown from left to right.
this selection method already removes those outliers that show a high offset.To eliminate any further outliers, we filter out any keypoint candidate whose distance to the mean prediction µ exceeds the standard deviation σ, i.e. the offsets of i will be masked out if |of i −µ| > σ.After removing outliers, we apply global averaging on {x, y, z} axis to obtain the voted keypoints { kp i } ∈ R 3 .We use singular value decomposition (SVD) to find the SE(3) transformation matrix Rt between the predicted keypoints { kp i } and the reference model keypoints {kp i }.
3. However, the Mean-Shift algorithm works iteratively and this prevents an efficient GPU implementation with deterministic execution time.To make the keypoint voting temporally deterministic, we first select a fixed amount of point cloud points for each keypoint with the smallest predicted offset.Compared to random sampling,

TABLE I :
The performance of 6D pose estimation on LineMOD compares to the state-of-the-art using RGBD.The bold objects are symmetric.
* With refinement

TABLE II :
6D pose estimation ADD(S) scores, using predicted or ground truth bounding boxes.

TABLE III :
Running time analysis of the proposed two-stage pose estimation approach.

TABLE IV :
Single Object grasping experiments under varying lighting.conditions