A Manufacturing-Oriented Intelligent Vision System Based on Deep Neural Network for Object Recognition and 6D Pose Estimation

Nowadays, intelligent robots are widely applied in the manufacturing industry, in various working places or assembly lines. In most manufacturing tasks, determining the category and pose of parts is important, yet challenging, due to complex environments. This paper presents a new two-stage intelligent vision system based on a deep neural network with RGB-D image inputs for object recognition and 6D pose estimation. A dense-connected network fusing multi-scale features is first built to segment the objects from the background. The 2D pixels and 3D points in cropped object regions are then fed into a pose estimation network to make object pose predictions based on fusion of color and geometry features. By introducing the channel and position attention modules, the pose estimation network presents an effective feature extraction method, by stressing important features whilst suppressing unnecessary ones. Comparative experiments with several state-of-the-art networks conducted on two well-known benchmark datasets, YCB-Video and LineMOD, verified the effectiveness and superior performance of the proposed method. Moreover, we built a vision-guided robotic grasping system based on the proposed method using a Kinova Jaco2 manipulator with an RGB-D camera installed. Grasping experiments proved that the robot system can effectively implement common operations such as picking up and moving objects, thereby demonstrating its potential to be applied in all kinds of real-time manufacturing applications.


INTRODUCTION
The assembly line is one of the greatest inventions in the manufacturing industry. With the rapid development of artificial intelligence and robotics, more intelligent robotics have been deployed in traditional assembly lines, replacing human workers. These robots are normally equipped with intelligent vision systems which not only detect parts in the working spaces but also estimate their poses before taking further actions such as grasping, rotating, moving, fitting, etc. Generally, object recognition and 6D pose estimation from images are the base of almost all kinds of robotic applications, such as robot manipulation (Tremblay et al., 2018), robot-human interaction (Svenstrup et al., 2009), and virtual reality (Yang et al., 2018).
Many approaches have been reported in the past decade. However, the problem remains challenging, especially in cluttered scenes due to the chaos in backgrounds, heavy occlusions between objects, and changing lighting conditions. Most classical methods work with RGB inputs (color images) (Kehl et al., 2017;Rad and Lepetit, 2017), and some of them use RGB-D inputs (color and depth images) (Wagner et al., 2008). In general, the basic idea of these methods is to estimate the object pose by establishing the correspondence of 2D image features between different viewpoints or constructing mapping from 3D models to 2D images. Difficulties often occur when dealing with low-textured objects and unstable lighting conditions. With the advent of affordable depth sensors, RGB-D data-based methods (Xiang et al., 2017;Qi et al., 2018;Xu et al., 2018) have become more popular and has recently undergone significant progress. Compared with the RGB image, the depth image contains abundant geometric information, such as shape, structure, surface, curvature, etc. Additionally, the depth channel is more stable than RGB channels, usually free from perturbance caused by texture and the changing of light, which makes these approaches more reliable and robust than RGB-only methods. However, due to involvement of a large amount of 3D data, it is still a big challenge to achieve accurate pose estimation in real-time.
With the appearance of powerful computing hardware, deep learning has garnered wide attention in recent years. (Tekin et al., 2018) proposed a single-shot deep CNN that takes the 2D image as the input, directly detects the 2D projections of the 3D bounding box vertices and estimates 6D poses by a PnP algorithm (Lepetit et al., 2009). Based on SSD architecture (Liu et al., 2016), SSD-6D (Kehl et al., 2017) can realize object detection and 6D pose estimation simultaneously, though it does not work well with the small objects and occlusions. Most recently, a series of data-driven methods using RGB-D inputs such as PoseCNN (Xiang et al., 2017), MCN (Li et al., 2018a), and DenseFusion  were presented and has made significant progress in the field of visual recognition. In addition, some methods, such as PointFusion (Xu et al., 2018) and Frustrum PointNet (Qi et al., 2018), focus on how to extract better features from color and depth images. Compared with methods based on handcraft-features, the deep neural network demonstrates a powerful ability for automatic feature extraction, a flexible structure, and an amazing capacity of resisting disturbance.
In this paper, we propose a new two-stage deep network to segment objects from the cluttered scene and to estimate their 6D poses. The overall framework is shown in Figure 1. First, by applying a dense-connected way to aggregate features of different scales, an improved segmentation network inspired by U-Net (Ronneberger et al., 2015) is built. After determining the segmentation mask, the objects are cropped from the scene in both color and depth images. The cropped object images are then fed into a 6D pose prediction network which makes use of two backbone networks to extract color and geometry feature embeddings. Both are then fused together and pass through the channel attention module, position attention module, and global feature extraction module to acquire more effective feature representations. Finally, an iterative network is adopted to refine outputs of the pose predictor.
In summary, the main contributions of our approach are stated as follows: • A new segmentation network is proposed using a densely connected method to aggregate features of different scales and to provide abundant semantic information for pixel-bypixel classification. • The channel attention module and position attention module are introduced into the pose estimation network and effectively improve system performance. • A vision-guided robotic grasping system is built to validate the feasibility of the proposed algorithm being applied to manufacturing applications like grasping, packaging, and assembling etc.
The remainder of this paper is organized as follows: section 2 reviews related work. Section 3 describes the details of the proposed method. The analysis of experimental results and performance evaluation are presented in section 4. Section 5 concludes the paper.

RELATED WORK
Our research mainly involves two topics: object recognition and pose estimation. Semantic segmentation is the most popular way to realize object recognition by predicting which object each pixel in the image belongs to. For semantic segmentation, Convolutional Neural Network (CNN) has proven to be the most successful method so far. For object pose estimation, prior works can be classified into three categories: feature-based methods, learning-based methods, and CNN-based methods. Theoretically, CNN is a specific framework in the machine learning family. A boom in research on deep learning has led to a dramatic number of CNN-based methods being reported in recent years. We therefore placed the CNN-based methods into a separate category.

Semantic Segmentation
Fully Convolutional Network (FCN) (Long et al., 2015) is the first semantic segmentation network. (Ronneberger et al., 2015) added skip connections to the network and created an excellent network known as U-Net. Many subsequent networks (Drozdzal et al., 2016) adopted this U-shape structure to develop their own networks. In order to increase the area of the receptive field without extra computing costs, Atrous Convolutions are proposed in the Deeplab (Chen et al., 2017a). In PSPNet (Zhao et al., 2017), the Pyramid Pooling Module (PPM) is proposed to aggregate contextual information from different scales to improve the network's ability to acquire global information.

Feature-Based Method
The common idea for the feature-based methods is recovering 6D poses based on 2D-3D correspondences (Tulsiani and Malik, 2015) by matching the local features extracted in the 2D image with those in the 3D model. However, this kind of approach usually requires sufficient texture in objects. Some improved versions of this algorithm (Wohlhart and Lepetit, 2015) are proposed to deal with textureless objects.

Learning-Based Method
Machine learning has been widely applied to address classical computer vision problems since 2000. Support Vector Machine (SVM) is proposed in Gu and Ren (2010) for object pose estimation. Hinterstoisser presented a model-based method (Hinterstoisser et al., 2012) to classify objects and 3D poses as well. In Lee et al. (2016), an adaptive Bayesian framework is designed to implement object detection and pose estimation in industrial environments. A decision forest is trained in Brachmann et al. (2014) to classify each pixel in the RGB-D image and determine the postures of objects.

CNN-Based Method
Most recently, with the rapid development of deep learning, CNN has become the mainstream approach in most pose estimation tasks. Zeng proposed a multi-stage feature learning network in Zeng et al. (2016) for object detection and pose estimation with RGB-D inputs. Literatures (Rad and Lepetit, 2017;Tekin et al., 2018) predicted the 2D projections of 3D bounding box of a 3D target by regression before computing poses by PnP algorithm. Kehl localized the 2D bounding box of an object and searched for the best pose from a pool of candidate 6D poses associated with the bounding box (Kehl et al., 2017). Nigam improved the accuracy of pose estimation in Nigam et al. (2018) through a novel network architecture which combined global features for target segmentation with local features for coordinate regression.
Li adapted a CNN-based framework by adding a channel for 3D feature extraction (Li et al., 2018a). PoseCNN (Xiang et al., 2017) is a multi-stage, multi-branch deep network that takes the 6D object poses estimation as a regression problem. With RGB images as inputs, the network can directly estimate the object poses in the whole scene. DenseFusion ) is a general framework for estimating the 6D poses of known objects with RGB-D image inputs. Two networks are utilized to extract color and geometric features, followed by a novel dense network fusing them. In addition, an end-to-end iterative refinement network is also applied further improving the pose estimations.
In sum, DenseFusion is one of the best networks so far, perfectly balancing accuracy and efficiency, thus making it appropriate for many real-time manufacturing applications.

THEORY AND METHOD
As illustrated in Figure 1, the proposed system consists of two stages: object segmentation and object pose estimation. The final objective is to recover object pose parameters from 2D-3D correspondences between the color and depth image. Therefore, an appropriate camera model should be determined before calculation. Figure 2 shows the concept of the pinhole camera model with four coordinate systems.

Pinhole Camera Model
• World coordinate system (X w , Y w , Z w ) is the absolute coordinate system of 3D world, also named as the object coordinate system in our application. • Camera coordinate system (X c , Y c , Z c ) is a 3D coordinate system with camera optical center as the origin, cameras optical axis as Z c -axis, X c -axis and Y c -axis parallel to x-axis and y-axis of image coordinate system respectively. • Image coordinate system (x, y) is a 2D coordinate system located in the image plane with the intersection of Z c -axis and image sensor as origin, x-axis and y-axis parallel to the horizontal and vertical edges of image plane respectively. The 2D coordinates denote pixel position in physical units, such as millimeters. • Pixel coordinate system (u, v) is a 2D coordinate system with the bottom-left corner of image plane as origin, u-axis and v-axis parallel to x-axis and y-axis of image coordinate system respectively. The digital image is represented by an M × N array of pixels. The 2D coordinates denote the pixel position in the image array.
By geometric analysis, the transformation between pixel coordinate system and world coordinate system can be expressed FIGURE 2 | Four coordinate systems in the pinhole camera model: world coordinate system, camera coordinate system, image coordinate system and pixel coordinate system.
f is the focal length that is the length between origin of camera coordinate system and origin of image coordinate system.
in homogeneous form by where (u 0 , v 0 ) are the pixel coordinates of the origin of the image coordinate system in the pixel coordinate system; dx and dy are the physical dimensions of each pixel in the x-axis and y-axis of the image plane respectively; f is the focal length; R is a 3 × 3 orthogonal rotation matrix, and t is a 3D translation vector indicating the transformation from the world coordinate system to the camera coordinate system. It can be seen that the parameters of matrix M 1 are determined by the internal structure of the camera sensor. Thus, M 1 is called an intrinsic parameter matrix of the camera. The rotation matrix R and translation vector t, however, are determined by the position and orientation of the camera coordinate system relative to the world coordinate system. M 2 is called the extrinsic parameter matrix of the camera which is determined by three rotation and three translation parameters. Parameters in M 1 can be calculated through camera calibration. Solving six pose parameters in M 2 is actually called the 6D pose estimation.

Semantic Segmentation
The architecture of the semantic segmentation network is illustrated in Figure 3. The entire network is composed of two parts: the encoder network ( Figure 3A, left) and the decoder network ( Figure 3A, right). The encoder network is designed to extract features of different scales, which consists of five MaxPooling layers and 16 convolutional layers. Same parameter settings such as the convolutional layer in VGG16 (Simonyan and Zisserman, 2015) are applied in the first 13 convolutional layers.
In the decoder network, the Multi-scale Feature Fusion Module (MFFM) is implemented to fuse multi-scale features and output pixel-by-pixel classifications through the final convolutional and softmax layer. The decoder network consists of three MFFMs, two upsampling layers as well as several convolution layers. In convolutional neural networks, feature maps of different sizes not only have diverse receptive fields but also contain complementary information generally. Therefore, fusing features of different scales is an important technique to improve network performance. Researchers have proposed different solutions for multi-scale feature extraction and fusion. Deeplabv3 (Chen et al., 2017b) utilized Atrous Spatial Pyramid Pooling (ASPP) to fuse global content information with multi-scale features. In Zhao et al. (2017), the global AvgPooling operation is applied to generate different scale outputs from feature maps and to extract high-level semantic multi-scale features. However, these methods parallelly convolute and pool the same feature layer at different scales to acquire multi-scale features. In essence, it is not a real fusion of different layer features since these features are all extended from the same layer. Theoretically, a lowerlayer feature contains more geometric detail and less semantic information. Conversely, higher-layer feature maps discard some geometric detail and preserve more semantic information. Thus, a new MFFM is designed to effectively integrate lower-layer and higher-layer features through a densely connected way, thereby enhancing the network's capability of understanding the images.
As shown in Figure 3B, each MFFM layer in the decoder takes the feature inputs from two sources: (1) the layers in the encoder with the same or a lower resolution of the current MFFM layer; (2) the previous layer in the decoder. First, all feature inputs will be upsampled to the resolution of the current MFFM layer, if necessary. Then each of them will pass through an individual convolution layer before finally being aggregated together and output. For inputs from encoder layers, the output channel number is set to 64 to reduce computational complexity. For inputs from the previous layer, the output channel number remains unchanged in order to preserve information coming from the previous layer as much as possible. Thus, different MFFM layers may have various numbers of input layers, as depicted in Figure 3A by arrows in different colors.
In the training stage, the lost function of the segmentation network is defined as cross-entropy and is expressed by where y i,j ∈ {1, 2, ..., C} is the true label of each pixel and y ′ i,j is the prediction.
In sum, the proposed semantic segmentation network provides more effective pixel-by-pixel classification by, in a densely connected way, fusing multi-scale features. Despite the small increase in computational cost, more accurate pixel classifications are helpful in determining the correct correspondences between 2D pixels and 3D points, which is crucial for the second stage of 6D pose estimation. Furthermore, the segmentation network can also extract the correct object contour, showing its ability in dealing with occlusions under complex scenes.

6D Object Pose Estimation
As mentioned in section 3.1, object pose estimation determines the transform between the object coordinate system and the camera coordinate system. A 3D translation vector t and a 3D rotation matrix R are included in the transformation matrix. So, there are six independent pose parameters to be calculated. Figure 4 depicts the concept.
The architecture of our 6D pose estimation network is illustrated in Figure 5. The entire network is composed of three stages: feature extraction stage (Figure 5A), feature fusion stage (Figure 5B), and pose regress stage ( Figure 5C). FIGURE 4 | Object 6D pose estimation. The pose transformation from the object coordinate system to the camera coordinate system is determined by the 3D rotation matrix R and the 3D translation vector t.  The color and geometric information of objects are acquired through the RGB image and depth image. Though they have similar storage formats, the physical meaning and distribution space are quite different. Two CNNs are applied to extract color and geometric features, respectively, as shown in Figure 5A.
Common neural networks generally treat all the features equally. However, some features are better at describing characteristics of objects and should receive more attention. To stress important features whilst suppressing unnecessary ones, we implemented three modules including the Position Attention Module (PAM), Channel Attention Module (CAM), and the Global Feature Extraction Module (GFEM). In the feature fusion stage, color features and geometric features are concatenated and fed into these modules, enabling the network to adaptively capture the local features and the global feature dependencies, providing better features for the pose predictor.
Position Attention Module: For a specified input feature, it is updated by weighting all the features according to their similarity to this feature. Thus, more similar features will have a bigger impact on the input feature. Figure 6 displays the process.
The original input feature matrix V i (V i ∈ R C×N , where C is the feature dimension and N is the number of the features), passes through two convolutional layers separately and get two new feature matrices V 1 , V 2 (V 1 , V 2 ∈ R C ′ ×N . The dimension of V 1 , V 2 ' changes from C to C ′ after passing through, in this paper, we set C ′ = C 8 ). After transposing V 1 multipling V 2 , and followed by a softmax operation, the feature similarity matrix M (M ∈ R N×N ) is obtained. In the meantime, V i passes through the third convolution layer to obtain V 3 (V 3 ∈ R C×N ), it is then multiplied by M to aggregate global features. Finally, the output Channel Attention Module: For any two channel maps, the self-attention mechanism is used to capture the channel dependencies. The weighted sum of all channel maps is calculated to update each channel map. The process of the channel attention module is shown in Figure 7.
Global Feature Extraction Module: The global feature of the objects is quite important for the pose estimation task. Here we use a convolutional layer to adjust the features and apply an AvgPooling layer to acquire the global features.
The output features from the PAM, CAM, and GFEM are concatenated and fed into a pose predictor. The pose predictor consists of several 1D convolution layers for regressing 6D pose parameters and confidence scores. In addition, the same iterative  refinement network as DenseFusion  is also utilized for further improvement in accuracy of pose estimation.
Due to the lack of a unique shape or texture associated with poses for the objects with symmetric structures, we use two types of loss functions in the training stage. For symmetric objects, the loss for one point in the estimated model is defined as the distance between this point and its closest point after pose transformation. The total loss of an object with n points is computed as follows: where T = [R|t], R and t are rotation matrix and translation vector, respectively. Meanwhile, p i represents the homogeneous form of i th point. T est is the current predicted result and T gt is the ground truth. For the object with an asymmetric structure, each pose is associated with a unique texture or shape. So, the loss function of the asymmetric object is defined as the average distance of each point after real pose transformation and the predicted pose transformation, as described in Eq. (4). In the training stage, we switch the loss function according to the labels given with the dataset.

The Platform of Hardware and Software
The proposed network in this paper is built by Pytorch (Paszke et al., 2019). All the experiments are conducted on a PC equipped with an Intel(R) Core(TM) i7-6850k CPU at 3.6GHz and a NVIDIA GeForce GTX1080Ti graphic card.

Training Setup
For the segmentation network, the initial learning rate is set to 0.0001, batch size is 4. The Stochastic Gradient Descent (SGD) optimizer is used in the training. The learning rate decreased by 0.8 times for every 5,000 training steps. For the pose estimation network, the initial learning rate is set to 0.0001, batch size is 8. The Adam optimizer (Kingma and Ba, 2015) is used in the training. When the test error drops to a certain threshold of 0.14, the learning rate starts to decline 0.3 times. When the test error drops to 0.12, the iterative refinement network starts to kick in the training process.
The YCB-Video dataset is built from Xiang et al. (2017) for object pose estimation. The dataset is composed of 92 RGB-D video sequences taken in different indoor scenes, and 21 objects with different shapes and textures. Here we follow the same training and testing settings as the PoseCNN (Xiang et al., 2017) where 80 video sequences are used for training, and 2,949 key frames extracted from the remaining 12 videos are used to evaluate the trained model. In addition, 80,000 synthetically rendered images provided by the dataset are also utilized to augment the training set.
The LineMOD dataset (Hinterstoisser et al., 2011) is another benchmark dataset for object pose estimation, which contains 15 video sequences of low-texture objects. There are approximately 1,100 images in each video sequence. For a fair comparison, we choose the same 13 video sequences, and the same training and testing set as some state-of-the-art methods (Kehl et al., 2017;Rad and Lepetit, 2017;Xiang et al., 2017;Li et al., 2018b;Sundermeyer et al., 2018;Xu et al., 2018). Besides, no synthetic data are generated from the LineMOD dataset and used for training.

Metrics
Metrics for semantic segmentation. Two common metrics, Mean Pixel Accuracy (mPA) and Mean Intersection over Union (mIoU) (Badrinarayanan et al., 2017;Garcia-Garcia et al., 2017) are used to evaluate the segmentation results. The mPA is defined as the mean of pixel accuracy for all classes, where the pixel accuracy is computed by the ratio of correct pixels on a perclass basis. The mIoU is defined as the mean of IoUs for all classes in the dataset, where IoU for each class is computed as the ratio of the intersection and the union of label pixels and predicted pixels.
Metrics for pose estimation. We adopted three metrics to evaluate system performance on the YCB-Video dataset, namely, ADD-S (Xiang et al., 2017) (the average closest point distance), ADD (the average distance) (Hinterstoisser et al., 2012), and AUC (Xiang et al., 2017) (the area under the accuracy-threshold curve of ADD/ADD-S). Given the ground truth pose P (P = [R|T]) and estimated poseP (P = [R|T]), we can calculate the distance between the point in the 3D model transformed byP and the nearest point to it in the 3D model transformed by P. The average distance of all points is called ADD-S. ADD has a similar definition as ADD-S, the difference is that the distance is calculated from corresponding point pairs in the 3D model transformed byP and P. In our experiments, the accuracy, defined as the percentage of testing samples with ADD/ADD-S values less than a certain threshold, is used to evaluate the performance of all methods. Here, the threshold is empirically set to 2 cm. The accuracy values for all possible thresholds are calculated and an accuracy-threshold curve of ADD/ADD-S is generated. AUC is then defined as the area under this curve within the threshold region [0, T m ]. To have a consistent measurement, the maximum threshold T m is set to 0.1 m. For AUC and accuracy, a larger value indicates better accuracy of pose estimation.
For LineMOD dataset, ADD (Hinterstoisser et al., 2012) is used as the metric following prior works (Kehl et al., 2017;Rad and Lepetit, 2017;Xiang et al., 2017;Li et al., 2018b;Sundermeyer et al., 2018;Xu et al., 2018). Instead of a fixed value, we set the distance threshold as 0.1 multiplied by the diameter of bounding sphere of the 3D model.

Experiments on YCB-Video Dataset
Semantic segmentation: Table 1 shows the comparative experiment results of pose estimation on the YCB-Video Dataset where bold numbers represent the best results for each metric. It is observed that our network is much better than the U-Net in both mPA and mIoU. Moreover, our mPA score is a little bit smaller than Deeplabv3 whilst our mIoU score outperforms Deeplabv3 by 2.78. 6D Pose estimation: Tables 2, 3 show the comparative experiment results of pose estimation on 21 objects in the YCB-Video Dataset using our method and some state-of-the-art methods, where the bold numbers represent the best results. In Table 2, AUC and accuracy of ADD-S (<2 cm) are calculated for all objects. As seen from the table, our mean accuracy score is the second-best and the mean AUC score is the best. The overall performance is very close to DenseFusion which is the benchmark method thus far. In Table 3, AUC and accuracy of ADD (<2 cm) are calculated. Apparently both mean accuracy score and mean AUC score of the proposed network outperform all the other methods. Essentially, ADD is a better and stricter metric than ADD-S because it computes the distances between matched point pairs, which usually requires matches on both shape and texture. A better ADD accuracy score is more convincible in evaluating the performance of our method. Figure 8A displays the accuracy-threshold curve of the rotation error and Figure 8B displays the accuracy-threshold curve of the translation error. The rotation error is the angle formed by the predicted and true rotation axes. The translation error is the 2-Norm of the difference between the predicted and true translation vectors. Accuracy is therefore defined as the percentage of testing samples with fewer translation/rotation errors than a threshold. For each threshold, the corresponding accuracy is calculated to build the accuracy-threshold curve. Figures 8A,B exhibit the superior accuracy of our method in both rotation and translation predictions. Moreover, a steeper curve is observed near zero degrees in the accuracy-threshold curve of the rotation error, showing that the proposed method can achieve higher accuracy at a small rotation error threshold, which indicates the smaller pose estimation errors. Figure 9 displays some qualitative results on the YCB-Video dataset. Figure 9A is the original images in the dataset. Figures 9B,D are segmentation results of DenseFusion and our method. Different colors stand for different object categories here. After acquiring the segmentation mask, the pixel-level area of each object in the image is extracted. If the effective pixel number in the depth map of an object is less than a certain threshold, it is identified as an invalid object without estimating its poses. For each valid object, the point cloud is transformed   with the predicted pose parameters. Its projection in the 2D image is then superimposed over the object region, as shown in Figures 9C,E. As illustrated in the second column from the left, the prediction for the bowl by DenseFusion is far away from its real orientation. Our method, however, provides a more correct prediction showing its advantage in dealing with symmetric objects. For some poor-textured objects, such as the banana in the first and fourth column, obvious errors are spotted for DenseFusion with no visually perceptible errors for our method.
Time efficiency. Table 4 shows the time efficiency comparison of our network with PoseCNN and DenseFusion. The time cost of all computation components including segmentation, pose estimation, and iterative refinement are calculated, respectively, for a more intuitive comparison except for PoseCNN, as it is not a pipeline structure network. For the total running time, our method is five times faster than PoseCNN. Compared with DenseFusion, our method is slightly slower in segmentation, while being slightly faster in pose estimation. Although the total time consumption is slightly lower than DenseFusion, it meets the requirements of real-time applications at a processing rate of 18 frames per second with about five objects in each frame. Considering the better accuracy of pose estimation, our method is overall proved to be the best among these state-of-the-art methods. What is more, a lightweight network will be applied for feature extraction in the future, which is expected to improve the time efficiency tremendously. Table 5 shows the comparison of our method with some other methods [BB8 (Rad and Lepetit, 2017), PoseCNN+DeepIM    Bold values represent the best results for all metrics. (Xiang et al., 2017;Li et al., 2018b), Implicit (Sundermeyer et al., 2018)+ICP, SSD-6D (Kehl et al., 2017)+ICP, PointFusion (Xu et al., 2018), DenseFusion ] on the LineMOD Dataset with the accuracy of ADD (< 2cm) adopted as metric. For the mean accuracy, our method outperforms DenseFusion by 2.6, Figure 10 visualizes the pose estimation results of our method on LineMOD Dataset. As expected, only small errors are perceived in these images even if under cluttered environments.

Vision-guided Robotic Grasping System
Object recognition and pose estimation methods can be widely used in robot visual servo systems Wu et al., 2020). In order to explore the feasibility of the proposed method being applied in manufacturing industry scenarios, we have built a vision-guided robotic system for the most common manufacturing task: object grasping. The framework of the system is illustrated in Figure 11. A camera is installed on the manipulator of the robot. Three coordinate systems are labeled including the robot coordinate system C r , the camera coordinate system C c , and the object coordinate system C o . T 1 is the 6D transform from C r to C c , and T 2 is that from C o to C c . T 1 is calculated by the famous navy hand-eye calibration algorithm (Park and Martin, 1994) and T 2 is predicted by the proposed algorithm. The object poses relative to the robot are then computed as T = T 1 × T 2 . The pose matrix is crucial for manipulator path planning and motion control in the grasping task.
The system is composed of a 6 DOF Kinova Jaco2 manipulator and a percipio RGB-D camera FM830-I installed on the side of the gripper, as shown in Figure 12B. The percipio camera utilizes structured light as well as binocular vision to build accurate depth maps. The precision of the captured depth data is up to 1 mm. Morevoer, we also take some building blocks as the target objects in the grasping experiments, as Figure 12A shows.
Before the experiment, first we need to calibrate T 1 , then train the semantic segmentation network and 6D pose estimation network. The experiment process is explained as follows: (1) Before grabbing objects, the manipulator should move to a certain position. (2) The RGB-D camera starts to capture images and sends the data to the image processing server. (3) On the server, the RGB-D images are fed into the segmentation network and pose estimation network to predict the 6D pose parameters of the target objects. (4) Based on the predicted transformation matrix, the host computer completes path planning and sends signals to the manipulator making it move to planned positions and performs the operation of grabbing objects, and then placing them in the target area.
Some experimental results are illustrated in Figure 13. In this case, the segmentation is perfect. However, for some objects, the predictions are not satisfactory. One possible reason is that the poor-textured building blocks may mislead the color feature extractor. In general, the grasping operation runs quickly and smoothly, which, to some extent, verifies the possibility of the new network being applied to all kinds of manufacturing applications. Figure 14 shows the complete process of the grasping experiment.

CONCLUSION
This paper presents a new two-stage deep neural network which can efficiently implement object recognition and 6D pose estimation on the input RGB-D images. First, a segmentation network is applied to segment the object from the scene using a densely connected way to fuse different scale features and effectively improve the semantic segmentation results. Second, by introducing the channel and position attention modules, better color and geometric features are extracted for the pose predictor; third, the output pose parameters are further improved by an iterative refinement network. A large number of experiments conducted on two benchmark datasets demonstrated the effectiveness and accuracy of the proposed method in comparison with some state-of-the-art methods. Moreover, a vision-guided robotic grasping system was built, and the grasping experiment has verified the potential of this algorithm being  applied in real-time manufacturing applications. Currently, the proposed method still has some problems in dealing with textureless or poor-textured objects. Finer differential geometric features with clear physical meaning and better shape detail are preferred and will be considered in future work.

DATA AVAILABILITY STATEMENT
The two benchmark datasets LineMod and YCB-Video analyzed for this study can be found at http://campar.in.tum.de/Main/ StefanHinterstoisser and https://rse-lab.cs.washington.edu/ projects/posecnn/. The building block dataset used in robotic grasping experiments in this paper is built by ourselves and will be made available to any qualified researcher. Further inquiries can be directed to the corresponding author/s.

AUTHOR CONTRIBUTIONS
GL conceived the research project. FC, YL, and YF provided support and discussions. GL, FC, and YL proposed the new algorithm. FC, YL, and YF built the whole program and conducted the experiments. GL, FC, and YL wrote the paper. XW and CW performed the English corrections. All authors reviewed and approved the submitted paper.