Learning Suction Graspability Considering Grasp Quality and Robot Reachability for Bin-Picking

Deep learning has been widely used for inferring robust grasps. Although human-labeled RGB-D datasets were initially used to learn grasp configurations, preparation of this kind of large dataset is expensive. To address this problem, images were generated by a physical simulator, and a physically inspired model (e.g., a contact model between a suction vacuum cup and object) was used as a grasp quality evaluation metric to annotate the synthesized images. However, this kind of contact model is complicated and requires parameter identification by experiments to ensure real world performance. In addition, previous studies have not considered manipulator reachability such as when a grasp configuration with high grasp quality is unable to reach the target due to collisions or the physical limitations of the robot. In this study, we propose an intuitive geometric analytic-based grasp quality evaluation metric. We further incorporate a reachability evaluation metric. We annotate the pixel-wise grasp quality and reachability by the proposed evaluation metric on synthesized images in a simulator to train an auto-encoder–decoder called suction graspability U-Net++ (SG-U-Net++). Experiment results show that our intuitive grasp quality evaluation metric is competitive with a physically-inspired metric. Learning the reachability helps to reduce motion planning computation time by removing obviously unreachable candidates. The system achieves an overall picking speed of 560 PPH (pieces per hour).


Introduction
In recent years, growth in retail e-commerce (electronic-commerce) business has led to high demand for warehouse automation by robots [Bogue, 2016].Although the Amazon picking challenge [Fujita et al., 2020] has advanced the automation of the pick-and-place task, which is a common task in warehouses, picking objects from a cluttered scene remains a challenge.
The key to automation of pick-and-place is to find the grasp point where the robot can approach via a collision free path and then stably grasp the target object.Grasp point detection methods can be broadly divided into analytical and data-driven methods.Analytical methods [Miller andAllen, 2004, Pharswan et al., 2019] require modeling the interaction between the object and the hand, and have a high computation cost [Roa and Suárez, 2015].For those reasons, data-driven methods are preferred for bin picking.
Many previous studies have used supervised deep learning, which is one of the most widely used data-driven method, to predict only grasp point configuration (e.g., location, orientation, and open width) without considering the grasp quality.Given an RGB-D image, the grasp configuration for a jaw gripper [Kumra and Kanan, 2017, Chu et al., 2018, Zhang et al., 2019] or a vacuum gripper [Jiang et al., 2020, Araki et al., 2020] can be directly predicted using a deep convolutional neural network (DCNN).Learning was extended from points to regions by Domae et al. [Domae et al., 2014, Mano et al., 2019], who proposed a convolution-based method in which the hand shape mask is convolved with the depth mask to obtain the region of the grasp points.Matsumura et al. [Matsumura et al., 2019] later learned the peak among all regions for different hand orientations to detect a grasp point capable of avoiding multiple objects.
However, in addition to the grasp configuration, the grasp quality is also important for a robot to select the optimal grasp point for bin picking.The grasp quality indicates the graspable probability by considering factors such as surface properties.For example, for suction grasping, although an object with a complicated shape may have multiple grasp points, the grasp points located on flat surfaces need to be given a higher selection priority because they have higher grasp quality (easier for suction by vacuum cup) than do curved surfaces.Zeng et al. [Zeng et al., 2018a] empirically labeled the grasp quality in the RGB-D images of the Amazon picking challenge object set.They proposed a multi-modal DCNN for learning grasp quality maps (pixel-wise grasp quality corresponding to an RGB-D image) for jaw and vacuum grippers.However, preparing a dataset by manual labeling is time consuming and so the dataset was synthesized in a simulator to reduce the time cost.Dex-Net [Mahler et al., 2018[Mahler et al., , 2019] ] evaluated the grasp quality by a physical model and generated a large dataset by simulation.They used the synthesized dataset to train a grasp quality conventional neural network (GQ-CNN) to estimate the success probability of the grasp point.However, defining a precise physical model for contact between gripper and object is difficult.Further, the parameters of the model needed to be identified experimentally to reproduce the salient kinematics and dynamics features of a real robot hand (e.g., the deformation and suction force of a vacuum cup).
Unlike Dex-Net, this study proposes an intuitive suction grasp quality analytic metric based on point clouds without the need for modeling complicated contact dynamics.Further, we incorporate a robot reachability metric to evaluate the suction graspability from the viewpoint of the manipulator.Previous studies have evaluated grasp quality only in terms of grasp quality for the hand only.However, it is possible that although a grasp point has high grasp quality, the manipulator is not able to move to that point.It is also possible for an object to have multiple grasp points with same the level of graspability but varying amounts of time needed for the manipulator to approach due to differences in the goal pose and surrounding collision objects.Bin picking efficiency can therefore be improved by incorporating a reachability evaluation metric.We label suction graspability by the proposed grasp quality and reachability metric and generate a dataset by physical simulator.An auto-encoder is trained to predict the suction graspability given the depth image input and a graspability clustering and ranking algorithm is designed to propose the optimal grasp point.
Our primary contributions include: 1) Proposal of an intuitive grasp quality evaluation metric without complicated physical modeling.2) Proposal of a reachability evaluation metric for labeling suction grapability in addition to grasp quality.3) Performance of a comparison experiment between the proposed intuitive grasp quality evaluation metric and a physically-inspired one (Dex-Net).4) Performance of an experiment to investigate the effect of learning reachability.

Pixel-wise graspability learning
In early studies, deep neural networks were used to directly predict the candidate grasp configurations without considering the grasp quality [Zhou et al., 2018a, Asif et al., 2018, Xu et al., 2021].However, since there can be multiple grasp candidates for an object that has a complicated shape or multiple objects in a cluttered scene, learning graspablity is required for the planner to find the optimal grasp among the candidates.
Pixel-wise graspablity learning uses RGB-D or depth-only images to infer the grasp success probability at each pixel.One representative work is [Zeng et al., 2018a] by Zeng et al., who used a manually labeled dataset to train fully convolutional networks (FCNs) for predicting pixel-wise grasp quality (affordance) maps of four pre-defined grasping primitives.Liu et al. [Liu et al., 2020] performed active exploration by pushing objects to find good grasp affordable maps predicted by Zeng's FCNs. Recently, Utomo et al. [Utomo et al., 2021] modified the architecture of Zeng's FCNs to improve the inference precision and speed.Based on Zeng's concept, Hasegawa et al. [Hasegawa et al., 2019] incorporated a primitive template matching module, making the system adaptive to changes in grasping primitives.Zeng et al. also applied the concept of pixel-wise affordance learning to other manipulation tasks such as picking by synergistic coordination of push and grasp motions [Zeng et al., 2018b], and picking and throwing [Zeng et al., 2020].However, preparing huge amounts of RGB-D images and manually labeling the grasp quality requires a large amount of effort.
Faced with dataset generation cost of RGB-D based graspability learning, researchers started to use depth-image-only based learning.The merits of using depth images is that they are easier to synthesize and annotate in a physical simulator compared with RGB images.Morrison et al. [Morrison et al., 2020] proposed a generative grasping convolutional neural network (GG-CNN) to rapidly predict pixel-wise grasp quality.Based on a similar concept of grasp quality learning, the U-Grasping fully convolutional neural network (UGNet) [Song et al., 2019], Generative Residual Convolutional Neural Network (GRConvNet) [Kumra et al., 2020], and Generative Inception Neural Network (GI-NNet) [Shukla et al., 2021] were later proposed and were reported to achieve higher accuracy than GG-CNN.Le et al. [Le et al., 2021] extended GG-CNN to be capable of predicting the grasp quality of deformable objects by incorporating stiffness information.Morrison et al. [Morrison et al., 2019] also applied GG-CNN to a multi-view picking controller to avoid bad grasp poses caused by occlusion and collision.However, the grasp quality dataset of GG-CNN was generated by creating masks of the center third of each grasping rectangle of the Cornell Grasping dataset [Lenz et al., 2015] and Jacquard dataset [Depierre et al., 2018].This annotation method did not deeply analyze the interaction between hand and object, which is expected to lead to insufficient representation of grasp robustness.
To improve the robustness of grasp quality annotation, a physically-inspired contact force model was designed to label pixel-wise grasp quality.Mahler et al. [Mahler et al., 2018, 2019] designed a quasi-static spring model for the contact force between the vacuum cup and the object.Based on the designed compliant contact model, they assessed the grasp quality in terms of grasp robustness in a physical simulator.They further proposed GQ-CNN to learn the grasp quality and used a sampling-based method to propose an optimal grasp in the inference phase, and also extended their work by proposing a fully convolutional GQ-CNN [Satish et al., 2019] to infer pixel-wise grasp quality, which achieved faster grasping.Recently, Cao et al. used an auto-encoder-decoder to infer the grasp quality, which was labeled by a similar contact model to that used in GQ-CNN, to generate the suction pose.However, the accuracy of the contact model depends on the model complexity and parameter tuning.High complexity may lead to long computation cost of annotation.Parameter identification by real world experiment [Bernardin et al., 2019] might also be necessary to ensure the validity of the contact model.
Our approach also labeled the grasp quality in synthesized depth images.Unlike GQ-CNN, we proposed a more intuitive evaluation metric based on a geometrical analytic method rather than a complicated contact analytic model.
Our results showed that the intuitive evaluation metric was competitive with GQ-CNN.A reachability heatmap was further incorporated to help filter pixels that had high grasp quality value but were unreachable.

Reachability assessment
Reachability was previously assessed by sampling a large number of grasp poses and then using forward kinematics calculation, inverse kinematics calculation, or manipulability ellipsoid evaluation to investigate whether the sampled poses were reachable [Porges et al., 2015, 2014, Vahrenkamp and Asfour, 2015, Zacharias et al., 2007, Makhal and Goins, 2018].The reachability map was generated off-line, and the feasibility of candidate grasp poses were queried during grasp planning for picking static [Akinola et al., 2018, Sundaram et al., 2020] or moving [Akinola et al., 2021] objects.However, creating an off-line map with high accuracy for a large space is computationally expensive.In addition, although the off-line map considered only collisions between the manipulator and a constrained environment (e.g., fixed bin or wall), since the environment for picking in a cluttered scene is dynamic, collision checking between the manipulator and surrounding objects is still needed and this can be time consuming.Hence, recent studies have started to learn reachability with collision awareness of grasp poses.Kim et al. [Kim and Perez, 2021] designed a density net to learn the reachability density of a given pose, but considered only self-collision.Murali et al. [Murali et al., 2020] used a learned grasp sampler to sample 6D grasp poses and proposed a CollisionNet to assess the collision score of sampled poses.Lou et al. [Lou et al., 2020] proposed a 3D CNN and reachability predictor to predict the pose stability and reachability of sampled grasp poses.They later extended the work by incorporating collision awareness for learning approachable grasp poses [Lou et al., 2021].These sampling-based methods have required designing or training a good grasp sampler for inferring the reachability.Our approach is one-shot, which directly infers the pixel-wise reachability from the depth image without sampling.
3 Problem statement

Objective
Based on depth image and point cloud input, the goal is to find a grasp pose with high graspability for a suction robotic hand to pick items in a cluttered scene and then place them on a conveyor.The depth image and point cloud point are directly obtained from an Intel RealSense SR300 camera.

Picking robot
As shown in Fig. 1 (A), the picking robot is composed of a 6 degree-of-freedom (DoF) manipulator (TVL500, Shibaura Machine Co., Ltd.) and a 1 DoF robotic hand with two vacuum suction cups (Fig. 1 (B)).The camera is mounted in the center of the hand and is activated only when the robot is at its home position (initial pose), and hence can be regarded as a fixed camera installed above the bin.This setup has the merit that the camera can capture the scene of the entire bin from the view above the bin center without occlusion by the manipulator.

Grasp pose
As shown in Fig. 1 (C), the 6D grasp pose G is defined as (p, n, θ), where p is the target point position of the vacuum suction cup center, n is the suction direction, and θ is the rotation angle around n. Given the point cloud of the target item and p position, the normal of p can be calculated simply by principal component analysis of a covariance matrix generated from neighbors of p using a point cloud library.n is the direction of the calculated normal of p.As n determines only the direction of the center axis of the vacuum suction cup, a further rotation degree of freedom (θ) is required to determine the 6D pose of the hand.Note that the two vacuum suction cups are symmetric with respect to the hand center.

Method
The overall picking system diagram is shown in Fig. 2. Given a depth image captured at the robot home position, the auto-encoder SG-U-Net++ predicts the suction graspability maps, including a pixel-wise grasp quality map and a robot reachability map.The auto-encoder SG-U-Net++ is trained using a synthesized dataset generated by a physical simulator without any human-labeled data.Cluster analysis is performed on two maps to find areas with graspbility higher than the thresholds.Local sorting is performed to extract the points with highest graspbility values in each cluster as grasp candidates.Global sorting is further performed to sort the candidates of all clusters in descending order of graspbility value, and this is sent to the motion planner.The motion planner plans the trajectory for reaching the sorted grasp candidates in descending order of graspability value.The path search continues until the first successful solution of the candidate is found.

Learning the suction graspability
SG-U-Net++ was trained on a synthesized dataset to learn suction graspability by supervised deep learning.Figure 3 (A) shows the overall dataset generation flow.A synthesized cluttered scene is firstly generated using pybullet to obtain a systematized depth image and object segmentation mask.Region growing is then performed on the point cloud to detect the graspable surfaces.A convolution-based method is further used to find the graspable areas of vacuum cup centers where the vacuum cup can make full contact with the surfaces.The grasp quality and robot reachability are then pixel-wise evaluated by the proposed metrics in the graspable area.

Cluttered scene generation
The object set used to synthesize the scene contains 3D CAD models from the 3DNet [Wohlkinger et al., 2012] and KIT Object database [Kasper et al., 2012].These models were used because they had previously been used to generate a dataset for which a trained CNN successfully predicted the grasp quality [Mahler et al., 2017].We empirically removed objects that are obviously difficult for suction to finally obtain 708 models.To generate cluttered scenes, a random number of objects were selected from the object set randomly, and were dropped from above the bin in random poses.Once the state of all dropped objects was stable, a depth image and segmentation mask for the cluttered scene was generated, as in Fig. 3 (A).

Graspable surface detection
As shown in Fig. 3 (C), in order to find the graspable area of each object, graspable surface detection was performed.Given the camera intrinsic matrix, the point cloud of each object can be easily created from the depth image and segmentation mask.To detect surfaces that are roughly flat and large enough for suction by the vacuum cup, a region growing algorithm [Rusu and Cousins, 2011] was used to segment the point cloud.To stably suck an object, the vacuum cup needs to be in full contact with the surface.Hence, inspired by [Domae et al., 2014], a convolution based method was used to calculate the graspable area (set of vacuum cup center positions where the cup could make full contact with the surface).Specifically, as shown in the middle of Fig. 3 (C), each segmented point cloud was projected onto its local coordinates to create a binary surface mask.Each pixel of the mask represents 1 mm.The surface mask was then convolved with a vacuum cup mask (of size 18 × 18, where 18 is the cup diameter) to obtain the graspable area.At a given pixel, the convolution result is the area of the cup (π * 0.009 2 for our hand configuration) if the vacuum cup can make full contact with the surface.Refer to [Domae et al., 2014] for more details.The calculated areas were finally remapped to a depth image to generate a graspable area map (right side of Fig. 3 (C)).

Grasp quality evaluation
Although the grasp areas of the surfaces were obtained, each pixel in the area may have different grasp probability, that is, grasp quality, owing to surface features.Therefore, an intuitive metric J q (Eq.( 1)) was proposed to assess the grasp quality for each pixel in the graspable area.The metric J q is made up of J c which evaluates the normalized distance to the center of the graspable area and J s which evaluates the flatness and smoothness of the contact area between the vacuum cup and surface.
J q = 0.5J c + 0.5J s (1) J c (Eqs. ( 2)-( 3)) was derived based on the assumption that the closer the grasp points are to the center of graspable area, the closer they are to the center of mass of the object.Hence, grasp points close to the area center (higher J c values) are considered to be more stable for the robot to suck and hold the object.
where p is a point in a graspable area of a surface, p c is the center of the graspable area, and maxmin(x) is a max-min normalization function.
J s (Eqs.( 4)-( 6)) was derived based on the assumption that a vacuum cup generates a higher suction force when in contact with a flat and smooth surface than a curved one.We defined p s as the point set of the contact area between the vacuum cup and the surface when the vacuum cup sucked at a certain point in the graspable area.As reported in [Nishina and Hasegawa, 2020], the surface flatness can be evaluated by the variance of the normals, the first term of J s assesses the surface flatness by evaluating the variance of the normals of p s as in Eq. ( 5).However, it is not sufficient to consider only the flatness.For example, although a vicinal surface has small normal variance, the vacuum cup cannot achieve suction to this kind of step-like surface.Hence, the second term (Eq.( 6)) was incorporated to assess the surface smoothness by evaluating the residual error to fit p s to a plane Plane(p s ) where sum of the distance of each point in p s to the fitted plane is calculated.Note that we empirically scaled res(p s ) by 5.0 and added weights 0.9 and 0.1 to two terms in Eq. 4 to obtain plausible grasp quality values.
J s = 0.9var(n s ) + 0.1e −5res(ps) (4) where p s are the points in the contact surface when the vacuum cup sucks at a point in the graspable area, N is the number of points in p s , n s are the point normals of p s , var(n s ) is the function to calculate the variance of n s , Plane(p s ) is a plane equation fitted by p s using the least squares method, and res(p s ) is the function to calculate the residual error of the plane fitting by calculating the sum of distance from each point in p s to the fitted plane.
Figure 3 (D) shows an example of the annotated grasp quality.Points closer to the surface center had higher grasp quality values, and points located on flat surfaces had higher grasp quality (e.g., surfaces of boxes had higher grasp quality values than cylinder lateral surfaces).

Robot reachability evaluation
The grasp quality considers only the interaction between the object and the vacuum cup without considering the manipulator.As a collision check and IK solution search for the manipulator are needed, on-line checking and searching for all grasp candidates is costly.Learning robot reachability helped to rapidly avoid the grasp points where the hand and manipulator may collide with the surroundings.It also assessed the ease of finding inverse kinematics (IK) solutions for the manipulator.
As described in section 3.3, p and n of a grasp pose G can be calculated from the point cloud.θ is the only undetermined variable for defining a G.We sampled the θ from 0 • to 355 • in step intervals of 5 • .IKfast [Diankov, 2010] and Flexible Collision Library (FCL) [Pan et al., 2012] were used to calculate the inverse kinematics solution and detect the collision check for each sampled θ.The reachability evaluation metric (Eqs.( 7)-( 8)) assessed the ratio of the number of IK valid θ (had collision free IK solution) to the sampled size N θ .
Solver(p, n, θ i ) = 1 if collision free and IK solution exists where N θ is the size of sampled θ and Solver is the IK solver and collision checker for the robot.
Note that because the two vacuum cups are symmetric with respect to the hand center, we evaluated the reachability score of only one cup.Figure 3 (E) shows an example of the robot reachability evaluation.

Clustering and ranking
The clustering and ranking block in Fig. 2 outputs the ranked grasp proposals.To validate the role of learning reachability, we proposed two policies (Policy 1: use only grasp quality; Policy 2: use both grasp quality and reachability) to propose the grasp candidates.Policy 1 extracted the area of grasp quality values larger than threshold th g .Policy 2 extracted the area of grasp quality score values larger than threshold th g and the corresponding reachability score values larger than th r .Filtering by reachability score value was assumed to help to remove pixels with high grasp quality value that are not reachable by the robot due to collision or IK error.The values of th g and th r were empirically set to 0.5 and 0.3, respectively.The extracted areas were clustered by scipy.ndimage.label[Virtanen et al., 2020].Points in each cluster were ranked (local cluster level) by the grasp quality values, and the point with the highest grasp quality was used as the grasp candidate for its owner clusters (see Ranked grasp candidates in Fig. 2).Finally, the grasp candidates were further ranked (global level) and sent to the motion planner.

Motion planning
Given the grasp candidates, goal poses were created for moveIt [Chitta et al., 2012] to plan a trajectory.As described in 3.3, the values of p and n of a goal pose could be obtained from the corresponding point cloud information of the grasp candidates so that only θ was undetermined.As a cartesian movement path is required for the hand to suck the object, p was set to a 1 cm offset away from the object along the n direction.θ was sampled from 0 • to 180 • at step intervals of 5 • .For each sampled goal pose, the trajectory was planned for left and right vacuum cup respectively, and the shorter trajectory was selected as the final solution.The planned trajectory was further time parametrized by Time-Optimal Path Parameterization (toppra) [Pham and Pham, 2018] to realize position control for the robot to approach the goal pose.After reaching the goal pose, the robot hand moved down along n to suck the object.Once the contact force between the vacuum cup and object, which was measured by a force sensor, exceeded the threshold, the object was assumed to be sucked by the vacuum cup and was then lifted and placed on the conveyor.

Data collection, training, and precision evaluation
We used the proposed suction graspability annotation method in pyBullet to generate 15,000 data items, which were split into 10,000 for training and 5,000 for testing.The synthesized data was then used to train SG-U-Net++, which was implemented by pyTorch.The adam optimizer (learning rate = 1.0e-4) was used to update the parameters of the neural network during the training.The batch size was set to 16.Both data collection and training were conducted on an Intel Core i7-8700K 3.70 GHz PC with 64G RAM and 4 Nvidia Geforce GTX 1080 GPUs.
To evaluate the learning results, we used a similar evaluation method to that reported in Zeng et al. [Zeng et al., 2018a] on the testing set.For practical utilization, it is important for SG-U-Net++ to find at least one point in ground truth suction graspable area or manipulator reachable area.We defined suction graspable area as the pixels whose ground truth grasp quality scores are larger than 0.5, and approachable area as the pixels whose ground truth reachability scores are larger than 0.5.The inferred grasp quality and reachability scores were divided by thresholds into Top 1%, Top 10%, Top 25%, and Top 50%.If pixels larger than the threshold were located in the ground truth area, the pixels were considered true positive, otherwise the pixels were considered false positive.We report the inference precision for the four thresholds above for SG-U-Net++, and compared them with Dex-Net.

Real world picking experiments
To evaluate and compare the performance of different policies for the picking system, a pick-and-placement task experiment was conducted.Figure 5 shows the experimental object set, which includes primitive solids, commodities, and 3D-printed objects.All of the objects are novel objects that were not used during training.During each trial, the robot was required to pick 13 randomly posed objects (except for the cup) from a bin and then place them on the conveyor.Note that the cup was placed in the lying pose because it could not be grasped if it was in a standing pose.A grasp attempt was treated as a failure if the robot could not grasp the object in three attempts.
We conducted 10 trials for Policy 1, Policy 2, and Dex-Net 4.0 (suction grasp proposal by fully convolutional grasping policy), respectively.Note that because Dex-Net had its own grasp planning method, we directly sorted the inferred grasp quality values without clustering.To compare our proposed intuitive grasp quality evaluation metric (Eq.( 1)) with the one used in Dex-Net, we evaluated and compared the grasp planning computation time cost and success rate of Policy 1 and Dex-Net.To evaluate the effect of incorporating the reachability score, we evaluated and compared the grasp planning computation time cost, motion planning computation time cost, and success rate of Policy 1 and Policy 2.
6 Results and discussion

Inference precision evaluation
Table 1 shows the inference precision of grasp quality and reachability.Both SQ-U-Net++ and Dex-Net achieved high precisions for Top 1% and Top 10%, but the precision of Dex-Net decreased to lower than 0.9 for Top 25% and Top 50%.This result indicates that our proposed intuitive grasp quality evaluation metric (Eq.( 1)) performed as well as a physically inspired evaluation metric.Learning the suction graspability annotation by point cloud analytic methods might not be so bad compared to dynamics analytic methods for the suction grasp task.However, the inference precision of the reachability for SQ-U-Net++ also achieved larger than 0.9 for Top 1% and Top 10%, but decreased sharply for Top 25% and Top 50%.The overall performance of reachability inference was poorer than grasp quality, indicating that reachability is more difficult to learn than grasp quality.This is probably because grasp quality can be learned from the surface features, but reachability learning requires more features such as the features of surrounding objects in addition to the surface features, leading to more difficult learning.

Overall performance
Table 2 shows the experimental results of Dex-Net and our proposed method.Although all three methods achieved a high grasp success rate (> 90%), our method took shorter time for grasp planning.Moreover, the motion planning computation time was reduced by incorporating the learning of reachability.The SQ-U-Net++ Policy 2 achieved a high speed picking of approximately 560 PPH (piece per hour) (see supplementary video).

Comparison with physically-inspired grasp quality evaluation metric
As shown in Table 2, although our method was competitive with Dex-Net, it was faster for grasp planning.This result indicates that our geometric analytic based grasp quality evaluation is good enough for the picking task compared with a physically-inspired one.The evaluation of contact dynamics between a vacuum cup and the object surface might be simplified to just analyze the geometric features of the vacuum cup (e.g., shape of the cup) and surfaces (e.g., surface curvature, surface smoothness, and distance from the cup center to the surface center).In addition, similar to the report in [Zeng et al., 2018a], the grasp proposal of Dex-Net was farther from the center of mass. Figure 6 shows an example of our method and Dex-Net.Our predicted grasps were closer to the center of mass of the object than the ones inferred by Dex-Net.This is because we incorporated J c (Eq. 2) to evaluate the distance from the vacuum cup center to the surface center, helping the SQ-U-Net++ to predict grasp positions much closer to the center of mass.

Role of learning reachability
Although learning reachability increased the grasp planning cost a little bit by 0.02 s due to the processes such as clustering and ranking of the reachability heatmap, it helped to reduce the motion planning cost (Policy 2: 0.90 s vs. Policy 1: 1.71s).As show in Fig. 7, Policy 2 predicted grasps with lower collision risks with neighboring objects than did Policy 1 and Dex-Net (e.g., Fig. 7 Left: Policy 1 and Dex-Net predicted grasps on wooden cylinder that had high collision risks between the hand and 3D printed objects).Further, an object might have surfaces with the same grasp quality (e.g., Fig. 7 Right: box with two flat surfaces).Whereas Policy 2 selected the surface that was easier to reach, Policy 1 might select the one that is difficult to reach (Fig. 7 Right), since it does not consider the reachability.Therefore, Policy 2 was superior to Policy 1 and Dex-Net because it removed the grasp candidates that were obviously unable or difficult to approach.However, for Policy 1 and Dex-Net, as they considered only the grasp quality, the motion planner might firstly search the solutions for the candidates with high grasp quality, but those candidates might be unreachable for the manipulator, and thus increase the motion planning effort.

Limitations and future work
Our study was not devoid of limitations.Several grasp failures occurred when picking 3D printed objects.Since the synthesized depth images differ from real ones because real images are noisy and incomplete, the neural network prediction error increased for real input depth images.This error was tolerable for objects with larger surfaces like cylinders and boxes, but intolerable for 3D printed objects that have complicated shapes where the graspable areas are quite small.In the future, we intend to conduct sim-to-real [Peng et al., 2018] or depth missing value prediction [Sajjan et al., 2020] to improve the performance of our neural network.Another failure was that although not very often, the objects fell down during holding and placement because the speed of the manipulator was too high to hold the object stably.We addressed this problem by slowing down the manipulator movement during the placement action but this sacrificed the overall system picking efficiency.In the future, we want to consider a more suitable method for object holding and placement trajectory such as model based control.

Conclusion
We proposed an auto-encoder-decoder to infer the pixel-wise grasp quality and reachability.Our method is intuitive but competitive with CNN trained by data annotated using physically-inspired models.The reachability learning improved the efficiency of the picking system by reducing the motion planning effort.However, the performance of the auto-encoder-decoder deteriorated because of differences between synthesized and real data.In the future, sim-to-real technology will be adopted to improve the performance under various environments.

Figure 6 :
Figure 6: Example of Dex-Net grasp prediction that is farther from the center of mass of the object.

Figure 7 :
Figure 7: Example of grasps predicted by Dex-Net and Policy 1 that are unreachable or difficult to reach.