Simulated mental imagery for robotic task planning

Traditional AI-planning methods for task planning in robotics require a symbolically encoded domain description. While powerful in well-defined scenarios, as well as human-interpretable, setting this up requires a substantial effort. Different from this, most everyday planning tasks are solved by humans intuitively, using mental imagery of the different planning steps. Here, we suggest that the same approach can be used for robots too, in cases which require only limited execution accuracy. In the current study, we propose a novel sub-symbolic method called Simulated Mental Imagery for Planning (SiMIP), which consists of perception, simulated action, success checking, and re-planning performed on 'imagined' images. We show that it is possible to implement mental imagery-based planning in an algorithmically sound way by combining regular convolutional neural networks and generative adversarial networks. With this method, the robot acquires the capability to use the initially existing scene to generate action plans without symbolic domain descriptions, while at the same time, plans remain human-interpretable, different from deep reinforcement learning, which is an alternative sub-symbolic approach. We create a data set from real scenes for a packing problem of having to correctly place different objects into different target slots. This way efficiency and success rate of this algorithm could be quantified.


I. INTRODUCTION
Task planning is the process of generating an action sequence to achieve a certain goal.To do this with conventional AI-planning one needs to rigorously define symbolic structuring elements: the planning domain including planning operators, pre-and post-conditions, as well as search/planning algorithms [1], [2], [3].While this is powerful in different complex scenarios, most every-day planning tasks are solved by humans without explicit structuring elements (even without pen&paper).Modern neural-network-based methods can predict the required action, given a scene, without any aforementioned symbolic pre-structuring [4], [5].However, the reasons for the decisions made by a neural networks usually remain opaque and interpretation by a human is impossible.Thus, networks elude explanations, which, however, might be important in human-robot cooperation tasks.Based on this need, we are suggesting a planning approach based on human-understandable entities: image segments, objects, and affordances, but no explicit domain descriptions.
T. Kulvicius is also with University Medical Center Göttingen, Child and Adolescent Psychiatry and Psychotherapy, 37075 Göttingen, Germany.
M. Tamosiunaite is also with the Faculty of Computer Science, Vytautas Mangnus University, Kaunas, Lithuania.
Our new method for task planning called Simulated Mental Imagery for Planning (SiMIP) consists of the following components: perception, imagination of the action effect, success checking and (re)planning.This is similar to everyday human plans, comprising few steps only, being many times ad hoc, involving frequent success-checking and re-planning [6].Note, however, that we abstract away from agent self-modeling (as e.g. in [7]) and only produce mental images of successive scenes.If one wants to extract parameters required for robotic execution, like locations of objects to be grasped or target locations of where to put the objects, one has to post-process the mental images showing scenes before the action and after the action.In addition, we do not include actions of other agents in our mental models (as e.g. in [8]).
Extending affordance-based approaches, which analyze one scene at a time [9], we add to our architecture Generative Adversarial Networks (GANs) for simulated imagery of scenes following an action.Given the impressive performance of GANs in realistic image generation [10], [11], one could potentially use them to envision outcomes of robot manipulation.However, when handling complex scenes, GANs tend to suffer from instabilities in learning.Also, when processing complex scenes in an end-to-end manner, network behaviour is hard to explain (e.g.see [12]).Instead, we suggest obtaining future scenes by re-combinations on an object-by-object basis, with a GAN-based "imagination" step for the completion of individual objects.This is reminiscent of object-centric approaches that address scenes in object-byobject manner in latent space, (e.g.see [13], [14]).However, we prefer to keep the model explicit for achieving more stable training and performance.
As stated above, we use a simulated mental imagery process, which creates images of the outcome of an imagined action, then we use the imagined outcome as the input for the next imagined action, and so on.This way we can create a planning tree composed of images for which conventional search algorithms can be used to arrive at an image sequence that leads to the goal.While the tree remains sub-symbolic, due to the object-wise treatment of the imagined scenes, it can be readily post-processed into a symbolic representation required for robotic action.Stated in natural language, from the representations employed it is possible to deduce commands like that: "pick an object with label A from the table top with center coordinate (x 1 ,y 1 ), and diameter B cm and place it on an empty space with center coordinate (x 2 ,y 2 )".This, together with the obtained image trees, makes the approach explainable to a human both in symbolic as well as in visual terms.
We demonstrate our approach on a task of box packing, where we created and labeled a small data set for that.As we keep the neural architectures simple (the aforementioned object-by-object attitude), comparatively small data sets suffice for training.Thus, one could also address a new task by preparing and labeling a new data at limited costs.A large domain of problems including packing, stacking and ordering (of table-tops, shelves) can be addressed this way.
The paper is structured as follows.In section II we discuss related work.Subsequently, an overview of our approach is presented in section III and implementation details are described in section IV.In Section V, we present experiments and results, and, finally, in section VI we provide a conclusion and outlook.

II. RELATED WORK
We will first briefly discuss classical symbolic, then neural-network-based sub-symbolic planning.We discuss usage of physical simulation in planning, in respect to mental imagery of future scenes used in our study.Then we provide an overview of affordance recognition, focusing on aspects relevant to our framework.In the end we briefly review usage of neuro-symbolic representations in visual reasoning, which is also to some degree related to our approach.

A. Symbolic Planning
Classical planning techniques originating from STRIPS [1] are the usual choice for decision-making for robotic execution.They use a symbolic, logic-based notation compatible with human language that permits an intuitive domain specification [15].Contemporary planning techniques go a step forward and handle real world uncertainties using probabilistic approaches [16], [17], [18].Despite the recent progress of such planning applied to robotics, these techniques are still subject to the symbolization problem mentioned before: all the relevant aspects for the successful execution of the robotic actions should be considered in the planning problem definition using scenario-specific domain descriptions.
To reduce hand-crafting, learning methods have been designed for aiding the domain definitions [19], [20], [21], [22].However, learning is not effort-free as data sets of pairs of pre-and post-conditions are required.In case of classical techniques, many constraints and problem pre-structuring is needed [19].In case of deep learning approaches, most often latent space representations are used for obtaining "symbols".Experimentation how many symbols (i.e., latent variables) does one need is required [22], while any human-understandable meaning of these symbols can only be hand-assigned post hoc.Thus, symbolic representation learning, though possible, requires quite some additional design efforts.Generalization of the developed representations many times requires additional machinery, where objects and effects need to be assigned into classes, based on similarities in some feature space [20], [23], where the feature space is used for generalization afterwards.Thus, though promising, learning of planning operators remains relatively complex and, thus, is not frequently used in practice.

B. Simulation
Physical simulation is another way for future state prediction and simulation-based approaches for planning also exist.Fusion of simulation of sensing and robot control in virtual environments is an important development leading to the application of such techniques in robotics [24].Planning of actions based on simulations has been done both in the realm of classical [25], [26] as well as deep-learning [27] methods.To perform simulations, however, one needs robot-and object-models, as well as a full specification of the scenario.In industrial tasks, CAD models of parts and setups are usually available.However, this is usually not the case in everyday environments.In this work, we are not concerned with industrial, high-precision robotic actions, but we are targeting the everyday-domain.There, most actions need only to be "fairly" accurate and, thus, one is not forced to simulate actions and their outcomes with the highest precision.Our method, thus, exploits mental simulation in the form of imagination of future scenes instead of physical simulation.

C. Sub-Symbolic Planning Using Neural Networks
Deep reinforcement learning approaches allow learning action selection strategies in complicated worlds.Here explicit symbolic representations are not required, as actions can be deduced from the learned value function given the current scene (e.g.see [4], but see also more citations, below).Such models are then capable of predicting future states, either at some level of abstraction, e.g.hidden/latent variables [28], [5], or as complete images [29], [30].Predicting future states helps training the models, as this way hypothetical future developments can be obtained.However, reinforcement learning requires probing very many consecutive states.Thus, such approaches, as for now, have been mainly developed for computer games, where there are easy ways to register state-action sequences.When using imitation learning, which reduces data requirements, 3D simulated environments as well as real scenes can be addressed [31], [32].Task and motion planning problem can be formulated and learned similar to reinforcement learning approaches [33].Stereotyped tasks can be attained in real world experiments through long self-supervised experimentation by a robot [34], where this can be unavailable or too expensive for developing concrete applications.Different from all that, our approach does not require action sequences or pre-and post-condition pairs.Conventional approaches suffice here for learning of the following entities, which we need: object detection, object completion and affordance segmentation in the scene.These allow performing planning for us.

Target scene
Fig. 1.Task definition.The table top has to be ordered by putting all objects in the given box.The target is to leave no objects outside the box.Note, the "target scene" here is presented only for illustration purposes, as all other configurations, where there is no object left outside the box, would be considered valid, too.

D. Affordance recognition
The term "affordance" originates from cognitive psychology [35].The set of affordances can be briefly described as the set of momentarily possible interactions between an agent and its environment.In robotics this term very often takes the meaning of: "Which actions could a robot perform in a given situation (with some given objects)?"The goal of affordance segmentation is to assign probabilities for a set of affordances to every location in an image.A straightforward problem is trying to estimate affordances of whole objects [9], [36].However, affordances can also be detected for multiple objects in the scene [37], [38].Works exist predicting affordances resulting after an action has been executed, this way aiding planning [39].Alternatively, here we will obtain future affordances through imagination of future scenes, thus pixel-wise affordance segmentation of scenes is enough for us.

E. Neuro-symbolic representations
Related to planning are visual reasoning tasks, like visual question answering (VQA) [40], [41], [42] which work through employing symbolic reasoning on images.These methods, similar to ours, include scene parsing modules, however, in addition, they heavily rely on NLP modules.We do not need NLP modules, as our aim is individual object manipulation, where object specificity beyond its affordances is not considered.Also related to our approach are video de-rendering tasks, where a latent representation is pushed towards an interpretable structure, by including a graphics engine into the decoder [43].Other elaborate mechanisms exist to obtain symbolically-meaningful latent representations [13].We, however, do not go into the direction of interpretable latent representations, but rely on explicitly modeling individual objects in the scene as instances with affordances.Finally, graph neural networks may be applied for planning tasks, where geometric and symbolic information of a scene is supplied to the algorithm [44].Different from all here mentioned algorithms, we avoid complicated network structures in order to avoid heavy demands on the amount of data required for training.We also avoid task pre-structured architectures, so that the application of the algorithm in a new situations is made easy.

III. OVERVIEW
We are solving the task of ordering a desktop, where the system is presented with an initial scene (see Figure 1) and the goal is to put the objects in the provided box, so that there are no objects remaining outside of the box.Thus, the algorithm is not provided with the target scene as such, but only with the condition that the table-top outside the box has to be free.The box can be initially empty, as shown in the figure, or partially filled.Initial filling of the box may be incorrect, with too small objects occupying compartments required for putting in a bigger object.Furthermore, an initial scene with no objects outside of the box is also a valid scene, where we expect the answer from our algorithm to be that nothing needs be done.
In Fig. 2, the general workflow of our system is visualized, which we will describe next (for more details, see next section).We take as input an initial scene.First, we perform object detection and pixel-level instance (object) segmentation.In addition to this, we also create an affordance map for the initial scene, which assigns to the scene pixellevel affordances.We then perform object completion (deocclusion) using a Generative Adversarial Network (GAN).This allows us to split the whole scene into background and a set of complete individual objects.This is followed by pose estimation (not shown in the flow diagram).
Following that, we imaginarily-execute actions (i.e., generate post-action images), where we can choose from pick&place, rotate, or flip vertically.After a post-action image was generated we perform a validity checking process Quantification: We use a set of initial scenarios and create decision trees based on imagined scenes (see Fig. 4) and check validity of the scenes.All valid image sequences then represent valid plans, where pre-and post-conditions are implicitly encoded by the images.This way, we can quantify whether or not such a system shows degradation along several planning steps, determining for different scenarios and manipulation sequences its actual usefulness for planning and execution.In Algorithm 1 we show in a formal way how a symbolic plan for a robot can be extracted from the image-based plan shown in Figure 4.For more details, see next section.

A. Data set
The data to train and evaluate our proposed method is created from a real environment.We used a top-view camera positioned 100 cm above the center of the table and collect images with a resolution of 1024 × 768 pixels.Note that the usage of the top-view camera is not a restriction of this method.At the end of this study, we show that top views can be generated by inverse perspective mapping.Hence, similar to human imagination processes, where we employ usually also some canonical "internal view" onto the imagined scene, also here the top view serves as the canonical perspective for our planning method.
The data set includes eleven different objects from seven classes: can, cup, plate, bowl, apple, box and cuboid, where cups and cuboids are two each and there are three different boxes.We use the following procedure for data collection: (a) we randomly place the box and some objects on a table ; (b) we apply random actions by hand, changing position or orientation of one of the objects or the box and take a picture after each action has been accomplished.We repeat (a) and (b) multiple times.This way we collected 1196 scenes.Each scene contains at least one object with a unique pose and position.Afterwards, the scenes were labeled with instance and affordance annotations.For instance annotation, the seven aforementioned object categories and four different affordance categories (grasp, place-on, obstruct, and hole; for description see Table I) were considered and extracted for all visible regions.It is important to note that our data set does not structure collected images into pairs: (image before the action, image after the action) and does not include all possible goal configurations.

B. Network implementation details
Many of the approaches combined here represent standard methods, and will, thus, only be described briefly.Note, that neural networks for object detection, instance and affordance segmentation, as well object completion are first trained separately using our dedicated data set.Afterwards, the results from different networks are integrated to obtain the imagined planning tree, where we provide details of that integration, too.
For object detection, considering the size of the dataset, we used EfficientDet-D0 [45] as the backbone network.
During training, we apply horizontal flipping, scale jittering and randomly masking to perform data augmentation, and then we resized the images to 512×512 pixels as input.We modified the output and added a pose classification head to predict whether the object is placed vertically or horizontally.The model is trained for 200 epochs with total batch size 4. We also used SGD optimizer with momentum 0.9 and reduced the initial learning rate 0.001 by factor 0.1 when the total loss has stopped improving after 3 epochs.The other parameters are same as in [45] and the original loss functions are utilized.
Example for pose mapping.We create a dictionary to store the horizontal and vertical pose of the blue cuboid.When we apply a flipping action on this object, we can lookup the dictionary and retrieve the corresponding pose

D. Applying the action
The actions, we apply, are pick & place, rotate, and flip vertically.For pick & place to be performed, the object has to have an affordance grasp and the place where the object is placed shall have an affordance place-on or affordance hole, however the obstruct affordance shall be not present for those pixels.In the first instance, we do not check if the object is fitting on the area of affordance place-on or hole correctly, but see the next subsection "Action validation" where we solve this.For the action rotate, the object shall have affordance grasp.Rotation is being performed in 15 deg.steps.For the action flip vertically, the object shall have affordance grasp and the imagined action is performed by retrieving entries from the dictionary of horizontal vs. vertical poses, as was described above.The result of an imagined action is a post-action image.
To obtain the post-action image we regard each object as a separate layer and then we use traditional image processing methods, such as cut-and-paste and rotation, to perform the movement and horizontal rotation of the object.We take the center of the object's bounding box as the origin when applying the action.For flipping objects, we need to replace the corresponding object layer with the flipped pose according to the dictionary.The object layers are afterwards overlaid on a background layer to get the resulting image showing the result of applying the action.

E. Action validation
In the last step, having obtained images after action imagination, we check whether the action is valid or not.We require that the object is not placed in an area where the affordance is 'obstruct'.Thus, the checking process is based on the affordance map.For this, we define conflict areas as the intersection of the 'obstruct' affordance with the manipulated object.We count the intersection pixels, where we set the threshold to 30 pixels.If the conflict area is less than 30 pixels, then we assume that the action is correct.

F. Formation of a planning tree
For the planning tree we use a basic greedy search approach (depth-first-search) to generate a valid plan.For the initial scene, we first randomly select an object with a picking affordance from the set of objects standing outside the box.Then we attempt to position the object on a randomly selected placing affordance.For that, we perform a series of imaginary actions, including rotation and flipping and verify the image after each action until the object passes a validity check, which a conflict area of no more tan 30 pixels as described above.If success was not achieved by rotating the object in 15°steps either flipped or non-flipped, we proceed to choose another object from the ones standing outside the box.If an object was successfully placed, we advance to the next planning step based on the image generated in the first planning step.Affordance-supported stacking here is also allowed.We terminate the process when there are no more objects outside or no action exists that passes the validity check.

G. Parsing of symbolic entities
In Algorithm 1 we show an example of parsing of the valid visual plan, represented as an image sequence in Figure 4, into symbolic planning entities with parameters required for robotic execution.Note, that entities used in the parsing process: class labels, bounding boxes, and the manipulated object sequence are directly obtained from the imagination process.Hence, for making a symbolic plan, it remains to collect those entities from the images and pass them to the corresponding robot action primitives.We provide the plan in an unwrapped form (instead of an algorithmic loop), to depict the full sequence of steps corresponding to the visual plan given in Figure 4. Note, that this plan could be further processes (translated) into human readable sentences (not shown here) or -as an alternative -one could use automatic, neural network-based methods (see e.g., [49]) for image captioning to arrive also at a language-description of the images.However, the latter is much more demanding than the former, due to the fact that our system already provides many relevant entities and variables for sentence generation.

V. EXPERIMENTS & RESULTS
As defined above, our task is defined as the need to organize a table top by packing objects into a box so that the table outside the box is empty.The box has differently sized partitions and, similarly, objects have different sizes and shapes.
First we will evaluate different system components: object detection, instance segmentation, affordance segmentation and object completion.Afterwords, we will evaluate the method as a whole, including ablation analysis.

A. Evaluation of the system's components
We evaluated the deep learning models used in our process on the test data set and the results are shown in Table II.Note that for the object completion task, we need to fill-in the occluded parts of objects, where we obtain small average losses L1 and L2 ( 0.0275 and 0.0028, respectively).As we are addressing the problem using a top-view, completion mostly addresses object stacks, however some small occlusions, occurring in case objects stand close together not directly under the camera, need this type of handling, too.Since our data set is relatively small and the difference in object appearance between the training and test sets is not significant, these deep learning models perform well in our assigned task (see Figure 6).Hence, these models build a solid foundation for the following task planning.

B. Evaluation of the method
We verified our method using 5-fold cross-validation.Scenes in the data sets differ in the number and location of objects.Our target is to place as many objects from outside the box as possible into the appropriate compartments in the box.To save computational resources, a depth-first search is used to find the complete plans, which are then checked whether they are valid.As many valid plans exist, it is costly to construct a ground-truth set for verification of validity (e.g.consider the need to account for all combinations of packing, including stacks of objects).Hence, we evaluated the obtained plans by eye.
Of the 240 test cases in each test set, there exist plans with zero up to seven packing steps.A "0 step" case corresponds to the situation where the box is fully packed and no planning steps are required.This is included to test the system's capability to recognize also such situations.Table III shows how many of these different cases had been successful.The grand average success rate across all cases was 90.92%.As expected, the success rate deteriorates by 10-15% for longer planning sequences, as both imagination and planning errors accumulate.
In Table IV we analyse all cases in a step-wise manner asking whether a plan step n has been successful or not.On average 623 steps were performed across all 240 test cases in the plan-search process.The overall average success rate of one step is 97.03%.Success rate deteriorates step-wise, however only by a couple percent from step 1 to step 7.This demonstrates that the imagination process used in our study degrades the images only minimally.
To identify the reasons of failed cases we analyzed the causes of each failure.The failures can be attributed to wrong object detection or inaccurate affordance segmentation

Initial scene
Step 1 Step 2 Step 3 Step In the failure cases, the red dashed box means an invalid step in a plan.In failure case 1, a part of the blue cup is incorrectly identified as another cup.The same failure cause also happens in failure case 2, where a part of the blue plate is identified as a can.In failure case 3, there were objects that could be packed, but no action was found in the search.
results, which account for 45.37% and 54.63% of the failure cases, respectively.Failures due to object completion can not be evaluated directly, thus, ablation study is made on that component, as described at the end of the section.
In Fig. 7, we show some successful and some failed plans.Three successful plans were able to complete our box packing task.The action sequences in those plans are described in the figure legend.In the failure cases, the red dashed box means an invalid step in a plan.In failure case 1, a part of the cup is incorrectly identified as another cup, which is caused by an inaccurate result of object detection.The same failure cause also happens in failure case 2, where a part of the plate is identified as a can, which in turn leads to a wrong action.In failure case 3, there were objects that could be packed, but no action was found in the search.This is, because none of the conflict areas calculated between the 'grasped' objects and all 'place-on' and 'hole' areas is smaller than the 30-pixel threshold value, which is caused by an inaccuracy of the result in affordance segmentation.

C. Comparison to baseline.
We performed a comparison to a baseline method where we randomly choose objects and placed them on random place-affordance locations for as many times (steps), as there were objects outside the box.For each test set the random placement was repeated four times to obtain more reliable averages.Results are shown in lines "baseline" in Tables III, IV.Note, that the baseline has a small advantage against our method, as it has information how many objects there are outside the box.This leads to deterministic 100% performance in case there are no objects outside the box, thus, we consider this case as "not applicable" in Table III.Otherwise, the baseline performs substantially worse than our method, which is especially visible for longer plans.Note, the number of total steps in our method and in baseline method are different, as the baseline method uses a simplified procedure on decision how many steps are required.

D. Ablation study.
Here we investigate the utility of different components.For the GAN-based component the results of the study are shown in Tables III and IV, lines "without object completion".In all cases the ablated version performs worse and the effect becomes especially prominent in the last steps of the plan (see the last columns of Table IV).This is expected, as with more imagination steps the need to reproduce object appearance grows.We did not make ablation study for other components of the method (e.g., object detection or affordance segmentation), as removal of those components disrupt operation of the framework completely.
As we cannot completely exclude object detection, instance and affordance segmentation from the algorithm, we made those evaluations differently.We evaluate the influence of those components on the final result by calculating success measures of components for successful and failed plans separately, see Table V.One can see that the mean average precision (mAP) for object detection is 5% smaller in failed cases, while mean intersection over union (mIoU) in instance and affordance segmentation is also a couple of percents smaller in failure as compared to success cases.This shows that there is a relation between success in the here analysed system components and the overall system performance.

VI. DISCUSSION & OUTLOOK
We have presented a method for planning of packing and ordering tasks based on mental imagery, where a tree of imagined future scenes is created and the plan is represented as a sequence of images from such a tree.Unlike methods that predict entire future images in robot manipulation scenes end-to-end [12], our approach involves a scene parsing process, which brings the following advantages: • Generative processes can be supported by comparatively small data sets; • The parsed entities can be further used for definition of robotic actions.While successful operation of generative processes was proven in our ablation analysis, actual robotic action specification based on developed image sequences and robotic implementation will be addressed in future work.
The approach supports explanation of the obtained plans to a human in a hybrid manner: symbolically, by using the labels of the parsed entities (see Algorithm 1) and at the sub-symbolic level by showing the human the pictures that were imagined by the system.E.g., by these pictures it is easy to see what would go wrong along those planning tree branches, which were not included into the valid plan.
The developed system generalizes to different distributions of objects in the initial scenes and can achieve goal states not explicitly provided in the training data.However, the objects need to be learned for instance-and affordance segmentation as well as for generative object completion.The advantage of our method is that a relatively small data sets suffices and, thus, can be labeled with a concrete application in mind.Furthermore, due to the modularity of this system, each component within the system can be readily replaced with newly emerging state-of-the-art techniques.
The current algorithm uses images obtained from a top view camera.This issue does not lead to restrictions, because one can recreate top view images using inverse perspective mapping methods as long as the ground plane is known.Fig. 8 shows how to generate top views from different camera perspectives.Here we created a simulated scene and placed four cameras at fixed positions around the scene for data collection.We first used inverse perspective mapping (IPM) to remap the images from four cameras into a preliminary orthographic projection based on the intrinsic and extrinsic camera parameters.Then we used a deep network (U-net) to further correct this distorted scene and to finally get a near optimal top view image.We used 2000 images for training and 200 images for testing.As this is not in the center of this study, we directly used top view cameras, instead, to generate a canonical view for all our experiments avoiding shape deformation, which might interfere with the planning process.However, if required, IPM pre-processing can be included into our algorithms without restrictions.
Concerning the generative process introduced in our study, we performed these on an object-by-object basis and this way achieved high performance, where future frames do not substantially deteriorate over time.Though generating full images of future scenes is in principle possible, and was addressed by several studies, e.g.[13], [12], the obtained images are blurry (see Figures 7b in [13] and Figure 1 in [12] ).In some own preliminary unpublished work, we were also attempting full image generation and saw the same deficiencies, too.Given that one needs anyhow individual object information for making robotic plans, applying objectby-object treatment of scenes, as now done in this study, is natural and reduces data requirements, while at the same time leading to satisfactory results.
Clearly one cannot address very precise 3D fitting tasks for objects with complicated shapes with our approach and more specialized methods are required for that [32].For generative approaches, more advanced methods like diffusion models [50] can be used.In general, existing works considering image-based foresight are mostly specialized, e.g.pouring [51], pushing, lifting and dragging [34], addressing only block-worlds, or rope manipulation [52], closing a door and object pushing [12].Here we show that for a packing, stacking, and ordering tasks one can simplify this by performing  planning directly by visual imagination without pre/postcondition pairs for training in case of every-day accuracy requirements.In addition, from a practical perspective it is important, that for implementation of our system only deeplearning-based image analysis knowledge is needed, while domain description or reinforcement learning knowledge is not required for that.
Although our current work does not involve direct interaction with a real robot arm, for the implementation of the system on a robot, one can follow a similar approach as we did during data collection.A camera, capable of providing a top-down perspective, is required and it needs to be set up so that the robot does not occlude the scene when in its home position.The camera has to be synchronized with the robot so that it takes an image each time after the robot has accomplished an action and has returned to the home position.
We also believe that incorporating feedback loops on a robot could enhance the success rate of the plans.We can assess the consistency between the actual scene after robot execution and the imagined scene to determine whether plan updates are necessary.If inconsistencies arise, we can choose to regenerate the plan, thereby improving the success rate of the task.E.g. errors like in Failure case 1, shown in Figure 7, where part of a cup was left behind in forward imagination could this way be corrected and would then not influence the final result.Alternatively, we can also choose to regenerate the plan after each step, which allows for continuous updates of the overall plan.This approach essentially involves making predictions for each step individually and our experimental analysis above suggests that planning only one step yields high success rates.This will be the focus of our future work.

Fig. 2 .
Fig.2.Flow diagram of our approach.Our system contains two main parts: scene understanding and action planning.For scene understanding we use three deep networks, a) Object detection, b) Affordance&Semantic segmentation, and c) Object completion.The details of the training and inference process can be seen in Fig.3.Through the scene understanding part we can get the complete shape of the background and each individual object and its affordance class.Then, we can apply actions such as move and rotate to the object and use the information obtained from the affordance map to check whether the action is valid or not.If it is valid, we can perform the next action.

Fig. 3 .
Fig. 3. Training and inference of our model.In training: a) Object detection, b) Instance& Affordance segmentation, c) Object completion (de-occlusion).In the training phase, we train the three models individually and then combine the obtained results in the inference phase.Note that after finishing the object completion (c, above), we need to do affordance segmentation (b, above) again, to get the complete object corresponding to the affordance classes (see red arrows).Bbox=bounding box.Details are explained in subsection IV-B.

Fig. 4 .
Fig. 4. Demonstration of a planning tree.Each column represents an action step, the branches represent possible actions and each action is based on an imagined scene, where the previous action had been completed.The red dashed boxes mark the scenes indicating the valid planning sequence and are numbered consecutively (these numbers are used in Algorithm 1).Red circles indicate the objects on which the action is applied.The green pointer indicates where the object marked by the circle in the previous image has been placed.
Variables: class label # determined for each object in each image, including box compartments object name # of each individual object image ref # reference to images in the valid plan sequence: image 1 to image n bounding box # parameters of the outer edges of the object in 3D rotations angle # between the same object in two consecutive images Functions: Bbox(object name, image ref ) # Read out of a bounding box, given object name and image reference Grasp(class label,bounding box) # for grasping of object, given its class label and bounding box Place at(class label,bounding box) # for placing object at object of class label, given its bounding box Rotate(object name,rotation angle) # for rotating of object using angle between two consecutive images Flip(object name) # clockwise flip an object by 90 • .

Fig. 6 .
Fig. 6.Object completion: qualitative examples.First row: image fragments from the test data set.Second row: mask of the obstructing object detected.Third row: completed object re-inserted into the scene.TABLE II: The results for model components a, b, c (see Fig. 3 first blue box on top).mAP -mean average precision, mIoU -mean intersection over union, L1 is the L1 norm; given as mean (SD) in 5-fold cross-validation.

Fig. 7 .
Fig. 7. Examples of three successful and three failed plans.The first column represents the initial scene, each following column represents an individual imagined action step.Circles emphasize the objects for which the action is applied.Failed steps are marked with red dashed boxes.Explanation for individual cases: Success case 1: 1) Pick up red apple & place into yellow cup.2) Pickup yellow cup (with red apple inside) & place into the compartment 4 of the box.3) Pickup the yellow cup (with the red apple inside) & place into black can.4) Pickup black can (with yellow cup and red apple inside) & place in the compartment 4 of the box.Success case 2: 1) Pick up yellow cup and place into compartment 4. 2) Pick up black can, rotate 60 deg.and place into compartment 1 of the box.3) Pick up blue bowl and place into compartment 2 of the box.Success case 3: 1) Pick up red apple and place into yellow cup.2) Pick up black can, rotate 60 deg.and place into compartment 1.In the failure cases, the red dashed box means an invalid step in a plan.In failure case 1, a part of the blue cup is incorrectly identified as another cup.The same failure cause also happens in failure case 2, where a part of the blue plate is identified as a can.In failure case 3, there were objects that could be packed, but no action was found in the search.

Fig. 8 .
Fig. 8. Example result of a simulated scene by applying inverse perspective mapping to create a top view from several side views.

TABLE I :
Description of the set of affordances used.

TABLE III :
Success rates for planning cases with different plan lengths using a 5-fold cross-validation.Number of steps in plans denote factual number of steps in plans achieved by different methods.The results are reported as mean (SD).Best result in each column is emphasized in bold.

TABLE IV :
Success rates for step by step analysis.The planning steps were performed for the testing dataset using a 5-fold cross-validation.The results are reported as mean(SD).Best result in each column is emphasized in bold.