From Semantics to Execution: Integrating Action Planning With Reinforcement Learning for Robotic Causal Problem-Solving

Reinforcement learning is generally accepted to be an appropriate and successful method to learn robot control. Symbolic action planning is useful to resolve causal dependencies and to break a causally complex problem down into a sequence of simpler high-level actions. A problem with the integration of both approaches is that action planning is based on discrete high-level action- and state spaces, whereas reinforcement learning is usually driven by a continuous reward function. Recent advances in model-free reinforcement learning, specifically, universal value function approximators and hindsight experience replay, have focused on goal-independent methods based on sparse rewards that are only given at the end of a rollout, and only if the goal has been fully achieved. In this article, we build on these novel methods to facilitate the integration of action planning with model-free reinforcement learning. Specifically, the paper demonstrates how the reward-sparsity can serve as a bridge between the high-level and low-level state- and action spaces. As a result, we demonstrate that the integrated method is able to solve robotic tasks that involve non-trivial causal dependencies under noisy conditions, exploiting both data and knowledge.


INTRODUCTION
How can one realize robots that reason about complex physical object manipulation problems, and how can we integrate this reasoning with the noisy sensorimotor machinery that executes the required actions in a continuous low-level action space?To address these research questions, we consider reinforcement learning (RL) as it is a successful method to facilitate low-level robot control (Deisenroth and Rasmussen, 2011).It is well known that non-hierarchical reinforcement-learning architectures fail in situations involving non-trivial causal dependencies that require the reasoning over an extended time horizon (Mnih et al., 2015).For example, the robot in Figure 1 (right) needs to first grasp the rake before it can be used to drag the block to a target location.Such a problem is hard to solve by RL-based low-level motion planning without any high-level method that subdivides the problem into smaller sub-tasks.
To this end, recent research has developed hierarchical and model-based reinforcement learning methods to tackle problems that require reasoning over a long time horizon, as demanded in domains like robotic tool use, block-stacking (Deisenroth and Rasmussen, 2011), and computer games (Aytar et al., 2018;Pohlen Both problems are currently being recognized by the state of the art in combined task and motion planning (e.g., Toussaint et al. (2018)), and, from a broader perspective, also in the field of state representation learning (e.g., Lesort et al. (2018); Doncieux et al. (2018)).However, to the best of our knowledge, there exist currently no satisfying and scalable solutions to these problems that have been demonstrated in the robotic application domain with continuous state-and action representations (see Section 2.4).In this research, we address P.1 by providing a simple, yet principled, formalism to map the propositional high-level state space to continuous subgoals (Section 3.1).We address P.2 by integrating this formalism with goal-independent reinforcement learning based on sparse rewards (Section 3.2).
Existing approaches that integrate action planning with reinforcement learning have not been able to map subgoals to low-level motion trajectories for realistic continuous-space robotic applications (Grounds and Kudenko, 2005;Ma and Cameron, 2009) because they rely on a continuous dense reward signal that is proportional to manually defined metrics that estimate how well a problem has been solved (Ng et al., 1999).The manual definition of such metrics, also known as reward shaping, is a non-trivial problem itself because the semantic distance to a continuous goal is often not proportional to the metric distance.
Recently, so-called universal value function approximators (UVFAs) (Schaul et al., 2015) in combination with hindsight experience replay (HER) (Andrychowicz et al., 2017) and neural actor-critic reinforcement learning methods (Lillicrap et al., 2016) have been proposed to alleviate this issue.HER realizes an efficient off-policy algorithm that allows for non-continuous sparse rewards without relying on reward shaping.Specifically, HER treats action trajectories as successful that do not achieve the desired specified goal, by pretending in hindsight that the achieved state was the desired goal state.Our research builds on this Eppe et al.

From semantics to execution
Figure 1.A robot performing two object manipulation tasks.1. Block-stacking (left): The gripper must stack three blocks at a random location within the robot's range on the table (indicated by the transparent goal markers behind the gripper).Herein, the robot needs to subdivide the course of actions into separate actions for grasping and placing the individual blocks.2. Tool use (right): The red block is out of the gripper's range (indicated by the dark brown ellipsoid), so that solving the task of moving the block to a target location requires the robot to break the problem down into a sequence of high-level actions that involve grasping the rake, moving the rake towards the block and pulling the rake.method because the sparse subgoal-specific rewards allow us to decouple the reward mechanism from the high-level action planning.
This approach enables us to address the following central hypotheses: H.1 We hypothesize that model-free reinforcement learning with universal value function approximators (UVFAs) and hindsight experience replay (HER) is appropriate to learn the grounding of a discrete symbolic action space in continuous action trajectories.We measure the appropriateness by comparing the resulting hybrid discrete/continuous architecture with continuous hierarchical reinforcement learning (HRL).We consider our approach to be appropriate if it is better capable of learning to solve causal object-manipulation puzzles that involve tool use and causal chains of non-trivial length that HRL.

H.2
We hypothesize that the approach is robust enough to handle a realistic amount of perceptual noise.We consider the approach to be robust to noise if there is no significant performance drop when moderate noise, e.g., 1-2% of the observational range, is added to the agent's state representation.
We address these hypotheses by applying our method to three simulated robotic environments that are based on the OpenAI Gym framework.For these environments, we provide manually defined action planning domain descriptions and combine a planner with a model-free reinforcement learner to learn the grounding of high-level action descriptions in low-level trajectories.
Our research contribution is a principled method and proof-of-concept to ground high-level semantic actions in low-level sensorimotor motion trajectories and to integrate model-free reinforcement learning with symbolic action planning.The novelty of this research is to use UVFAs and HER to decouple the reward mechanism from the high-level propositional subgoal representations provided by the action planner:

From semantics to execution
Instead of defining an individual reward function for each predicate, our approach allows for a single simple threshold-based sparse reward function that is the same for all predicates.
Our research goal is to provide a proof-of-concept and a baseline for the integration of action planning with reinforcement learning in continuous domains that involve complex causal dependencies.
The remainder of the article is organized as follows.In Section 2 we investigate the state of the art in task and motion planning, hierarchical learning and the integration of planning with learning.We identify the problem of grounding high-level actions in low-level trajectories as a critical issue for robots to solve causal puzzles.We present our method and the underlying background in Section 3. We describe the realization of our experiments in Section 4 and show the experimental results in Section 5 before we discuss and align our findings with the hypotheses in Section 6.We conclude in Section 7.

STATE OF THE ART
Our work is related to robotic task and motion planning, but it also addresses plan execution.Therefore, it is also related to hierarchical learning algorithms and the integration of learning with planning.

Combined task and motion planning
The field of combined task and motion planning (TAMP) investigates methods to integrate low-level motion planning with high-level task planning.The field aims at solving physical puzzles and problems that are too complex to solve with motion planning alone, often inspired by smart animal behavior (Toussaint et al., 2018).For example, crows are able to perform a sequence of high-level actions, using tools like sticks, hooks or strings, to solve a puzzle that eventually leads to a reward (Taylor et al., 2009).A set of related benchmark problems has recently been proposed by Lagriffoul et al. (2018).However, since TAMP focuses primarily on the planning aspects and not necessarily on the online action execution, the benchmark environments do not consider a physical action execution layer.Toussaint et al. (2018) formulate the TAMP problem as an inverted differentiable physics simulator.The authors consider the local optima of the possible physical interactions by extending mixed-integer programs (MIP) (Deits and Tedrake, 2014) to first-order logic.The authors define physical interactions as action primitives that are grounded in contact switches.The space of possible interactions is restricted to consider only those interactions that are useful for the specific problem to solve.These interactions are formulated based on a fixed set of predicates and action primitives in the domain of robotic tool use and object manipulation.However, the authors provide only a theoretical framework for planning, and they do not consider the physical execution of actions.Therefore, an empirical evaluation to measure the actual performance of their framework, considering also real-world issues like sensorimotor noise, is not possible.
Other TAMP approaches include the work by Alili et al. (2010) andde Silva et al. (2013) who both combine a hierarchical symbolic reasoner with a geometrical reasoner to plan human-robot handovers of objects.Both approaches consider only the planning, not the actual execution of the actions.The authors do not provide an empirical evaluation in a simulated or real environment.Srivastava et al. (2014) also consider action execution and address the problem of grounding high-level tasks in low-level motion trajectories by proposing a planner-independent interface layer for TAMP that builds on symbolic references to continuous values.Specifically, they propose to define symbolic actions and predicates such that they refer to certain objects and their poses.They leave it to the low-level motion planner to resolve the references.Their approach scales well on the planning level in very cluttered scenes, i.e., the authors demonstrate that the planning approach can solve problems with 40 objects.The authors also present a physical demonstrator

Eppe et al.
From semantics to execution using a PR2 robot, but they do not provide a principled empirical evaluation to measure the success of the action execution under realistic or simulated physical conditions.Wang et al. (2018) also consider action execution, though only in a simple 2D environment without realistic physics.Their focus is on solving long-horizon task planning problems that involve sequences of 10 or more action primitives.To this end, the authors present a method that learns the conditions and effects of high-level action operators in a kitchen environment.
A realistic model that also considers physical execution has been proposed by Leidner et al. (2018).The authors build on geometric models and a particle distribution model to plan goal-oriented wiping motions.Their architecture involves low-level and high-level inference and a physical robotic demonstrator.However, the authors build on geometric modeling and knowledge only, without providing a learning component.Noisy sensing is also not addressed in their work.

Hierarchical learning-based approaches
Most TAMP approaches consider the planning as an offline process given geometrical, algebraic or logical domain models.This, however, does not necessarily imply the consideration of action execution.The consideration of only the planning problem under idealized conditions is not appropriate in practical robotic applications that often suffer from sensorimotor noise.To this end, researchers have investigated hierarchical learning-based approaches that differ conceptually from our work because they build on data instead of domain-knowledge to realize the high-level control framework.
For example, Levy et al. (2019) and Nachum et al. (2018) both consider ant-maze problems in a continuous state-and action space.The challenge of these problems lies in coordinating the low-level walking behavior of the ant-like agent with high-level navigation.However, these approaches do not appreciate that different levels of the problem-solving process require different representational abstractions of states and actions.For example, in our approach, the planner operates on propositional state descriptions like "object 1 on top of object 2" and generates high-level conceptual actions like "move gripper to object".In those HRL approaches, the high-level state-and action representations are within the same state-and action space as the low-level representations.This leads to larger continuous problem spaces.
Other existing hierarchical learning-based approaches are limited to discrete action-or state spaces on all hierarchical layers.For example, Kulkarni et al. (2016) present the h-DQN framework to integrate hierarchical action-value functions with goal-driven intrinsically motivated deep RL.Here, the bottom-level reward needs to be hand-crafted using prior knowledge of applications.Vezhnevets et al. (2017) introduce the FeUdal Networks (FuNs), a two-layer hierarchical agent.The authors train subgoal embeddings to achieve a significant performance in the context of Atari games with a discrete action space.Another example of an approach that builds on discrete actions is the option-critic architecture by Bacon et al. (2017).Their method extends gradient computations of intra-option policies and termination functions to enable learning options that maximize the expected return within the options framework, proposed by Sutton et al. (1999).The authors apply and evaluate their framework in the Atari gaming domain.

Integrating learning and planning
There exist several robot deliberation approaches that exploit domain knowledge to deliberate robotic behavior and to perform reasoning (e.g., Eppe et al. (2013); Rockel et al. (2013)).The following examples from contemporary research extend the knowledge-based robotic control approach and combine it with reinforcement learning: Eppe et al.

From semantics to execution
The Dyna Architecture (Sutton, 1991) and its derived methods, e.g., Dyna-Q (Sutton, 1990), queue-Dyna (Peng and Williams, 1993), RTP-Q (Zhao et al., 1999), aim to speed up the learning procedure of the agent by unifying reinforcement learning and incremental planning within a single framework.While the RL component aims to construct the action model as well as to improve the value function and policy directly through real experiences from environment interaction, the planning component updates the value function with simulated experiences collected from the action model.The authors show that instead of selecting uniformly experienced state-action pairs during the planning, it is much more efficient to focus on pairs leading to the goal state (or nearby states) because these cause larger changes in value function.This is the main idea of the prioritized sweeping method (Moore and Atkeson, 1993) and derivatives (Andre et al., 1998).The methods based on Dyna and prioritized sweeping have neither been demonstrated to address sparse rewards nor do they consider mappings between discrete high-level actions and states and their low-level counter parts.Ma and Cameron (2009) present the policy search planning method, in which they extend the policy search GPOMDP (Baxter and Bartlett, 2001) towards the multi-agent domain of robotic soccer.Herein, they map symbolic plans to policies using an expert knowledge database.The approach does not consider tool use or similar causally complex problems.A similar restriction pertains to PLANQ-learning framework (Grounds and Kudenko, 2005): The authors combine a Q-learner with a high-level STRIPS planner (Fikes and Nilsson, 1972), where the symbolic planner shapes the reward function to guide the learners to the desired policy.First, the planner generates a sequence of operators to solve the problem from the problem description.Then, each of these operators is learned successively by the corresponding Q-learner.This discrete learning approach, however, has not been demonstrated to be applicable beyond toy problems, such as the grid world domain that the authors utilize for demonstrations in their paper.Yamamoto et al. (2018) propose a hierarchical architecture that uses a high-level abduction-based planner to generate subgoals for the low-level on-policy reinforcement learning component, which employed the proximal policy optimization (PPO) algorithm (Schulman et al., 2017).This approach requires the introduction of an additional task-specific evaluation function, alongside the basic evaluation function of the abduction model to allow the planner to provide the learner with the intrinsic rewards, similar to (Kulkarni et al., 2016).The evaluation is only conducted in a grid-based virtual world where an agent has to pick up materials, craft objects and reach a goal position.
A very interesting approach has been presented by Ugur and Piater (2015).The authors integrate learning with planning, but instead of manually defining the planning domain description they learn it from observations.To this end, they perform clustering mechanisms to categorize object affordances and high-level effects of actions, which are immediately employed in a planning domain description.The authors demonstrate their work on a physical robot.In contrast to our work, however, the authors focus mostly on the high-level inference and not on the robustness that low-level reinforcement-based architectures provide.

Summary of the state of the art
The main weaknesses of the related state of the art that we address in this research are the following: TAMP approaches (Section 2.1) mainly focus on the planning aspect.Whereas they do not consider the physical execution of the planned actions or only evaluate the plan execution utilizing manually defined mapping between high-level (symbolic) and low-level (continuous value).These approaches require domain knowledge and model of robots and the environment to specify and execute the task and motion plans, which may suffer from noisy sensing conditions.On the contrary, hierarchical learning-based approaches

Eppe et al.
From semantics to execution (Section 2.2) propose to learn both high-level and low-level from data, but mostly focus on solving problems with discrete action space, and they require internal hand-crafted reward functions.Methods with continuous action space like (Levy et al., 2019;Nachum et al., 2019) only consider setups without representational abstractions between the different hierarchical layers.Mixed approaches (Section 2.3) that integrate learning and planning have similar disadvantages as the two other groups.In particular the lack of principled approaches to realize the mapping between discrete and continual spaces, the manual shaping of reward functions, and the lack of approaches that have been demonstrated and applied in a continuous-space realistic environment.To tackle the problems P.1 and P.2, and to address the research goal of grounding high-level actions in low-level control trajectories, we propose the architecture depicted in Figure 2. The novelty of the architecture with respect to the state of the art is its ability to learn to achieve subgoals that are provided in an abstract symbolic high-level representation by using a single universal sparse reward function that is appropriate for all discrete high-level goal definitions.This involves i) the grounding of the high-level representations to low-level subgoals (Section 3.1, Algorithm 1), and ii) the formalization of the abstraction of the low-level space to the high-level space (Section 3.1, Equation 4), iii) a UVFA-and HER-based reinforcement learner to achieve the low-level subgoals (Section 3.2), and, iv) the integration of an action planner with the reinforcement learning using the abstraction-and grounding mechanisms (Section 3.3).

INTEGRATING REINFORCEMENT LEARNING WITH ACTION PLANNING
The resulting architecture is able to acquire a large repertoire of skills, similar to a multiple-goal exploration process (Forestier et al., 2017).However, instead of sampling and evaluating (sub-)goals through intrinsic rewards, the subgoals within our architecture are generated by the planner.

Abstraction and grounding of states and goals
Our abstraction and grounding mechanism tackles the research problem P.1, i.e., the mapping from high-level actions to low-level subgoals.STRIPS-based action descriptions are defined in terms of state changes based on predicates for pre-and postconditions.To appreciate that the state change is determined

Eppe et al.
From semantics to execution by the postcondition predicates, and not by the actions themselves, it is more succinct to define subgoals in terms of postcondition predicates because multiple actions may involve the same postconditions.Therefore, we define a grounding function for subgoals f subg .The function is based on predicates instead of actions to avoid redundancy and to minimize the hand-engineering of domain models and background knowledge.
To abstract from low-level perception to high-level world states, we define abstraction functions f S , f G .These functions do not require any additional background knowledge because they fully exploit the definition of f subg .In our formalization of the abstraction and grounding, we consider the following conventions and assumptions C.1-C.7: C.1 The low-level environment state space is fully observable, but observations may be noisy.We represent low-level environment states with finite-dimensional vectors s.To abstract away from visual preprocessing issues and to focus on the main research questions, we adapt the state representations commonly used in deep reinforcement learning literature (Levy et al., 2019;Andrychowicz et al., 2017), i.e., states are constituted by the locations, velocities, and rotations of objects (including the robot itself) in the environment.C.2 The low-level action space is determined by continuous finite-dimensional vectors u.For example, the robotic manipulation experiments described in this paper consider a four-dimensional action space that consists of the normalized relative three-dimensional movement of the robot's gripper in Cartesian coordinates plus a scalar value to represent the opening angle of the gripper's fingers.C.3 Each predicate of the planning domain description determines a Boolean property of one object in the environment.The set of all predicates is denoted as The high-level world state S is defined as the conjunction of all positive or negated predicates.C.4 The environment configuration is fully determined by a set of objects whose properties can be described by the set of high-level predicates P. Each predicate p ∈ P can be mapped to a continuous finite-dimensional low-level substate vector s p .For example, the location property and the velocity property of an object in Cartesian space are both fully determined by a three-dimensional continuous vector.C.5 For each predicate p, there exists a sequence of indices p idx that determines the indices of the low-level environment state vector s that determines the property described by p.For example, given that p refers to the object being at a specific location, and given that the first three values of s determine the Cartesian location of an object, we have that p idx = [0, 1, 2].A requirement is that the indices of the individual predicates must not overlap, i.e., abusing set notation: p idx 1 ∩ p idx 2 = ∅ (see Algorithm 1 for details).C.6 The high-level action space consists of a set of grounded STRIPS operators (Fikes and Nilsson, 1972) a that are determined by a conjunction of precondition literals and a conjunction of effect literals.C.7 A low-level goal g is the subset of the low-level state s, indicated by the indices g idx , i.e., g = s[g idx ].
For example, consider that the low-level state s is a six-dimensional vector where the first three elements represent the location and the last three elements represent the velocity of an object in Cartesian space.Then, given that g idx = [0, 1, 2], we have that g = s[g idx ] refers to the location of the object.

Mapping predicates to low-level subgoals
Abstracting from observations to high-level predicate specifications is achieved by mapping the low-level state space to a high-level conceptual space.This is realized with a set of functions f p subg that we define

Eppe et al.
From semantics to execution manually for each predicate p.For a given predicate p, the function f p subg generates the low-level substate s p that determines p, based on the current state s and the goal g: To illustrate how f p subg can be implemented, consider the following two examples from a block-stacking task: 1. Consider a predicate (at target o1) which indicates whether an object is at a given goal location on the surface of a table.Then the respective function f (at target o1) subg (s, g) can be implemented as follows: In this case, the function extracts the respective target coordinates for the object o1 directly from g and does not require any information from s. 2. Consider further a predicate (on o2 o1), which is true if an object o2 is placed on top of another object o1.Given that the Cartesian location of o1 is defined by the first three elements of the state vector s, one can define the following subgoal function: where h obj denotes the height of an object (3) Here, the target coordinates of the object o2 are computed by considering the current coordinates of o1, i.e., the first three values of s, and by adding a constant for the object height to the third (vertical axis) value.

Grounding high-level representations in low-level subgoals
The function f subg that maps the high-level subgoal state G sub in the context of s, g to a low-level subgoal g sub builds on Eq. (1), as described with the following Algorithm 1: subgoal changed ← s last subg = s subg The while loop is necessary to prevent the situation where the application of f p subg (in line 7, Algorithm 1) changes s subg in a manner that affects a previous predicate subgoal function.For example, consider the two predicates (at target o1) and (on o2 o1).The predicate (at target o1) determines the Cartesian location of o1, and (on o2 o1) depends on these coordinates.Therefore, it may happen that f (at target o1) subg = [x, y, z] causes the first three elements of s subg to be x, y, z.However, f (on o2 o1) subg = This is a provisional file, not the final typeset article

Eppe et al.
From semantics to execution [x , y , z ] depends on these x, y, z to determine the x , y , z that encode the location of o2 in s subg .The while loop assures that f (on o2 o1) subg is applied at least once after f (at target o1) subg to consider this dependency.This assures that all dependencies between the elements of s subg are resolved.
To guarantee the termination of the Algorithm 1, i.e., to avoid that the alternating changes of s subg cause an infinite loop, the indices p idx must be constrained in such a way that they do not overlap (see assumption C.5).

Abstracting from low-level state-and goal representations to propositional statements
To realize the abstraction from low-level to high-level representations, we define a set of functions in the form of Equation 4. Specifically, for each predicate p, we define the following function f p maps the current low-level state and the low-level goal to the predicates' truth values.
Equation 4 examines whether the subgoal that corresponds to a specific predicate is true, given the current observed state s and a threshold value for the coordinates .In this article, but without any loss of generality, we assume that each predicate is determined by three coordinates.Equation 4 computes the difference between these coordinates given the current state, and the coordinates determined by f p subg as their Euclidean distance.For example, f (at target o1) subg may define target coordinates for o1, and Equation 4 considers the distance between the target coordinates and the current coordinates.

Generation of adaptive low-level action trajectories
To address the research problem P.2, i.e, the grounding of actions and subgoals in low-level action trajectories, we consider continuous goal-independent reinforcement learning approaches (Lillicrap et al., 2016;Schaul et al., 2015).Most reinforcement learning approaches build on manually defined reward functions based on a metric that is specific to a single global goal, such as the body height, posture and forward speed of a robot that learns to walk (Schulman et al., 2015).Goal-independent reinforcement learning settings do not require such reward shaping (c.f.(Ng et al., 1999)), as they allow one to parameterize learned policies and value functions with goals.We employ the actor-critic deep deterministic policy gradient (DDPG) (Lillicrap et al., 2016) approach in combination with hindsight experience replay (HER) (Andrychowicz et al., 2017) to realize the goal-independent RL for the continuous control part in our framework.Using the HER technique with off-policy reinforcement learning algorithms (like DDPG) increases the efficiency of sampling for our approach since HER stores not only the experienced episode with the original goal (episode ← (s t , u t , s t+1 , g)) in the replay buffer.It also stores modified versions of an episode where the goal is retrospectively set to a state that has been achieved during the episode, i.e., episode ← (s t , u t , s t+1 , g ) with g = s t [g idx ] for some t > t.
To realize this actor-critic architecture, we provide one fully connected neural network for the actor π, determined by the parameters θ π and one fully connected neural network for the critic Q, determined by the parameters θ Q .The input to both networks is the concatenation of the low-level state s and the low-level subgoal g sub .The optimization criterion for the actor π is to minimize the q value provided by the critic.The optimization criterion for the critic is to minimize the mean squared error between the critic's output q and the discounted expected reward according to the Bellmann equation for deterministic policies, as

Eppe et al.
From semantics to execution described by Equation 5.
Given that the action space is continuous n-dimensional, the observation space is continuous mdimensional, and the goal space is continuous k-dimensional with k ≤ m, the following holds for our theoretical framework: At each step t, the agent executes an action u t ∈ R n given a state s t ∈ R m and a goal g ⊆ R k , according to a behavioral policy, a noisy version of the target policy π that deterministically maps the observation and goal to the action1 .The action generates a reward r t = 0 if the goal is achieved at time t.Otherwise, the reward is r t = −1.To decide whether a goal has been achieved, a function f (s t ) is defined that maps the observation space to the goal space, and the goal is considered to be achieved if |f (s t ) − g| < for a small threshold .This sharp distinction of whether or not a goal has been achieved based on a distance threshold causes the reward to be sparse and renders shaping the reward with a hand-coded reward function unnecessary.

Integration of high-level planning with reinforcement learning
Our architecture integrates the high-level planner and the low-level reinforcement learner as depicted in Figure 2. The input to our framework is a low-level goal g.The sensor data that represents the environment state s is abstracted together with g to a propositional high-level description of the world state S and goal G.An action planner based on the planning domain definition language (PDDL) (McDermott et al., 1998) takes these high-level representations as input and computes a high-level plan based on manually defined action definitions (c.f. the Appendix in Section 7 for examples).We have implemented the caching of plans to accelerate the runtime performance.The high-level subgoal state G sub is the expected successor state of the current state given that the first action of the plan is executed.This successor state is used as a basis to compute the next subgoal.To this end, G sub is processed by the subgoal grounding function f subg (Algorithm 1) that generates a subgoal g sub as input to the low-level reinforcement learner.

EXPERIMENTS
This section describes three experiments, designed for the evaluation of the proposed approach.We refer to the experiments as block-stacking, tool use, and ant navigation.The first two experiments are conducted with a Fetch robot arm, and the latter is adapted from research on continuous reinforcement learning for legged locomotion (Levy et al., 2019).All experiments are executed in the Mujoco simulation environment (Todorov et al., 2012).For all experiments, we use a three-layer fully connected neural network with the rectified linear unit (ReLU) activation function to represent the actor-critic network of the reinforcement learner in both experiments.We choose a learning rate of 0.01, and the networks' weights are updated using a parallelized version of the Adam optimizer (Kingma and Ba, 2015).We use a reward discount of γ = 1 − 1/T , where T is the number of low-level actions per rollout.For the block-stacking, we use 50, 100 and 150 low-level actions for the case of one, two and three blocks respectively.For the tool use experiment, we use 100 low-level actions, and for the ant navigation, we used 900 low-level actions.

From semantics to execution
Preliminary hyperoptimization experiments showed that the optimal number of units for each layer of the neural networks for actor and critic of the reinforcement learning depends on the observation space.Therefore, we implement the network architecture such that the number of units in each layer scales with the size of the observation vector.Specifically, the layers in the actor and critic consist of 12 units per element in the observation vector.For example, for the case of the block-stacking experiment with two blocks, this results in 336 neural units per layer (see Section 4.1).We apply the same training strategy of HER (Andrychowicz et al., 2017), evaluate periodically learned policies during training without action noise.We use a fixed maximum number of epochs and early stopping at between 80% and 95% success rate, depending on the task.
In all experiments, we evaluate the robustness of our approach to perceptual noise.That is, in the following we refer to perceptual noise, and not to the action noise applied during the exploration phase of the RL agent, if not explicitly stated otherwise.To evaluate the robustness to perceptual noise, we consider the amount of noise relative to the value range of the state vector.To this end, we approximate the continuous-valued state range, denoted rng, as the difference between the upper and lower quartile of the elements in the history of the last 5000 continuous-valued state vectors that were generated during the rollouts2 .For each action step in the rollout we randomly sample noise, denoted s γ , according to a normal distribution with rng being the standard deviation.We add this noise to the state vector.To parameterize the amount of noise added to the state, we define a noise-to-signal-ratio κ such that the noise added to the state vector is computed as s noisy = s + κ • s γ .We refer to the noise level as the percentage corresponding to κ.That is, e.g., κ = 0.01 is equivalent to a noise level of 1%.
For all experiments, i.e., block-stacking, tool use and ant navigation, we trained the agent on multiple CPUs in parallel and averaged the neural network weights of all CPU instances after each epoch, as described by Andrychowicz et al. (2017).Specifically, we used 15 CPUs for the tool use and the blockstacking experiments with one and two blocks; we used 25 CPUs for the block-stacking with 3 blocks.For the ant navigation, we used 15 CPUs when investigating the robustness to noise (Figure 7) and 1 CPU when comparing our approach to the framework by Levy et al. (2019) (Figure 8).The latter was necessary to enable a fair comparison between the approaches because the implementation of the architecture of Levy et al. (2019) supports only a single CPU.For all experiments, an epoch consists of 100 training rollouts per CPU, followed by training the neural networks for actor and critic with 15 batches after each epoch, using a batch size of 256.The results in Figure 4 and Figure 6 illustrate the median and the upper and lower quartile over multiple (n ≥ 5) repetitions of each experiment.

Block-stacking
Figure 1 (left) presents the simulated environment for this experiment, where a number of blocks (i.e., up to three) are placed randomly on the table.The task of the robot is to learn how to reach, grasp and stack those blocks one-by-one to their corresponding random target location.The task is considered completed when the robot successfully places the last block on top of the others in the right order, and moves its gripper to another random target location.The difficulty of this task increases with the number of blocks to stack.The order in which the blocks need to be stacked is randomized for each rollout.The causal dependencies involved here are, that a block can only be grasped if the gripper is empty, a block (e.g., A) can only be placed on top of another block (e.g., B) if there is no other block (e.g., C) already on top of either A or B, etc.

Eppe et al.
From semantics to execution Figure 3.An ant agent performing a navigation and locomotion task in a four-room environment.Herein, the agent needs to learn how to walk and find a way to reach the desired position.In this case, the agent needs to walk from the upper right room to the target location in the lower left room.
The size of the goal space depends on the number of blocks.For this experiment, the goal space is a subset of the state-space that is constituted by the three Cartesian coordinates of the robot's gripper and three coordinates for each block.That is, the dimension of the goal-and subgoal space is k = (1 + n o ) • 3, where n o ∈ {1, 2, 3} is the number of objects.
The state-space of the reinforcement learning agent consists of the Cartesian location and velocity of the robot's gripper, the gripper's opening angle, and the Cartesian location, rotation, velocity, and rotational velocity of each object.That is, the size of the state vector is |s| = 4 + n o • 12, where n o is the number of blocks.
The planning domain descriptions for all environments are implemented with the PDDL actions and predicates provided in the Appendix (in Section A).

Tool use
The environment utilized for this experiment is shown in Figure 1 (right).A single block is placed randomly on the table, such that it is outside the reachable region of the Fetch robot.The robot has to move the block to a target position (which is randomized for every rollout) within the reachable (dark brown) region on the table surface.In order to do so, the robot has to learn how to use the provided rake.The robot can drag the block either with the left or the right edge of the rake.The observation space consists of the Cartesian velocities, rotations, and locations of the robot's gripper, the tip of the rake, and the object.An additional approximation of the end of the rake is added in this task.The goal space only contains the Cartesian coordinates of the robot's gripper, the tip of the rake, and the object.The planning domain description for this tool use environment can be found in the Appendix (in Section B).

Ant navigation
The environment of navigation and locomotion in a four-connected-room scenario is shown in Figure 3, where the ant has to find a way to the randomly allocated goal location inside one of the four-rooms.The state-space consists of the Cartesian location and transitional velocity of the ant's torso, along with the joint position and velocity of the eight joints of the ant (i.e., each leg has one ankle joint and one hip joint).The goal space contains the Cartesian coordinate of the ant's torso.There are no other objects involved in the task.The planning domain description and the high-level action specifications for this navigation environment can be found in the Appendix (in Section C).

RESULTS
To evaluate our approach, we investigate the success rate of the testing phase over time for all experiments, given varying noise levels κ (see Section 4).The success rate is computed per epoch, by averaging over the number of successful rollouts per total rollouts over ten problem instances per CPU.

Block-stacking
For the experiment with one block, the approach converges after around ten epochs, even with κ = 0.06, i.e., if up to 6% noise is added to the observations.The results are significantly worse for 8% and more noise.For the experiment with two blocks, the performance drops already for 6% noise.Interestingly, for both one and two blocks, the convergence is slightly faster if a small amount (1-2%) of noise is added, compared to no noise at all.The same seems to hold for three blocks, although no clear statement can be made because the variance is significantly higher for this task.
For the case of learning to stack three blocks consider also Figure 5, which shows how many subgoals have been achieved on average during each epoch.For our particular PDDL implementation, six high-level actions, and hence six sugboals, are at least required to solve the task: [move gripper to(o1), move to target(o1), move gripper to(o2), move o on o(o2,o1), move gripper to(o3), move o on o(o3,o2)].First, the gripper needs to move to the randomly located object o1, then, since the target location of the stacked tower is also randomly selected, the gripper needs to transport o1 to the target location.Then the gripper moves to o2 to place it on top of o1, and repeats these steps for o3.The result shows that the agent can consistently learn to achieve the first five subgoals on average, but is not able to proceed further.This demonstrates that the agent robustly learns to stack the first two objects, but fails to stack also the third one.

Tool use
The results in Figure 6 reveal that our proposed approach allows the agent to learn and complete the task in under 100 training epochs (corresponding to approximately 8 hours with 15 CPUs) even with a noise level increased up to 4% of the state range.We observe that it becomes harder for the agent to learn when the noise level exceeds 4%.In the case of 8% noise, the learning fails to achieve a reasonable performance in the considered time-first 100 training epochs (i.e., it only obtains less than 1% success rate).Interestingly, in cases with very low noise levels (1%-2%), the learning performance is better or at least as good as the case with no noise added at all.

Ant navigation
Figure 7 presents the performance of trained agents following our proposed approach in the ant navigation scenario.The results show that the agent can learn to achieve the task in less than 30 training epochs under the low noise level conditions (up to 1%).The performance decreases slightly in the case of 1.5% but the agent still can learn the task after around 70 training epochs.With a higher noise level (i.e., 2%), the agent requires longer training time to cope with the environment.

Comparison with hierarchical reinforcement learning
Results in Figure 8 depict the benchmark experiment of our proposed approach with the HRL approach by Levy et al. (2019).Though the HRL approach quickly learns the task at the beginning, it does not exceed a success rate of 70 %.In comparison, our approach learns to solve the task more reliably, eventually reaching 100%, but the success rate grows significantly later, at around 50 epochs. .Comparison of two approaches for the ant navigation experiment between two approaches: our (PDDL+HER) approach and hierarchical reinforcement learning (HRL) (Levy et al., 2019)

DISCUSSION
The results indicate that our proof-of-concept addresses the hypotheses H.1 and H.2 as follows:

Hypothesis H.1: Ability to ground high-level actions in low-level trajectories
Our experiments indicate that the grounding of high-level actions in low-level RL-based robot control using the HER approach performs well for small to medium-sized subgoal spaces.However, learning is not completely automated, as the approach requires the manual definition of the planning domain and of the functions f p subg that maps planning domain predicates to subgoals.For the tasks of stacking two blocks and the tool use, the subgoal space involved nine values, and both tasks could be learned successfully.The qualitative evaluation and visual inspection of the agent in the Eppe et al.

From semantics to execution
rendered simulation revealed that the grasping of the first and second block failed more often for the experiment of stacking three blocks than for the experiment of stacking two blocks.Therefore, we conclude that the subgoal space for stacking three blocks, which involves twelve values, is too large.
However, the performance on the control-level was very robust.For example, it happened frequently during the training and exploration phase that the random noise in the actions caused a block to slip out of the robot's grip.In these cases, the agent was able to catch the blocks immediately while they were falling down.During the tool use experiment, the agent was also able to consider the rotation of the rake, to grasp the rake at different positions, and to adapt its grip it when it was slipping.
The results indicate that the approach is able to solve causal puzzles if the subgoal space is not too large.The architecture depends strongly on the planning domain representation that needs to be implemented manually.In practice, the manual domain-engineering that is required for the planning is appropriate for tasks that are executed frequently, such as adaptive robotic co-working at a production line or in a drone delivery domain.Due to the caching of plans (see Section 3.3), we have not encountered issues with the computational complexity problem and run-time issues of the planning approach.
Our measure of appropriateness that we state in H.1 is to evaluate whether our method outperforms a state-of-the-art HRL approach.Figure 8 depicts that this is the case in terms of the final success rate.Specifically, the figure shows that the HRL approach learns faster initially, but never reaches a success rate of more than 70%, while our approach is slower but reaches 100%.A possible explanation for this behavior is that HRL implements a "curriculum effect" (cf.Eppe et al. (2019)), in the sense that it first learns to solve simple subgoals due to its built-in penalization of difficult subgoals.However, as the success rate increases, there are fewer unsuccessful rollouts to be penalized which potentially leads to more difficult subgoals and, consequently, a lower overall success rate.This curriculum effect is not present in our approach because the planning mechanism does not select subgoals according to their difficulty.Investigating and exploiting this issue in detail is potentially subject to further research.

Hypothesis H.2: Robustness to noise
For the block-stacking with up to two blocks and the tool-use experiments, the approach converged with a reasonable noise-to-signal ratio of four to six percent.For the block-stacking with three blocks and for the ant environment, a smaller amount of noise was required.An interesting observation is that a very low amount of random noise, i.e., κ = 0.01, improves the learning performance for some cases.Adding random noise, e.g., in the form of dropout, is a common technique to improve neural network-based machine learning because it helps neural networks to generalize better from datasets.One possible explanation for the phenomenon is, therefore, that the noise has the effect of generalizing the input data for the neural network training, such that the parameters become more robust.
Noise is an important issue for real physical robots.Our results indicate that the algorithm is potentially appropriate for physical robots, at least for the case of grasping and moving a single block to a target location.For this case, with a realistic level of noise, the algorithm converged after approximately ten epochs (see Figure 4).Per epoch and CPU, the agent conducts 100 training rollouts.A single rollout would take around 20 seconds on a physical robot.Considering that we used 15 CPUs, the equivalent robot training time required is approximately 83 hours.For the real application, the physical training time can potentially further be lowered by applying more neural network training batches per rollout, and by performing pre-training using the simulation along with continual learning deployment techniques, such as the method proposed by Traoré et al. (2019).

CONCLUSION
We have developed a hierarchical architecture for robotic applications in which agents must perform reasoning over a non-trivial causal chain of actions.We have employed a PDDL planner for the high-level planning and we have integrated it with an off-policy reinforcement learner to enable robust low-level control.
The innovative novelty of our approach is the combination of action planning with goal-independent reinforcement learning and sparse rewards (Lillicrap et al., 2016;Andrychowicz et al., 2017).This integration allowed us to address two research problems that involve the grounding of the discrete highlevel state and action space in sparse rewards for low-level reinforcement learning.We addressed the problem of grounding of symbolic state spaces in continuous-state subgoals (P.1), by proposing a principled predicate-subgoal mapping, which involves the manual definitions of functions f p subg for each predicate p.
We assume that the manual definition of functions f p subg generally involves less engineering effort than designing a separate reward function for each predicate.Although this assumption heavily depends on the problem domain and may be subject to further discussion, the manual definition of functions f p subg is at least a useful scaffold for further research that investigates the automated learning of functions f p subg , possibly building on the research by Ugur and Piater (2015).
The predicate-subgoal mapping is also required to address the problem of mapping subgoals to low-level action trajectories (P.2) by means of reinforcement learning with sparse rewards using hindsight experience replay (Andrychowicz et al., 2017).Our resulting approach has two advantages over other methods that combine action planning and reinforcement learning, e.g., (Grounds and Kudenko, 2005;Yamamoto et al., 2018): The low-level action space for our robotic application is continuous and it supports a higher dimensionality.
We have realized and evaluated our architecture in simulation, and we addressed two hypotheses (H.1 and H.2): First, we demonstrate that the approach can successfully integrate high-level planning with reinforcement learning and this makes it possible to solve simple causal puzzles (H.1); second, we demonstrate robustness to a realistic level of sensory noise (H.2).The latter demonstrates that our approach is potentially applicable to real-world robotic applications.The synthetic noise used in our experiments does not yet fully guarantee that our approach is capable of bridging the reality gap, but we consider it a first step towards real robotic applications (see also Andrychowicz et al. (2018); Nguyen et al. (2018);Traoré et al. (2019)).
The causal puzzles that we investigate in this paper are also relevant for hierarchical reinforcement learning (HRL) (e.g., (Levy et al., 2019)), but we have not been able to identify an article that presents good results in problem-solving tasks that have a causal complexity comparable to our experiments.An empirical comparison was, therefore, not directly possible.Our approach has the advantage over HRL that it exploits domain knowledge in the form of planning domain representations.The disadvantage compared to HRL is that the domain knowledge must be hand-engineered.In future work, we plan to complement both approaches, e.g., by building on vector-embeddings to learn symbolic planning domain descriptions from scratch by means of the reward signal of the reinforcement learning.A similar approach, based on the clustering of affordances, has been presented by Ugur and Piater (2015), and complementing their method with reinforcement learning suggests significant potential.An overview of this topic and potential approaches is provided by Lesort et al. (2018).We also plan to apply the approach to a physical robot and Eppe et al.

From semantics to execution
to reduce the amount of physical training time by pre-training the agent in our simulation and by applying domain-randomization techniques (Andrychowicz et al., 2018).) )

C Ant navigation
The domain description of the ant navigation task is described in the following listing 3.
For this listing we did not use the built-in PDDL objects and variables (indicated with ?<objname> syntax) to instantiate the predicates and actions.Instead, we implemented a script to generate the predicate and action definitions according to Listing 3 such that the following criteria are met: 1. Rooms (denoted <R>) are labeled 00, 01, 10, and 10, such that the 0 and 1 denote the column and row of the 2x2 grid in the ant navigation environment.E.g., room 00 is the lower left room and room 11 is the upper right room. 2. Doors (denoted <D>, the passages between the rooms) are labeled 0001 0010 0111 and 1011.The labels indicate the passages that connect the rooms.For example door 0001 connects room 00 with room 01. 3.For each door <D> and room <R> we generate the respective predicate names as listed in the :predicates section of the domain definition.4. For each door and room combination we generate the action definitions indicated in the listing below, such that the connections of doors and rooms are appropriate.For example we generate an action definition move to room center 00 from door 0001 because it is possible to move from door 0001 to the center of room 00.However, we do not generate the action move to room center 00 from door 0111, because door 0111 is not connected to room 00.

Figure 2 .
Figure 2. Our proposed integration model.Low-level motion planning elements are indicated in green and high-level elements in orange color.The abstraction functions f S , f G map the low-level state and goal representations s, g to high-level state and goal representations S, G.These are given as input to the planner to compute a high-level subgoal state G sub .The subgoal grounding function f subg maps G sub to a low-level subgoal g sub under consideration of the context provided by the current low-level state s and the low-level goal g.The reinforcement learner learns to produce a low-level motion plan that consists of actions u based on the low-level subgoal g sub and the low-level state s.

Figure 4 .
Figure 4. Results of the block-stacking experiments for one (top), two (middle) and three (bottom) blocks under different sensory noise levels.

Figure 5 .Figure 6 .
Figure 5. Number of subgoals reached for the case of stacking three blocks.

Figure 7 .
Figure 7. Results of the ant navigation experiment under different noise levels.The curves are subject to early stopping.

Eppe et al.
Algorithm 1 Mapping propositional state representations to continuous state representations 1: function f subg (G subg , s, g) 6:for p ∈ P do For each predicate in high-level goal state, set low-level subgoal indices p idx 7:s subg [p idx ] ← s p , where s p = f p subg (s subg , g) 8: