Learning to reason over scene graphs: a case study of finetuning GPT-2 into a robot language model for grounded task planning

Long-horizon task planning is essential for the development of intelligent assistive and service robots. In this work, we investigate the applicability of a smaller class of large language models (LLMs), specifically GPT-2, in robotic task planning by learning to decompose tasks into subgoal specifications for a planner to execute sequentially. Our method grounds the input of the LLM on the domain that is represented as a scene graph, enabling it to translate human requests into executable robot plans, thereby learning to reason over long-horizon tasks, as encountered in the ALFRED benchmark. We compare our approach with classical planning and baseline methods to examine the applicability and generalizability of LLM-based planners. Our findings suggest that the knowledge stored in an LLM can be effectively grounded to perform long-horizon task planning, demonstrating the promising potential for the future application of neuro-symbolic planning methods in robotics.


Introduction
The autonomous execution of long-horizon tasks is of utmost importance for future assistive and service robots.An intelligent robot should reason about its surroundings, e.g., regarding the included objects and their spatial-semantic relations, and abstract an action plan for achieving a goal that will purposefully alter the perceived environment.Such an elaborate course of robot actions requires scene understanding, semantic reasoning, and planning over symbols and geometries.The advent of Deep Learning led many researchers to faithfully follow end-to-end approaches due to the representation power of differentiable deep neural networks [1].
The problem of sequential decision-making has been addressed both with search-based and optimization approaches [2][3][4][5][6], as well as learning-based [7][8][9] and hybrid methods [10][11][12][13].While the first ones enjoy probabilistic completeness, they require full domain specification and have high computational demands.The learning-based methods require broad exploration to learn from experience, but they have shown better generalization capabilities in similar domains to those experienced during training.
In recent years, Large Language Models (LLMs) have exhibited an unprecedented generative ability [14], thanks to the transformer architecture [15] combined with massive datasets distilled from the internet.Naturally, in the quest for general artificial intelligence, researchers try to benchmark such models in reasoning tasks, among others [16,17].Robotic embodied intelligence requires both logical and geometric reasoning; hence, it is a holy grail of AI.Several researchers saw a benefit in LLMs, and it was not long before several works started exploring their application to robotics for endowing robots with the long-wished objective of reasoning in the scope of autonomous task planning and interaction [18,19].However, most of these works have focused on the prompting [20] and the subsequent prompt engineering [21] , in which engineers provide appropriate inputs to LLMs for extracting outputs that can be realizable by a robotic agent, either for human-instruction following [22] or for planning [23,24].This work investigates the applicability of a smaller class of LLMs for robotics, i.e., GPT-2 [25], and we propose and investigate a method that decomposes a long-horizon task into subgoals in the form of a goal-specification for a robotic task planner to execute.The main contribution of our work is the representation of the domain as a scene graph and the subsequent linearization of the graph to keep its structure and relations while forming a suitable input for the finetuning of the GPT-2 model.Our thorough experimental evaluation shows that finetuning GPT-2 by additionally grounding its input on the domain can help translate human requests (tasks) to executable robot plans, to learn to reason over long-horizon tasks, as those encountered in the ALFRED benchmark [26].We compare our proposed approach with classical planning methods to investigate the applicability and generalizability of the Pre-trained Language Model (PLM)-based planners compared to classical task planners operating on a limited computational budget for a fair comparison.We conclude that the knowledge stored in a PLM can be grounded on different domains to perform long-horizon task planning, showing encouraging results for the future application of neuro-symbolic planning methods in robotics.

State of the Art 2.1 Reasoning with Large Language Models
LLMs have attracted much attention for understanding the commonsense and reasoning patterns in their latent space [27][28][29].It has been shown that some abilities in logical and mathematical reasoning seem to emerge when LLMs are prompted appropriately [18,30].However, the engineering effort, as well as the lack of robustness, is a key issue in prompting massive models [31,32].While great effort seems to be consumed on few-shot prompting of huge parametric models, it has also been shown by other lines of work show that efficient finetuning of much smaller models [33], or the use of small adaptation modules (Adapters) [34,35] can lead to methods that perform more robustly than large-scale generalist few-shot prompters.In the same direction, the chatbot versions of those huge models raised several points of criticism recently, showing that much more is needed than just prompting a blind human-preference alignment1 .

Robot behavior planning
Long-horizon robot behavior planning is an NP-hard problem [36].Current advances in ML and perception led researchers to revisit this fundamental problem, i.e., the execution of multi-stage tasks, whose completion requires many sequential goals to be achieved, considering learning-based heuristics [11].Researchers consider such problems as Task And Motion Planning (TAMP) problems [3,12,37], with a symbolic plan over entities and predicate with respective action operators with preconditions and effects in the environment.In contrast, a motion plan tries to find a feasible path to the goal.Nevertheless, most TAMP methods rely on manually specified rules; they do not integrate perception, and the combinatorial explosion when searching over symbolic and continuous parameters prohibits scaling the methods to challenging, realistic problems [4,10].
Transformer models [15] that revolutionized the field of Natural Language Processing (NLP) opened the way for multiple new applications, in particular for robotics, e.g., visual-language instruction following [38], 3D scene understanding and grounding [39,40], language-based navigation [41,42].Due to their training on extensive databases, several works explored the use of LLMs for task planning and long-horizon manipulation [43], mainly employing clever prompting [24,44], using multimodal information [23,45], grounding with value-functions [19,46,47], and deploying advances in code generation to extract executable robot plans [48].[49] propose to use a PLM as a scaffold for decisionmaking policies in interactive environments, demonstrating benefits in the generalization abilities for policy learning even when language is not provided as input or output.Recently, PALM-e [11] has integrated a vision transformer with the PALM language model and has encoded some robotic state data to propose a multimodal embodied model, which showed the potential of integrating geometric information of the robot state but achieved limited performance in robotic tasks.
3 Language models for grounded robot task planning

Problem statement
Let us assume an agent that is able to move and manipulate objects in an environment, e.g., a mobile manipulator robot in a household environment.Let the environment be composed of a combination of rooms, such as 'bathroom,' 'living room,' or 'kitchen.'Each room contains objects and receptacles, i.e., objects that are able to receive other objects, such as 'table,' 'drawer,' or 'sink.'Each object has (household-specific) properties associated with it that define whether it can be picked up, cleaned, heated, cooled, cut, etc.These properties can change, meaning that the objects have a state.The agent can pick up only one object at a time, meaning the agent also has a state, e.g., 'object in hand.'Given the fact that the state is preserved over time and future actions depend on past actions, the environment can be characterized as sequential.Therefore, a series of actions has to be reasoned upon for an agent to be able to execute a series of actions for solving a long-horizon task, i.e., a task that requires the completion of several subtasks and potentially the manipulation of various objects to achieve the end goal.

The ALFRED benchmark
The ALFRED benchmark [26] contains human-annotated training samples and image-based recordings of everyday household tasks; this is "25,743 English language directives describing 8,055 expert demonstrations averaging 50 steps each, resulting in 428,322 image-action pair".In addition to that, the dataset provides a PDDL domain of the overall task and a PDDL problem for each sample [50].ALFRED heavily depends on AI2-THOR [51], which acts as the underlying controller and simulation environment (based on the Unity game engine): trajectories for each sample of the ALFRED dataset were generated with AI2-THOR and the validation of user-generated actions requires the AI2-THOR controller.Figure 1 shows a sample scene loaded into the AI2-THOR simulator.Each sample in the dataset consists of a high-level plan in PDDL and the trajectory of the agent's actions which lead to successful task completion, together with a description of the task goal and each plan step in Natural Language (NL).

Data augmentation
Each sample in the ALFRED dataset can be replayed in the AI2-THOR simulator to collect additional information not contained in the original dataset.ALFRED provides a script that has been modified for that purpose.Data augmentation is necessary for Graph2NL (c.f.§3.8) to generate a graph representation from the environment state.For each replayed sample, the complete list of objects in the scene, with their respective name, position, and rotation, and the agent position is saved to a separate file next to the trajectory data.This file is later loaded and turned into a processable graph.
Task: Put some Peppershaker on Sidetable  [26].The scenes show a room with household objects and the robot executing a task.Note that the robot does not have an arm, and the object automatically floats in front of the camera; it interacts with the environment through discrete actions.The discrete actions are shown underneath each frame in the form of PDDL commands.

State and action space
The state space defines the feedback provided by the environment, while the action space defines the available actions to interact with the environment.
State space.The AI2-THOR environment can be framed either as fully observable when access to the complete simulator state is given, or partially observable when the agent uses its perception (images) to observe the environment without having access to the true state.AI2-THOR eliminates most physics-related aspects (e.g., objects are automatically picked up and placed by a single action), which makes the highly dynamic and stochastic household environment almost static and deterministic -almost because some physics still exists.This simplifies the core TAMP problem, together with the discrete agent actions coming from the ALFRED dataset.Therefore, the ALFRED benchmark represents an appropriate choice for studying the problem of learning for robotic task planning, where motion failures are minimized by the underlying AI2-THOR controllers.Hence, we can focus on the reasoning aspects of the problem, which is the focus of this study.In this paper, the environment is assumed to be fully observable.The state space, therefore, includes the whole domain as provided by the simulation.In the proposed methods, the state is transformed to NL context ( §3.8) and not used directly as a model input.
Action space.ALFRED has an action space of eight discrete high-level actions: GotoLocation, PickupObject, PutObject, CoolObject, HeatObject, CleanObject, SliceObject and ToggleObject.The actions CoolObject, HeatObject, CleanObject themselves are a composition of the remaining high-level actions.The underlying AI2-THOR controller also has a discrete action space for the low-level actions: The agent can move forward, backward, left or right and rotate clockwise or counter-clockwise in fixed steps.

Task categories
The ALFRED dataset encompasses seven categories of household tasks: 'Look at object', 'Pick and place', 'Pick two and place', 'Pick and place with movable receptacle', 'Pick, clean then place', 'Pick, cool then place', 'Pick, heat then place'.Because, in mobile manipulation, objects can be placed in different corners of a room, each of these tasks includes the sub-problem of navigation.
For the 'pick' or 'place' subtasks executing the respective PickupObject or PutObject action is sufficient.But, the subtasks 'clean', 'cool' and 'heat' must be seen as planning problems on their own, because the corresponding actions are a composition of high-level state-dependent actions.In regards to the household environment, the subtask 'cool' requires a fridge, 'heat' requires a microwave (or oven), and 'clean' requires a sink as an receptacle.
Adding to the difficulty of the 'clean,' 'cool,' 'heat,' and 'cut' subtasks is the fact that the ALFRED simulator tracks the state of each object: the subtask is only considered successful when the final object state is correct.For example, if the task category is 'Pick, clean then place', the task goal is only completed when the placed object is marked as 'clean.'The implementation aspects of these task categories are discussed in Section 4.

RobLM: Robot Language Model for Task plan generation
Just like images can be represented by discretizing color space, NL can be expressed as a sequence of tokens x = [x 1 , x 2 , ..., x n ], where each token is mapped to an embedding (lookup table).The Language Model (LM) places a probability distribution p(x) over the output token sequence.p(x) can be decomposed into a conditional probability distribution p(x i+1 |x i ), where the probability of each token depends on all previous tokens.This results in the following joint distribution p(x) for a sequence of tokens x: In regards to Neural Network (NN), p(x) is commonly estimated with the Softmax function [52] 2 where W is the learned weight matrix, b the bias and h T the output vector of the NN.For text generation, the joint probability distribution p(x) (see Equation 1) can be formulated as a maximum-likelihood objective, where the objective is to maximize the likelihood of the next token occurrence for the given data.
Our goal is to fine-tune a LM to get a Robot Language Model (RobLM) that can generate a complete high-level task plan in one shot, given the domain information and a task goal.Because LMs are unsupervised learners, a single training sample contains both given and desired information as NL text.A restriction to the text format (a string of characters) comes with challenges: structural information needs to be condensed into a single linear dimension, and conceptually different aspects of the input need to be annotated in the text.This text format, including the syntax, has to be designed in such a way that information can be fed to and extracted from the LM reliably.
In RobLM, the format definition for a NL task description must comply with the following syntactic rule (spaces added for readability): Goal is the task goal in NL.Context is any additional, yet optional information provided to the LM.The task might have ambiguous solutions, and the inherent assumption is that the LM will better "understand" the task if given a context.Examples of a context are the name of the room, the name of the target object, or a NL description of the environment (see 3.8).
Plan is the sequence of high-level task actions and their respective arguments, such as objective name or location.
Because PLM have been trained on a diverse corpus of NL, including program code, the format for plans follows syntactical rules similar to that of a generic programming language: The sequence between the special tokens < BOS > and < EOS > can be extracted to retrieve the plan from the LM-generated output.

Training
RobLM generate a plan as a text given the goal and the context, which involves causal language modeling for text generation.Decoder-only autoregressive language models 2 are frequently used for the problem of text generation.Hence, we chose GPT2 model as the base model for RobLM.RobLM uses the base version of the GPT-2 PLM ('gpt-2') [25], loaded and initialized with pre-trained weights from the Huggingface [53] Transformer library.Fine-tuning GPT-2 for causal language generation has a self-supervised setup, where the labels are the inputs shifted to the right, which entitles learning to predict the next token in a sequence.
The GPT-2 PLM is fine-tuned to the pre-processed training data of the ALFRED dataset, which has around 20.000 samples, with three sets of NL descriptions for each sample.The ADAM [54] optimizer is used with a learning rate of 5e −5 and the LM is trained for two epochs.Fine-tuning a GPT-2 LM to the ALFRED training data with a single GPU-accelerated computer takes around 30 minutes (27 iterations/s -measurement not representative due to hardware dependence).

Generation pipeline
For inference, RobLM takes only the NL task goal together with an optional context and outputs the complete stepby-step plan for completing the goal.This plan is composed of high-level instructions rather than low-level controller commands.
Example Given the task "Put the soap into the drawer:", RobLM (no context) generates the following plan: Put the soap into the drawer: 0.GotoLocation(countertop) 1.PickupObject(soap) 2.GotoLocation(drawer) 3.PutObject(soap,drawer) The plan is generated by consecutive forward passes through the Transformer model.For a vocabulary size of k and a token sequence of length l (with l ≤ 1024 for GPT-2), the forward pass of the Transformer yields an output vector of size k × l with values in the interval [0, 1].
The Transformer outputs scores, called logits, for each token in the input sequence.These scores are converted to a probability distribution p(x) by using the Softmax function, as described in Equation 2.  Two possible generation strategies for selecting the next token from p(x) are: greedy search and top-k / top-p sampling [55].In the greedy strategy, the token x sel with the highest likelihood is picked with x sel = arg max p(x).In the top-k sampling strategy, as the name suggests, the scores are sorted, and one of the first k candidate tokens is randomly sampled.By extending the top-k sampling with an additional top-p strategy, the sum of the k candidates must be equal to or greater than p ∈ [0, 1].Simply put, top-k widens the choice over the next tokens and top-p filters out low-probability tokens.

LayerNorm
Figure 3 illustrates, on the basis of an example, a forward pass through the Transformer with a greedy selection strategy.These steps are repeated recursively until an end-of-text token is encountered or the defined sequence limit is reached to generate the full plan.
This LM model was fine-tuned to generate a structured output, omitting special tokens, characterized by numbered actions and their arguments in parenthesis.Note that the input is always part of the output, due to the generation function utilized by RobLM.
Note that it is not guaranteed that the 'soap' can be found inside the 'drawer' on the 'countertop'.In fact, it could be in any possible location permitted by the environment.However, given a greedy search strategy, for the given task goal the likelihood for the 'soap' being on the 'countertop' is the highest in this case.

Transforming Scene Graphs to Natural Language: Graph2NL
PLMs are trained on NL.Because of this, NL is a natural modality for fine-tuning a PLM.When a context is provided to the LM, this context must be presented in NL just like the input sequence.If the context should encapsulate the environment state, this means that the state has to be transformed into NL before being supplied to the PLM.The forward passes are repeated in a recursive manner until an end-of-text token is encountered or the defined sequence limit is reached.
Graph2NL is a novel method that "translates" the object-centric scene graph representation of the environment state to NL. Optionally, domain knowledge about the environment3 can be infused into this graph.The following steps describe the core Graph2NL process: 1. Generate an object-scene graph G with a node for the agent and nodes corresponding to objects, node attributes being the position and rotation of the object in Euclidean space and their respective distance and orientation vectors as edge attributes.
2. (Optional) Infuse domain knowledge about the environment by connecting all dependent nodes and all nodes reachable by the agent.
3. Connect the agent (node) to all reachable nodes, if given domain knowledge, or to all nodes, if not given domain knowledge.
4. Given a task and the identified target object, find all paths in the graphs leading from the agent (node) to the target object (node).
5. Use edge attributes in the found paths to describe the task-centric environment state, by mapping geometric relations to NL tokens.

NL mapping
To translate geometric relations attributed by the graph edges into a NL description, a mapping function is designed.In human speech, distances are expressed by a vocabulary of words such as "close" or "far" and orientations are expressed by words such as "in front" or "behind".Graph2NL adapts this vocabulary to describe the (numeric) distance and orientation from one node relative to another in NL.  1 summarizes the mapping used in Graph2NL.The distance between nodes is expressed in Cartesian space and orientation in polar coordinates, where Yaw is the azimuth angle (rotation along the surface normal) and Pitch is the zenith angle (altitude).With this mapping, the geometric relation between two nodes can be explained by three words (one for each: distance, pitch, and yaw).
The vocabulary contains 8 words to express the distance, 4 words to express the vertical, and 2 words to express the horizontal orientation.Combinatorially, this gives 64 possible geometric configurations.By treating each of these configurations as a relation and assigning a special symbol (token) for each relation, the geometric relationship is expressed in a condensed form.A simple approach, referring to the Symbol column in Table 1, is by assigning a symbol to each word.By combining the symbols for distance, pitch, and yaw, the condensed (three-letter) representation of the geometric relationship is created.
These symbolic representations can optionally be added to the LM tokenizer as so-called special tokens.Shorter token sequences, in general, decrease both the training and inference time.
Example Let the task goal be: "Put the soap into the drawer".The input query to Graph2NL would consist of the target object 'soap'.Figure 4 shows the graph constructed by Graph2NL from the augmented data ( §3.2), including domain-specific knowledge.After finding the shortest paths between the root ('agent') and target node ('soapbar'), Graph2NL produces an output in the following form (cut-off at search depth 2):

[Bathroom= -closer below left sink near below back soapbar -closer below left cabinet near above back soapbar -closer above left countertop next above back soapbar -close below back toilet closer below back soapbar -closer below back garbagecan close below back soapbar]
The NL context by Graph2NL starts with the name of the room extracted from the scene graph, followed by the geometric description of each node connected to the target node on the path from the agent.'-' indicates the root note, which is the agent.
For this example, Graph2NL would produce the following condensed form: -fmk countertop gmj soapbar -enj toilet fnj soapbar -fnj garbagecan enj soapbar] This form of state representation is unique for each problem configuration and forms the context that grounds RobLM.

Preliminary analysis for task plan generation with GPT-2 and GPT-3
LLMss can represent knowledge from the data they have been trained on, however, the question remains, if this knowledge can be leveraged to solve planning tasks, i.e., can LMs reason?This is investigated by comparing the text-generation results of GPT-2 (Zero-Shot Learning (ZSL)) and GPT-3 (Few-Shot Learning (FSL)) for a planning task.
Let an example planning task be the following: Given an instruction to a household robot, formulate the steps to complete the instruction.
The instruction is: ''Put a washed slice of apple on the table.'' The results for this task vary greatly between the two LMs.
GPT-2 ('gpt-2' model, 1.5B parameters) completely fails this task and produces an output that resembles a repetition of phrases in the input sentence: GPT-3 ('text-davinci-002' model, 175B parameters), when given the same instruction as input, is able to make assumptions and formulate a whole plan based on these assumptions: Assuming the apple is not cut and not washed.
1. Pick up the apple from the counter.2. Cut the apple into a slice.

Place the apple slice on the table.
The FSL paradigm allows GPT-3 to be very sensitive to context changes and seemingly understand the request at hand.However, smaller GPT-3 PLMs (GPT-3 curie, GPT-3 babbage, GPT3-ada) show a degraded quality in the produced plan.[56] have shown that GPT-3 would not pass the Turing test 4 , as to having "no understanding of the semantics and contexts of the request, but only a syntactic (statistical) capacity to associate words [...]".
These tests have shown that plan generation capabilities of LLMss vary dramatically depending on the underlying learning paradigm, model architecture, and parameter size.GPT-2, out of the box, is completely unsuited for solving planning tasks that require a minimum level of text understanding.However, as later ( §3.5) shown, GPT-2 is able to successfully generate plans when fine-tuned to a training dataset ( §3.2).The question of whether a fine-tuned GPT-2 model can leverage knowledge for planning is addressed in the following section.GPT-3, unfortunately, is only accessible through a paid service by OpenAI as part of the OpenAI Beta program5 (requires registration).Fine-tuning of own GPT-3 models is possible through the provided service.Practical applications, however, are limited because each query has to be sent to and processed by the OpenAI service.Even if a PLM was made available, the hardware requirements for running GPT-3 models are immense, even for today's standards, due to the sheer parameter count.It is for these reasons that GPT-3 is not considered as a basis for fine-tuning to RobLM.

Evaluation of RobLM
This section presents the main experiments conducted for evaluation of RobLM.We first define the appropriate metrics and a baseline method required to make the evaluations measurable and comparable.The grounding problem is explained in accordance with the practical aspects of integrating the available methods into the simulator.For the experimentation part, a set of fine-tuned LLMss is compared with the baseline performance.

Metrics
To validate a fine-tuned LM, only the NL task goal of each validation sample and optionally, the context is fed to the RobLM generation pipeline (see Figure 3).Validation is performed over each task category rather than all the validation data.This enables the analysis of a task-dependent performance: some task categories are more complex than others leading to a longer trajectory of actions and hence an increased difficulty.
Two metrics are defined for validation: LM accuracy and plan success rate.
Definition -Accuracy.Accuracy measures how accurately the LM is able to predict the following parts of the plan: • the correct count and names of all actions in the plan (action accuracy) • the correct count and names of all arguments in the plan (argument accuracy) • the correct count and names of all actions and arguments in the plan ("full plan" accuracy) For a found plan, the accuracy of actions and arguments counts if all actions or arguments are correct.With this metric, it is possible to anchor the cause of plan failure to either the actions or the arguments, or both.
Having an accurate LM does not necessarily mean that the generated plan leads to success -at least, as long the "full plan" accuracy is below 1.0, i.e., the trajectory is not replicated perfectly.A second metric is required that measures the actual success rate of the fine-tuned LM in simulation.There are two possible scenarios that justify this additional metric: First, the plan could fail in simulation, even if it seems accurate.And second, the plan could succeed in simulation, even if the plan is not completely accurate.
Definition -Success rate.The success rate is a measure of the successful completion of individual sub-tasks of a validation task.After loading the trajectory, environment state, and goal from the validation sample into the AI2-THOR simulator, the actions predicted by the LM are translated into low-level controller actions via task and geometric grounding ( §4.2.4), which are then passed to the AI2-THOR controller and executed in the simulator.After every simulator step, a check is performed to determine whether the target conditions for sub-task completion have been met.
If the target conditions are kept unsatisfied after execution of the last low-level action, it counts as a success towards the sub-task, or otherwise, as a failure.

Baseline
A baseline is an oracle, or upper bound, that serves as a measurement reference.Fast Downward (FD) [57] is used as the baseline for evaluation.We consider a classical task planner like FD appropriate since it also has access to the full domain and is a complete algorithm [57].Therefore, the ability of a RobLM to match or outperform FD (for a given time budget) would reveal whether LMs can be helpful towards learning task planning.Every ALFRED validation sample comes with a PDDL problem file, while the PDDL domain is shared by all tasks; this allows the PDDL planner to generate a plan for each sample.
To generate a plan using FD, the PDDL problem files provided by ALFRED have to be pre-processed.FD is able to handle Action Description Language (ADL) instructions, as found in the PDDL problem, but is not able to process optimization-related additional information present in the files.

Hardware setup
For fine-tuning LMs and evaluating each model, we used the Lichtenberg Cluster of TU Darmstadt, which contains stacks of NVIDIA® A100 and V100 GPUs.Internal tests have shown that a single GPU can decrease the training time by a factor of 10 (these tests are not representative because performance depends on every hardware component).To run experiments in the AI2-THOR simulation, we used a PC with an NVIDIA® RTX 3080Ti GPU.

Instruction grounding
Grounding van be defined as mapping a high-level, abstract, or symbolic representation to a low-level (ground) representation.Grounding of an abstract plan to objects is called object or geometric grounding (or "world grounding"), and grounding of NL to robot tasks is called task grounding.In this case, instructions generated by the LM are made up of actions that require a task grounding, and arguments, which require a geometric grounding.

Task grounding
Plans generated by RobLM consist of high-level actions and are not directly executable by the AI2-THOR controller.Each possible action predicted by the LM has to be grounded to a task, which then translates to a sequence of low-level controller actions.
For task grounding, three possible types of tasks are defined: navigation, manipulation and composite.In a navigation task, the agent is required to move from one to another location.In a manipulation task, the agent performs an action affecting the environment state.Composite tasks are a composition of manipulation tasks that need to be completed in a specific order.
Task grounding is performed as follows: • The action GotoLocation is grounded to the navigation task and delegated to a trajectory planner for navigation (see below).
• The actions PickupObject, PutObject, ToggleObject and SliceObject are grounded to the manipulation task, the actions can be directly executed by the low-level controller.
For example, the action HeatObject in a scene containing a microwave would be grounded to the composite task and translate to a series of low-level actions that involve opening the microwave (ToggleObject), putting the target object inside (PutObject), closing the microwave (ToggleObject), waiting (no action), opening the microwave (ToggleObject), picking up the heated target object (PickupObject) and finally closing the microwave (ToggleObject).

Geometric grounding
An argument can be anything from a location to an object name.An argument produced by the LM might be ambiguous or non-existing in the environment.In order to be understood by the controller, these arguments have to be grounded on a geometric level.
For grounding arguments, first, all available objects are retrieved from the simulation.Then, the world coordinates of all objects matching the predicted symbol (target object) are gathered.E.g., if the predicted target object is 'soap', the position of all 'soap'-type objects can be queried and retrieved from the simulator.The low-level control commands are finally generated with the help of the ground-truth navigation graph of the scene.

Navigation
By overlaying the world with a grid, every position in the world is given a discrete coordinate.A navigation graph (not to be confused with a scene graph or Graph2NL graph) creates a node for each coordinate and connects all the nodes that are accessible one from another.
Similar to the procedure of Graph2NL ( §3.8), the navigation graph is traversed after locating the agent and target node by the object name.A search algorithm is used to find the shortest path in the graph from the agent to the target object -in this case, it is the A* algorithm 6 .The search returns a sequence of nodes, which corresponds to a sequence of coordinates (a trajectory).Lastly, a motion planner takes the trajectory as an input and outputs a sequence of low-level controller actions (AI2-THOR conveniently provides a motion planner for navigation).

Experimental results
A set of fine-tuned RobLM models are evaluated against the baseline.The fine-tuned models differ in the amount of context provided during training time:  Given these results, the following conclusions about the examined models can be made: 2. The LM is able to learn the structure of tasks, but not scene-dependent components.
RobLM is able to distinguish between the task categories and provide a correct task action plan.However, where this model fails is in finding all correct action arguments, i.e., locations and object names.This can be explained by the fact that the task goal alone does not reveal the actual location of the target object.Because the target can be in any accessible location in the environment, or in any accessible receptacle, the produced argument is the result of the LM imitating the most-likely cases observed in the training data.
Overall, these results are consistent with the point made on contextual information and prediction accuracy: giving the model information about the environment, i.e., fine-tuning a model to be grounded to the scene, does improve performance.

Results -Success rate
A plan is successful if each of the sub-tasks for a stated task is completed.Table 2 summarizes the success rate of RobLM compared to the baseline for actions of the navigation task (GotoLocation) and manipulation task (PickupObject, PutObject, etc.).Composite tasks have been omitted from this evaluation because of high failure rates caused by their task grounding complexity.
Regarding geometric grounding, arguments predicted by RobLM are grounded to all matching objects in the world, and RobLM is allowed to "try" all possibilities.E.g., when the objective is to "Get soap", multiple 'soap'-type objects could exist in the scene.Each possibility given by the geometric grounding is simulated by storing and restoring the simulator state.While this is a clear advantage for RobLM over the baseline, the evaluation still holds because the LM is required to predict the correct location or object names.
Based on the presented results, the LM-based system performs exceptionally well on sub-tasks requiring the action PickupObject, while the action PutObject does not succeed equally well, since it does not come close to the baseline performance.
Overall, the success rate of the baseline method is not nearly as high as expected, hinting at potential implementationspecific failures in the task grounding and in the low-level controller interaction with objects.In the low-level controller, visual information is not included.This means that the robot is controlled in a "blind flight" mode.The AI2-THOR simulation requires the target object to be in view.If the object is not visible, e.g., because the agent is looking in the wrong direction, the interaction fails and with it, the sub-task.Because of the fact that both systems have been evaluated within the same framework, these results do not dismiss a potential use-case for LM in planning.

Additional Results
We provide additional experiments for a deeper analysis of potential points of failure of RobLM.These experiments entail a different sampling strategy and context refinement.Top-k / top-p sampling So far, every experiment conducted has used a greedy next-token selection strategy.In order to be able to tell with certainty that a found plan is the "best possible" plan, a comparison with another sampling strategy is required.This additional experiment repeats the previous one, but this time with a top-k and top-p sampling strategy.
The comparison is done with the 'No context' RobLM model, for all tasks with k = 10 and p = 0.9, i.e., tokens are sampled from the top-10 predictions and sum up to a probability ≥ 0.9.
Since a similar pattern was observed in the individual task evaluations, Table 3 reports the results for the 'Pick Simple' task only.Each token is sampled three times, giving three possible solutions to be evaluated.Slight -uniform and hence dismissable -variations in the prediction accuracy exist between these three runs.The sampling-based method performs slightly worse than the greedy strategy.Refined context In a deeper analysis of RobLM failure cases, it has been found that the first argument in the generated plan is the hardest to predict correctly by the LM.The LM is not able to draw enough conclusions about the first instruction from the supplied context of any form.This causality becomes obvious after the following experiment: Given the task goal and a NL description of the first instruction as context, how does the overall accuracy of the LM change?The following text is an example of an instruction description in NL, as found in the ALFRED dataset: Turn left and walk across the room towards the shelves on the wall.
The results in Figure 6 show that, given this extra information, RobLM is almost able to reach the performance levels of the baseline measurement across all tasks; it shows very high accuracy on "full plan" actions and arguments.The conclusion of this experiment is that the more precisely the supplied context is tailored towards the key issue of LM generation task, the more accurate the generated plan becomes.For this specific problem, finding the correct first argument is key to a successful plan, and with a NL description of the first instruction, the LM is able to draw the necessary connections from context to plan.
The overall conclusion of this observation is that LM are adaptive.In other words: LM are able to adapt new information into the plan generation, towards a more accurate sequence of instructions.

Run-time analysis
Time is an important factor when it comes to real-life applications.This is especially true for industrial robotics, where cycle times are important.But not every robotic application is time-critical, e.g., a household robot is not expected to respond in a sub-second time.However, if task planning is seen as a programming problem, a fast execution time greatly enhances the operator experience.
Table 4 shows a comparison of the inference speeds of RobLM against the baseline (FD).RobLM, in all cases, is very slow compared to the baseline, which is likely due to the reliance on the full GPT-2 vocabulary size for the LM tokenizer and the usage of a LM-internal, implementation-specific generation function 7 .

Conclusion
We presented a framework for finetuning grounded Large Language Models (LLMs) and investigated the applicability of such models combined with planning in solving ling-horizon robot reasoning tasks.This paper has shown that LLMs can extract commonsense knowledge through precise queries and adjust their behavior based on available information or context.Among our contributions are the development of RobLM, a grounded finetuned LLM that generates plans directly from natural language commands, and Graph2NL, which creates natural language text describing graph-based data, to represent scene graphs as inputs into RobLM.Our extensive experimental results have revealed, nevertheless, the challenges in representing structured and geometric data in natural language.
However, LLMs still need to demonstrate a consistent ability to perform long-horizon planning tasks and, therefore, cannot replace classical planning systems.Despite their limitations, LLMs possess powerful features such as efficient storage and retrieval of commonsense knowledge, which can be useful in planning tasks when presented with partially observable environments.

Figure 1 :
Figure 1: AI2-THOR simulator rendering a sample rollout from the ALFRED [26].The scenes show a room with household objects and the robot executing a task.Note that the robot does not have an arm, and the object automatically floats in front of the camera; it interacts with the environment through discrete actions.The discrete actions are shown underneath each frame in the form of PDDL commands.

Figure 2 :
Figure2: Decoder-only Transformer architecture.The input to the decoder is tokenized text, and the output is probabilities over the tokens in the tokenizer vocabulary.The positional encoding is added to the embedded input to account for the order.Transformer's decoder can have multiple transformer blocks, each of which contains multi-head attention with linear layers and layer normalization.GPT models uses GeLU activation between the last linear layers.

Figure 3 :
Figure 3: Illustration of a forward pass through RobLM for text generation with a greedy next-token selection strategy.The forward passes are repeated in a recursive manner until an end-of-text token is encountered or the defined sequence limit is reached.

Figure 4 :
Figure 4: Graph2NL example graph.After locating the root ('agent') and target node ('soapbar'), the shortest paths connecting those nodes are found and summarized in NL by mapping all edge attributes along the path.

1 .'
No context' -Only task goal 2. 'Scene knowledge' -List of all available objects in the environment, found in the PDDL problem 3. 'Scene graph' -Description of geometric relations to the target object, generated by Graph2NL 4. 'Full context' -Description of geometric relations of all objects, generated by Graph2NL Given a PDDL problem file, Graph2NL automatically generates the context in the specified text format.This context is provided to the LM for training and inference.

Figure 5 :
Figure 5: Prediction accuracy of actions and arguments for previously unseen data across a set of tasks.Neither RobLM model is able to outperform the baseline (blue) but shows high accuracy in the prediction of plan actions.Context-driven models (green, red, and purple) perform better than the model without any scene-related context (orange).

Figure 6 :
Figure 6: Prediction accuracy of actions and arguments for unseen tasks of a model with a refined context.This experiment compares a model fine-tuned to the task goal and a context consisting of a NL description of the first plan instruction (green) with the baseline (blue) and previous RobLM 'No context' model (orange).

Table 1 :
Graph2NL mapping table.Distances are mapped to NL vocabulary (or a symbol) in a one-to-one relation.Yaw describes the orientation along the surface normal when viewed from a top-down perspective, and Pitch describes the z-planar offset (altitude) in relation to the origin.
a washed slice of apple on the table.Put a washed slice of apple on the table.Put a washed slice of apple on the table.... Similar behavior has been observed for other LM falling into the ZSL category.The input sentence is not an open-ended question and requires a high level of reasoning capabilities.

Table 2 :
Success rates of sub-task completion in simulation -RobLM ('No context') compared to the baseline on seen and unseen validation data.

Table 3 :
Top-k and top-p sampling (k = 10 and p = 0.9) -tokens are sampled three times for the 'Pick Simple' task, giving only slight deviations in the final accuracy.

Table 4 :
Comparison of inference speeds -RobLM against baseline.GPU acceleration used for LM (NVIDIA® GeForce RTX 2080 SUPER).The timer starts only after the program or model has been loaded into memory, i.e., only computation (inference) time is measured.'No context' has a maximum token sequence length of 200 and 'Full context' has a maximum length of 1024 tokens for generation.Iterations per second (average over 800 samples) Baseline RobLM 'No context' RobLM 'Full context'