A human-centered safe robot reinforcement learning framework with interactive behaviors

Deployment of Reinforcement Learning (RL) algorithms for robotics applications in the real world requires ensuring the safety of the robot and its environment. Safe Robot RL (SRRL) is a crucial step toward achieving human-robot coexistence. In this paper, we envision a human-centered SRRL framework consisting of three stages: safe exploration, safety value alignment, and safe collaboration. We examine the research gaps in these areas and propose to leverage interactive behaviors for SRRL. Interactive behaviors enable bi-directional information transfer between humans and robots, such as conversational robot ChatGPT. We argue that interactive behaviors need further attention from the SRRL community. We discuss four open challenges related to the robustness, efficiency, transparency, and adaptability of SRRL with interactive behaviors.


Introduction
Deep learning has shown impressive performance in recent years [42].By leveraging deep learning, Reinforcement Learning (RL) has achieved remarkable successes in many scenarios and superhuman performance in some challenging tasks [25,24], e.g., autonomous driving [23], recommender system [81], robotics [10], games [65,16,26], and finance [71].Most RL methods aim to maximize reward performance without considering safety constraints.However, safety is critical when deploying RL in real-world applications, especially in robotics.In Robot RL (RRL), a robot interacts with static or dynamic environments to learn the probability of better actions.When humans are also part of the robot's environment, ensuring their safety is crucial.This paper proposes a framework to achieve Safe Robot RL (SRRL) by leveraging interactive behaviors.
Interactive behaviors are behaviors that can mutually influence the interacting elements.Interaction is everywhere in human life [35], and agentenvironment interaction is the basis of RL [70].When humans and robots act in a shared environment, their actions can be influenced by each other through interactive behaviors [53,32,72,52,31].For example, a robot navigating in a public space can plan its path to avoid collisions with pedestrians and provide signals to pedestrians to move aside.The pedestrians can also plan their path to avoid collisions with the robot and provide signals to the robot to move aside.Figure 1 shows a schematic of interactive behaviors consisting of three elements: robots, environment, and humans.The outer loop consists of feedback or signals from the robots to the humans and vice versa.The two inner loops consist of actions from robots and humans in the environment and feedback or rewards from the environment to robots and humans.Here, we use the term "feedback" to mean any type of information transfer between the two interacting elements.Interactive behaviors can lead to better SRRL by enabling bi-directional information transfer between the interacting elements.It is a core technology to improve the dialogue performance of a Large Language Model (LLM), e.g., ChatGPT [60] by leveraging interactive behaviors.For example, in ChatGPT, on the one hand, high-quality data from human feedback is collected to design the reward model [20,66].On the other hand, after having human feedback, the agent model will be trained to align human values, and then give safe feedback to humans .In most conventional SRRL approaches, there is no human interaction, e.g., CPO [1] and ATACOM [49].With interactive behaviors, the robot can learn about human behaviors and convey its features and decision-making processes to humans.Those approaches that consider the human-in-the-loop of the robot's learning process do not utilize feedback from the robot learner to the human teacher.

Action Feedback
In this paper, we investigate interactive behaviors to achieve three stages of human-centered SRRL: "safe exploration", "safety value alignment", and "safe human-robot collaboration".In the "safe exploration" stage, the robot must explore the unknown state space while preserving safety.In the "safety value alignment" stage, the robot has to align its intentions with the humans.Finally, in the "collaboration" stage, the robot should contribute towards achieving shared goals with humans.

Related Work on Safe Robot-Reinforcement Learning
Safe robot learning has received substantial attention over the last few decades [75,7,30,54,38].SRRL methods focus on robot action and state optimization and modeling to ensure the safety of robot learning.For instance, Gaussian models are used to model the safe state space [2,68,74,8,69,77]; formal methods are leveraged to verify safe action and state space [19,33,79]; control theory is applied to search safe action space [13,14,55,34,45].
Anayo et al. [2] introduced a reachability-based method to learn system dynamics by using Gaussian models, in which the agent can adaptively learn unknown system dynamics and the maximal safe set.Berkenkamp et al. [8] proposed a region of attraction method to guarantee safe state space based on Gaussian processes and Bayesian optimization.Although the method provided a theoretical analysis for safe robot learning, the assumptions in the study are quite strong and may not be applicable in practical environments.Sui et al. [68] present a safe exploration method considering noise evaluations, in which the safety is ensured with a high likelihood via a Gaussian process confidence bound.Moreover, the sample complexity and convergence of the method are analyzed, and real-world applications, such as movie recommen-dations and therapeutic spinal cord stimulation, are used to test how well the system works.Nonetheless, they considered the safe exploration as a bandit setting without any constraints [74].Turchetta et al. [74] developed an algorithm in which Gaussian processes are leveraged to model the safe constraints.However, a set of starting safe states is required from which the agent can begin to explore in the work of Sui et al. [68] and Turchetta et al. [74].
Some recent works have investigated the application of formal methods for SRRL.Nathan and Andre [19] used formal verification to check the correctness of the state transition model and select safe action during RL.They allow unsafe actions if the model is incorrect.Researchers have proposed three types of provably safe RL methods for hard safety [36]: action mask, action replacement, and action projection.Action masking approaches apply a safety layer to restrict the agent's actions to safe actions only [37].Action replacement approaches replace unsafe actions with safe actions [28].Action projection methods project the unsafe actions to close safe actions [33].
Control theory based SRRL approaches have utilized Lyapunov functions and Model Predictive Control (MPC).Chow et al. [13,14] introduced Lyapunov functions based on discrete and continuous control methods for the global safety of behavior policy in RL.However, designing Lyapunov functions for different environments may be difficult, and the requirement of a baseline policy may be challenging to satisfy during real-world applications.Koller et al. [34] proposed an MPC based method to ensure safe exploration using a statistical model of the system.Marvi and Kiumarsi [55] introduced a control barrier function method for safe off-policy robot learning without needing to have a thorough understanding of system dynamics, in which the cost functions are augmented by a control barrier function.
Most of the above methods do not consider human-robot interaction.There are some prior works that investigate human-centered SRRL.For instance, Kazantzidis1 et al. [31] introduced a mechanism to ensure safety during exploration by harnessing human preferences.Reddy et al. [61] present a method to learn the model of human objectives by leveraging human feedback based on hypothetical behaviors, and then the model can be used to ensure the safety of robot learning.Saunders et al. [63] try to guarantee reinforcement learning safety by human interventions, where human interventions are learned through a supervised learning model.However, most of these works do not consider the mutual influence of humans and robots in shared environments.Thus, interactive behaviors between humans and robots that leverage bi-directional information transfer, as shown in Figure 1, still need further investigation to ensure SRRL.Our proposed human-centered SRRL framework, as shown in Figure 2, consists of three stages: safe exploration, safety value alignment, and safe collaboration.When a robot enters a new environment, it must explore it through safe action optimization and state modeling.Then, the robot has to learn the safety value alignment from the human-robot interaction.Finally, the robot should be able to achieve safe collaboration with humans.Interactive behaviors can help successfully achieve each of these three stages.

Safe Exploration
A robot should explore and find helpful information when entering a new environment.During the exploration, the robot must ensure its and the environment's (including other agents') safety.Therefore, the first stage of human-centered SRRL, safe exploration, focuses on exploring new environments and getting information about the environment model through safe actions.Uncertainty about the environment makes it difficult to determine safe actions.For example, newborn infants sometimes end up putting harmful objects in their mouths during oral exploration.It is challenging to ensure safe actions using conventional RL methods that learn new skills through trial and error.Especially with model-free RL, a robot cannot avoid a destructive action unless it has already tried it (or a similar action) [63].
Several methods have been proposed to address the challenge of safe exploration for RL [9,21,49,48,73,78,51].For example, Garcia and Fernández [21] introduced a safe exploration method, in which a predefined baseline policy is required, and the baseline policy is approximated by behavioral cloning methods [3].Nonetheless, derived from the method, it would be hard to search for the optimal exploration policy, and the capabilities of the baseline policy could severely limit the method's performance.A conservative safety critic [9] is proposed to guarantee safety with high probability during robot exploration processes.Garcia and Fernández [21] introduced a smoother risk function for safe robot exploration, which can achieve monotonic reward improvement and ensure safety.This method needs a predefined baseline policy to explore the safe space.Liu et al. [49] developed a safe exploration method for robot learning by constructing a constrained manifold.This method can guarantee safety for robot exploration using model-free RL.However, it requires an accurate robot model or a perfect tracking controller, which may hinder their method's real-world applications.
Interactive behaviors can be used to transfer expert knowledge from the human to the robot for safe exploration, just like infants are safeguarded by their parents.Human intervention has been investigated to avoid catastrophes in RL [63], and human feedback can be a reference for RL to ensure robot safety during safe exploration [17].However, the amount of human labor required for complex real-world applications is prohibitive.The bi-directional feedback in interactive behaviors can reduce the human-time cost during safe exploration.On the one hand, the robot can actively query the human and provide explanations of its behavior.On the other hand, the human can also actively query the robot to learn about the robot's model, in addition to intervening for safe exploration.

Safety Value Alignment
In the second stage of safe robot learning, to train a robot to perform a task safely, we need to evaluate how well it performs in terms of its performance on safety.Whether it is a form of costs, rewards, or labels, we need some form of signals to guide the safety policies for robot learning.In some scenarios, the safety performance can be evaluated automatically, such as bumping into other vehicles of autonomous driving or breaking the arms in robot manipulation.In these cases, training signals can be straightforwardly defined for safe robot learning.Furthermore, to facilitate the safe deployment of agents in real-world tasks, agents also need to be compatible with users' ethical judgment.Taking automated vehicles as an example, should one autonomous vehicle cut in line to maximize its reward in achieving a goal?Instead of maximizing only the reward, the agents need the capacity to abide by human moral values, which is essential but lacks effort.
Current robot learning algorithms rely on humans to state these training signals [25,80] and assume that humans understand the dangers.For example, imitation learning inferences a reward function from human demonstrations; preference-based learning guides the robot based on human judgments.These classes of tasks involve "human" training signals.The related problem of how to align has been discussed in earlier literature [44,15,50,31] on how to align agents with user intentions, in which meaningful training signals can be hard to obtain, due to the unpredictable long-term effect of the behaviors, or potential influence to other agents and environments in large multi-agent systems.
Designing AI agents that can achieve arbitrary objectives, such as minimizing some cost or penalties, can be deficient in that the systems are intrinsically unpredictable and might result in negative and irreversible outcomes for humans.In the context of interactive learning, we consider how a robot can behave safely or align with the user's intentions whilst maintaining safety under interactive behaviors with humans, and we frame this as the safety value alignment problem: how to create robots that behave safely and align with the human's intentions?
Interactive behaviors allow agents to infer human values.While agents infer human values from their feedback, bi-directional feedback enables the agent to explain its decision-making process.One early attempt [80] studied bi-directional communication in tabular-based navigation tasks without considering more practical scenarios.The next step aims to study the generalization of such results to large-scale problems via more efficient algorithms.

Safe Human-Robot Collaboration
The third stage of our safe-robot learning framework aims to accomplish safe physical and cognitive collaboration between robots and humans.Collaboration has enabled humans to achieve great evolutionary success.Therefore, safe human-robot collaboration is essential for successful human-robot coexistence.
The four main categories of human-robot collaboration tasks explored in the literature are collaborative assembly, object handling, object handovers, and collaborative manufacturing [64].RL has been used for tuning impedance controllers in physical human-robot collaboration tasks such as lifting objects [62] and guided trajectory following [59].However, these works evaluated the controllers in simplified scenarios and did not evaluate the generalizability of the learned policies.RL has also been used for performing robot-to-human object handovers [40] and human-to-robot object handovers [12].Nevertheless, in some cases, the spatial generalizability of learned policies is low [39].Ghadirzadeh et al. [22] used deep q-learning to generate proactive robot actions in a human-robot collaborative packaging task.However, they only evaluated a specific task scenario with a highly engineered reward function.Also, they did not test the trained policy in the real world and for different human participants than the training set.
Deep RL methods have been applied in real-world learning scenarios for tasks like quadrupedal walking, grasping objects, and varied manipulation skills [29].One of the desired features of these works is the ability to perform training with little or no human involvement.However, scenarios of humanrobot collaboration typically involve multiple humans in the robot's learning process.Multi-agent RL methods such as self-play or population-play do not perform very well with human partners [11].One proposed solution called Fictitious Co-Play (FCP) involves training with a population of self-play agents and their past checkpoints taken throughout training [67].However, FCP was evaluated only in a virtual game environment.
Interactive RL (IRL) approaches involve a human-in-the-loop to guide the robot's RL process.IRL has been applied to various human-computer interaction scenarios [4].In addition, human social feedback in the form of evaluation, advice, or instruction has also been utilized for several robot RL tasks [47].However, more research is needed towards utilizing IRL for safe human-robot collaboration.Also, while some works have explored non-verbal cues to express the robot's uncertainty during the learning process [56], most existing IRL approaches do not involve feedback from robots to humans.As depicted in Figure 2, evaluative feedback from robots to humans could help improve human-robot collaboration.

Open Challenges
In this section, we describe four key open challenges towards utilizing interactive behaviors for SRRL.These open challenges are related to the robustness, efficiency, transparency, and adaptability of SRRL.
1. How can the robot learn robust behaviours with potential human adversaries?
2. How to improve data efficiency of SRRL for effective utilization of interactive behaviors?
3. How to design "transparent" user interfaces for interactive behaviors?
4. How to enhance adaptability of SRRL for handling multiple scenarios of interactive behaviors?
The first challenge is to achieve robust SRRL with respect to unintentional or intentionally erroneous human conduct in interactive behaviors.In existing SRRL methods, it is often neglected that humans might misstate the safety signals.Also, due to the potential involvement of multiple humans with different values, robots need to learn to strike a balance between them.In the extreme case, adversaries may intentionally state their signal to mislead the training of robots to achieve malicious objectives.Training robust agents against such malicious users needs further research.Adversarial training may be useful to ensure safety in such scenarios [57].However, adversarial training is not yet ready for real-world robot learning [41].
The second challenge is to improve the data efficiency of SRRL, given that real-world interactive behaviors are expensive.Data efficiency can determine how quickly robots learn new skills and adapt to new environments and how effectively interactive behaviors can be utilized in the learning process.The success of machine learning can be attributed to the availability of large datasets and simulation environments.Therefore, one possible solution to reduce the need for real-world interactive data is to build large datasets or simulations of interactive behaviors.For example, Lee et al. [43] present a mixed reality (MR) framework in which humans can interact with virtual robots in virtual or augmented reality (VR/AR) environments.It can serve as a platform for collecting data in various human-robot interaction and collaboration scenarios.However, this framework suffers from the limiting aspects of MR, such as the inconvenience of wearable interfaces, motion sickness, and fatigue.Improvements in MR technology will be crucial for the widespread use of such MR environments.
The third challenge is to maintain transparency during SRRL.Transparency is important for the effective utilization of interactive behaviors.James et al. [53] provide empirical results to demonstrate human feedback and robot policy can be interactively influenced by each other, and indicate that the assumption is that the feedback from a human is independent of the robot's current policy, may be incorrect.Transparency can be achieved in the context of interactive behaviors by explaining the robot's decision-making process to the human and exposing the robot's internal state.Several solutions have been proposed to improve the explainability of robot RL [53,56,27,46,76,6,5].For example, Likmeta et al. [46] introduced an interpretable rule-based controller for the transparency of RL in transportation applications.Nonetheless, the policy that the method provides may be too conservative, since it severely depends on the restricted rules.Matarese et al. [56] present a method to improve robot behaviors' transparency to human users by leveraging emotional-behavior feedback based on robot learning progress.However, further research is needed towards communicating the robot's internal state.Non-verbal communication in the form of gazes and gestures can be leveraged to communicate the robot's internal state.For example, if the robot is uncertain about its decisions, it can show hesitation gestures.
The fourth challenge is to enhance the adaptability of safe robot learning to handle a variety of settings involving interactive behaviors.As shown in Figure 3, the robot can encounter different environments.0 denotes completely trusted environments, in which the environment information is known and the environment is stable; 1 denotes Unstable environments for which environment information is known; 2 represents Uncertain environments, where the information about the environment is uncertain and partial; 3 represents Unknown environments, in which the environment information is E2 E3 E0 E1 Figure 3: Trusted Environments (E0) and "3U" Environments (E1: Unstable, E2: Uncertain, E3: Unknown).As the color gradually changes from gray to green, the information observed is more and more abundant, and the environment is more and more certain.

completely unknown.
Safety can be ensured in trusted environments, e.g., robots can safely grasp an object in a static environment.However, in "3U" environments, ensuring SRRL is challenging.For instance, during sim-to-real transfer, the discrepancies between the simulation models and real-world models are inevitable [58], and the real-world environments are replete with uncertain disturbances and unknown information, e.g., in a multi-agent system, guaranteeing each agent's safety may be difficult.Some works provide a potential direction to ensure multi-robot learning safety in unstable environments, for example, Multi-Agent Constrained Policy Optimization and Multi-Agent Proximal Policy Optimization Lagrangian [24].Nevertheless, the exploration involving interactive behaviors between agents and environments can be intricate and time-intensive.The incorporation of human insights and value alignment in the exploratory phase is instrumental in enhancing the adaptability of these agents within a human-in-the-loop learning system.As we envisage future trajectories of research, a salient focus is the attainment of SRRL characterized by interactive behaviors in "3U" environments.In this vein, game theory [18] emerges as a pivotal tool, offering nuanced strategies and frameworks for optimizing agent-environment interactions.Concurrently, the integration of advancements in cognitive science is anticipated to play a quintessential role.Specific areas of interest encompass the optimization of information management protocols between humans and robotic agents and the augmentation of robotic cognitive faculties through the effective utilization of perception devices.These integrated approaches aim to engender a more seamless, efficient, and adaptive interaction paradigm, catalyzing enhanced performance and adaptability in complex, dynamic environments.
However, the exploration with interactive behaviors between agents and environments may be time-consuming.Involving human knowledge and value alignment in the exploration can improve its adaptability in a human-loop learning system.Future work to achieve SRRL with interactive behaviors in "3U" environments can leverage game theory [18] and advances in cognitive science, for instance, how to manage the information between humans and robots, and robots how to leverage perception devices to enhance its cognitive abilities.

Conclusion
Robots need to ensure safety while leveraging RL in human environments.However, conventional SRRL algorithms do not consider the mutual influence between the robot and the human.In this paper, we proposed a human-centered SRRL framework consisting of three stages: safe exploration, value alignment, and human-robot collaboration.We discussed how these stages can leverage mutual influence or bidirectional information transfer between the robot and the human through interactive behaviors.Furthermore, we described four key open challenges related to the robustness, efficiency, transparency, and adaptability of SRRL for effective utilization of interactive behaviors.