Adaptive querying for reward learning from human feedback

Anand, Yashwanthi; Nwagwu, Nnamdi; Sabbe, Kevin; Fitter, Naomi T.; Saisubramanian, Sandhya

doi:10.3389/frobt.2025.1734564

ORIGINAL RESEARCH article

Front. Robot. AI, 12 February 2026

Sec. Computational Intelligence in Robotics

Volume 12 - 2025 | https://doi.org/10.3389/frobt.2025.1734564

This article is part of the Research TopicAdvanced Sensing, Learning and Control for Effective Human-Robot InteractionView all 3 articles

Adaptive querying for reward learning from human feedback

Sandhya Saisubramanian*

Oregon State University, Corvallis, OR, United States

Learning from human feedback is a popular approach to train robots to adapt to user preferences and improve safety. Existing approaches typically consider a single querying (interaction) format when seeking human feedback and do not leverage multiple modes of user interaction with a robot. We examine how to learn a penalty function associated with unsafe behaviors using multiple forms of human feedback, by optimizing both the query state and feedback format. Our proposed adaptive feedback selection is an iterative, two-phase approach which first selects critical states for querying, and then uses information gain to select a feedback format for querying across the sampled critical states. The feedback format selection also accounts for the cost and probability of receiving feedback in a certain format. Our experiments in simulation demonstrate the sample efficiency of our approach in learning to avoid undesirable behaviors. The results of our user study with a physical robot highlight the practicality and effectiveness of adaptive feedback selection in seeking informative, user-aligned feedback that accelerate learning. Experiment videos, code and supplementary materials are found on our website: https://tinyurl.com/AFS-learning.

1 Introduction

A key factor affecting an autonomous agent’s behavior is its reward function. Due to the complexity of real-world environments and the practical challenges in reward design, agents often operate with incomplete reward functions corresponding to underspecified objectives, which can lead to unintended and undesirable behaviors such as negative side effects (NSEs) (Amodei et al., 2016; Saisubramanian et al., 2021a; Srivastava et al., 2023). For example, a robot optimizing the distance to transport an object to the goal, may damage items along the way if its reward function does not model the undesirability of colliding into other objects in the way (Figure 1).

Figure 1

A robotic arm is positioned in a workspace with a green background. It’s shown making a movement over boxes and objects. A red dashed line indicates the trajectory before learning, while a purple line shows the trajectory after learning. Annotations with correction and rank information are placed near the lines.

Figure 1. An illustration of adaptive feedback selection. The robot arm learns to move the blue object to the white bin, without colliding with other objects in the way, by querying the human in different format across the state space.

Human feedback offers a natural way to provide the missing knowledge, and several prior works have examined learning from various forms of human feedback to improve robot performance, including avoiding side effects (Cui and Niekum, 2018; Cui et al., 2021b; Lakkaraju et al., 2017; Ng and Russell, 2000; Saran et al., 2021; Zhang et al., 2020). In many real-world settings, the human can provide feedback in many forms, ranging from binary signals indicating action approval to correcting robot actions, each varying in the granularity of information revealed to the robot and the human effort required to provide it. For instance, a person supervising a household robot may occasionally be willing to provide detailed corrections when the robot encounters a fragile vase but may only want to give quick binary approvals during a routine motion. Ignoring this variability either limits what the robot can learn or burdens the user. To efficiently balance the trade-off between seeking feedback in a format that accelerates robot learning and reducing human effort involved, it is beneficial to seek detailed feedback sparingly in certain states and complement it with feedback types that require less human effort in other states. Such an approach could also reduce the sampling biases associated with learning from any one format, thereby improving learning performance (Saisubramanian et al., 2022). In fact, a recent study indicates that users are generally willing to engage with the robot in more than one feedback format (Saisubramanian et al., 2021b). However, existing approaches rarely exploit this flexibility, and do not support gathering feedback in different formats in different regions of the state space (Cui et al., 2021a; Settles, 1995).

These practical considerations motivate the core question of this paper: “How can a robot identify when to query and in what format, while accounting for the cost and availability of different forms of feedback?” We present a framework for adaptive feedback selection (AFS) that enables a robot to seek feedback in multiple formats in its learning phase, such that its information gain is maximized. Rather than treating all states and feedback formats uniformly, AFS prioritizes human feedback in states where feedback is most valuable and chooses feedback types based on their expected cost and information gain. This design reduces user effort, accommodates different levels of feedback granularity, and focuses on state where learning improves safety. In the interest of clarity, the rest of this paper grounds the discussion of AFS as an approach for robots to learn to avoid negative side effects (NSEs) of their actions. The NSEs refer to unintended and undesirable outcomes that arise as the agent performs its assigned task. In object delivery example in Figure 1, the robot may inadvertently collide with other objects on the table, producing NSEs. Focusing on NSEs provides a well-defined and measurable setting–quantified by the number of NSE occurrences–to evaluate how AFS improves an agent’s learning efficiency and safety. However, note that AFS is a general technique that can be applied broadly to learn about various forms of undesirable behavior.

Minimizing NSEs using AFS involves four iterative steps (Figure 4): (1) states are partitioned into clusters, with a cluster weight proportional to the number of NSEs discovered in it; (2) a set of critical states—states where human feedback is crucial for learning an association of state features and NSEs, i.e., a predictive model of NSE severity, is formed by sampling from each cluster based on its weight; (3) a feedback format that maximizes the information gain in critical states is identified, while accounting for the cost and uncertainty in receiving a feedback, using the human feedback preference model; and (4) cluster weights and information gain are updated, and a new set of critical states are sampled to learn about NSEs, until the querying budget expires. The learned NSE information is mapped to a penalty function and augmented to the robot’s model to compute an NSE-minimizing policy to complete its task.

We evaluate AFS in both simulation and using a user study where participants interact with a robot arm. First, we evaluate the approach in three simulated proof-of-concept settings with simulated human feedback. Second, we conduct a pilot study where 12 human participants interact with and provide feedback to the agent in a simulated gridworld domain. Finally, we evaluate using a Kinova Gen3 7DoF arm and 30 human participants. Besides the performance and sample efficiency, our experiments also provide insights into how the querying process can influence user trust. Together, these complementary studies demonstrate both the practicality and effectiveness of AFS.

2 Background and related work

2.1 Markov Decision Processes (MDPs)

The MDPs are a popular framework to model sequential decision making problems. An MDP is defined by the tuple $M = ⟨ S, A, T, R, γ ⟩$ , where $S$ is the set of states, $A$ is the set of actions, $T (s, a, s^{'})$ is the probability of reaching state $s^{'} \in S$ after taking an action $a \in A$ from a state $s \in S$ and $R (s, a)$ is the reward for taking action $a$ in state $s$ . An optimal deterministic policy $π * : S \to A$ is one that maximizes the expected reward. When the objective or reward function is incomplete, even an optimal policy can produce unsafe behaviors such as side effects. Negative Side Effects (NSEs) are immediate, undesired, unmodeled effects of an agent’s actions on the environment (Krakovna et al., 2018; Saisubramanian and Zilberstein, 2021; Srivastava et al., 2023). We focus on NSEs arising due to incomplete reward function (Saisubramanian et al., 2021a), which we mitigate by learning a penalty function using human feedback.

2.2 Learning from human feedback

Learning from human feedback is a popular approach to train agents when reward functions are unavailable or incomplete (Abbeel and Ng, 2004; Ng and Russell, 2000; Ross et al., 2011; Najar and Chetouani, 2021), including to improve safety (Brown et al., 2020b; 2018; Hadfield Menell et al., 2017; Ramakrishnan et al., 2020; Zhang et al., 2020; Saisubramanian et al., 2021a; Hassan et al., 2025). Feedback can take various forms such as demonstrations (Ramachandran and Amir, 2007; Saisubramanian et al., 2021a; Seo and Unhelkar, 2024; Zha et al., 2024), corrections (Cui et al., 2023; Bärmann et al., 2024), critiques (Cui and Niekum, 2018; Tarakli et al., 2024), ranking trajectories (Brown et al., 2020a; Xue et al., 2024; Feng et al., 2025), natural language instructions (Lou et al., 2024; Yang Y. et al., 2024; Hassan et al., 2025) or may be implicit in the form of facial expressions and gestures (Cui et al., 2021b; Strokina et al., 2022; Candon et al., 2023).

While the existing approaches for learning from feedback have shown success, they typically assume that a single feedback type is used to teach the agent. This assumption limits learning efficiency and adaptability. Some efforts combine demonstrations with preferences (Bıyık et al., 2022; Ibarz et al., 2018), showing that utilizing more than one format accelerates learning. Extending this idea, recent works integrate richer modalities such as language and vision with demonstrations. Yang Z. et al. (2024) learn reward function from comparative language feedback, while Sontakke et al. (2023) show that a single demonstration or natural language description can help define a proxy reward when used along with a vision-language models (VLM) that is pretrained on a large amount of out-of-domain video demonstrations and language pairs. Kim et al. (2023) use multimodal embeddings of visual observations and natural language descriptions to compute alignment-based rewards. A recent study even emphasizes that combining multiple feedback modalities can further enhance learning outcomes (Beierling et al., 2025). Together, these works highlight that combining complementary feedback formats help advance reward learning beyond using a fixed feedback format. Building on this insight, our approach uses multiple forms of human feedback for learning.

Learning from human feedback has also been used for modeling variations in human behavior. Huang et al. (2024) model the heterogeneous behaviors of human, capturing differences in feedback frequency, delay, strictness, and bias to improve the robustness during the learning process, as optimal behaviors vary across users. Along the same line, the reward learning approach proposed by Ghosal et al. (2023), selects a single feedback format based on the user ability to provide feedback in that format, resulting in an interaction that is tailored to a user’s skill level. Collectively, these works reveal a shift towards adaptive and user-aware querying mechanisms that improves reward inference and learning efficiency, motivating our approach to dynamically select both when to query and in what feedback format.

3 Problem formulation

Setting: Consider a robot operating in a discrete environment modeled as a Markov Decision Process (MDP), using its acquired model $M = ⟨ S, A, T, R_{T} ⟩$ . The robot optimizes the completion of its assigned task, which is its primary objective described by reward $R_{T}$ . A primary policy, $π^{M}$ , is an optimal policy for the robot’s primary objective.

Assumption 1. Similar to (Saisubramanian et al., 2021a), we assume that the agent’s model $M$ has all the necessary information for the robot to successfully complete its assigned task but lacks other superfluous details that are unrelated to the task.

Since the model is incomplete in ways unrelated to the primary objective, executing the primary policy produces negative side effects (NSEs) that are difficult to identify at design time. Following (Saisubramanian et al., 2021a), we define NSEs as immediate, undesired, unmodeled effects of a robot’s actions on the environment. We focus on settings where the robot has no prior knowledge about the NSEs of its actions or the underlying true NSE penalty function $R_{N}$ . It learns to avoid NSEs by learning a penalty function ${\hat{R}}_{N}$ from human feedback that is consistent with $R_{N}$ .

We target settings where the human can provide feedback in multiple ways and the robot can seek feedback in a specific format such as approval or corrections. This represents a significant shift from traditional active learning methods, which typically gather feedback only in a single format (Ramakrishnan et al., 2020; Saisubramanian et al., 2021a; Saran et al., 2021). Using the learned ${\hat{R}}_{N}$ , the robot computes an NSE-minimizing policy to complete its task by optimizing: $R (s, a) = θ_{1} R_{T} (s, a) + θ_{2} {\hat{R}}_{N} (s, a)$ , where $θ_{1}$ and $θ_{2}$ are fixed, tunable weights denoting priority over objectives.

Running Example: We illustrate the problem using a simple object delivery task using a Kinova Gen3 7DoF arm shown in Figure 1. The robot optimizes delivering the blue block to the white bin, by taking the shortest path. However, passing through states with a cardboard box or a glass bowl constitutes an NSE. Since the robot has no prior knowledge about NSEs of its actions, it may inadvertently navigate through these states causing NSEs.

Human’s Feedback Preference Model: The feedback format selection must account for the cost and human preferences in providing feedback in a certain format. The user’s feedback preference model is denoted by $D = ⟨ F, ψ, C ⟩$ where,

• $F$ is a predefined set of feedback formats the human can provide, such as demonstrations and corrections;

• $ψ : F \to [0,1]$ is the probability of receiving feedback in a format $f$ , denoted as $ψ (f)$ ; and

• $C : F \to R$ is a cost function that assigns a cost to each feedback format $f$ , representing the human’s time or cognitive effort required to provide that feedback.

This work assumes the robot has access to the user’s feedback preference model $D$ —either handcrafted by an expert or learned from user interactions prior to robot querying, as in our user study experiments. Abstracting user feedback preferences into probabilities and costs enables generalizing the preferences across similar tasks. We take the pragmatic stance that $ψ$ is independent of time and state, denoting the user’s preference about a format, such as not preferring formats that require constant supervision of robot performance. While this can be relaxed and the approach can be extended to account for state-dependent preferences, obtaining an accurate state-dependent $ψ$ could be challenging in practice.

Assumption 2. Human feedback is immediate and accurate, when available.

Below, we describe the various feedback formats considered in this paper, and how the data from these formats are mapped to NSE severity labels.

3.1 Feedback formats studied

The agent learns an association between state-action pairs and NSE severity, based on the human feedback provided in response to agent queries. The NSE categories we consider in this work are ${No NSE, Mild NSE, Severe NSE}$ . We focus on the following commonly used feedback types, each differing in the level of information conveyed to the agent and the human effort required to provide them.

Approval (App): The robot randomly selects $N$ state-action pairs from all possible actions in critical states and queries the human for approval or disapproval. Approved actions are labeled as acceptable, while disapproved actions are labeled as unacceptable.

Annotated Approval (Ann. App): An extension of Approval, where the human specifies the NSE severity (or category) for each disapproved action in the critical states.

Corrections (Corr): The robot performs a trajectory of its primary policy in the critical states, under human supervision. If the robot’s action is unacceptable, then the human intervenes with an acceptable action in these states. If all actions in a state lead to NSE, the human specifies an action with the least NSE. When interrupted, the robot assumes all actions except the correction are unacceptable in that state.

Annotated Corrections (Ann. Corr): An extension of Corrections, where the human specifies the severity of NSEs caused by the robot’s unacceptable action in critical states.

Rank: The robot randomly selects $N$ ranking queries of the form $⟨ state, action 1, action 2 ⟩$ , by sampling two actions for each critical state. The human selects the safer action among the two options. If both are safe or unsafe, one of them is selected at random. The selected action is marked as acceptable and the other is treated as unacceptable.

Demo-Action Mismatch (DAM): The human demonstrates a safe action in each critical state, which the robot compares with its policy. All mismatched robot’s actions are labeled as unacceptable. Matched actions are labeled as acceptable.

Mapping feedback data to NSE severity labels: We use $l_{a}$ , $l_{m}$ , and $l_{h}$ to denote labels corresponding to no, mild and severe NSEs, respectively. An acceptable action in a state is mapped to $l_{a}$ , i.e., $(s, a) \to l_{a}$ , while an unacceptable action is mapped to $l_{h}$ . When the severity of NSEs for unacceptable actions is known, actions producing mild NSEs are mapped to $l_{m}$ and those producing severe NSEs to $l_{h}$ . Mapping feedback to this common label set provides a consistent representation of NSE severity across diverse feedback types. The granularity of information and the sampling biases of the different feedback types affect the learned reward. Figure 2 illustrates this with the learned NSE penalty for the running example of moving an object to the bin (Figure 1), motivating the need for an adaptive approach that can learn from more than one feedback format. In the running example, the robot arm colliding with cardboard boxes is a mild NSE, and colliding with a glass bowl is a severe NSE.

Figure 2

Flowchart illustrating different feedback and reward types in a grid environment, including

Figure 2. Visualization of reward learned using different feedback types. (Row 1) Black arrows indicate queries, and feedback is in speech bubbles. , , indicates high, mild, and zero penalty. Outer box is the true reward, and inner box shows the learned reward. Mismatches between the outer and inner box colors indicate incorrect learned model.

4 Adaptive feedback selection

Given an agent’s decision making model $M$ and the human’s feedback preference model $D$ , AFS enables the agent to query for feedback in critical states in a format that maximizes its information gain. We first formalize the NSE model learning process and then describe in detail how AFS selects critical states and the query format.

Formalizing NSE Model Learning: Let $p * : S \times A \to {l_{a}, l_{m}, l_{h}}$ denote the true NSE severity label for each state-action pair, which is unknown to the agent but known to the human. The label $l_{a}$ corresponds to no NSE, $l_{m}$ denotes mild NSE, $l_{h}$ denote the label for severe NSE. Let $p$ be a sampled approximation of $p *$ $(p \sim p *)$ , denoting the dataset of NSE labels collected via human feedback in response to the $(s, a)$ pairs queried. That is, $p^{t}$ denotes the data collected from human feedback until iteration $t$ , where $p^{t} (s, a)$ represents the categorical NSE severity label assigned to the state-action pair $(s, a)$ . Let $q : S \times A \to {l_{a}, l_{m}, l_{h}}$ denote the labels predicted by the learned NSE model—learned using a supervised classifier with $p$ as the training data. In this paper, we use a Random Forest (RF) classifier, though any classifier can be used in practice. Hyperparameters are optimized through randomized search with three-fold cross validation, and the configuration yielding the lowest mean-squared error is selected for training.

Figure 3 shows an example of $p$ and $q$ for the object delivery task. We encode NSE categories as ${0,1,2}$ corresponding to $\{no NSE, mild NSE, severe NSE\}$ respectively. Each state has four possible actions $A = {a_{1}, a_{2}, a_{3}, a_{4}}$ , and the vector $p (s) = {\cdot, \cdot, \cdot, \cdot}$ (and similarly $q (s)$ ) encodes the categorical NSE labels for $(s, a_{1}), (s, a_{2}), (s, a_{3}), (s, a_{4})$ in that order. Since the human’s categorization of NSE is initially unknown, $p (s)$ is sampled from a uniform prior over the labels, and $q (s)$ is initialized to [0,0,0,0] (all actions are assumed to be safe) across all states.

Figure 3

Two grid diagrams showing iterations $t-1$ and $t$ for sequences $p$ and $q$. At $t-1$, sequence $p$ has an orange highlighted cell in the first column, and sequence $q$ has a blue highlighted cell in the first column, both with set $\{Ann. App\}$. At $t$, sequence $p$ has a red highlighted cell in the second column, and sequence $q$ has a red highlighted cell in the fourth column, both with set $\{Ann. App, App\}$.

Figure 3. Illustration of $p$ (accumulated feedback) and $q$ (generalized NSE labels) for the object delivery task. $f_{1 : t - 1}^{*}$ indicates the feedback formats selected until iteration $t - 1$ . indicates no NSE; indicates mild NSE; indicates severe NSE. Queried states in each iteration is highlighted in blue.

At $t - 1$ , $p^{t - 1}$ reflects a single labeled state from the feedback received, while $q^{t - 1}$ reflects NSE label for the state after learning from $p^{t - 1}$ . For example, in iteration $t - 1$ , an action $a_{3}$ in state $s$ is randomly selected for querying using the Annotated Approval feedback format. The human labels it as mild NSE, so $p^{t - 1} (s, a_{3}) = 1$ , and consequently $p^{t - 1} (s) = [0,0,1,0]$ . After training on $p^{t - 1}$ , the classifier may sometimes incorrectly predict $q^{t - 1} (s) = [0,0,0,0]$ , especially in early iterations when there is less data. At the next iteration $t$ , the agent queries in a similar state using the Approval format, where the action $a_{1}$ is randomly selected. Because the NSE severity level (i.e., mild/severe) cannot be indicated through the Approval format, $p^{t}$ is updated as $p^{t} (s) = [2,0,0,0]$ , and training now yields a prediction $q^{t} (s) = [2,0,1,0]$ (i.e., the NSE model predicts severe NSE outcome on $a_{1}$ and a mild NSE outcome on $a_{3}$ ). This illustrates that $q$ may initially disagree with $p$ , but as feedback accumulates on related states, the generalization of $q$ across actions begins to align with $p$ .

Each predicted label is then mapped to a penalty value to form the learned penalty function, ${\hat{R}}_{N}$ , with penalties for $l_{a}, l_{m}$ and $l_{h}$ set to $0, - 5$ and $- 10$ respectively, in our experiments. This penalty function is integrated into the agent’s reward model to compute an updated policy that minimizes NSEs while completing the primary task.

In this learning setup, minimizing NSEs using AFS involves four iterative steps (Figure 4). In each learning iteration, AFS identifies (1) which states are most critical for querying (Section 4.1), and (2) which feedback format maximizes the expected information gain at the critical states, while accounting for user feedback preferences and effort involved (Section 4.2). The information gain associated with a feedback quantifies the effect of a feedback in improving the agent’s understanding of the underlying reward function, and is measured using Kullback-Leibler (KL) Divergence (Ghosal et al., 2023; Tien et al., 2023). At the end of each iteration, the cluster weights and information gain are updated, and a new set of critical states are sampled to learn about NSEs, until the querying budget expires or the KL-divergence is below a problem-specific, pre-defined threshold.

Figure 4

Flowchart depicting a process for feedback-driven learning in critical states. It illustrates steps for selecting optimal feedback format, updating information gain, and learning the NSE severity prediction model. The chart includes visual representations of feedback methods such as demonstration, approval, and correction. It involves computing policies, updating budgets, and executing actions through supervised learning, with mathematical formulas for optimization. Various components like critical states, feedback reception, budget updates, and policy computation are intertwined in the learning process. The chart is structured for clarity with arrows denoting process flow and decision points.

Figure 4. Solution approach overview. The critical states $Ω$ for querying are selected by clustering the states. A feedback format $f^{*}$ that maximizes information gain is selected for querying the user across $Ω$ . The NSE model is iteratively refined based on feedback. An updated policy is calculated using a penalty function ${\hat{R}}_{N}$ , derived from the learned NSE model.

4.1 Critical states selection

When the budget for querying a human is limited, it is useful to query in states with a high learning gap measured as the KL-divergence between the agent’s knowledge of NSE severity and the true NSE severity given the feedback data collected so far. States with a high learning gap are called critical states $(Ω)$ and querying in these states can reduce the learning gap.

Since $p^{t}$ and $q^{t}$ contain categorical values rather than probabilities, their corresponding empirical probability mass functions (PMFs) are computed over the three NSE categories (no NSE, mild NSE, and severe NSE), yielding ${\hat{p}}^{t}$ and ${\hat{q}}^{t}$ , respectively. In this case, ${\hat{p}}^{t}$ and ${\hat{q}}^{t}$ will be vectors of length three, since we consider three NSE categories.

In order to select critical states for querying, we compute the KL divergence between ${\hat{q}}^{t - 1}$ and ${\hat{p}}^{t}$ , $D_{K L} ({\hat{p}}^{t} ‖ {\hat{q}}^{t - 1})$ . Although $D_{K L} ({\hat{p}}^{t} ‖ {\hat{q}}^{t})$ may appear as a reasonable criterion to guide critical states selection, it only measures how well the agent learns from the feedback at $t$ . It does not reveal states where the agent’s predictions were incorrect. For the example shown in Figure 3 with $q^{t - 1} (s) = [0,0,0,0]$ and $p^{t} (s) = [2,0,0,0]$ , ${\hat{p}}^{t}$ and ${\hat{q}}^{t - 1}$ are calculated as the average occurrence of each NSE category (no NSE, mild NSE, severe NSE) across the four actions. That is, for $q^{t - 1} (s) = [0,0,0,0]$ , the frequency is $[\frac{4}{4}, \frac{0}{4}, \frac{0}{4}]$ , resulting in ${\hat{q}}^{t - 1} (s) = [1.0, 0.0, 0.0]$ . For $p^{t} (s) = [2,0,0,0]$ , the frequency is $[\frac{3}{4}, \frac{0}{4}, \frac{1}{4}]$ , yielding ${\hat{p}}^{t} (s) = [0.75, 0.0, 0.25]$ . Calculating the divergence between ${\hat{p}}^{t} (s)$ and ${\hat{q}}^{t - 1} (s)$ reveals that the prediction was incorrect at $s$ and therefore more data is required to align the learned model, and hence $s$ or similar states should be selected for querying. Therefore, the sampling weight of the cluster containing $s$ is increased (the region where the NSE model is still uncertain). In the following iteration, critical states are drawn from the reweighted clusters. Algorithm 1 outlines our approach for selecting critical states at each learning iteration, with the following three key steps.

1. Clustering states: Since NSEs are typically correlated with specific state features and do not occur at random, we cluster the states $S$ into $K$ number of clusters so as to group states with similar NSE severity (Lakkaraju et al., 2017). In our experiments, we use KMeans clustering algorithm with Jaccard distance to measure the distance between states based on their features. In practice, any clustering algorithm can be used, including manual clustering. The goal is to create meaningful partitions of the state space to guide critical states selection for querying the user.

2. Estimating information gain: We define the information gain of sampling from a cluster $k \in K$ , based on the learning gap, as follows:

I G {(k)}^{t} = \frac{1}{| Ω_{k}^{t - 1} |} \sum_{s \in Ω_{k}^{t - 1}} D_{K L} ({\hat{p}}^{t} ‖ {\hat{q}}^{t - 1}) (1)

= \frac{1}{| Ω_{k}^{t - 1} |} \sum_{s \in Ω_{k}^{t - 1}} \sum_{l \in \{l_{a}, l_{m}, l_{h}\}} {\hat{p}}^{t} (l | s) \cdot \log (\frac{{\hat{p}}^{t} (l | s)}{{\hat{q}}^{t - 1} (l | s)}), (2)

where $Ω_{k}^{t - 1}$ denotes the set of states sampled for querying from cluster $k$ at iteration $t - 1$ . ${\hat{p}}^{t} (l | s)$ and ${\hat{q}}^{t - 1} (l | s)$ denote the probability of observing NSE category $l \in {l_{a}, l_{m}, l_{h}}$ in state $s$ , derived from $p^{t}$ and $q^{t}$ , respectively. This formulation quantifies how much the predicted NSE distribution diverges from the feedback received for each state, providing a principled measure of the expected information gain from querying in a cluster, $k$ , as defined in Equation 1.

3. Sampling critical states: At each learning iteration $t$ , the agent assigns a weight $w_{k}$ to each cluster $k \in K$ , proportional to the new information on NSEs revealed by the most informative feedback format identified at $t - 1$ , using Equation 2. Clusters are given equal weights when there is no prior feedback (Line 4). Let $N$ denote the number of critical states to be sampled in every iteration. We sample critical states in batches but they can also be sampled sequentially. When sampling in batches of $N$ states, the number of states $n_{k}$ to be sampled from each cluster is determined by its assigned weight. At least one state is sampled from each cluster to ensure sufficient information for calculating the information gain for every cluster (Line 5). The agent randomly samples $n_{k}$ states from corresponding cluster and adds them to a set of critical states $Ω$ (Lines 6, 7). If the total number of critical states sampled is less than $N$ due to rounding, then the remaining $N_{r}$ states are sampled from the cluster with the highest weight and added to $Ω$ (Lines 9–12).

Algorithm 1

Algorithm 1. Critical States Selection.

4.2 Feedback format selection

To query in the critical states, $Ω$ , it is important to select a feedback format that not only maximizes the expected information gain about NSEs but also accounts for likelihood and cost of the feedback. The information gain of a feedback format $f$ at iteration $t$ , for $N = | Ω |$ critical states, is computed as the KL divergence between the observed and predicted NSE severity distributions, ${\hat{p}}^{t}$ and ${\hat{q}}^{t}$ :

V_{f} = \frac{1}{N} \sum_{s \in Ω} D_{K L} ({\hat{p}}^{t} ‖ {\hat{q}}^{t}) \cdot I [f = f_{H}^{t}] + V_{f} \cdot (1 - I [f = f_{H}^{t}]), (3)

where, $I [f = f_{H}^{t}]$ is an indicator function that checks whether the format provided by the human, $f_{H}^{t}$ , matches the requested format $f$ . If no feedback is received, the information gain for that format remains unchanged. The following equation is used to select the feedback format $f^{*}$ , accounting for feedback cost and user preferences:

f^{*} = \underset{f \in F}{a r g m a x} \underset{Feedback utility of f}{\underset{⏟}{\frac{ψ (f)}{V_{f} \cdot C (f)} + \sqrt{\frac{\log t}{n_{f} + ϵ}}}}, (4)

where $ψ (f)$ is the probability of receiving a feedback in format $f$ and $C (f)$ is the feedback cost, determined using the human preference model $D$ . $t$ is the current learning iteration, $n_{f}$ is the number of times feedback in format $f$ was received, and $ϵ$ is a small constant for numeric stability. The selected format $f^{*}$ represents the most informative feedback format given the agent’s current knowledge, balancing exploration (less frequently used formats) and exploitation (formats known to provide high information gain).

Algorithm 2

Algorithm 2. Feedback Selection for NSE Learning.

Algorithm 2 outlines our feedback format selection approach. Since the agent has no prior knowledge of how the human categorizes NSE for each state-action pairs, the labeling function $p$ is instantiated by sampling from a uniform prior over the three NSE labels $(l_{a}, l_{m}, l_{h})$ for every $(s, a)$ , while q is initialized assuming all actions are safe $(l_{a})$ (Line 2). These initial labels are progressively refined as human feedback is received. At each iteration, the agent samples $| Ω |$ critical states using Algorithm 1 (Line 4), and selects a feedback format $f^{*}$ is selected using Equation 4. The agent queries the human for feedback in $f^{*}$ (Line 5). If the feedback is received (with probability $ψ (f^{*})$ ), the observed NSE labels $p^{t}$ are updated and an NSE prediction model $P$ is trained (Lines 6–8). The classifier $P$ predicts the labels for the sampled critical states $Ω$ , yielding $q^{t}$ . We restrict the prediction to $Ω$ since these states indicate regions of high uncertainty and contribute to reducing the divergence between the true and learned NSE distributions. Further, restricting predictions to $Ω$ also reduces computational overhead during iterative querying. $V_{f^{*}}$ is recomputed using Equation 3, and $n_{f^{*}}$ is incremented (Lines 9–11). This repeats until either the querying budget is exhausted or the KL divergence between ${\hat{p}}^{t}$ and ${\hat{q}}^{t}$ over all states is within a problem-specific threshold $δ$ .

Figure 5 illustrates the critical states and the most informative feedback formats selected at each iteration in the object delivery task using AFS, demonstrating that feedback utility changes over time, based on the robot’s current knowledge.

Figure 5

Bar graph shows feedback utility scores across five learning iterations, comparing six feedback methods: Annotated Approval, Approval, Ranking, Annotated Correction, Correction, and DAM. A robotic arm is positioned over colored squares labeled one to five on a checkered tabletop, illustrating a task setup.

Figure 5. Feedback utility of each format across iterations. Numbers mark when a state was identified as critical, and circle colors denote the chosen feedback format.

4.3 Stopping criteria

Besides guiding the selection of critical states and feedback format, the KL-divergence also serves as an indicator of when to stop querying. The querying phase can be terminated when $D_{K L} ({\hat{p}}^{t} ‖ {\hat{q}}^{t}) \leq δ$ , where $δ$ is a problem-specific threshold. When $D_{K L} ({\hat{p}}^{t} ‖ {\hat{q}}^{t}) \leq δ$ , it indicates that the learned model is a reasonable approximation of the underlying NSE distribution, and therefore the querying can be terminated even if the allotted budget $B$ has not been exhausted. The choice of $δ$ provides a trade-off between thorough learning and human effort, and can be tuned based on domain-specific safety requirements.

5 Experiments in simulation

We first evaluate AFS on three simulated domains (Figure 6). Human feedback is simulated by modeling an oracle that selects safer actions with higher probability using a softmax action selection (Ghosal et al., 2023; Jeon et al., 2020): the probability of choosing an action $a^{'}$ from a set of all safe actions $A^{*}$ in state $s$ is, $\Pr (a^{'} | s) = \frac{e^{Q (s, a^{'})}}{\sum_{a \in A^{*}} e^{Q (s, a)}}$ .

Figure 6

Panel (a) illustrates a navigation scenario with a vehicle on a track featuring green paths and water puddles. Panel (b) shows a grid with vases obstructing a path where a robot icon is placed at the top. Panel (c) presents a grid with purple circles and an orange puzzle piece, labeled

Figure 6. Illustrations of evaluation domains. Red box denotes the agent and the goal location is in green. (a) Navigation: Unavoidable NSE. (b) Vase: Unavoidable NSE. (c) Safety-gym Push.

Baselines (i) Naive Agent: The agent naively executes its primary policy without learning about NSEs, providing an upper bound on the NSE penalty incurred. (ii) Oracle: The agent has complete knowledge about $R_{T}$ and $R_{N}$ , providing a lower bound on the NSE penalty incurred. (iii) Reward Inference with $β$ Modeling (RI) (Ghosal et al., 2023): The agent selects a feedback format that maximizes information gain according to the human’s inferred rationality, $β$ . (iv) Cost-Sensitive Approach: The agent selects a feedback method with the least cost, according to the preference model $D$ . (v) Most-Probable Feedback: The agent selects a feedback format that the human is most likely to provide, based on $D$ . (vi) Random Critical States: The agent uses our AFS framework to learn about NSEs, except the critical states are sampled randomly from the entire state space. We use $θ_{1} = 1$ and $θ_{2} = 1$ for all our experiments. AFS uses learned ${\hat{R}}_{N}$ .

Domains, Metrics and Feedback Formats: We evaluate the performance of various techniques on three domains in simulation (Figure 6): outdoor navigation, vase and safety-gym’s push. We optimize costs (negations of rewards) and compare techniques using average NSE penalty and average cost to goal, averaged over 100 trials. For navigation, vase and push, we simulate human feedback. The cost for $l_{a}$ , $l_{m}$ , and $l_{h}$ are 0, $+ 5$ , and $+ 10$ respectively.

Navigation: In this ROS-based city environment, the robot optimizes the shortest path to the goal location. A state is represented as $⟨ x, y, f, p ⟩$ , where, $x$ and $y$ are robot coordinates, $f$ is the surface type (concrete or grass), and $p$ indicates the presence of a puddle. The robot can move in all four directions and each costs $+ 1$ . Actions succeed with probability 0.8. Navigating over grass damages the grass and is a mild NSE. Navigating over grass with puddles is a severe NSE. Features used for training are $⟨ f, p ⟩$ . Here, NSEs are unavoidable.

Vase: In this domain, the robot must quickly reach the goal, while minimizing breaking a vase as a side effect (Krakovna et al., 2020). A state is represented as $⟨ x, y, v, c ⟩$ where, $x$ and $y$ are robot’s coordinates. $v$ indicates the presence of a vase and $c$ indicates if the floor is carpeted. The robot moves in all four directions and each costs $+ 1$ . Actions succeed with probability 0.8. Breaking a vase placed on a carpet is a mild NSE and breaking a vase on the hard surface is a severe NSE. $⟨ v, c ⟩$ are used for training. All instances have unavoidable NSEs.

Push: In this safety-gymnasium domain, the robot aims to push a box quickly to a goal state (Ji et al., 2023). Pushing a box on a hazard zone (blue circles) produces NSEs. We modify the domain such that in addition to the existing actions, the agent can also wrap the box that costs $+ 1$ . Every move action succeeds with probability 0.8, and the wrap action succeeds with probability 1.0. The NSEs can be avoided by pushing a wrapped box. A state is represented as $⟨ x, y, b, w, h ⟩$ where, $x, y$ are the robot’s coordinates, $b$ indicates carrying a box, $w$ indicates if box is wrapped and $h$ denotes if it is a hazard area. $⟨ b, w, h ⟩$ are used for training.

5.1 Results and discussion

Effect of learning using AFS: We first examine the benefit of querying using AFS, by comparing the resulting average NSE penalties and the cost for task completion, across domains and query budget. Figure 7 shows the average NSE penalties when operating based on an NSE model learned using different querying approaches. Clusters for critical state selection were generated using KMeans clustering algorithm with $K = 3$ for navigation, vase and safety-gym’s push domains (Figures 7a–c). The results show that our approach consistently performs similar to or better than the baselines.

Figure 7

Three line graphs depict the average penalty versus budget for different agents across three tasks: Navigation, Vase, and Safety-gym Push. Each graph shows performance of various methods, including Naive Agent, Most Probable Feedback, Cost-Sensitive Approach, Random Critical States, RI, Oracle, and AFS, with AFS and Oracle achieving the lowest penalties. The graphs have a budget range of 400 to 2500 and penalty values varying according to the task.

Figure 7. Average penalty incurred when querying with different feedback selection techniques. (a) Navigation: Unavoidable NSE. (b) Vase: Unavoidable NSE. (c) Safety-gym Push.

There is a trade-off between optimizing task completion and mitigating NSEs, especially when NSEs are unavoidable. While some techniques are better at mitigating NSEs, they significantly impact task performance. Table 1 shows the average cost for task completion at $B = 400$ . Lower values are better for both NSEs and task completion cost. While the Naive Agent has a lower cost for task completion, it incurs the highest NSE penalty as it has no knowledge of $R_{N}$ . RI causes more NSEs, especially when they are unavoidable, as its reward function does not fully model the penalties for mild and severe NSEs. Overall, the results show that our approach consistently mitigates avoidable and unavoidable NSEs, without affecting the task performance substantially.

Table 1

Table 1. Average cost and standard error at task completion.

Figure 8 shows the average penalty when AFS uses KL-divergence (KLD) as the stopping criteria, compared to querying with budget $B = 400$ . For comparison, we also annotate in the plot the querying budget used by AFS with KLD stopping at the time of termination. The results show that despite terminating earlier and using few queries, AFS with the KLD stopping achieves comparable performance to that of AFS with query budget $B = 400$ , demonstrating the usefulness of KLD as a stopping criterion.

Figure 8

Bar chart showing average penalty comparisons across three tasks: Navigation, Vase, and Safety-gym Push, each with a specific B value. Bars represent

Figure 8. Average penalty incurred when learning with AFS using querying budget $B = 400$ , and KL divergence (KLD) as the stopping criterion. The budget utilized by AFS with KLD stopping is annotated in the plot.

6 In-person user study with a physical robot arm

We conducted an in-person study with a Kinova Gen3 7DoF arm (Kinova, 2025) tasked with delivering two objects—an orange toy and a white box—across a workspace containing items of varying fragility (Figure 9). This setup involves users providing both interface-based and kinesthestic feedback to the robot. The study was approved by Oregon State University IRB. Participants were compensated with a $$ 15$ Amazon gift card for their participation in the study.

Figure 9

Side-by-side comparison showing a physical robotic arm setup with assorted objects for a user study on the left, and a simulated version of the same workspace featuring colored grid tiles, direction labels, and a digital robotic arm on the right.

Figure 9. Task setup for the human subject study. (a) Physical setup of the task for human subjects study; (b) Replication of the physical setup using PyBullet. A dialog box corresponding to the current feedback format is shown for every query.

This user study had three goals: (1) to measure our approach’s effectiveness in reducing NSEs for a real-world task, (2) to understand how users perceive the adaptivity, workload and competence of the robot operating in the AFS framework, and (3) to evaluate the extent to which AFS captures user preferences in practice, while ensuring maximum information gain during the learning process.

6.1 Methods

6.1.1 Participants

We conducted a pilot study in simulation to inform our overall design, the details of which are discussed under Section 2 in the Supplementary Material. We conducted another pilot study with $N = 10$ participants to evaluate the study setup with the Kinova arm. In particular, this pilot study assessed the clarity of instructions, survey wording, and feasibility of the task design in the object delivery task of the Kinova arm. Based on the participant feedback, we simplified the survey questions and included example trajectories that demonstrated safe and NSE-causing behaviors. For the main study, we recruited $N = 30$ participants with basic computer literacy from the general population through university mailing lists and public forums. Participants were aged 18–72 years $(M = 32.10, S D = 13.11)$ , with $53.3 %$ men and $46.7 %$ women. Participants reported varied prior experience with robots: $73.3 %$ had general awareness of similar robot products, $6.7 %$ had researched or investigated robots, $3.3 %$ had interacted through product demos, and $13.3 %$ had no prior awareness of similar products.

6.1.2 Robotic system setup

The Kinova Gen3 arm was equipped with a joint space compliant controller which allowed participants to physically move the joints of the arm through space with gravity compensation when needed. Additionally, a task-space planner allowed for navigation to discrete grid positions for both feedback queries and policy execution (Kinova, 2025). Figure 9a shows the physical workspace and the two delivery objects, while Figure 9b shows the corresponding PyBullet simulation used for visualization during GUI-based feedback. A dialog box was displayed to prompt the participant whenever feedback was queried¹.

6.1.3 Interaction premise

The interaction simulated an assistive robot delivering objects to their designated bins. Specifically, the task required the Kinova arm to deliver an orange plush toy and a rigid white box to their respective bins while avoiding collision with surrounding obstacles of different fragility. Collisions with fragile obstacles (e.g., a glass vase) during delivery of the plush toy were considered a mild NSE. Collisions involving the white rigid box were severe NSEs if with a fragile object and were mild NSEs if with a non-fragile object. All other scenarios were considered safe. The workspace was discretized into a grid of cells marked with tape on the tabletop and mirrored in the GUI. Each cell represented a state corresponding to possible end-effector position.

6.1.4 Study design

The robot’s state space was discretized and represented as $⟨ x, y, i_{1}, i_{2}, o, f, g_{1}, g_{2} ⟩$ , where $(x, y)$ denote the end-effector position, $i_{1}$ and $i_{2}$ indicate the presence of either orange plush toy or white rigid box in the end effector, $o$ indicates the presence of an obstacle, and $f$ indicates obstacle fragility, and $g_{1}$ and $g_{2}$ indicate whether either of the objects were delivered in their corresponding goal locations (i.e., orange plush toy in white bin and the white box in the wicker bin).

Participants interacted with the robot through four feedback formats, $F = {App, Corr, Rank, DAM}$ , during both the training and main experience phases. Depending on the feedback format, the Kinova arm executed the queried action in the physical workspace or displayed a simulation of the action in the graphical user interface (GUI). Interaction across the four feedback formats are described below.

1. Approval: The robot executed a single action in simulation, and participants indicated whether it was safe by selecting “yes” or “no” in the GUI.

2. Correction: The robot first executes action prescribed by its policy in simulation. If the action in simulation is deemed unsafe by the participant, the robot in the physical setup moves to the queried location. Participants then correct the robot by physically moving the robot arm to demonstrate a safe alternative action.

3. Demo-Action Mismatch: The robot first physically moved its arm to a specific end-effector position in the workspace. Participants then provided feedback by guiding the arm to a safe position, thereby demonstrating the safe action. The robot compares the action given by its policy to the demonstrated action. If the robot’s action and the demonstrated actions do not match, then the robot’s action is considered unsafe.

4. Ranking: Simulation clips of two actions selected at random in a given state were presented in GUI. Participants compared the two candidate actions and selected which was safer. If both actions were judged equally safe or unsafe, either option could be chosen.

Each participant experienced four learning conditions in a within-subjects, counterbalanced design:

1. The baseline RI approach proposed in Ghosal et al. (2023),

2. AFS with random $Ω$ , where critical states are randomly selected,

3. AFS with a fixed feedback format (DAM) for querying, consistent with prior works that rely primarily on demonstrations, and

4. The proposed AFS approach, where both the feedback format and the critical states are selected to maximize information gain.

Each condition is a distinct feedback query selection strategy controlling how the robot queried participants during learning. These conditions are the independent variables. The dependent measures include NSE occurrences, their severity, perceived workload, trust, competence and user alignment.

6.1.5 Hypotheses

We test the following hypotheses in the in-person study. These hypotheses were derived from trends observed in the experiments and human subjects study in simulation (Section 5 and Section 2 in the Supplementary Materials).

H1: Robots learning using AFS will have fewer NSEs in comparison to the baselines.

This hypothesis is derived from the results of our experiments on simulated domains (Figure 7) where AFS consistently reduced NSEs while completing the assigned task. We hypothesize that this trend extends to physical human-robot interactions.

H2: AFS will achieve comparable or better performance compared to the baselines, with a lower perceived workload for the users.

The results on simulated domains (Figure 8) show that AFS achieved better or comparable performance to the baselines, using fewer feedback queries. While the in-person user study requires relatively greater physical and cognitive effort, we expect the advantage of the sample efficiency to persist and investigate whether it translates to reduced perceived workload.

H3: Participants will report AFS as more trustworthy, competent, and aligned with user expectations, in comparison to the baselines.

In the human subjects simulation study (Supplementary Table S2), participants reported that AFS selected intelligent queries, targeted critical states, and improved the agent’s performance, reflecting indicators of trust, competence and user alignment. We hypothesize that this trend extends to physical settings as well.

Hypotheses H1 and H2 explore trends identified in simulation and are therefore confirmatory. Hypothesis H3 builds on the perception measures used in the human subjects study in simulation, and is hence treated as an extended confirmatory hypothesis.

6.1.6 Procedure

Each study session lasted approximately 1 hour and followed three phases.

6.1.6.1 Training

Participants were first introduced to the task objective, workspace, and the four feedback formats. For each format, they provided feedback on four sample queries to practice both GUI-based and kinesthetic interactions. After the completing each format, the participants rated the following: (i) probability of responding to a query in that format, $ψ (f)$ , (ii) perceived cost or effort required to provide feedback, $C (f)$ , and (iii) the overall task workload. This phase helped establish measures like feedback likelihood, perceived effort, and workload.

6.1.6.2 Main experience

Following training, participants completed the four learning conditions corresponding to different approaches under evaluation. In each condition, the participants provided feedback to train the robot to avoid collision while performing the object-delivery task. Depending on the feedback format selected by the querying strategy, participants either evaluated short simulation clips on the GUI or physically guided the robotic arm. At the end of each condition, the robot executed its learned policy based on its learning under that condition. The participants then observed its performance and completed a brief post-condition questionnaire assessing workload, trust, perceived competence, and user-alignment.

6.1.6.3 Closing

At the end of the study, participants compared the four learning approaches in terms of trade-offs between learning speed and safety. Participants reported their preferences on providing feedback through multiple formats versus relying on a single feedback format. These responses offered qualitative insight into AFS’s practicality and user acceptance.

6.1.7 Measures

We collected both quantitative and qualitative measures. The quantitative measure captured task-level performance through the frequency and the severity of NSEs (mild and severe). Qualitative measures captured participants’ perceptions of the following.

1. Workload: Participants’ perceived workload across the feedback formats and learning conditions were measured using the NASA Task Load Index (NASA TLX) (Hart and Staveland, 1988). The questionnaire scales were transformed to seven-point subscales ranging from “Very Low” (1) to “Very High” (7). Responses were collected during the training phase and after each condition in the main experience phase.

2. Robot Attributes: Perceived robot attributes, like competence, warmth and discomfort, were measured using the nine-point Robotic Social Attributes Scale (RoSAS) (Carpinella et al., 2017), ranging from “Strongly Disagree” (1) to “Strongly Agree” (9). Participants completed this questionnaire after each learning condition.

3. Trust: A custom 10-point trust scale $(0 % - 100 %)$ was used to measure participants’ confidence in the robot’s ability to act safely under each learning condition. Participants rated their trust both before and after the robot’s training phase to capture changes in its learning performance.

4. User Alignment: Participants’ perception of user alignment was assessed using a custom seven-point Likert scale ranging from “Strongly Disagree” (1) to “Strong Agree” (7). Participants rated (i) how well the critical states queried by the robot aligned with their own assessment of which states were important for learning, and (ii) how well the feedback formats chosen across conditions matched their personal feedback preferences. Higher rating indicated stronger perceived alignment between the robot’s querying strategy and the participants’ expectations.

6.1.8 Analysis

Survey responses were compiled into cumulative RoSAS (competence, warmth, discomfort) and NASA-TLX workload scores. A repeated-measures ANOVA (rANOVA) tested for significant differences across learning conditions; we report the $F$ -statistic, $p$ -value and effect size as generalized eta-squared $(η_{G}^{2})$ . When effects were significant, Tukey’s post-hoc tests identified pairwise differences. All results are reported with means (M), standard errors (SE), and $p$ -values.

6.2 Results

We evaluate hypotheses H1-H3 using both objective and subjective measures. Data from all 30 participants were included in the analysis, as all sessions were completed successfully.

6.2.1 Effectiveness of AFS in mitigating NSEs (H1)

Figure 10a shows the average penalty incurred under each condition. AFS approach incurred the least NSE penalty $(M = 3.83, S E = 1.21)$ , substantially lower than AFS with random $Ω$ $(M = 11.55, S E = 1.57)$ and AFS with a fixed feedback format $(M = 10.50, S E = 0.37)$ . The RI baseline incurred higher penalties $(M = 5.00, S E = 0.00)$ compared to AFS. These results confirm hypothesis H1 and demonstrate that adaptively selecting both critical states and feedback formats reduced unsafe behaviors more effectively than random or fixed querying strategies.

Figure 10

(a) Bar chart showing average penalty for four methods: RI, AFS with Random Ω, AFS with Fixed Format, and AFS (Ours). Penalties vary, with AFS (Ours) having the lowest. (b) Box plot depicting NASA-TLX workload ratings across the same methods. A significant difference exists between RI and AFS with Random Ω, indicated by p < 0.05.

Figure 10. Results from the user study on the Kinova 7DoF arm. (a) Average penalty incurred across methods in the human subjects study. (b) NASA-TLX workload across the four conditions.

6.2.2 Learning efficiency and workload (H2)

We first compare the perceived workload across different feedback formats, followed by the results across learning conditions. Demonstration is the most widely used feedback format in existing works but was perceived as the most demanding (Figure 11c). While corrections offer corrective action in addition to disapproving agent’s action, it also imposed substantial effort on the users. Approval required the least workload but conveyed limited information. A repeated-measures ANOVA revealed a significant effect of feedback format on perceived workload, $(F (3,87) = 3.33, p = 0.023, η_{G}^{2} = 0.046)$ . Post hoc comparisons indicated that Approval $(M = 2.11, S E = 0.12)$ imposed significantly lower workload $(p = 0.026)$ than Demo-Action Mismatch $(M = 2.62, S E = 0.19)$ , while no other pairwise differences reached significance. This trade-off underscores the need for an adaptive selection strategy to balance informativeness with user effort.

Figure 11

Three box plots show different data analyses. (a) RoSAS competence across four conditions with similar ratings mostly above five. (b) Participants' trust perception before and after learning; trust decreases post-learning across the four conditions. (c) NASA-TLX scores across feedback formats; Rank and Corr formats show lower ratings with statistically significant differences noted.

Figure 11. User study results. (a,b) RoSAS competence and NASA Task-Load across the four conditions in the main study; (c) NASA Task-Load across feedback formats.

The rANOVA analysis across the four learning conditions further revealed a significant effect in the NASA-TLX workload ratings $(F (3,87) = 3.73, p = 0.014, η_{G}^{2} = 0.030)$ . Among the four conditions, AFS achieved one of the lowest perceived workload ratings $(M = 2.34, S E = 0.12)$ , comparable to AFS with random $Ω$ $(M = 2.26, S E = 0.15)$ and lower than both AFS with fixed format $(M = 2.56, S E = 0.19)$ and RI $(M = 2.64, S E = 0.19)$ . Tukey post-hoc tests showed that workload in AFS with random $Ω$ imposed a significantly lower workload than RI $(p = 0.033)$ . Overall, these results support H2, indicating that adaptively selecting queries helps reduce perceived workload relative to the baselines (Figure 10b).

6.2.3 Trust, competence, and preference alignment (H3)

Participants’ rating on the robot’s ability to act safely increased after learning with AFS, as shown in Figure 11b. A significant effect was also found for perceived robot competence $(F (3,87) = 10.6, p < 0.001, η_{G}^{2} = 0.082)$ (Figure 11a). AFS was rated highest $(M = 7.04, S E = 0.32)$ , significantly greater than AFS with random $Ω$ $(M = 5.88, S E = 0.32, p = 0.002)$ and AFS with fixed format $(M = 5.88, S E = 0.30, p < 0.001)$ , while comparable to RI $(M = 6.68, S E = 0.32)$ . These results support H3—AFS was perceived as more competent and trustworthy compared to the baselines.

Descriptive analyses of user alignment on state criticality and feedback alignment ratings, indicated consistent trends across participants. While differences between conditions were not statistically significant $(p > 0.05)$ , AFS consistently received higher ratings for feedback alignment $(M = 3.79, S E = 0.42)$ relative to state criticality $(M = 3.14, S E = 0.40)$ , suggesting that participants found AFS’s query selections relevant and aligned with their preferences. Participants (both those aware and unaware of similar robotic systems) perceived AFS’s queries as critical for learning and well-aligned with their feedback preferences. Participants with prior research experience rated state criticality and format alignment comparable, indicating confidence in adaptivity of AFS’s querying process.

7 Discussion

Our experiments followed an increasingly realistic progression in design. In the experiments in simulation with both avoidable and unavoidable NSEs, AFS incurred lower penalties and overall costs compared to the baselines, demonstrating its ability to balance task performance with safety. The results of our pilot study, where users interacted with a simulated agent, showed that AFS effectively learns the participant’s feedback preference model and uses them to select formats aligned with user expectations. Finally, the in-person user study with the Kinova arm, showed the practicality of using AFS in real-world settings, achieving favorable ratings on trust, workload, and user-preference alignment. These findings support our three hypotheses regarding the performance of AFS: (H1) it reduces unsafe behaviors more effectively than the baselines, (H2) it improves learning efficiency while reducing user workload, and (H3) it is perceived as more trustworthy and competent. The results collectively highlight that adaptively selecting both the query format and the states to pose the queries to the user enhances learning efficiency and reduces user effort.

Beyond confirming these hypotheses, the findings provide important design implications for human-in-the-loop learning systems. By modeling the trade-off between informativeness and effort, AFS offers a framework to balance user workload with the need for high-quality feedback. The learned feedback preference model allows the agent to adaptively select querying formats while minimizing human effort. Using KL-divergence as stopping criterion further enables adaptive termination of the querying process. This overcomes the problem of determining the “right” querying budget for a problem, and shows that AFS enables efficient learning while minimizing redundant human feedback. These design principles can inform the development of interactive systems that adapt query format and frequency based on agent’s current knowledge and user feedback preferences. Overall the results show that AFS (1) consistently outperforms the baselines across different evaluation settings, and (2) can be effectively deployed in real-world human-robot interaction scenarios.

A key strength of this work lies in its extensive evaluation, from simulation to real robot studies, supporting AFS’s robustness and practicality. One limitation, however, is that the current evaluation focuses on discrete environments. Extending AFS to continuous domains introduces challenges such as identifying critical states and estimating divergence-based information gain in high-dimensional spaces. While gathering feedback at the trajectory-level is relatively easier in continuous settings, gathering state-level feedback, which is the focus of this work, is challenging. These challenges stem from the need for scalable state representations and efficient sampling strategies, which will be a focus for future work.

8 Conclusion and future work

The proposed Adaptive Feedback Selection (AFS) facilitates querying a human in different formats in different regions of the state space, to effectively learn a reward function. Our approach uses information gain to identify critical states for querying, and the most informative feedback format to query in these states, while accounting for the cost and uncertainty of receiving feedback in each format. Our empirical evaluations using four domains in simulation and a human subjects study in simulation demonstrate the effectiveness and sample efficiency of our approach in mitigating avoidable and unavoidable negative side effects (NSEs). The subsequent in-person user study with a Kinova Gen3 7DoF arm further validates these finding, showing that AFS not only improves NSE avoidance but also enhances user trust, competence perception, and user-alignment. While AFS assumes that human feedback reflects a true underlying notion of safety, biased feedback can misguide the robot and lead to unintended NSEs. Understanding when such biases arise and how to correct for them remains an open challenge. Extending AFS with bias-aware inference mechanisms is a promising future direction. Future work will also focus on extending AFS to continuous state and action spaces, strengthening AFS’s applicability to complex, safety-critical domains where user-aware interaction is essential.

Data availability statement

The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Ethics statement

The studies involving humans were approved by Human Research Protection Program and Institutional Review Board, Oregon State University. The studies were conducted in accordance with the local legislation and institutional requirements. The participants provided their written informed consent to participate in this study.

Author contributions

YA: Writing – review and editing, Investigation, Data curation, Methodology, Conceptualization, Writing – original draft, Visualization. NN: Writing – review and editing, Writing – original draft, Data curation. KS: Writing – review and editing, Data curation. NF: Resources, Writing – review and editing, Supervision. SS: Supervision, Funding acquisition, Resources, Writing – review and editing.

Funding

The author(s) declared that financial support was received for this work and/or its publication. This work was supported in part by National Science Foundation grant number 2416459.

Conflict of interest

The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/frobt.2025.1734564/full#supplementary-material

Footnotes

¹See Section 3.1 in the Supplementary Materials for details on the dialog box and examples for each feedback format

References

Abbeel, P., and Ng, A. Y. (2004). “Apprenticeship learning via inverse reinforcement learning,” in Proceedings of the twenty-first international conference on machin learning, (ICML).

Google Scholar

Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., and Mané, D. (2016). Concrete problems in AI safety. arXiv Preprint arXiv:1606.06565. doi:10.48550/arXiv.1606.06565

CrossRef Full Text | Google Scholar

Bärmann, L., Kartmann, R., Peller-Konrad, F., Niehues, J., Waibel, A., and Asfour, T. (2024). Incremental learning of humanoid robot behavior from natural interaction and large language models. Front. Robotics AI 11, 1455375. doi:10.3389/frobt.2024.1455375

PubMed Abstract | CrossRef Full Text | Google Scholar

Beierling, H., Beierling, R., and Vollmer, A. (2025). The power of combined modalities in interactive robot learning. Front. Robotics AI 12, 1598968. doi:10.3389/frobt.2025.1598968

PubMed Abstract | CrossRef Full Text | Google Scholar

Bıyık, E., Losey, D. P., Palan, M., Landolfi, N. C., Shevchuk, G., and Sadigh, D. (2022). Learning reward functions from diverse sources of human feedback: optimally integrating demonstrations and preferences. Int. J. Robotics Res. (IJRR) 41, 45–67. doi:10.1177/02783649211041652

CrossRef Full Text | Google Scholar

Brown, D. S., Cui, Y., and Niekum, S. (2018). “Riskaware active inverse reinforcement learning,” 87. Conference on Robot Learning.

Google Scholar

Brown, D., Coleman, R., Srinivasan, R., and Niekum, S. (2020a). “Safe imitation learning via fast Bayesian reward inference from preferences,” in International conference on machine learning (ICML) (PMLR).

Google Scholar

Brown, D., Niekum, S., and Petrik, M. (2020b). Bayesian robust optimization for imitation learning. Adv. Neural Inf. Process. Syst. (NeurIPS). doi:10.5555/3495724.3495933

CrossRef Full Text | Google Scholar

Candon, K., Chen, J., Kim, Y., Hsu, Z., Tsoi, N., and Vázquez, M. (2023). “Nonverbal human signals can help autonomous agents infer human preferences for their behavior,” in Proceedings of the international conference on autonomous agents and multiagent systems.

Google Scholar

Carpinella, C. M., Wyman, A. B., Perez, M. A., and Stroessner, S. J. (2017). “The robotic social attributes scale (rosas): development and validation,” in 12th ACM/IEEE international conference on human robot interaction (HRI).

Google Scholar

Cui, Y., and Niekum, S. (2018). “Active reward learning from critiques,” in IEEE international conference on robotics and automation (ICRA).

Google Scholar

Cui, Y., Koppol, P., Admoni, H., Niekum, S., Simmons, R., Steinfeld, A., et al. (2021a). “Understanding the relationship between interactions and outcomes in humanintheloop machine learning,” in International joint conference on artificial intelligence (IJCAI).

Google Scholar

Cui, Y., Zhang, Q., Knox, B., Allievi, A., Stone, P., and Niekum, S. (2021b). “The empathic framework for task learning from implicit human feedback,” in Conference on robot learning (CoRL).

Google Scholar

Cui, Y., Karamcheti, S., Palleti, R., Shivakumar, N., Liang, P., and Sadigh, D. (2023). “No, to the right online language corrections for robotic manipulation via shared autonomy,” in Proceedings of ACM/IEEE conference on human robot interaction (HRI).

Google Scholar

Feng, X., Jiang, Z., Kaufmann, T., Xu, P., Hüllermeier, E., Weng, P., et al. (2025). “Duo: diverse, uncertain, on-policy query generation and selection for reinforcement learning from human feedback,” in Proceedings of the AAAI conference on artificial intelligence (AAAI).

Google Scholar

Ghosal, G. R., Zurek, M., Brown, D. S., and Dragan, A. D. (2023). “The effect of modeling human rationality level on learning rewards from multiple feedback types,” in Proceedings of the AAAI conference on artificial intelligence (AAAI).

Google Scholar

Hadfield Menell, D., Milli, S., Abbeel, P., Russell, S. J., and Dragan, A. (2017). Inverse reward design. Adv. Neural Inf. Process. Syst. (NeurIPS). doi:10.5555/3295222.3295421

CrossRef Full Text | Google Scholar

Hart, S. G., and Staveland, L. E. (1988). Development of nasatlx (task load index): results of empirical and theoretical research. Adv. Psychology. doi:10.1016/j.ecns.2024.101607

CrossRef Full Text | Google Scholar

Hassan, S., Chung, H.-Y., Tan, X. Z., and Alikhani, M. (2025). “Coherence-driven multimodal safety dialogue with active learning for embodied agents,” in Proceedings of the 24th international conference on autonomous agents and multiagent systems (AAMAS).

Google Scholar

Huang, J., Aronson, R. M., and Short, E. S. (2024). “Modeling variation in human feedback with user inputs: an exploratory methodology,” in Proceedings of ACM/IEEE international conference on human robot interaction (HRI).

Google Scholar

Ibarz, B., Leike, J., Pohlen, T., Irving, G., Legg, S., and Amodei, D. (2018). Reward learning from human preferences and demonstrations in atari. Adv. Neural Inf. Process. Syst. (NeurIPS). doi:10.5555/3327757.3327897

CrossRef Full Text | Google Scholar

Jeon, H. J., Milli, S., and Dragan, A. (2020). Reward rational (implicit) choice: a unifying formalism for reward learning. Adv. Neural Inf. Process. Syst. (NeurIPS). doi:10.5555/3495724.3496095

CrossRef Full Text | Google Scholar

Ji, J., Zhang, B., Zhou, J., Pan, X., Huang, W., Sun, R., et al. (2023). “Safety gymnasium: a unified safe reinforcement learning benchmark,” in Thirty-seventh conference on neural information processing systems datasets and benchmarks track (NeurIPS).

Google Scholar

Kim, C., Seo, Y., Liu, H., Lee, L., Shin, J., Lee, H., et al. (2023). “Guide your agent with adaptive multimodal rewards,” in Thirty-seventh conference on neural information processing systems.

Google Scholar

Kinova (2025). Kinova gen3 ultra lightweight robot.

Google Scholar

Krakovna, V., Orseau, L., Martic, M., and Legg, S. (2018). Measuring and avoiding side effects using relative reachability. arXiv Preprint arXiv:1806.01186.

Google Scholar

Krakovna, V., Orseau, L., Ngo, R., Martic, M., and Legg, S. (2020). Avoiding side effects by considering future tasks. Adv. Neural Inf. Process. Syst. (NeurIPS). doi:10.5555/3495724.3497324

CrossRef Full Text | Google Scholar

Lakkaraju, H., Kamar, E., Caruana, R., and Horvitz, E. (2017). “Identifying unknown unknowns in the open world: representations and policies for guided exploration,” in Proceedings of the AAAI conference on artificial intelligence (AAAI).

Google Scholar

Lou, X., Zhang, J., Wang, Z., Huang, K., and Du, Y. (2024). “Safe reinforcement learning with free-form natural language constraints and pre-trained language models,” in The 23rd international conference on Autonomous Agents and Multi-Agent Systems (AAMAS).

Google Scholar

Najar, A., and Chetouani, M. (2021). Reinforcement learning with human advice: a survey. Front. Robotics AI 8, 8–2021. doi:10.3389/frobt.2021.584075

PubMed Abstract | CrossRef Full Text | Google Scholar

Ng, A. Y., and Russell, S. (2000). “Algorithms for inverse reinforcement learning,” in Proceedings of the seventeenth international conference on machine learning (ICML).

Google Scholar

Ramachandran, D., and Amir, E. (2007). “Bayesian inverse reinforcement learning,” in Proceedings of the 20th international joint conference on artifical intelligence (IJCAI).

Google Scholar

Ramakrishnan, R., Kamar, E., Dey, D., Horvitz, E., and Shah, J. (2020). Blind spot detection for safe sim-to-real transfer. J. Artif. Intell. Res. (JAIR) 67, 191–234. doi:10.1613/jair.1.11436

CrossRef Full Text | Google Scholar

Ross, S., Gordon, G., and Bagnell, D. (2011). “A reduction of imitation learning and structured prediction to no-regret online learning,” in Proceedings of the fourteenth international conference on artificial intelligence and statistics, (AISTATS).

Google Scholar

Saisubramanian, S., and Zilberstein, S. (2021). “Mitigating negative side effects via environment shaping,” in International conference on Autonomous Agents and Multiagent Systems (AAMAS).

Google Scholar

Saisubramanian, S., Kamar, E., and Zilberstein, S. (2021a). “A multiobjective approach to mitigate negative side effects,” in Proceedings of the twenty-ninth international joint conference on artificial intelligence (International Joint Conferences on Artificial Intelligence Organization).

Google Scholar

Saisubramanian, S., Roberts, S. C., and Zilberstein, S. (2021b). “Understanding user attitudes towards negative side effects of AI systems,” in Extended abstracts of the 2021 conference on human factors in computing systems (CHI).

Google Scholar

Saisubramanian, S., Kamar, E., and Zilberstein, S. (2022). Avoiding negative side effects of autonomous systems in the open world. J. Artif. Intell. Res. (JAIR) 74, 143–177. doi:10.1613/jair.1.13581

CrossRef Full Text | Google Scholar

Saran, A., Zhang, R., Short, E. S., and Niekum, S. (2021). “Efficiently guiding imitation learning agents with human gaze,” in International conference on Autonomous Agents and Multiagent Systems (AAMAS).

Google Scholar

Seo, S., and Unhelkar, V. (2024). “Idil: imitation learning of intent-driven expert behavior,” in Proceedings of the 23rd international conference on Autonomous Agents and Multiagent Systems (AAMAS).

Google Scholar

Settles, B. (1995). Active learning literature survey. Science.

Google Scholar

Sontakke, S. A., Zhang, J., Arnold, S., Pertsch, K., Biyik, E., Sadigh, D., et al. (2023). “RoboCLIP: one demonstration is enough to learn robot policies,” in Thirty-seventh conference on neural information processing systems (NeurIPS).

Google Scholar

Srivastava, A., Saisubramanian, S., Paruchuri, P., Kumar, A., and Zilberstein, S. (2023). “Planning and learning for Non-markovian negative side effects using finite state controllers,” in Proceedings of the AAAI conference on artificial intelligence (AAAI).

Google Scholar

Strokina, N., Yang, W., Pajarinen, J., Serbenyuk, N., Kämäräinen, J., and Ghabcheloo, R. (2022). Visual rewards from observation for sequential tasks: autonomous pile loading. Front. Robotics AI 9, 9–2022. doi:10.3389/frobt.2022.838059

PubMed Abstract | CrossRef Full Text | Google Scholar

Tarakli, I., Vinanzi, S., and Nuovo, A. D. (2024). Interactive reinforcement learning from natural language feedback. IEEE/RSJ International Conference on Intelligent Robots and Systems IROS.

CrossRef Full Text | Google Scholar

Tien, J., He, J. Z., Erickson, Z., Dragan, A., and Brown, D. S. (2023). “Causal confusion and reward misidentification in preferencebased reward learning,” in The eleventh international conference on learning representations (ICLR).

Google Scholar

Xue, W., An, B., Yan, S., and Xu, Z. (2024). “Reinforcement learning from diverse human preferences,” in Proceedings of the thirty-third international joint conference on artificial intelligence, IJCAI (International Joint Conferences on Artificial Intelligence Organization).

Google Scholar

Yang, Y., Neary, C., and Topcu, U. (2024a). “Multimodal pretrained models for verifiable sequential decision-making: planning, grounding, and perception,” in Proceedings of the 23rd international conference on autonomous agents and multiagent systems (AAMAS).

Google Scholar

Yang, Z., Jun, M., Tien, J., Russell, S., Dragan, A., and Biyik, E. (2024b). “Trajectory improvement and reward learning from comparative language feedback,” in Conference on robot learning (CoRL).

Google Scholar

Zha, Y., Guan, L., and Kambhampati, S. (2024). “Learning from ambiguous demonstrations with self-explanation guided reinforcement learning,” in Proceedings of the AAAI conference on artificial intelligence.

Google Scholar

Zhang, S., Durfee, E., and Singh, S. (2020). “Querying to find a safe policy under uncertain safety constraints in markov decision processes,” in Proceedings of the AAAI conference on artificial intelligence (AAAI).

Google Scholar

Keywords: information gain, interactive imitation learning, learning from human feedback, learning from multiple formats, robot learning

Citation: Anand Y, Nwagwu N, Sabbe K, Fitter NT and Saisubramanian S (2026) Adaptive querying for reward learning from human feedback. Front. Robot. AI 12:1734564. doi: 10.3389/frobt.2025.1734564

Received: 28 October 2025; Accepted: 15 December 2025;
Published: 12 February 2026.

Edited by:

Chao Zeng, University of Liverpool, United Kingdom

Reviewed by:

Pasqualino Sirignano, Sapienza University of Rome, Italy
Chuanfei Hu, Southeast University, China

Copyright © 2026 Anand, Nwagwu, Sabbe, Fitter and Saisubramanian. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Sandhya Saisubramanian, c2FuZGh5YS5zYWlAb3JlZ29uc3RhdGUuZWR1

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.