Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Robot. AI, 12 January 2026

Sec. Robot Learning and Evolution

Volume 12 - 2025 | https://doi.org/10.3389/frobt.2025.1682200

Solving robotics tasks with prior demonstration via exploration-efficient deep reinforcement learning

  • 1Unicontrol ApS, Odense, Denmark
  • 2SDU Robotics, Maersk McKinney Møller Institute, University of Southern Denmark, Odense, Denmark

This paper proposes an exploration-efficient deep reinforcement learning with reference (DRLR) policy framework for learning robotics tasks incorporating demonstrations. The DRLR framework is developed based on an imitation bootstrapped reinforcement learning (IBRL) algorithm. Here, we propose to improve IBRL by modifying the action selection module. The proposed action selection module provides a calibrated Q-value, which mitigates the bootstrapping error that otherwise leads to inefficient exploration. Furthermore, to prevent the reinforcement learning (RL) policy from converging to a sub-optimal policy, soft actor–critic (SAC) is used as the RL policy instead of twin delayed DDPG (TD3). The effectiveness of our method in mitigating the bootstrapping error and preventing overfitting is empirically validated by learning two robotics tasks: bucket loading and open drawer, which require extensive interactions with the environment. Simulation results also demonstrate the robustness of the DRLR framework across tasks with both low and high state–action dimensions and varying demonstration qualities. To evaluate the developed framework on a real-world industrial robotics task, the bucket loading task is deployed on a real wheel loader. The sim-to-real results validate the successful deployment of the DRLR framework.

1 Introduction

Model-free deep reinforcement learning (DRL) has shown great potential in learning continuous control tasks in robotics (Allshire et al., 2022; Qi et al., 2023; Rudin et al., 2022; Haarnoja et al., 2018a; Nguyen and La, 2019; Ibarz et al., 2021). However, there remain challenges that limit the widespread applicability of these methods in real-world robotic applications. One major challenge is the poor sample efficiency of learning with model-free DRL; even relatively simple tasks can require millions of interaction steps, while learning policies from high-dimensional observations or complex environments may require significantly more interactions (Haarnoja et al., 2018b; Osinski et al., 2020; Raffin et al., 2021). A primary cause for the poor sample efficiency is on-policy learning (Haarnoja et al., 2018b) since some of the most widely used DRL algorithms, such as A3C (Mnih et al., 2016) and PPO (Schulman et al., 2017), require new interactions with the environments for each gradient step. Consequently, on-policy DRL is often impractical for real-world systems as allowing untrained policies to interact with real systems can be both costly and dangerous. Even when learning occurs solely in simulation, it is still preferred to utilize previously collected data instead of starting from scratch (Levine et al., 2020). On the other hand, off-policy DRL methods improve sample efficiency by reusing past experience and have demonstrated strong performance on continuous control tasks (Lillicrap et al., 2015; Fujimoto et al., 2018; Haarnoja et al., 2018b; Fujimoto et al. 2019; Fujimoto and Gu, 2021; Kumar et al., 2020). However, for complex robotics tasks where data collection itself is expensive, e.g., in construction machines, educational agents, or medical devices, even off-policy approaches become costly when the DRL policy requires extensive explorations. Under these scenarios, improving exploration efficiency is as crucial as sample efficiency to reduce the exploration needed for achieving a good policy.

Therefore, effectively leveraging prior demonstrations to facilitate efficient exploration is considered a promising strategy for the broad application of off-policy DRL in real-world industrial robotics. Two main research directions have emerged to achieve this goal:

1.1 Offline-to-online DRL

Pretraining DRL with prior expert demonstrations, along with continuous training with online data, has shown its impressive performance in exploration efficiency (Vecerik et al., 2017; Nair et al., 2020; Uchendu et al., 2023; Zhou et al., 2024; Goecks et al., 2019; Lee et al., 2022). Early studies initialize training by mixing offline demonstrations and online interaction in the replay buffer and use a prioritized replay mechanism to enable the reinforcement learning (RL) policy for efficient exploration (Vecerik et al., 2017; Song et al., 2022). More recent approaches separate offline pretraining from online fine-tuning and report superior exploration efficiency (Gao et al., 2018; Goecks et al., 2019; Nair et al., 2020; Lee et al., 2022). In offline training, a behavior cloning (BC) loss or Kullback–Leibler (KL) divergence is typically used to encourage the RL policy to closely follow the behavior policy, which is used to generate the demonstrations, thereby facilitating efficient exploration in online interactions. However, when transferring to the online interacting phase, some methods are required to “recalibrate” the offline Q-estimates to the new online distribution to maintain learning stability and mitigate forgetting of pre-trained initializations (Nair et al., 2020; Uchendu et al., 2023; Ball et al., 2023).

1.2 DRL-Ref policy

Some novel studies have proposed to explicitly integrate a reference policy, trained from the prior demonstration to guide DRL training (Zhang et al., 2023; Hu et al., 2024). In these works, a stand-alone reference policy is trained using offline demonstration and then used to provide additional guidance in the DRL online learning phase. In this work, we consider the imitation bootstrapped reinforcement learning (IBRL) framework as an ideal approach for learning robotics tasks with prior demonstrations as it prevents catastrophic forgetting of pre-trained initializations and automatically balances offline and online training (Hu et al., 2024).

However, the IBRL framework is built on off-policy RL and imitation learning (IL). It risks the same challenges posed by bootstrapping errors in off-policy RL (Kumar et al., 2019, Kumar et al., 2020, Fujimoto et al., 2019, Fujimoto and Gu, 2021), where the target critic and actor networks are updated using out-of-distribution (OOD) actions with overestimated Q-values (Kumar et al., 2019, Kumar et al., 2020). Meanwhile, the IL policy in IBRL could also face the state distribution shift (Hussein et al., 2017), when OOD actions keep getting selected. To address these challenges, in this work, we propose an exploration-efficient DRL with reference (DRLR) policy framework, as shown in Figure 1, and summarize our contributions as follows:

1. Identify and analyze the main cause of the failure cases trained with the IBRL framework: distribution shift due to the bootstrapping error.

2. Propose a simple action selection module and use a maximum entropy RL to mitigate inefficient explorations caused by bootstrapping errors and convergence on a sub-optimal policy due to overfitting.

3. Demonstrate the effectiveness and robustness of the proposed framework on tasks with both low and high state–action dimensions and demonstrations of different quality.

4. Showcase an implementation and deployment of the proposed framework on a real industrial task.

Figure 1
Flowchart illustrating a reinforcement learning (RL) process. The top section depicts RL training with demos and critic networks updating policy actions, which are stored in a relay buffer. The bottom section describes online interactions, where RL policy selects out-of-distribution actions due to state distribution shifts. Arrows in different colors show the paths for ref policy and RL policy, highlighting efficient and inefficient exploration transitions.

Figure 1. Overview of the proposed exploration-efficient DRLR framework. The proposed framework extends a sample-efficient DRL-Ref method with a simple action selection module to mitigate inefficient explorations caused by (1) bootstrapping errors leading to the RL policy selecting out-of-distribution actions; (2) Ref policy failing to provide good actions under state distribution shifts.

2 Problem statement

The proposed framework is generalized toward learning robotics tasks with the following problems: 1) collecting a large amount of data is costly. 2) Learning requires extensive interactions. 3) A small number of expert demonstrations are available. Based on the characteristics, bucket loading (Shen and Sloth, 2024a) and open drawer (Makoviychuk et al., 2021a) tasks are selected to evaluate the effectiveness of the proposed framework. The task environments are shown in Figure 2.

Figure 2
Panel (a) features a yellow construction vehicle carrying stacked orange blocks in its front loader on a checkered surface. Panel (b) depicts a robotic arm reaching towards a shelf containing square objects in a similar checkered environment.

Figure 2. Selected tasks for testing the proposed framework. (a) Bucket loading. (b) Open drawer.

Compared to the selected DRL-Ref framework, IBRL, the proposed framework attempts to mitigate distribution shift caused by bootstrapping errors and prevent convergence to a sub-optimal policy from overfitting to the demonstrations.

Bootstrapping error can arise in off-policy RL when the value function is updated using Bellman backups. It occurs because the target value function and policy are updated using OOD actions with overestimated Q-values (Kumar et al., 2010). Studies have shown that bootstrapping error can lead to unstable training and even divergence from the optimal policy (Kumar et al., 2019, Kumar et al., 2020), particularly when the current policy output is far from the behavior policy, which is used to generate the transitions in the replay buffer (Fujimoto et al., 2019; Fujimoto and Gu, 2021; Kumar et al., 2020; Kumar et al., 2019).

In the IBRL, the critic (value) function’s parameters ϕ are updated with the following Bellman backup (Hu et al., 2024):

Lϕ=Est,at,rt,st+1BQ̂ϕst,atQ2,(1)

where

Qrt+γargmaxaat+1IL,at+1RLQϕst+1,a.(2)

Here, Q̂ϕ(st,at) is the estimated Q-value with states and actions sampled from the replay buffer B, while the target value Qϕst+1,a in Equation 2 is estimated using the current RL policy at+1RL or IL policy at+1IL. IBRL training starts with a replay buffer mixed with expert demonstrations and transitions collected during interactions, which introduces a mismatch between the current RL policy and behavior policy. Although IBRL allows for selecting actions from the IL policy, whose output is closer to the behavior policy in the demonstration, it relies on an accurate value estimation between Qϕst+1,at+1RL and Qϕst+1,at+1IL. However, because of the exploration noises during the online interaction, the future rollout states st+1 sampled from B are likely OOD relative to the offline demo buffer, D (Lee et al., 2022; Nakamoto et al., 2023; Zhou et al., 2024). When the IL policy proposes actions in these OOD states, the critic networks have no prior data for these state–action pairs and could assign a lower Q-value than the OOD actions proposed by the RL agent. As a result, the lower bounds brought by the IL policy fail if the RL policy is updated with bad OOD actions with an overestimated Q-value. Such errors could be corrected by attempting the OOD action during online interaction and observing its actual Q-value, but this, in turn, leads to insufficient policy exploration. Thus, finding a reliable and calibrated Q-value estimation is crucial for mitigating the bootstrapping error (Nakamoto et al., 2023).

Another disadvantage of bootstrapping error is that OOD actions selected by the RL policy during online interaction can lead to state distribution shift. When the IL agent fails to provide high-quality actions for the unseen interaction states, the exploration efficiency of IBRL will be degraded. Furthermore, although the IBRL has stated that both twin delayed DDPG (TD3) and soft actor–critic (SAC) can be used as RL policies for continuous control tasks (Hu et al., 2024), the authors exclusively used TD3 in their experiments due to its strong performance and high sample efficiency in challenging image-based RL settings. However, we argue that the deterministic RL algorithm, TD3, is less suitable for high-dimensional, continuous state-based tasks as it is more prone to overfitting offline data, converging to sub-optimal policies, and suffering from inefficient exploration (Haarnoja et al., 2018b). To prevent the RL policy from convergence to a sub-optimal policy because of overfitting, a maximum entropy stochastic RL method, SAC, is adopted.

3 Preliminaries

This section presents an overview of maximum entropy DRL and IBRL.

3.1 Maximum entropy deep reinforcement learning

For sample efficiency, off-policy DRL methods have been widely studied due to their ability to learn from past experiences. However, studies have also found that the off-policy DRL method struggles to maintain stability and convergence in high-dimensional continuous state–action spaces (Haarnoja et al., 2018b). To address this challenge, maximum entropy DRL has been proposed.

As the state–action spaces are continuous in the selected robotics tasks, we consider a Markov decision process (MDP) with continuous state–action spaces: an agent explores and interacts with an environment, at each time step t; the agent observes the state st, takes action at based on the RL policy πθ with parameters θ, and receives rewards rt. Different from standard RL, which aims to find a policy that maximizes the expected return in Equation 3

Jπ=t=0TEst,atθπγtrst,at,γ0,1,(3)

maximum entropy DRL aims to maximize the discounted reward and expected policy entropy H(π(st)) at each time step in Equation 4:

Jπ=t=0TEst,atρπγtrst,at+αHπst,(4)

where T is the terminal time step, γ0,1 is the discount factor, and α is the temperature parameter, which determines the relative importance of the entropy term against the reward and thus controls the stochasticity of the optimal policy (Haarnoja et al., 2018b). With this objective, maximum entropy DRL methods have shown great potential in DRL-efficient online exploration with sparse reward settings (Hiraoka et al., 2021; Ball et al., 2023), consistent with the goal of this paper.

To apply maximum entropy RL in continuous spaces, one of the widely used methods, SAC (Haarnoja et al., 2018b), is applied.

3.2 Imitation bootstrapped reinforcement learning

IBRL is a sample-efficient DRL framework that combines a stand-alone IL policy with an off-policy DRL policy (Hu et al., 2024). First, IBRL requires an IL policy μψ trained using expert demonstrations D. The goal of μψ is to mimic an expert behavior and can be trained by minimizing a BC loss LBC in Equation 5:

LBCψμ=Es,aDμψsa22.(5)

Then IBRL leverages trained μψ to help the DRL policy πθ with online exploration and its target value estimation, referred to as the actor proposal phase and the bootstrap proposal phase, respectively. In the actor proposal phase, IBRL selects between an IL action, aILμψ(st), and an RL action, aRLπθ(st). The action with a higher Q-value computed by the target critic networks, Qϕ, is selected for the online interaction. Thus, the action selection module in IBRL is defined in Equation 6:

a*=maxaaIL,aRLQϕs,a.(6)

Furthermore, to prevent local optimum Q-value update, the soft version of IBRL selects actions according to a Boltzmann distribution over Q-values instead of considering argmax.

Similarly, in the bootstrap proposal phase, the future rollout will be carried out by selecting the action by argmax or argsoftmax between Qϕ(st+1,at+1IL)andQϕ(st+1,at+1RL). The critic networks Qϕ(st,at) are updated as presented in Equation 1. The RL policy network, aRLπθ, is updated as in the selected off-policy DRL.

4 Methods

To reduce the exploration time wasted in correcting unreliable overestimated Q-values and, in turn, improve exploration efficiency, it is crucial for the policy to favor distributions with more stable Q-values. This motivates selecting batches with reliable Q-value evaluations when updating both the critic and policy networks. Previous studies have shown that the Q-value estimates of Qϕst+1,a(st+1) are only reliable when (st+1,a(st+1)) is sampled from the same distributions as the dataset used to train Q̂(st,at) (Kumar et al., 2019; Nakamoto et al., 2023). In our critic network update process, instead of selecting between Qϕst+1,μψ(st+1) and Qϕst+1,πθ(st+1), where both st+1,μψ(st+1) and st+1,πθ(st+1) could be OOD state–action pairs, we propose to select between Qϕst+1,πθ(st+1) and Qϕst+1,μψ(st+1), where st+1 is only sampled from D. This modification ensures that (st+1,μψ(st+1)) is always from the same distribution as D, providing a reliable and calibrated Q-value estimates of the reference policy, whose values are on the similar scale as the true return value of D (Nakamoto et al., 2023). With D fixed, we compare the mean estimated return of st+1,πθ(st+1) sampled from B against the bootstrapping-error-free ground-truth mean return of D, thereby reducing the accumulated bootstrapping error in the action selection process. Thus, Equation 2 when updating the critic network becomes (Equation 7)

Qst,atrt+γQϕst+1,a*st+1.(7)

Compared with IBRL, the key modification is a simple action selection module, denoted as a*(s)

a*s=μψs,Q̄ϕs,μψs>Q̄ϕs,πθs,πθs,otherwise,(8)

where Q̄ denotes the mean of estimated Q-values, s is the state from B, and s is the state only sampled in D.

In the bootstrap proposal phase, the future rollouts st+1 are sampled randomly from B. One can select st+1 by finding the states closest to st+1 within D, to enable more precise comparisons between nearby state–action pairs. However, for implementation simplicity, current st+1 is uniformly random-sampled from D. By simple random sampling, the expected sample mean Q-value, Q̄ϕ(s,μψ(s)), from each batch converges to the population mean Q-value of the expert buffer (Rice, 2007). Therefore, even though the comparison is made across different states, it remains valid because we have compared the mean Q-values of the distributions induced by the IL policy and the RL policy.

Similarly, to align the policy strategy in the online interaction phase with the policy selected to propose future rollouts, the same action selection module (Equation 8) is used. With fewer OOD actions getting selected, the state distribution shift is also mitigated. However, if μψ(s) fails to provide good or recovery actions toward any state distribution shift, the considered action selection module might fail as Qψ(s,μψ(s)) is not updated with fixed D, and the same bad behavior from the reference policy is repeatedly selected. Therefore, to leverage this action selection module for enhanced exploration efficiency, the initial online exploration states should lie within or near those in D, and the reference policy should remain robust under small shifts in the state distribution.

Furthermore, to prevent the RL policy from overfitting the demonstration dataset and converging on a sub-optimal policy, we propose to replace TD3 with SAC. In SAC, the critic parameter ϕ is updated by minimizing the soft Bellman residual:

JQϕ=Est,atD12Qst,atQ̂ϕst,at2,(9)

where Qϕst,at is estimated using Equation 10:

Qst,atrt+γQϕst+1,a*st+1αlogπθfθϵt+1;st+1st+1.(10)

The stochastic actor parameter θ is updated by minimizing the expected KL-divergence:

Jπθ=EstD,ϵtNαlogπθfθϵt;ststQϕst,fθϵt;st,(11)

where the stochastic action is fθ(ϵt;st) and ϵt is an input noise distribution, sampled from some fixed distribution (Haarnoja et al., 2018b). We propose that the distribution can be the demonstration D, but in this study, we only consider a simple Gaussian distribution N. logπθ(fθ(ϵt;st)st) is the log-probability of the stochastic action fθ(ϵt;st) under the current policy πθ.

Finally, to leverage the robustness of the proposed framework toward the quality of the demonstration, we propose to choose offline DRL as the reference policy (IL policy in the IBRL framework) when the quality of the demonstration is unknown or imperfect. With strong sequential decision-making ability, offline DRL can be more robust to the demonstration quality than IL methods (Kumar et al., 2020; Fujimoto and Gu, 2021).

Combining all the modifications, DRLR is introduced in Algorithm 1; our new modifications are marked in red.

Algorithm 1
www.frontiersin.org

Algorithm 1.

5 Experiment design and evaluation

In this section, experiments are designed and conducted in the simulation to evaluate the proposed method. The experimental design and evaluation aim to answer the following core questions.

5.1 How generalizable is DRLR across environments with varying reward densities and state–action space complexities?

To answer the question, the tasks selected in the problem statement are studied under both dense reward and sparse reward settings. For the bucket loading task, the state and action dimensions are 4 and 3, respectively. The details, such as reward design, domain randomization, and prior demonstration collection, are provided in Section 6. For the open drawer task, the state and action dimensions are 23 and 9, respectively. The details of the open drawer task are provided in Makoviychuk et al. (2021a). The original reward design for the open drawer task is dense and contains distance reward, open drawer reward, and some bonus reward for opening the drawer properly. To study the same task with a sparse reward setting, we simply set the distance reward gain to 0. To collect simulated demonstrations for the open drawer task, a TD3 policy was trained with dense, human-designed rewards. A total of 30 prior trajectories are recorded by evaluating the trained TD3 with random noise added to the policy output.

Both tasks are trained with Isaac Gym (Makoviychuk et al., 2021b). All experiments with the open drawer task were run with 10 parallel environments, using two different random seeds (10 and 11) to ensure robustness and reproducibility. All experiments with the bucket loading task were run in a single environment, using two different random seeds (10 and 11). The detailed configurations for training each task are shown in Section 8.1.

Question A is answered through the following evaluation results: Figure 5 demonstrates the performance of DRLR to learn the open drawer task with both sparse reward and dense reward, by achieving the highest reward in both reward settings; the results validate the robustness of DRLR toward varying reward densities. Figures 5, 6 present the performance of DRLR under different state–action space complexities. By outperforming IBRL on the open drawer task and achieving comparable reward in the bucket loading task, the results validate the ability of DRLR to generalize across varying levels of state–action space complexities.

5.2 How effective is the proposed action selection module in addressing the bootstrapping error and improving exploration efficiency during learning compared to IBRL?

To examine the effectiveness of the action selection module in addressing bootstrapping error and improving exploration efficiency, we conducted experiments in which only the action selection module of the original IBRL framework was replaced. The reference policy used is the IL policy, while the RL policy remains TD3 in both setups. Four criteria are recorded during training: 1) the Q-value of the Ref policy during action selection in the online interaction phase; 2) the Q-value of the RL policy during action selection in the online interaction phase; 3) BC loss: LBCπθ=E(s,a)Bπθ(s)a22, for measuring the difference between sampled actions in the replay buffer and the actions output by the RL policy; 4) reward convergence over training steps. Figures 3, 4 present a comparison of the considered criteria between the baseline IBRL and our proposed method across two selected tasks in the sparse reward setting.

Figure 3
Graphs A and B compare Q-value vs. training steps for RL and Ref policies, showing growth with RL outperforming Ref. Graph C depicts BC loss vs. training steps; Ours has lower loss compared to IBRL. Graph D shows mean reward vs. training steps; Ours achieves higher rewards than IBRL. Each graph includes 95% confidence intervals.

Figure 3. Exp2: Validation of the effectiveness of the proposed new action selection method using the open drawer task. (a) Q-value estimation with original IBRL. (b) Q-value estimation with our proposed action selection module. (c) BC loss. (d) Mean reward.

Figure 4
Four graphs labeled A to D show the performance of various reinforcement learning policies over training steps. Graphs A and B depict Q-values with RL policy and reference policy, highlighting 95% confidence intervals. Graph C shows BC loss comparing

Figure 4. Exp3: Validation of the effectiveness of the proposed new action selection method using the bucket loading task. (a) Q-value estimation with original IBRL. (b) Q-value estimation with proposed new IBRL. (c) BC loss. (d) Mean reward.

The results for the open drawer task are shown in Figure 3. In Figure 3a, we compare the Q-value of the Ref policy and RL policy during action selection in the online interaction phase in the IBRL. The Q-values of the Ref policy appear closely estimated to that of the RL policy, and both Q-values have high variances during training. Combining the results of the BC loss between sampled actions and the agent’s output actions in Figure 3c indicates a mismatch between the updated policy and the behavior policy, suggesting that OOD actions are being selected due to the bootstrapping error discussed in Section 2. As a result, the Ref policy failed to get selected to provide reliable guidance, as reflected in the degraded performance in Figure 3d. Figure 3b presents a stable Q-value estimation of the Ref policy and a clear higher mean value compared with the RL policy in the early training steps, which aligns with the core idea of the IBRL framework. The corresponding BC loss in Figure 3c is significantly reduced by approximately 80% compared to the BC loss of IBRL, indicating that the bootstrapping error is effectively mitigated using our action selection method. Consequently, the Ref policy succeeded in efficient guidance throughout the RL training, as demonstrated by the improved reward convergence in Figure 3d. The proposed action selection module achieved a mean reward approximately four times higher than IBRL during the interaction steps.

The results for the bucket loading task are shown in Figure 4. Notably, the experiments of the bucket loading task were run in a single environment since it is computationally expensive to simulate thousands of particles in parallel environments. Thus, the results of the bucket loading tasks have higher variance than those of the open drawer task, where 10 environments are running in parallel. The results suggest that the action selection module has less effect on the low-dimensional state–action task, and the original IBRL can already score a near-optimal reward. This can also be attributed to the performance of the Ref policy. If the RL policy can easily acquire a higher Q-value than the Ref policy, the effect of our action selection module will be limited. Nevertheless, the stable Q-value estimation of the Ref policy in Figure 4b still validates the effectiveness of our action selection module in maintaining reliable Q-value estimations.

5.3 How effective is SAC in improving exploration efficiency during learning compared to the initial IBRL?

To examine the effectiveness of the SAC in improving exploration efficiency, we conducted 1) the original IBRL, denoted as IBRLTD3; 2) the IBRL with our action selection module, denoted as OursTD3; 3) the IBRL with SAC to be the RL policy, denoted as IBRLSAC; 4) our DRLR framework, denoted as OursSAC. The Ref policy remains the IL policy in all setups.

The reward convergence over training steps is recorded as the main evaluation criteria. Figures 5, 6 present a comparison of the considered experiments across two selected tasks. The results for the open drawer task with varied reward settings are shown in Figure 5. The reward convergence suggests that with the same training steps, the experiments with SAC are able to explore higher rewards than those with TD3, which converged on a sub-optimal reward. The results for the bucket loading task in sparse reward settings are shown in Figure 6. The results suggest that our method and IBRL achieve similar performance in low-dimensional state–action spaces.

Figure 5
Line graphs A and B show mean rewards across training steps for different models: IBRL_SAC, Ours_SAC, IBRL_TD3, and Ours_TD3. The graphs include 95% confidence intervals, displayed in different colors. Mean rewards increase with training steps, illustrating model performance.

Figure 5. Exp4: Validation of the effectiveness of SAC using the open drawer task. (a) Dense reward setting. (b) Sparse reward setting.

Figure 6
Line graph showing mean reward versus training steps, comparing algorithms: IBRL with SAC and TD3, and

Figure 6. Exp5: Validation of the effectiveness of SAC using the bucket loading task.

The final evaluation results of each algorithm across two tasks are shown in Table 1. Table 1 shows that DRLR achieves the best evaluation performance in both tasks. In the open drawer task with sparse rewards, DRLR improves the averaged reward by approximately 347%, showing a dramatic improvement.

Table 1
www.frontiersin.org

Table 1. Averaged rewards of evaluating each RL policy at the last time step over five episodes.

5.4 What is the impact of demonstration quality on the performance of our method?

To evaluate the robustness of the proposed method toward varying demonstration qualities, the following experiments were conducted: we fill the demonstration dataset with 1) 50% data from the random policy, denoted as 50%demo. 2) Sub-optimal demo: noise is added to the expert policy outputs. For simplicity, a BC policy is selected as the IL policy. A minimalist approach to offline RL, known as TD3 + BC (Fujimoto and Gu, 2021), is selected as our Ref policy. Due to the complexity involved in designing such experiments, only the open drawer task with sparse rewards, which is the most difficult to learn, is evaluated in the experiments. The results are shown in Figure 7. Figure 7a demonstrates that TD3 + BC can learn a good policy even from 50%demo, while BC cannot. Furthermore, TD3 + BC also learns a better policy using the sub-optimal demo. Figure 7b validates the robustness of our method toward varying demonstration qualities, by achieving the same level of rewards with both datasets.

Figure 7
Chart A shows rewards over training steps for four approaches: BC with 50% Demo, BC Demo, TD3 with BC 50% Demo, and TD3 with BC Demo. Chart B shows mean reward over training steps for three methods: Ours with IL Demo, Ours with Offline RL Demo, and Ours with Offline RL 50% Demo, along with their 95% confidence intervals.

Figure 7. Exp6: Validation of the robustness of our framework toward varying demonstration qualities using the open drawer task. (a) Comparison between the IL policy and offline RL policy. (b) Reward convergence with the varying demonstration qualities.

To this end, we have demonstrated the effectiveness of the proposed method. The method is also applied to a real industrial application to showcase the implementation process and sim-to-real performance.

6 Real industrial applications

This section presents an application of the proposed framework to the wheel loader loading task, where only a limited number of expert demonstrations are allowed to demonstrate the data efficiency. The detailed implementation is illustrated in Figure 8.

Figure 8
A diagram titled

Figure 8. Illustration of the implementation of applying the proposed framework to the automatic wheel loader loading task.

6.1 Bucket–media simulation

Before learning with the proposed framework, it is important to create an environment similar to the real world to enable policy exploration while applying domain randomization to deal with observation shifts. In the simulation, the wheel loader is configured with the same dynamic parameters obtained from a real machine. Because it is impractical to directly model the hydraulic actuation force or the bucket–media interaction force under different materials and geometries, this paper attempts to regularize the external torque rather than modeling it. We proposed to use admittance controllers to decrease the variances in the external torque by changing the position reference. The implementation of the admittance controller is provided in the Supplementary Appendices.

Table 2 shows the parameters we randomized to simulate bucket–media interactions with different pile geometries and pile materials. A comparison of the estimated external torque during penetration of the pile between simulation and real-world experiments is presented in Figure 9. Different from real-world settings, the external torque is estimated from contact sensors in the simulation, due to the poor performance of the force sensor in Isaac Gym.

Table 2
www.frontiersin.org

Table 2. Domain randomization parameters and their sampling ranges.

Figure 9
Scatter plot comparing bucket-media interaction in real-world and simulation contexts for stone and sand. The x-axis shows penetration length in meters, and the y-axis shows torque in newton-meters. Different colored dots represent stone simulation, sand real-world, and sand simulation.

Figure 9. Comparison of the estimated external torque during penetration between simulation and real-world experiments. In the real-world experiment (orange), the external torque is measured while loading dry sand. In the simulation experiment, the external torque (green and blue) is generated by loading sand and stone piles, using the same penetration motion as in the real-world experiment.

6.2 DRLR implementation

Both the Ref and RL policies have four inputs, namely, [q1,q2,La,τ̂e], representing boom joint position, bucket joint position, advancing length, and estimated external torque, and three outputs, namely, [qd1,qd2,τd], where qd1,qd2 are desired position references for boom and bucket joint positions, respectively, and τd is the desired torque reference for admittance control that is only used during penetration.

To train the Ref policy, 10 expert demonstrations of loading dry sand piles with changing pile geometries are recorded. During demonstration, [q1,q2,La] are directly used as inputs, and τ̂e is scaled at [1,1]. Position references are acquired from the forward dynamics of the actuation signals sent in the demonstrations; they are first normalized and then used as [qd1,qd2], while scaled τ̂e is directly assigned as τd. The state–action pairs that are used for training the reference policy are shown in Figure 10. For simplicity, BC is used to train the reference policy.

Figure 10
Graph illustrating the scaled states and actions for training BC over 10 seconds. It features multiple line plots showing variables \(q_1\), \(q_2\), \(L_\alpha\), \(\tau_e\), \(q_{d1}\), \(q_{d2}\), and \(\tau_d\) on separate panels, each plotted against time in seconds.

Figure 10. States–action pairs for training the Ref policy. Each curve represents the data recorded in one bucket loading demonstration.

The wheel loader loading process can be divided into three phases, as shown in Figure 11: penetrate, shovel, and lift (Sarata et al., 2004).

Figure 11
Diagram illustrating a backhoe performing three phases: Phase 1 is

Figure 11. Three phases of the wheel loader loading process.

To train DRL, the bucket loading task is divided into two sub-tasks as shown in Equation 12:

subtask=P1,qd2>0.5,P2&P3else.(12)

In phase 1, P1, the boom and bucket penetrate the pile with an admittance controller tracking qd1,qd2,τd, and the loader moves forward with a constant velocity. In phases 2 and 3, P2&P3, the controller switches to an inverse dynamics controller with only tracking the position references qd1,qd2, and the loader stops moving forward. The transition between P1 and P2&P3 is determined by the point at which the loader stops moving forward. Based on observing the demonstrations, this transition is identified when the desired bucket reference position qd2 surpasses approximately −0.5.

The goal for the bucket loading task is to achieve a full bucket-fill rate and the boom–bucket joint reaching its designated end position, corresponding to the maximum allowable value within the position reference range. This leads to a natural sparse reward setting, where the reward only occurs at the end of the tasks. However, sparse reward requires a longer training time because it is more difficult for the RL agent to explore than dense reward settings. Although Shen and Sloth (2024a) demonstrated a successful performance with dense rewards, designing such rewards is challenging and may lead to sub-optimal actions. Since our framework has shown robust performance in sparse reward settings, a simpler sparse reward setting is designed in Equation 13:

r=Rf+Re,T50,10,Fail,0,Else,(13)

where T represents the final step of an episode. A loading failure (Fail) occurs if the bucket-fill rate reward Rf and the end reward Re do not achieve at least half of their maximum designed values by the end step T. The rewards Rf and Re are defined as Equation 14:

Rf=VVmax,Re=1ddmax,(14)

where Vmax is the bucket capacity, V is the current bucket load volume, and V=τ̂e/ρradgl1, where ρrad is the particle density and l1 is the length of boom. d is the Euclidean distance between the current boom–bucket joint position and the end position, while dmax is the Euclidean distance between the initial boom–bucket joint position and the end position.

6.3 Sim-to-real results

The reward convergence results learning the bucket loading task are shown in Figure 6. The trained actor is deployed on a real machine MUSTANG 2040 operating in wet sand and stone pile fields. The experiment site is shown in Figure 12.

Figure 12
A yellow bulldozer with a large front bucket is positioned next to a pile of dirt on a construction site. In the background, there is a mound of gravel and some foliage.

Figure 12. Experiment site showing MUSTANG 2040 operating in wet sand and stone pile fields.

In the experiments, the inputs [q1,q2] are measured in radians with inertial measurement units (IMUs) mounted on the boom and bucket. [La] denotes the forwarding distance of the loader, determined using GNSS antennas mounted on the machine. [τ̂e] is computed based on the pressure sensor readings obtained from both sides of the hydraulic pistons in the boom and bucket hydraulic pump. All the sensors operate at an update rate of 10Hz. The outputs [qd1,qd2,τd] are from the deployed Neural Networks (NNs), while the loader’s forwarding motion is manually controlled by an operator at a random speed. The operator halts the forward motion upon noticing the boom’s lift.

First, a two-sided admittance controller with both position and torque reference is tested. However, due to the high compaction nature of wet sand and stone pile, the downward curl of the bucket generates extremely large normal forces, causing the admittance controller to fail to track τd and, consequently, leading to boom and bucket vibrations during penetration and unstable outputs from the deployed actor network. These unstable NN outputs could result from a state distribution shift caused by the large normal forces during interaction with compacted material. In the simulation environment, such a compaction effect is not accurately modeled as the material pile is simulated using discrete particles that lack adhesive or cohesive properties. A penalty for causing such unsafe behavior should be considered in the future reward design.

Due to safety and stable performance, only a one-sided admittance controller is tested in the following experiments with position reference [qd1,qd2] and τsat=800 N.m to prevent the bucket from getting stuck.

To evaluate the policy, 25 experiments were carried out, involving 10 trials for loading wet sand and 15 trials for loading stone. Sim-to-real results for loading stones are presented in Figure 13. Despite changing environments, including pile geometries, material types, and forwarding velocities, all the experiments successfully loaded and lifted the materials. The average bucket-fill rates for loading sand and stone in the simulation and real-world experiments are provided in Table 3. To compare the sim-to-real performance in terms of the bucket-fill rate, the bucket-fill rates in simulation are also recorded and averaged over five episodes. The bucket-fill rate differences between simulation and real-world experiments may stem from environmental uncertainties present under real-world conditions, such as the irregular pile shapes.

Figure 13
Six line graphs display various dynamics over a twelve-second period. From top to bottom: \( q_1 \), \( q_2 \), \( L_a \), \( \hat{\tau}_e \), \( q_{d1} \), and \( q_{d2} \). Each graph shows differently colored lines, representing multiple data sets fluctuating at different time intervals, with significant activity between seven and nine seconds.

Figure 13. Sim-to-real results of 15 trials for loading stone with different pile geometries. Each curve represents the data recorded in one bucket-loading sim-to-real experiment.

Table 3
www.frontiersin.org

Table 3. Average bucket-fill rates for simulation and real-world experiments.

7 Conclusion

This paper proposes and implements an exploration-efficient DRLR framework to reduce the need for extensive interaction when applying off-policy DRL to real-world robotic tasks. The designed experiments empirically validate the effectiveness of our framework in mitigating bootstrapping errors and addressing convergence to sub-optimal policies, ultimately reducing the exploration required to attain high-performing policies compared to IBRL. Furthermore, we demonstrated the implementation details for using the DRLR framework on a real industrial robotics task, wheel loader bucket loading. The sim-to-real results validate the successful deployment of the considered framework, demonstrating its potential for application to complex robotic tasks.

In future work, one could improve the action selection module by selecting st+1 by determining the states closest to st+1 within D using Euclidean or Mahalanobis distance, thereby facilitating more precise comparisons between neighboring state–action pairs. To better demonstrate the advantages of DRLR, it is necessary to compare it against established offline-to-online DRL baselines that explicitly addressed bootstrapping errors, such as CAL-QL, RLPD, and WSRL (Nakamoto et al., 2023; Ball et al., 2023; Zhou et al. 2024).

Moreover, one could also consider using deep ensembles to quantify the uncertainties in the demonstrations and utilize these uncertainties as prior data for the SAC entropy. Integrating the concepts of active learning and uncertainty-aware RL into the proposed framework could further improve the exploration efficiency.

Data availability statement

The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Author contributions

CeS: Writing – original draft, Software, Resources, Visualization, Validation, Formal Analysis, Methodology, Writing – review and editing, Data curation, Conceptualization. CrS: Writing – review and editing, Writing – original draft.

Funding

The author(s) declared that financial support was received for this work and/or its publication. This work was supported and funded by Unicontrol ApS, and Innovation Fund Denmark, grant number 1044-5800117B. The funder was not involved in the study design, collection, analysis, interpretation of data, the writing of this article, or the decision to submit it for publication.

Acknowledgements

The authors would like to thank Unicontrol ApS for granting permission to use their wheel loader and Unicontrol’s 3D machine control system to collect demonstration data and conduct the sim-to-real experiments.

Conflict of interest

Author CeS was employed by Unicontrol ApS.

The remaining author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/frobt.2025.1682200/full#supplementary-material

References

Allshire, A., MittaI, M., Lodaya, V., Makoviychuk, V., Makoviichuk, D., Widmaier, F., et al. (2022). “Transferring dexterous manipulation from gpu simulation to a remote real-world trifinger,” in 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE), 11802–11809.

CrossRef Full Text | Google Scholar

Ball, P. J., Smith, L., Kostrikov, I., and Levine, S. (2023). “Efficient online reinforcement learning with offline data,” in International Conference on Machine Learning (PMLR), 1577–1594.

Google Scholar

Fujimoto, S., and Gu, S. S. (2021). A minimalist approach to offline reinforcement learning. Adv. Neural Information Processing Systems 34, 20132–20145. doi:10.48550/arXiv.2106.06860

CrossRef Full Text | Google Scholar

Fujimoto, S., Hoof, H., and Meger, D. (2018). “Addressing function approximation error in actor-critic methods,” in International conference on machine learning (PMLR), 1587–1596.

Google Scholar

Fujimoto, S., Meger, D., and Precup, D. (2019). “Off-policy deep reinforcement learning without exploration,” in International conference on machine learning (PMLR), 2052–2062.

Google Scholar

Gao, Y., Xu, H., Lin, J., Yu, F., Levine, S., and Darrell, T. (2018). Reinforcement learning from imperfect demonstrations. arXiv preprint arXiv:1802.05313

Google Scholar

Goecks, V. G., Gremillion, G. M., Lawhern, V. J., Valasek, J., and Waytowich, N. R. (2019). Integrating behavior cloning and reinforcement learning for improved performance in dense and sparse reward environments. arXiv preprint arXiv:1910.04281

Google Scholar

Haarnoja, T., Ha, S., Zhou, A., Tan, J., Tucker, G., and Levine, S. (2018a). Learning to walk via deep reinforcement learning. arXiv preprint arXiv:1812.11103

Google Scholar

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018b). “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in International conference on machine learning (Pmlr), 1861–1870.

Google Scholar

Hiraoka, T., Imagawa, T., Hashimoto, T., Onishi, T., and Tsuruoka, Y. (2021). Dropout q-functions for doubly efficient reinforcement learning. arXiv preprint arXiv:2110.02034

Google Scholar

Hu, H., Mirchandani, S., and Sadigh, D. (2024). “Imitation bootstrapped reinforcement learning,” in Robotics: science and systems (RSS).

Google Scholar

Hussein, A., Gaber, M. M., Elyan, E., and Jayne, C. (2017). Imitation learning: a survey of learning methods. ACM Comput. Surv. (CSUR) 50, 1–35. doi:10.1145/3054912

CrossRef Full Text | Google Scholar

Ibarz, J., Tan, J., Finn, C., Kalakrishnan, M., Pastor, P., and Levine, S. (2021). How to train your robot with deep reinforcement learning: lessons we have learned. Int. J. Robotics Res. 40, 698–721. doi:10.1177/0278364920987859

CrossRef Full Text | Google Scholar

Kumar, A., Fu, J., Soh, M., Tucker, G., and Levine, S. (2019). Stabilizing off-policy q-learning via bootstrapping error reduction. Adv. Neural Information Processing Systems 32. doi:10.48550/arXiv.1906.00949

CrossRef Full Text | Google Scholar

Kumar, A., Zhou, A., Tucker, G., and Levine, S. (2020). Conservative q-learning for offline reinforcement learning. Adv. Neural Information Processing Systems 33, 1179–1191. doi:10.48550/arXiv.2006.04779

CrossRef Full Text | Google Scholar

Lee, S., Seo, Y., Lee, K., Abbeel, P., and Shin, J. (2022). “Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble,” in Conference on Robot Learning (PMLR), 1702–1712.

Google Scholar

Levine, S., Kumar, A., Tucker, G., and Fu, J. (2020). Offline reinforcement learning: tutorial, review. And perspectives on open problems 5

Google Scholar

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., et al. (2015). Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. doi:10.48550/arXiv.1509.02971

CrossRef Full Text | Google Scholar

Makoviychuk, V., Wawrzyniak, L., Guo, Y., Lu, M., Storey, K., Macklin, M., et al. (2021a). isaac-sim/IsaacGymEnvs. Available online at: https://github.com/isaac-sim/IsaacGymEnvs/blob/main/isaacgymenvs/tasks/franka_cabinet.py.

Google Scholar

Makoviychuk, V., Wawrzyniak, L., Guo, Y., Lu, M., Storey, K., Macklin, M., et al. (2021b). Isaac gym: high performance gpu-based physics simulation for robot learning

Google Scholar

Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., et al. (2016). “Asynchronous methods for deep reinforcement learning,” in International conference on machine learning (PmLR), 1928–1937.

Google Scholar

Nair, A., Gupta, A., Dalal, M., and Levine, S. (2020). Awac: accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359

Google Scholar

Nakamoto, M., Zhai, S., Singh, A., Sobol Mark, M., Ma, Y., Finn, C., et al. (2023). Cal-ql: calibrated offline rl pre-training for efficient online fine-tuning. Adv. Neural Inf. Process. Syst. 36, 62244–62269. doi:10.48550/arXiv.2303.05479

CrossRef Full Text | Google Scholar

Nguyen, H., and La, H. (2019). “Review of deep reinforcement learning for robot manipulation,” in 2019 Third IEEE international conference on robotic computing (IRC) (IEEE) (IEEE), 590–595.

CrossRef Full Text | Google Scholar

Osinski, B., Finn, C., Erhan, D., Tucker, G., Michalewski, H., Czechowski, K., et al. (2020). Model-based reinforcement learning for atari. ICLR 1, 2.

Google Scholar

Qi, H., Kumar, A., Calandra, R., Ma, Y., and Malik, J. (2023). “In-hand object rotation via rapid motor adaptation,” in Conference on Robot Learning (PMLR), 1722–1732.

Google Scholar

Raffin, A., Hill, A., Gleave, A., Kanervisto, A., Ernestus, M., and Dormann, N. (2021). Stable-baselines3: reliable reinforcement learning implementations. J. Machine Learning Research 22, 1–8.

Google Scholar

Rice, J. A. (2007). Mathematical statistics and data analysis, 371. Belmont, CA: Thomson/Brooks/Cole.

Google Scholar

Rudin, N., Hoeller, D., Reist, P., and Hutter, M. (2022). “Learning to walk in minutes using massively parallel deep reinforcement learning,” in Conference on robot learning (PMLR), 91–100.

Google Scholar

Sarata, S., Osumi, H., Kawai, Y., and Tomita, F. (2004). “Trajectory arrangement based on resistance force and shape of pile at scooping motion,” in 2004 international conference on robotics and automation (ICRA) (Newyork: IEEE), 4, 3488–3493. doi:10.1109/robot.2004.1308793

CrossRef Full Text | Google Scholar

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347

Google Scholar

Shen, C., and Sloth, C. (2024a). “Generalized framework for wheel loader automatic shoveling task with expert initialized reinforcement learning,” in IEEE/SICE international symposium on system integration (SII), 382–389.

Google Scholar

Song, Y., Zhou, Y., Sekhari, A., Bagnell, J. A., Krishnamurthy, A., and Sun, W. (2022). Hybrid rl: using both offline and online data can make rl efficient. arXiv Preprint arXiv:2210.06718.

Google Scholar

Uchendu, I., Xiao, T., Lu, Y., Zhu, B., Yan, M., Simon, J., et al. (2023). “Jump-start reinforcement learning,” in International Conference on Machine Learning (PMLR), 34556–34583.

Google Scholar

Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., et al. (2017). Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817

Google Scholar

Zhang, H., Xu, W., and Yu, H. (2023). Policy expansion for bridging offline-to-online reinforcement learning. arXiv preprint arXiv:2302.00935

Google Scholar

Zhou, Z., Peng, A., Li, Q., Levine, S., and Kumar, A. (2024). Efficient online reinforcement learning fine-tuning need not retain offline data. arXiv preprint arXiv:2412.07762

Google Scholar

Keywords: deep reinforcement learning, learning from demonstration, automation in construction, robotics, sim-to-real

Citation: Shen C and Sloth C (2026) Solving robotics tasks with prior demonstration via exploration-efficient deep reinforcement learning. Front. Robot. AI 12:1682200. doi: 10.3389/frobt.2025.1682200

Received: 08 August 2025; Accepted: 08 December 2025;
Published: 12 January 2026.

Edited by:

Hongwei Mo, Harbin Engineering University, China

Reviewed by:

Shuo Ding, Nanjing University of Aeronautics & Astronautics, China
Ravishankar Prakash Desai, Amrita Vishwa Vidyapeetham, Amaravati Campus, India

Copyright © 2026 Shen and Sloth. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Chengyandan Shen, Y3NoZW5AbW1taS5zZHUuZGs=

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.