- 1Unicontrol ApS, Odense, Denmark
- 2SDU Robotics, Maersk McKinney Møller Institute, University of Southern Denmark, Odense, Denmark
This paper proposes an exploration-efficient deep reinforcement learning with reference (DRLR) policy framework for learning robotics tasks incorporating demonstrations. The DRLR framework is developed based on an imitation bootstrapped reinforcement learning (IBRL) algorithm. Here, we propose to improve IBRL by modifying the action selection module. The proposed action selection module provides a calibrated Q-value, which mitigates the bootstrapping error that otherwise leads to inefficient exploration. Furthermore, to prevent the reinforcement learning (RL) policy from converging to a sub-optimal policy, soft actor–critic (SAC) is used as the RL policy instead of twin delayed DDPG (TD3). The effectiveness of our method in mitigating the bootstrapping error and preventing overfitting is empirically validated by learning two robotics tasks: bucket loading and open drawer, which require extensive interactions with the environment. Simulation results also demonstrate the robustness of the DRLR framework across tasks with both low and high state–action dimensions and varying demonstration qualities. To evaluate the developed framework on a real-world industrial robotics task, the bucket loading task is deployed on a real wheel loader. The sim-to-real results validate the successful deployment of the DRLR framework.
1 Introduction
Model-free deep reinforcement learning (DRL) has shown great potential in learning continuous control tasks in robotics (Allshire et al., 2022; Qi et al., 2023; Rudin et al., 2022; Haarnoja et al., 2018a; Nguyen and La, 2019; Ibarz et al., 2021). However, there remain challenges that limit the widespread applicability of these methods in real-world robotic applications. One major challenge is the poor sample efficiency of learning with model-free DRL; even relatively simple tasks can require millions of interaction steps, while learning policies from high-dimensional observations or complex environments may require significantly more interactions (Haarnoja et al., 2018b; Osinski et al., 2020; Raffin et al., 2021). A primary cause for the poor sample efficiency is on-policy learning (Haarnoja et al., 2018b) since some of the most widely used DRL algorithms, such as A3C (Mnih et al., 2016) and PPO (Schulman et al., 2017), require new interactions with the environments for each gradient step. Consequently, on-policy DRL is often impractical for real-world systems as allowing untrained policies to interact with real systems can be both costly and dangerous. Even when learning occurs solely in simulation, it is still preferred to utilize previously collected data instead of starting from scratch (Levine et al., 2020). On the other hand, off-policy DRL methods improve sample efficiency by reusing past experience and have demonstrated strong performance on continuous control tasks (Lillicrap et al., 2015; Fujimoto et al., 2018; Haarnoja et al., 2018b; Fujimoto et al. 2019; Fujimoto and Gu, 2021; Kumar et al., 2020). However, for complex robotics tasks where data collection itself is expensive, e.g., in construction machines, educational agents, or medical devices, even off-policy approaches become costly when the DRL policy requires extensive explorations. Under these scenarios, improving exploration efficiency is as crucial as sample efficiency to reduce the exploration needed for achieving a good policy.
Therefore, effectively leveraging prior demonstrations to facilitate efficient exploration is considered a promising strategy for the broad application of off-policy DRL in real-world industrial robotics. Two main research directions have emerged to achieve this goal:
1.1 Offline-to-online DRL
Pretraining DRL with prior expert demonstrations, along with continuous training with online data, has shown its impressive performance in exploration efficiency (Vecerik et al., 2017; Nair et al., 2020; Uchendu et al., 2023; Zhou et al., 2024; Goecks et al., 2019; Lee et al., 2022). Early studies initialize training by mixing offline demonstrations and online interaction in the replay buffer and use a prioritized replay mechanism to enable the reinforcement learning (RL) policy for efficient exploration (Vecerik et al., 2017; Song et al., 2022). More recent approaches separate offline pretraining from online fine-tuning and report superior exploration efficiency (Gao et al., 2018; Goecks et al., 2019; Nair et al., 2020; Lee et al., 2022). In offline training, a behavior cloning (BC) loss or Kullback–Leibler (KL) divergence is typically used to encourage the RL policy to closely follow the behavior policy, which is used to generate the demonstrations, thereby facilitating efficient exploration in online interactions. However, when transferring to the online interacting phase, some methods are required to “recalibrate” the offline Q-estimates to the new online distribution to maintain learning stability and mitigate forgetting of pre-trained initializations (Nair et al., 2020; Uchendu et al., 2023; Ball et al., 2023).
1.2 DRL-Ref policy
Some novel studies have proposed to explicitly integrate a reference policy, trained from the prior demonstration to guide DRL training (Zhang et al., 2023; Hu et al., 2024). In these works, a stand-alone reference policy is trained using offline demonstration and then used to provide additional guidance in the DRL online learning phase. In this work, we consider the imitation bootstrapped reinforcement learning (IBRL) framework as an ideal approach for learning robotics tasks with prior demonstrations as it prevents catastrophic forgetting of pre-trained initializations and automatically balances offline and online training (Hu et al., 2024).
However, the IBRL framework is built on off-policy RL and imitation learning (IL). It risks the same challenges posed by bootstrapping errors in off-policy RL (Kumar et al., 2019, Kumar et al., 2020, Fujimoto et al., 2019, Fujimoto and Gu, 2021), where the target critic and actor networks are updated using out-of-distribution (OOD) actions with overestimated Q-values (Kumar et al., 2019, Kumar et al., 2020). Meanwhile, the IL policy in IBRL could also face the state distribution shift (Hussein et al., 2017), when OOD actions keep getting selected. To address these challenges, in this work, we propose an exploration-efficient DRL with reference (DRLR) policy framework, as shown in Figure 1, and summarize our contributions as follows:
1. Identify and analyze the main cause of the failure cases trained with the IBRL framework: distribution shift due to the bootstrapping error.
2. Propose a simple action selection module and use a maximum entropy RL to mitigate inefficient explorations caused by bootstrapping errors and convergence on a sub-optimal policy due to overfitting.
3. Demonstrate the effectiveness and robustness of the proposed framework on tasks with both low and high state–action dimensions and demonstrations of different quality.
4. Showcase an implementation and deployment of the proposed framework on a real industrial task.
Figure 1. Overview of the proposed exploration-efficient DRLR framework. The proposed framework extends a sample-efficient DRL-Ref method with a simple action selection module to mitigate inefficient explorations caused by (1) bootstrapping errors leading to the RL policy selecting out-of-distribution actions; (2) Ref policy failing to provide good actions under state distribution shifts.
2 Problem statement
The proposed framework is generalized toward learning robotics tasks with the following problems: 1) collecting a large amount of data is costly. 2) Learning requires extensive interactions. 3) A small number of expert demonstrations are available. Based on the characteristics, bucket loading (Shen and Sloth, 2024a) and open drawer (Makoviychuk et al., 2021a) tasks are selected to evaluate the effectiveness of the proposed framework. The task environments are shown in Figure 2.
Compared to the selected DRL-Ref framework, IBRL, the proposed framework attempts to mitigate distribution shift caused by bootstrapping errors and prevent convergence to a sub-optimal policy from overfitting to the demonstrations.
Bootstrapping error can arise in off-policy RL when the value function is updated using Bellman backups. It occurs because the target value function and policy are updated using OOD actions with overestimated Q-values (Kumar et al., 2010). Studies have shown that bootstrapping error can lead to unstable training and even divergence from the optimal policy (Kumar et al., 2019, Kumar et al., 2020), particularly when the current policy output is far from the behavior policy, which is used to generate the transitions in the replay buffer (Fujimoto et al., 2019; Fujimoto and Gu, 2021; Kumar et al., 2020; Kumar et al., 2019).
In the IBRL, the critic (value) function’s parameters
where
Here,
Another disadvantage of bootstrapping error is that OOD actions selected by the RL policy during online interaction can lead to state distribution shift. When the IL agent fails to provide high-quality actions for the unseen interaction states, the exploration efficiency of IBRL will be degraded. Furthermore, although the IBRL has stated that both twin delayed DDPG (TD3) and soft actor–critic (SAC) can be used as RL policies for continuous control tasks (Hu et al., 2024), the authors exclusively used TD3 in their experiments due to its strong performance and high sample efficiency in challenging image-based RL settings. However, we argue that the deterministic RL algorithm, TD3, is less suitable for high-dimensional, continuous state-based tasks as it is more prone to overfitting offline data, converging to sub-optimal policies, and suffering from inefficient exploration (Haarnoja et al., 2018b). To prevent the RL policy from convergence to a sub-optimal policy because of overfitting, a maximum entropy stochastic RL method, SAC, is adopted.
3 Preliminaries
This section presents an overview of maximum entropy DRL and IBRL.
3.1 Maximum entropy deep reinforcement learning
For sample efficiency, off-policy DRL methods have been widely studied due to their ability to learn from past experiences. However, studies have also found that the off-policy DRL method struggles to maintain stability and convergence in high-dimensional continuous state–action spaces (Haarnoja et al., 2018b). To address this challenge, maximum entropy DRL has been proposed.
As the state–action spaces are continuous in the selected robotics tasks, we consider a Markov decision process (MDP) with continuous state–action spaces: an agent explores and interacts with an environment, at each time step
maximum entropy DRL aims to maximize the discounted reward and expected policy entropy
where
To apply maximum entropy RL in continuous spaces, one of the widely used methods, SAC (Haarnoja et al., 2018b), is applied.
3.2 Imitation bootstrapped reinforcement learning
IBRL is a sample-efficient DRL framework that combines a stand-alone IL policy with an off-policy DRL policy (Hu et al., 2024). First, IBRL requires an IL policy
Then IBRL leverages trained
Furthermore, to prevent local optimum Q-value update, the soft version of IBRL selects actions according to a Boltzmann distribution over Q-values instead of considering
Similarly, in the bootstrap proposal phase, the future rollout will be carried out by selecting the action by
4 Methods
To reduce the exploration time wasted in correcting unreliable overestimated Q-values and, in turn, improve exploration efficiency, it is crucial for the policy to favor distributions with more stable Q-values. This motivates selecting batches with reliable Q-value evaluations when updating both the critic and policy networks. Previous studies have shown that the Q-value estimates of
Compared with IBRL, the key modification is a simple action selection module, denoted as
where
In the bootstrap proposal phase, the future rollouts
Similarly, to align the policy strategy in the online interaction phase with the policy selected to propose future rollouts, the same action selection module (Equation 8) is used. With fewer OOD actions getting selected, the state distribution shift is also mitigated. However, if
Furthermore, to prevent the RL policy from overfitting the demonstration dataset and converging on a sub-optimal policy, we propose to replace TD3 with SAC. In SAC, the critic parameter
where
The stochastic actor parameter
where the stochastic action is
Finally, to leverage the robustness of the proposed framework toward the quality of the demonstration, we propose to choose offline DRL as the reference policy (IL policy in the IBRL framework) when the quality of the demonstration is unknown or imperfect. With strong sequential decision-making ability, offline DRL can be more robust to the demonstration quality than IL methods (Kumar et al., 2020; Fujimoto and Gu, 2021).
Combining all the modifications, DRLR is introduced in Algorithm 1; our new modifications are marked in red.
5 Experiment design and evaluation
In this section, experiments are designed and conducted in the simulation to evaluate the proposed method. The experimental design and evaluation aim to answer the following core questions.
5.1 How generalizable is DRLR across environments with varying reward densities and state–action space complexities?
To answer the question, the tasks selected in the problem statement are studied under both dense reward and sparse reward settings. For the bucket loading task, the state and action dimensions are 4 and 3, respectively. The details, such as reward design, domain randomization, and prior demonstration collection, are provided in Section 6. For the open drawer task, the state and action dimensions are 23 and 9, respectively. The details of the open drawer task are provided in Makoviychuk et al. (2021a). The original reward design for the open drawer task is dense and contains distance reward, open drawer reward, and some bonus reward for opening the drawer properly. To study the same task with a sparse reward setting, we simply set the distance reward gain to 0. To collect simulated demonstrations for the open drawer task, a TD3 policy was trained with dense, human-designed rewards. A total of 30 prior trajectories are recorded by evaluating the trained TD3 with random noise added to the policy output.
Both tasks are trained with Isaac Gym (Makoviychuk et al., 2021b). All experiments with the open drawer task were run with 10 parallel environments, using two different random seeds (10 and 11) to ensure robustness and reproducibility. All experiments with the bucket loading task were run in a single environment, using two different random seeds (10 and 11). The detailed configurations for training each task are shown in Section 8.1.
Question A is answered through the following evaluation results: Figure 5 demonstrates the performance of DRLR to learn the open drawer task with both sparse reward and dense reward, by achieving the highest reward in both reward settings; the results validate the robustness of DRLR toward varying reward densities. Figures 5, 6 present the performance of DRLR under different state–action space complexities. By outperforming IBRL on the open drawer task and achieving comparable reward in the bucket loading task, the results validate the ability of DRLR to generalize across varying levels of state–action space complexities.
5.2 How effective is the proposed action selection module in addressing the bootstrapping error and improving exploration efficiency during learning compared to IBRL?
To examine the effectiveness of the action selection module in addressing bootstrapping error and improving exploration efficiency, we conducted experiments in which only the action selection module of the original IBRL framework was replaced. The reference policy used is the IL policy, while the RL policy remains TD3 in both setups. Four criteria are recorded during training: 1) the Q-value of the Ref policy during action selection in the online interaction phase; 2) the Q-value of the RL policy during action selection in the online interaction phase; 3) BC loss:
Figure 3. Exp2: Validation of the effectiveness of the proposed new action selection method using the open drawer task. (a) Q-value estimation with original IBRL. (b) Q-value estimation with our proposed action selection module. (c) BC loss. (d) Mean reward.
Figure 4. Exp3: Validation of the effectiveness of the proposed new action selection method using the bucket loading task. (a) Q-value estimation with original IBRL. (b) Q-value estimation with proposed new IBRL. (c) BC loss. (d) Mean reward.
The results for the open drawer task are shown in Figure 3. In Figure 3a, we compare the Q-value of the Ref policy and RL policy during action selection in the online interaction phase in the IBRL. The Q-values of the Ref policy appear closely estimated to that of the RL policy, and both Q-values have high variances during training. Combining the results of the BC loss between sampled actions and the agent’s output actions in Figure 3c indicates a mismatch between the updated policy and the behavior policy, suggesting that OOD actions are being selected due to the bootstrapping error discussed in Section 2. As a result, the Ref policy failed to get selected to provide reliable guidance, as reflected in the degraded performance in Figure 3d. Figure 3b presents a stable Q-value estimation of the Ref policy and a clear higher mean value compared with the RL policy in the early training steps, which aligns with the core idea of the IBRL framework. The corresponding BC loss in Figure 3c is significantly reduced by approximately
The results for the bucket loading task are shown in Figure 4. Notably, the experiments of the bucket loading task were run in a single environment since it is computationally expensive to simulate thousands of particles in parallel environments. Thus, the results of the bucket loading tasks have higher variance than those of the open drawer task, where 10 environments are running in parallel. The results suggest that the action selection module has less effect on the low-dimensional state–action task, and the original IBRL can already score a near-optimal reward. This can also be attributed to the performance of the Ref policy. If the RL policy can easily acquire a higher Q-value than the Ref policy, the effect of our action selection module will be limited. Nevertheless, the stable Q-value estimation of the Ref policy in Figure 4b still validates the effectiveness of our action selection module in maintaining reliable Q-value estimations.
5.3 How effective is SAC in improving exploration efficiency during learning compared to the initial IBRL?
To examine the effectiveness of the SAC in improving exploration efficiency, we conducted 1) the original IBRL, denoted as
The reward convergence over training steps is recorded as the main evaluation criteria. Figures 5, 6 present a comparison of the considered experiments across two selected tasks. The results for the open drawer task with varied reward settings are shown in Figure 5. The reward convergence suggests that with the same training steps, the experiments with
Figure 5. Exp4: Validation of the effectiveness of SAC using the open drawer task. (a) Dense reward setting. (b) Sparse reward setting.
The final evaluation results of each algorithm across two tasks are shown in Table 1. Table 1 shows that DRLR achieves the best evaluation performance in both tasks. In the open drawer task with sparse rewards, DRLR improves the averaged reward by approximately
5.4 What is the impact of demonstration quality on the performance of our method?
To evaluate the robustness of the proposed method toward varying demonstration qualities, the following experiments were conducted: we fill the demonstration dataset with 1) 50% data from the random policy, denoted as
Figure 7. Exp6: Validation of the robustness of our framework toward varying demonstration qualities using the open drawer task. (a) Comparison between the IL policy and offline RL policy. (b) Reward convergence with the varying demonstration qualities.
To this end, we have demonstrated the effectiveness of the proposed method. The method is also applied to a real industrial application to showcase the implementation process and sim-to-real performance.
6 Real industrial applications
This section presents an application of the proposed framework to the wheel loader loading task, where only a limited number of expert demonstrations are allowed to demonstrate the data efficiency. The detailed implementation is illustrated in Figure 8.
Figure 8. Illustration of the implementation of applying the proposed framework to the automatic wheel loader loading task.
6.1 Bucket–media simulation
Before learning with the proposed framework, it is important to create an environment similar to the real world to enable policy exploration while applying domain randomization to deal with observation shifts. In the simulation, the wheel loader is configured with the same dynamic parameters obtained from a real machine. Because it is impractical to directly model the hydraulic actuation force or the bucket–media interaction force under different materials and geometries, this paper attempts to regularize the external torque rather than modeling it. We proposed to use admittance controllers to decrease the variances in the external torque by changing the position reference. The implementation of the admittance controller is provided in the Supplementary Appendices.
Table 2 shows the parameters we randomized to simulate bucket–media interactions with different pile geometries and pile materials. A comparison of the estimated external torque during penetration of the pile between simulation and real-world experiments is presented in Figure 9. Different from real-world settings, the external torque is estimated from contact sensors in the simulation, due to the poor performance of the force sensor in Isaac Gym.
Figure 9. Comparison of the estimated external torque during penetration between simulation and real-world experiments. In the real-world experiment (orange), the external torque is measured while loading dry sand. In the simulation experiment, the external torque (green and blue) is generated by loading sand and stone piles, using the same penetration motion as in the real-world experiment.
6.2 DRLR implementation
Both the Ref and RL policies have four inputs, namely,
To train the Ref policy, 10 expert demonstrations of loading dry sand piles with changing pile geometries are recorded. During demonstration,
Figure 10. States–action pairs for training the Ref policy. Each curve represents the data recorded in one bucket loading demonstration.
The wheel loader loading process can be divided into three phases, as shown in Figure 11: penetrate, shovel, and lift (Sarata et al., 2004).
To train DRL, the bucket loading task is divided into two sub-tasks as shown in Equation 12:
In phase 1,
The goal for the bucket loading task is to achieve a full bucket-fill rate and the boom–bucket joint reaching its designated end position, corresponding to the maximum allowable value within the position reference range. This leads to a natural sparse reward setting, where the reward only occurs at the end of the tasks. However, sparse reward requires a longer training time because it is more difficult for the RL agent to explore than dense reward settings. Although Shen and Sloth (2024a) demonstrated a successful performance with dense rewards, designing such rewards is challenging and may lead to sub-optimal actions. Since our framework has shown robust performance in sparse reward settings, a simpler sparse reward setting is designed in Equation 13:
where
where
6.3 Sim-to-real results
The reward convergence results learning the bucket loading task are shown in Figure 6. The trained actor is deployed on a real machine MUSTANG 2040 operating in wet sand and stone pile fields. The experiment site is shown in Figure 12.
In the experiments, the inputs
First, a two-sided admittance controller with both position and torque reference is tested. However, due to the high compaction nature of wet sand and stone pile, the downward curl of the bucket generates extremely large normal forces, causing the admittance controller to fail to track
Due to safety and stable performance, only a one-sided admittance controller is tested in the following experiments with position reference
To evaluate the policy, 25 experiments were carried out, involving 10 trials for loading wet sand and 15 trials for loading stone. Sim-to-real results for loading stones are presented in Figure 13. Despite changing environments, including pile geometries, material types, and forwarding velocities, all the experiments successfully loaded and lifted the materials. The average bucket-fill rates for loading sand and stone in the simulation and real-world experiments are provided in Table 3. To compare the sim-to-real performance in terms of the bucket-fill rate, the bucket-fill rates in simulation are also recorded and averaged over five episodes. The bucket-fill rate differences between simulation and real-world experiments may stem from environmental uncertainties present under real-world conditions, such as the irregular pile shapes.
Figure 13. Sim-to-real results of 15 trials for loading stone with different pile geometries. Each curve represents the data recorded in one bucket-loading sim-to-real experiment.
7 Conclusion
This paper proposes and implements an exploration-efficient DRLR framework to reduce the need for extensive interaction when applying off-policy DRL to real-world robotic tasks. The designed experiments empirically validate the effectiveness of our framework in mitigating bootstrapping errors and addressing convergence to sub-optimal policies, ultimately reducing the exploration required to attain high-performing policies compared to IBRL. Furthermore, we demonstrated the implementation details for using the DRLR framework on a real industrial robotics task, wheel loader bucket loading. The sim-to-real results validate the successful deployment of the considered framework, demonstrating its potential for application to complex robotic tasks.
In future work, one could improve the action selection module by selecting
Moreover, one could also consider using deep ensembles to quantify the uncertainties in the demonstrations and utilize these uncertainties as prior data for the SAC entropy. Integrating the concepts of active learning and uncertainty-aware RL into the proposed framework could further improve the exploration efficiency.
Data availability statement
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.
Author contributions
CeS: Writing – original draft, Software, Resources, Visualization, Validation, Formal Analysis, Methodology, Writing – review and editing, Data curation, Conceptualization. CrS: Writing – review and editing, Writing – original draft.
Funding
The author(s) declared that financial support was received for this work and/or its publication. This work was supported and funded by Unicontrol ApS, and Innovation Fund Denmark, grant number 1044-5800117B. The funder was not involved in the study design, collection, analysis, interpretation of data, the writing of this article, or the decision to submit it for publication.
Acknowledgements
The authors would like to thank Unicontrol ApS for granting permission to use their wheel loader and Unicontrol’s 3D machine control system to collect demonstration data and conduct the sim-to-real experiments.
Conflict of interest
Author CeS was employed by Unicontrol ApS.
The remaining author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Generative AI statement
The author(s) declared that generative AI was not used in the creation of this manuscript.
Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
Supplementary material
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/frobt.2025.1682200/full#supplementary-material
References
Allshire, A., MittaI, M., Lodaya, V., Makoviychuk, V., Makoviichuk, D., Widmaier, F., et al. (2022). “Transferring dexterous manipulation from gpu simulation to a remote real-world trifinger,” in 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE), 11802–11809.
Ball, P. J., Smith, L., Kostrikov, I., and Levine, S. (2023). “Efficient online reinforcement learning with offline data,” in International Conference on Machine Learning (PMLR), 1577–1594.
Fujimoto, S., and Gu, S. S. (2021). A minimalist approach to offline reinforcement learning. Adv. Neural Information Processing Systems 34, 20132–20145. doi:10.48550/arXiv.2106.06860
Fujimoto, S., Hoof, H., and Meger, D. (2018). “Addressing function approximation error in actor-critic methods,” in International conference on machine learning (PMLR), 1587–1596.
Fujimoto, S., Meger, D., and Precup, D. (2019). “Off-policy deep reinforcement learning without exploration,” in International conference on machine learning (PMLR), 2052–2062.
Gao, Y., Xu, H., Lin, J., Yu, F., Levine, S., and Darrell, T. (2018). Reinforcement learning from imperfect demonstrations. arXiv preprint arXiv:1802.05313
Goecks, V. G., Gremillion, G. M., Lawhern, V. J., Valasek, J., and Waytowich, N. R. (2019). Integrating behavior cloning and reinforcement learning for improved performance in dense and sparse reward environments. arXiv preprint arXiv:1910.04281
Haarnoja, T., Ha, S., Zhou, A., Tan, J., Tucker, G., and Levine, S. (2018a). Learning to walk via deep reinforcement learning. arXiv preprint arXiv:1812.11103
Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018b). “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in International conference on machine learning (Pmlr), 1861–1870.
Hiraoka, T., Imagawa, T., Hashimoto, T., Onishi, T., and Tsuruoka, Y. (2021). Dropout q-functions for doubly efficient reinforcement learning. arXiv preprint arXiv:2110.02034
Hu, H., Mirchandani, S., and Sadigh, D. (2024). “Imitation bootstrapped reinforcement learning,” in Robotics: science and systems (RSS).
Hussein, A., Gaber, M. M., Elyan, E., and Jayne, C. (2017). Imitation learning: a survey of learning methods. ACM Comput. Surv. (CSUR) 50, 1–35. doi:10.1145/3054912
Ibarz, J., Tan, J., Finn, C., Kalakrishnan, M., Pastor, P., and Levine, S. (2021). How to train your robot with deep reinforcement learning: lessons we have learned. Int. J. Robotics Res. 40, 698–721. doi:10.1177/0278364920987859
Kumar, A., Fu, J., Soh, M., Tucker, G., and Levine, S. (2019). Stabilizing off-policy q-learning via bootstrapping error reduction. Adv. Neural Information Processing Systems 32. doi:10.48550/arXiv.1906.00949
Kumar, A., Zhou, A., Tucker, G., and Levine, S. (2020). Conservative q-learning for offline reinforcement learning. Adv. Neural Information Processing Systems 33, 1179–1191. doi:10.48550/arXiv.2006.04779
Lee, S., Seo, Y., Lee, K., Abbeel, P., and Shin, J. (2022). “Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble,” in Conference on Robot Learning (PMLR), 1702–1712.
Levine, S., Kumar, A., Tucker, G., and Fu, J. (2020). Offline reinforcement learning: tutorial, review. And perspectives on open problems 5
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., et al. (2015). Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. doi:10.48550/arXiv.1509.02971
Makoviychuk, V., Wawrzyniak, L., Guo, Y., Lu, M., Storey, K., Macklin, M., et al. (2021a). isaac-sim/IsaacGymEnvs. Available online at: https://github.com/isaac-sim/IsaacGymEnvs/blob/main/isaacgymenvs/tasks/franka_cabinet.py.
Makoviychuk, V., Wawrzyniak, L., Guo, Y., Lu, M., Storey, K., Macklin, M., et al. (2021b). Isaac gym: high performance gpu-based physics simulation for robot learning
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., et al. (2016). “Asynchronous methods for deep reinforcement learning,” in International conference on machine learning (PmLR), 1928–1937.
Nair, A., Gupta, A., Dalal, M., and Levine, S. (2020). Awac: accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359
Nakamoto, M., Zhai, S., Singh, A., Sobol Mark, M., Ma, Y., Finn, C., et al. (2023). Cal-ql: calibrated offline rl pre-training for efficient online fine-tuning. Adv. Neural Inf. Process. Syst. 36, 62244–62269. doi:10.48550/arXiv.2303.05479
Nguyen, H., and La, H. (2019). “Review of deep reinforcement learning for robot manipulation,” in 2019 Third IEEE international conference on robotic computing (IRC) (IEEE) (IEEE), 590–595.
Osinski, B., Finn, C., Erhan, D., Tucker, G., Michalewski, H., Czechowski, K., et al. (2020). Model-based reinforcement learning for atari. ICLR 1, 2.
Qi, H., Kumar, A., Calandra, R., Ma, Y., and Malik, J. (2023). “In-hand object rotation via rapid motor adaptation,” in Conference on Robot Learning (PMLR), 1722–1732.
Raffin, A., Hill, A., Gleave, A., Kanervisto, A., Ernestus, M., and Dormann, N. (2021). Stable-baselines3: reliable reinforcement learning implementations. J. Machine Learning Research 22, 1–8.
Rice, J. A. (2007). Mathematical statistics and data analysis, 371. Belmont, CA: Thomson/Brooks/Cole.
Rudin, N., Hoeller, D., Reist, P., and Hutter, M. (2022). “Learning to walk in minutes using massively parallel deep reinforcement learning,” in Conference on robot learning (PMLR), 91–100.
Sarata, S., Osumi, H., Kawai, Y., and Tomita, F. (2004). “Trajectory arrangement based on resistance force and shape of pile at scooping motion,” in 2004 international conference on robotics and automation (ICRA) (Newyork: IEEE), 4, 3488–3493. doi:10.1109/robot.2004.1308793
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347
Shen, C., and Sloth, C. (2024a). “Generalized framework for wheel loader automatic shoveling task with expert initialized reinforcement learning,” in IEEE/SICE international symposium on system integration (SII), 382–389.
Song, Y., Zhou, Y., Sekhari, A., Bagnell, J. A., Krishnamurthy, A., and Sun, W. (2022). Hybrid rl: using both offline and online data can make rl efficient. arXiv Preprint arXiv:2210.06718.
Uchendu, I., Xiao, T., Lu, Y., Zhu, B., Yan, M., Simon, J., et al. (2023). “Jump-start reinforcement learning,” in International Conference on Machine Learning (PMLR), 34556–34583.
Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., et al. (2017). Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817
Zhang, H., Xu, W., and Yu, H. (2023). Policy expansion for bridging offline-to-online reinforcement learning. arXiv preprint arXiv:2302.00935
Keywords: deep reinforcement learning, learning from demonstration, automation in construction, robotics, sim-to-real
Citation: Shen C and Sloth C (2026) Solving robotics tasks with prior demonstration via exploration-efficient deep reinforcement learning. Front. Robot. AI 12:1682200. doi: 10.3389/frobt.2025.1682200
Received: 08 August 2025; Accepted: 08 December 2025;
Published: 12 January 2026.
Edited by:
Hongwei Mo, Harbin Engineering University, ChinaReviewed by:
Shuo Ding, Nanjing University of Aeronautics & Astronautics, ChinaRavishankar Prakash Desai, Amrita Vishwa Vidyapeetham, Amaravati Campus, India
Copyright © 2026 Shen and Sloth. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Chengyandan Shen, Y3NoZW5AbW1taS5zZHUuZGs=