- China University of Petroleum (East China), Qingdao, China
The path planning capability of autonomous robots in complex environments is crucial for their widespread application in the real world. However, long-term decision-making and sparse reward signals pose significant challenges to traditional reinforcement learning (RL) algorithms. Offline hierarchical reinforcement learning offers an effective approach by decomposing tasks into two stages: high-level subgoal generation and low-level subgoal attainment. Advanced Offline HRL methods, such as Guider and HIQL, typically introduce latent spaces in high-level policies to represent subgoals, thereby handling high-dimensional states and enhancing generalization. However, these approaches require the high-level policy to search and generate sub-objectives within a continuous latent space. This remains a complex and sample-inefficient challenge for policy optimization algorithms—particularly policy gradient-based PPO—often leading to unstable training and slow convergence. To address this core limitation, this paper proposes a novel offline hierarchical PPO framework—LG-H-PPO (Latent Graph-based Hierarchical PPO). The core innovation of LG-H-PPO lies in discretizing the continuous latent space into a structured “latent graph.” By transforming high-level planning from challenging “continuous creation” to simple “discrete selection,” LG-H-PPO substantially reduces the learning difficulty for the high-level policy. Preliminary experiments on standard D4RL offline navigation benchmarks demonstrate that LG-H-PPO achieves significant advantages over advanced baselines like Guider and HIQL in both convergence speed and final task success rates. The main contribution of this paper is introducing graph structures into latent variable HRL planning. This effectively simplifies the action space for high-level policies, enhancing the training efficiency and stability of offline HRL algorithms for long-sequence navigation tasks. It lays the foundation for future offline HRL research combining latent variable representations with explicit graph planning.
1 Introduction
With the rapid advancement of robotics, endowing robots with the ability to autonomously navigate in unknown or complex environments has become one of the core challenges in the fields of artificial intelligence and robotics (Martinez-Baselga et al., 2023). Whether for household service robots, warehouse logistics AGVs, or planetary rovers, efficient and safe path planning forms the foundation for accomplishing their tasks. However, real-world navigation tasks often involve long-horizon decision making—where robots must execute a long sequence of actions to reach their destination—while simultaneously facing sparse rewards—clear positive feedback signals are only obtained when the robot ultimately reaches the goal or completes specific subtasks. These two characteristics pose significant challenges to traditional supervised learning and model-based planning methods. Reinforcement learning (RL), particularly deep reinforcement learning (DRL), is considered a powerful tool for addressing such problems due to its ability to learn optimal strategies through trial and error (Barto, 2021).
Standard online RL algorithms, such as Proximal Policy Optimization (PPO) (Schulman et al., 2017), have achieved success in many domains. However, their “learn-while-exploring” paradigm requires extensive interactions with the environment to gather sufficient effective experience. This is often costly, time-consuming, and even hazardous in real robotic systems (Chen et al., 2025).
To overcome the limitations of online RL, offline reinforcement learning (Offline RL) (Levine et al., 2020) emerged. Offline RL aims to learn policies using only pre-collected, fixed datasets, completely avoiding online interactions with the environment. This enables the utilization of large-scale, diverse historical data. However, Offline RL faces its own unique challenge: the out-of-distribution (OOD) action problem. Learned policies may select actions not present in the dataset, and their value estimates are often inaccurate, leading to a sharp decline in performance (Kumar et al., 2020). For long-temporal-horizon and sparse-reward problems, hierarchical reinforcement learning (HRL) offers an effective solution (Kulkarni et al., 2016). HRL decomposes complex tasks into multiple hierarchical sub-tasks. In a typical two-layer architecture, the high-level policy formulates a sequence of subgoals, while the low-level policy executes primitive actions to achieve the current subgoal. This decomposition not only reduces the temporal scale a single policy must handle but also facilitates credit assignment.
In recent years, Offline HRL—the integration of Offline RL and HRL—has emerged as a research hotspot, regarded as a promising direction for tackling complex robotic tasks. Guider (Shin and Kim, 2023) and HIQL (Park et al., 2023) share the common contribution of successfully leveraging latent spaces to handle high-dimensional states (e.g., images) and promote subgoal generalization, while improving sample efficiency through offline learning frameworks. However, they also share a core limitation: high-level policies
Robustness: The value function learned by PPO allows the agent to identify and avoid edges that appear feasible in the graph structure but are unreliable for actual traversal, offering greater robustness against imperfect graph construction.
Generalization: A learned policy can better handle states that do not perfectly align with graph nodes, enabling smoother control through probabilistic selection, which is difficult for rigid graph search methods to achieve.
To this end, we propose the LG-H-PPO (Latent Graph-based Hierarchical PPO) framework. Our core idea is to transform the challenging continuous latent variable space from Guider (Shin and Kim, 2023)/HIQL (Park et al., 2023) into a discrete, easily manageable latent variable graph, then enable the high-level PPO to plan on this graph. By simplifying the high-level policy’s action space from a continuous latent variable space to node selection on a discrete latent variable graph, our preliminary experiments on D4RL benchmarks like Antmaze validate our expectations. LG-H-PPO demonstrates significant improvements over Guider (Shin and Kim, 2023) and HIQL (Park et al., 2023) in both convergence speed and final success rate. The main contribution of this paper is the introduction of a new paradigm for offline HRL that combines latent variable representations with explicit graph structures. Theoretically, discretizing the continuous latent space significantly mitigates the complexity of the credit assignment problem in hierarchical policy gradients. In continuous latent space methods (like Guider), the high-level policy must learn a mapping from states to exact latent vectors, where slight deviations in the output can lead to vastly different low-level traversals, causing high variance in gradient estimation. By restricting the high-level policy’s output to a finite set of graph nodes (transforming ‘creation’ into ‘selection’), LG-H-PPO drastically reduces the variance of the policy gradient. This stabilization of the training process allows for more accurate value estimation and significantly improves sample efficiency. By discretizing the high-level action space, we effectively resolve the planning challenges faced by existing methods. This work lays the foundation for future exploration of more efficient and robust graph- and latent variable-based offline HRL algorithms.
2 The proposed methods
2.1 LG-H-PPO algorithm
In this section, we present our proposed LG-H-PPO (Latent Graph-based Hierarchical PPO) framework in detail. The core objective of this framework is to significantly reduce the complexity of long-term planning for high-level policies in offline hierarchical reinforcement learning (Offline HRL), particularly within PPO-based frameworks, by introducing a latent variable graph structure. The overall architecture of LG-H-PPO is illustrated in Figure 1, comprising three organically integrated stages: latent variable encoder pre-training, latent variable graph construction, and graph-based hierarchical PPO training.
LG-H-PPO follows the fundamental paradigm of HRL by decomposing complex navigation tasks into two levels: high-level (subgoal selection) and low-level (subgoal arrival). The key innovation of LG-H-PPO lies in constructing a discrete latent variable graph
Figure 2. (a) The raw Antmaze environment with latent states extracted from the offline dataset. (b) The constructed discrete latent graph
2.2 Training process
The first stage involves pretraining a latent variable encoder. The objective is to learn a high-quality, low-dimensional state representation
Next comes the core innovation: we constructed a latent variable graph, discretizing the high-dimensional, continuous latent variable space
The K cluster centroids
To reflect dynamic reachability between states, we utilize trajectory information from the dataset
Next, train the hierarchical PPO policy
Specifically, to stabilize the high-level policy updates and prevent catastrophic policy collapse, we employ the standard clipped surrogate objective function of PPO. The optimization objective
where
3 Experiment and result analysis
This section aims to evaluate the effectiveness of our proposed LG-H-PPO framework through a series of rigorous experiments. We compare LG-H-PPO with current offline hierarchical and non-hierarchical reinforcement learning algorithms on the challenging D4RL Antmaze navigation benchmark. The experimental design focuses on validating LG-H-PPO’s performance advantages in addressing long-temporal-order, sparse-reward problems, particularly in convergence speed, final performance, and training stability. Furthermore, we conduct ablation studies and qualitative analyses to delve into the critical role of latent variable graph structures and the internal workings of the framework.
3.1 Experimental design
We evaluate on the Antmaze navigation benchmark, focusing on antmaze-medium-diverse-v2 and antmaze-large-diverse-v2. These environments simulate navigation tasks for quadrupedal robots in medium and large mazes, characterized by high state space dimensions (29-dimensional), continuous action space (8-dimensional), limited field of view, sparse rewards (+1 only upon reaching the goal), and long task durations (up to 1,000 steps). They serve as an ideal platform for testing long-term planning and offline learning capabilities. Training is based on the antmaze-medium-diverse-v2 and antmaze-large-diverse-v2 offline datasets. The diverse dataset contains a large number of suboptimal trajectories generated by medium-level policy exploration, offering broad coverage but few successful trajectories. This places high demands on the algorithm’s trajectory stitching capabilities and its ability to learn optimal policies from suboptimal data (Baek et al., 2025; Shin and Kim, 2023). Each dataset contains approximately one million transition samples.
Baselines: To comprehensively evaluate LG-H-PPO’s performance, we selected the following representative baselines for comparison:
1. 1. Guider (Shin and Kim, 2023): A state-of-the-art offline HRL algorithm based on VAE latent variables and continuous high-level actions. It serves as a crucial foundation and benchmark for our approach.
2. 2. HIQL (Park et al., 2023): A state-of-the-art offline HRL algorithm based on implicit Q-learning and latent variable states as high-level actions. It represents another value-learning-based technical approach to HRL.
3. 3. GAS (Baek et al., 2025): The latest state-of-the-art offline HRL algorithm based on graph structures and graph search, which does not learn explicit high-level policies and excels at trajectory splicing.
4. 4. CQL + HER (Kumar et al., 2020): A state-of-the-art non-hierarchical offline RL algorithm, combined with Hindsight Experience Replay (Andrychowicz et al., 2017) to handle sparse rewards, and used to demonstrate the advantages of hierarchical structures.
5. 5. H-PPO (Continuous): Our baseline implementation adopts the same PPO algorithmic framework as LG-H-PPO, but its high-level policy directly generates actions in the continuous latent variable
We use normalized scores as the primary performance metric. This score is linearly scaled based on the environment’s raw rewards, where 0 corresponds to the performance of a random policy and 100 corresponds to the performance of an expert policy. We run each algorithm and environment with five different random seeds, reporting the final policy’s average normalized score, standard deviation, maximum, and minimum over 100 evaluation rounds. Additionally, we plot the average normalized score curve (learning curve) during training to compare the convergence speed and training stability of different algorithms. We implement the LG-H-PPO framework using PyTorch. Clustering Implementation Details: For the latent graph construction, we utilize the KMeans module from the Scikit-learn library. To ensure high-quality cluster center initialization and faster convergence, we explicitly employ the ‘k-means++’ initialization strategy rather than random initialization. This is critical for generating representative graph nodes in the complex high-dimensional latent space. Discussion on Node Count K: The number of graph nodes
3.2 Results and analysis
We summarize the final performance of LG-H-PPO and various baseline algorithms on the D4RL Antmaze task in Table 2. To present the results more comprehensively, we report the average normalized score, standard deviation, and maximum and minimum scores across 100 evaluation rounds for five random seeds.
Table 2 clearly demonstrates the superiority of LG-H-PPO. In the antmaze-medium environment, the average scores of all hierarchical methods significantly outperform the non-hierarchical CQL + HER, highlighting the inherent advantage of hierarchical structures in handling long-temporal sequence problems. LG-H-PPO achieves a high score of 90.5 on this task, matching the performance of top methods HIQL and GAS while significantly outperforming continuous latent space planning-based approaches like Guider and H-PPO (Cont.). This performance advantage is further amplified in the more challenging antmaze-large environment. Confronted with longer paths and sparser rewards, the non-hierarchical method CQL + HER experiences a steep decline to 11.3. Guider and H-PPO (Cont.) also drop to 80.8 and 68.1 respectively, indicating significantly increased difficulty in long-term planning within continuous latent spaces. In contrast, LG-H-PPO, leveraging its planning capability on discrete latent variable graphs, maintained a high performance of 85.6. This substantially outperformed both Guider and H-PPO (Cont.), coming very close to HIQL’s performance. This strongly demonstrates that latent variable graph structures are key to overcoming the bottlenecks of long-term offline HRL planning. By discretizing the action space of high-level PPO, we enable policy gradient methods to more effectively learn long-range dependencies and select optimal subgoal sequences. Concurrently, LG-H-PPO exhibits a relatively small standard deviation, and the gap between its maximum and minimum values indicates stable performance across different random seeds.
To provide a more intuitive understanding of LG-H-PPO’s decision-making process, we visualize a planned trajectory in the Antmaze-Large environment in Figure 3. The visualization highlights the two-level hierarchical structure.As shown in Figure 3a, the high-level PPO policy
Figure 3. Visualization of LG-H-PPO’s hierarchical planning and execution. (a) The high-level PPO policy selects a sequence of discrete graph nodes (yellow stars) as subgoals. (b) The low-level policy executes trajectories (red dotted line) to reach each sequential subgoal, successfully navigating from the initial state to the final goal.
4 Conclusion and future work
The main contribution of this paper is the proposal and validation of a novel offline HRL paradigm (LG-H-PPO) that integrates latent variable representation learning with explicit graph structure planning. By discretizing the action space of high-level PPO, we effectively overcome the bottlenecks of existing offline HRL methods based on policy gradients, which suffer from low planning efficiency and poor stability in continuous latent spaces. This work lays a solid foundation for future exploration of more efficient and robust offline HRL algorithms that integrate the abstractive capabilities of latent variables with the advantages of explicit structured planning, particularly in robotic applications requiring long-term reasoning and suboptimal data utilization.
Future research directions hold great promise. First, exploring the learning of edge weights in the graph—such as adopting time efficiency metrics from GAS (Baek et al., 2025) or directly learning edge reachability probabilities/transition costs—and integrating this information into the decision-making process of high-level PPO or as reward shaping signals for low-level policies could enable smarter path selection. Second, online dynamic graph expansion mechanisms can be investigated, allowing agents to dynamically add or modify graph nodes and edges based on new experiences during (limited) online interactions or deployment. This enables the discovery of optimal paths potentially missing in offline data, endowing the algorithm with lifelong learning capabilities. Finally, extending the LG-H-PPO framework to navigation tasks based on high-dimensional observations (e.g., images) represents a significant direction. This requires investigating more robust visual encoders and exploring how to effectively construct and utilize graph structures within visual latent spaces.
Data availability statement
The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.
Author contributions
XH: Writing – original draft, Writing – review and editing.
Funding
The author(s) declared that financial support was not received for this work and/or its publication.
Conflict of interest
The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Generative AI statement
The author(s) declared that generative AI was used in the creation of this manuscript. Generative AI was used to assist in content summarization and minor text refinement within the Abstract section.
Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
References
Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., et al. (2017). Hindsight experience replay. Adv. Neural Information Processing Systems 30.
Baek, S., Park, T., Park, J., Oh, S., and Kim, Y. (2025). Graph-assisted stitching for offline hierarchical reinforcement learning. arXiv Preprint arXiv:2506.07744.
Barto, A. G. (2021). Reinforcement learning: an introduction. by richard’s sutton. SIAM Rev. 6 (2), 423.
Chen, F., Jia, Z., Rakhlin, A., and Xie, T. (2025). Outcome-based online reinforcement learning: Algorithms and fundamental limits. arXiv Preprint arXiv:2505.20268.
Eysenbach, B., Salakhutdinov, R. R., and Levine, S. (2019). Search on the replay buffer: bridging planning and reinforcement learning. Adv. Neural Information Processing Systems 32.
Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in International conference on machine learning (Stockholm, Sweden: PMLR), 1861–1870.
Hafner, D., Lillicrap, T., Norouzi, M., and Ba, J. (2020). Mastering atari with discrete world models. arXiv Preprint arXiv:2010.02193.
Kulkarni, T. D., Narasimhan, K., Saeedi, A., and Tenenbaum, J. (2016). Hierarchical deep reinforcement learning: integrating temporal abstraction and intrinsic motivation. Adv. Neural Information Processing Systems 29.
Kumar, A., Zhou, A., Tucker, G., and Levine, S. (2020). Conservative q-learning for offline reinforcement learning. Adv. Neural Information Processing Systems 33, 1179–1191.
Levine, S., Kumar, A., Tucker, G., and Fu, J. (2020). Offline reinforcement learning: tutorial, review, and perspectives on open problems. arXiv Preprint arXiv:2005.01643.
Martinez-Baselga, D., Riazuelo, L., and Montano, L. (2023). Improving robot navigation in crowded environments using intrinsic rewards. arXiv Preprint arXiv:2302.06554, 9428–9434. doi:10.1109/icra48891.2023.10160876
Park, S., Ghosh, D., Eysenbach, B., and Levine, S. (2023). “Offline goal-conditioned rl with latent states as actions,” in ICML workshop on new frontiers in learning, control, and dynamical systems.
Peng, X. B., Kumar, A., Zhang, G., and Levine, S. (2019). Advantage-weighted regression: simple and scalable off-policy reinforcement learning. arXiv Preprint arXiv:1910.00177.
Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. (2015). High-dimensional continuous control using generalized advantage estimation. arXiv Preprint arXiv:1506.02438.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. arXiv Preprint arXiv:1707.06347.
Keywords: latent graph, offline hierarchical PPO, offline reinforcement learning, robot path planning, sparse reward
Citation: Han X (2026) LG-H-PPO: offline hierarchical PPO for robot path planning on a latent graph. Front. Robot. AI 12:1737238. doi: 10.3389/frobt.2025.1737238
Received: 01 November 2025; Accepted: 08 December 2025;
Published: 07 January 2026.
Edited by:
Jun Ma, Hong Kong University of Science and Technology, Hong Kong SAR, ChinaReviewed by:
Pengqin Wang, Hong Kong University of Science and Technology, Hong Kong SAR, ChinaMengwei Zhang, Tsinghua University, China
Copyright © 2026 Han. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Xiang Han, MjQ3MzQ5NTk4OUBxcS5jb20=