- Learning and Intelligent Systems Lab (LiSL), School of Electrical Engineering and Computer Science, Ohio University, Athens, OH, United States
Training mobile robots through digital twins with deep reinforcement learning (DRL) has gained increasing attention to ensure efficient and safe navigation in complex environments. In this paper, we propose a novel physics-inspired DRL framework that achieves both effective and explainable motion planning. We represent the robot, destination, and obstacles as electrical charges and model their interactions using Coulomb forces. These forces are incorporated into the reward function, providing both attractive and repulsive signals to guide robot behavior. In addition, obstacle boundaries extracted from LiDAR segmentation are integrated as anticipatory rewards, allowing the robot to avoid collisions from a distance. The proposed model is first trained in Gazebo simulation environments and subsequently deployed on a real TurtleBot v3 robot. Extensive experiments in both simulation and real-world scenarios demonstrate the effectiveness of the proposed framework. Results show that our method significantly reduces collisions, maintains safe distances from obstacles, and generates safer trajectories toward the destinations.
1 Introduction
LiDAR-based mobile robot navigation marks a significant advancement in robotics, offering a wide range of advantages and applications. Unlike traditional global-map-based systems, LiDAR generates a real-time, detailed 3D map of the robot’s surroundings, enabling operators to make informed decisions with precise spatial data. This capability is crucial in dynamic and unpredictable environments, where adaptive, sensor-driven awareness outperforms reliance on fixed perspectives.
Motion planning and collision avoidance are critical components of high-performance robotic autonomy. Traditional motion planning approaches typically rely on geometric, sampling-based, or optimization-based methods to create feasible and efficient paths from a starting point to a goal while avoiding obstacles. Graph-based methods, such as the A* (Hart et al., 1968), and D* (Stentz, 1995) alongside sampling-based techniques like Rapidly-Exploring Random Trees* (RRT*) (Karaman and Frazzoli, 2011) and Probabilistic Roadmap (PRM) (Yang et al., 2018), remain among the most widely adopted solutions.
In recent years, machine learning (ML)-based solutions have gained popularity for enabling mobile robots to perceive their environments and make maneuvering decisions. Supervised learning methods perform perception and decision-making simultaneously, directly predicting control policies from sensor data such as images (Kim and Chen, 2015; Giusti et al., 2015; Tai et al., 2016; Dai et al., 2020; Back et al., 2020) and LiDAR scans (Chen et al., 2020; Murillo, 2023). In contrast, reinforcement learning (RL) (Michels et al., 2005) allows robots to learn optimal navigation strategies through trial and error. By interacting with the environment and receiving feedback, robots can gradually enhance their navigation performance. When combined with neural networks, deep reinforcement learning (DRL) has demonstrated superhuman performance in various games (Mnih et al., 2015; Xie et al., 2017; He et al., 2020). More recently, DRL-based solutions for collision avoidance and goal-reaching have also been proposed (Singla et al., 2019; Xue and Gonsalves, 2021; Song et al., 2022; Chang et al., 2021; Ouahouah et al., 2021; Olayemi et al., 2023). To reduce costs and improve effectiveness, training is often initially conducted in simulated environments.
LiDAR-based DRL methods have been investigated in recent studies, with particular attention to intrinsic motivation as a means to improve generalization. Zhelo et al. (2018) addressed RL limitations in scenarios such as long corridors and dead ends by incorporating an intrinsic curiosity module, which enhanced exploration and outperformed predefined reward functions in virtual 2D tasks (Mirowski et al., 2016; Long et al., 2018). Shi et al. (2019) applied a similar curiosity-driven approach within an A3C framework using sparse LiDAR input, enabling policies to transfer effectively from simulation to realistic mixed environments.
In parallel, researchers have explored novel architectural designs to address persistent challenges in motion planning. Wang et al. (2018) decomposed planning into obstacle avoidance and goal navigation, employing raw laser rangefinder data within a dual-stream Q-network to generate force-based actions. Kim et al. (2021) proposed a DQN-GRU-based navigation method that incorporated action skipping to improve performance in partially observable MDP-modeled environments, achieving superior results in simulation compared to standard DQN and non-skipping baselines. To address cross-task generalization, Wang et al. (2020) introduced elastic weight consolidation (EWC) into a DDPG framework, enabling policies to preserve prior knowledge and mitigate catastrophic forgetting without full retraining. Yan et al. (2023) also employed DDPG for mapless navigation and demonstrated improved performance in unknown environments compared to A*.
While DRL-based LiDAR navigation methods show considerable promise, their architectures are often constrained by reward designs that lack strong physical grounding. In particular, the reward structures in existing approaches frequently lack clear physical interpretation. For instance, several studies (Xue et al., 2019; Gao et al., 2020; Song et al., 2021) employ a fixed distance penalty, yet its actual impact remains insufficiently examined. Furthermore, many classical path-planning techniques, such as artificial potential fields, have not been effectively integrated into DRL frameworks. Finally, approaches remain limited to simulated validation, and even those tested on real-world robots often provide little quantitative evidence to confirm their effectiveness in practical deployment.
In this paper, we propose a physics-inspired DRL-based motion planning algorithm that generates continuous commands without relying on a map. We utilize Coulomb force to model interactions between the robot, its destination, and surrounding obstacles. To enhance safety, we incorporate object segmentation, enabling the robot to anticipate and avoid collisions from a distance. A 2D LiDAR sensor provides the data necessary to support the robot’s behaviors, including collision avoidance and goal-reaching. We also develop a carefully designed method to enhance the generalization of our solution, validated across various environment settings.
The proposed designs are trained and tested in Gazebo simulation scenes (Robotics, 2014) with large geometric obstacles and deployed in real TurtleBot v3 (TB3) robots (Robotics, 2017). Experiments conducted in both simulated and real environments demonstrate the effectiveness of our overall design and individual components.
The contributions of our work can be summarized as follows:
• We model the robot, the destination, and obstacles as electrical charges and use Coulomb forces to model the interactions between these charges.
• These Coulomb forces are integrated into our DRL framework, providing encouraging and preventative rewards for the robots. The proposed Coulomb-based rewards are smooth and pervasive, offering consistent guidance throughout the entire training field. To the best of our knowledge, this is the first work ever to employ Coulomb force in path planning and reinforcement learning.
• Obstacle boundaries extracted from LiDAR segmentation enable the robot to anticipate and avoid collisions from a distance.
• The proposed Coulomb- and vision-based rewards have clear, interpretable effects on robot behavior and performance, thereby providing strong overall explainability of our model.
• We carefully design environment-invariant components in our DRL system to improve the generalization of our solution.
2 Background
2.1 Gravitational and coulomb forces
Gravity, as formulated by Isaac Newton, is a force of attraction that acts between all objects with mass. Mathematically, it is expressed as
where
Coulomb’s force, in contrast, describes the electrostatic interaction between charged particles. As illustrated in Figure 1, Coulomb’s law quantifies the strength of the repulsive or attractive force between two point charges, such as a proton and an electron in an atom. The law states that the electric force exerted by one charge on another depends on the magnitudes of the charges and the square of the distance
where
Figure 1. Coulomb force between two particles
The two forces share a striking mathematical similarity, both decreasing with the square of the distance between interacting entities. In fact, the forms of Newton’s law of gravitation and Coulomb’s law look very much alike, reflecting the inverse-square nature common to both. Nonetheless, they differ fundamentally in that gravity is always attractive, while electrostatic forces can attract or repel. Another key difference is their relative strength: on atomic and subatomic scales, the electrostatic force between charged particles is far stronger than their mutual gravitational attraction. Yet over astronomical distances, neutrality of charge means that gravity dominates, shaping the large-scale structure of the universe by pulling together planets, stars, and galaxies into stable orbits and clusters.
2.2 Classical and RL-Based motion planning methods
Robot motion planning methods can be broadly divided into classical and RL-based approaches. Classical planners follow a structured pipeline of global planning, local obstacle avoidance, and trajectory generation, often using graph search or sampling strategies such as A*, RRT*, or DWA (Fox et al., 2002). While reliable in static, well-mapped environments, they struggle in dynamic or uncertain settings due to rigid rule-based logic and the need for labor-intensive map construction. Early classical methods were limited to simple static models (Philippsen and Siegwart, 2003; Cosío and Castañeda, 2004) or treated dynamic objects as static at discrete time steps (Borenstein and Koren, 1990; Borenstein and Koren, 1991), restricting real-world applicability. More recent efforts have improved collision avoidance through algorithmic refinements, including A* and DWA for real-time navigation (Kherudkar et al., 2024), GBI-RRT combined with SLAM (Sun et al., 2023), and enhanced DWA variants for dynamic environments (Cao and Nor, 2024).
In contrast, RL-based motion planning can be broadly categorized into hybrid, end-to-end, and multi-robot approaches. Hybrid methods integrate RL with classical planners to combine reliability with adaptability, such as DRL-enhanced DWA for smoother trajectories (Wang and Huang, 2022), A2C with optimization outperforming Dijkstra + DWA (Xing et al., 2022), and DRL combined with A* to reduce computation (Liu et al., 2024). These methods, however, remain dependent on map quality and consistency. End-to-end methods learn directly from sensor data, with LiDAR-based DQN-GRU achieving superior performance over standard DQN (Kim et al., 2021), stochastic sampling with 2D LiDAR enabling faster training and improved collision avoidance (Beomsoo et al., 2021), and LSTM-TD3 models offering improved temporal decision-making (Wen et al., 2024).
While promising for handling unseen environments, RL-based solutions suffer from high training demands and limited real-world generalization. Multi-robot approaches often adopt centralized training and decentralized execution (CTDE), such as enhanced DQN for warehouse path planning a DRL-MPC-GNN framework for task allocation and coordination (Li et al., 2024), and curriculum-learning with LiDAR costmaps yielding strong real-world results (Yu et al., 2024). Memory-augmented DQN variants have also improved multi-robot coordination (Quraishi et al., 2025). Despite scalability and coordination benefits, these methods face challenges in training complexity, dynamic interactions, and partial observability, limiting their practical deployment in robot systems.
2.3 Artificial potential field (APF) algorithm
The APF algorithm, first proposed by Khatib (Khatib, 1986), is a classical physics-inspired framework for real-time path planning and obstacle avoidance. In this approach, the robot is modeled as a particle moving under the influence of a synthetic potential function
where
The attractive potential pulls the robot toward its goal and is commonly modeled as a quadratic function of the Euclidean distance:
where
The repulsive potential is activated only within a finite influence range
where
The resultant virtual force
The DRL framework presented in this work is motivated by the principles of the APF algorithm. To overcome its inherent limitations, we design RL reward functions that yield a physically grounded and globally smooth motion-guidance field. This field is realized using Coulomb-force–based rewards, with the full formulation presented in Section 3.
2.4 Generalization and sim-to-real transfer in DRL
Model generalization is a critical issue in machine learning, and it is especially important for DRL-based navigation and control. In robotics, the ability of a policy to adapt to new or changing environments is vital, as operating conditions are often unpredictable and diverse. A well-generalized policy can handle unseen scenarios, task variations, and sensor differences, whereas a poorly generalized one risks catastrophic failure outside its training distribution. Despite its importance, many DRL studies still evaluate methods only on the same environments they were trained on, such as Atari (Bellemare et al., 2013), Gazebo (Koenig and Howard, 2004), or OpenAI Gym (Brockman et al., 2016), providing limited insight into generalization.
Recent efforts have begun to address this issue. Yu et al. (Yu et al., 2020) studied the generalization of multiple DRL algorithms by training across diverse environments, while Doukui et al. (Doukhi and Lee, 2021) mapped sensor data, robot states, and goals to continuous velocity commands, though their work was restricted to unseen targets rather than unseen scenes. Increasingly, DRL-based robot obstacle avoidance research has emphasized sim-to-real transfer in (Lee et al., 2022; Wu et al., 2023; Joshi et al., 2024). For instance, Anderson et al. (Anderson et al., 2021) introduced a subgoal model that aligns simulation-trained discrete actions with real-world continuous control, using domain randomization to reduce visual discrepancies. Similarly, Zhang et al. (Zhang et al., 2021) applied object detection to generate real-time 3D bounding boxes, mitigating the effect of varying obstacle shapes and appearances on robot navigation and improving robustness to sim-to-real differences.
3 Methods
In this work, a mobile robot begins at a specified location and autonomously navigates toward a target destination. Static and dynamic obstacles are placed along the straight line connecting the start and end points. The primary objective is to enable the robot to reach the destination while effectively avoiding collisions. This capability is achieved through a physics-inspired DRL framework with two key considerations: (1) obstacle avoidance and (2) generalization to previously unseen environments.
The proposed models are first trained in the Gazebo simulation environment and then deployed on a TB3, which functions as a simple two-wheeled differential-drive robot. The TB3 is equipped with multiple sensors, including a 360° LiDAR and an Inertial Measurement Unit (IMU). It also supports the Robot Operating System (ROS 2), which enables seamless communication and control between the DRL algorithms and the hardware platform.
Figure 2 illustrates the experimental setup. The green dashed line represents the straight path between the robot’s initial position and the destination, while the red dashed line shows the actual trajectory, deviating from the path to avoid obstacles. At each time step, the LiDAR produces a distance map of sampled points from the surrounding environment, represented by yellow dashed lines on the obstacles. The robot’s motion is controlled by two velocity components: (1) linear velocity, which determines the speed of forward movement, and (2) angular velocity, which controls the rotation rate of the two-wheeled base.
3.1 Overall design and key innovations
The primary innovation of this work lies in: (1) modeling the robot, destination, and obstacles as charged particles, and (2) utilizing Coulomb forces to represent their interactions while formulating DRL rewards based on these interactions. Specifically, the robot and its destination are modeled as electric charges with opposite signs, generating an attractive force that guides the robot toward the goal. Obstacles are represented as an array of charges with the same sign as the robot, producing a combined repulsive force that steers the robot away from collisions. In both cases, the force magnitudes scale inversely with the square of the distance between interacting entities.
Our design introduces two breakthrough innovations that enable highly effective agent learning. The first is the use of gradually varying and ubiquitous forces, which provide consistent guidance across the entire training field. The second is the inverse-square distance formulation, which is particularly effective for collision avoidance, as the repulsive force increases sharply when the robot approaches an obstacle. By incorporating the robot’s direction of movement, these forces are translated into reward signals for the DRL agent, either encouraging or redirecting its trajectory.
Our innovative designs have clear physical interpretations, making the behavior and performance of our model highly explainable. Furthermore, we propose an object-avoidance reward based on LiDAR scan segmentation, which enables the robot to avoid large obstacles from a distance, significantly enhancing the overall performance of the models.
The overall workflow of the proposed framework is illustrated in Figure 3, which outlines the interaction among the environment, a Twin Delayed Deep Deterministic Policy Gradient (TD3) control agent (Fujimoto et al., 2018), and Prioritized Experience Replay (PER) (Schaul et al., 2016). The environment interacts with the agent through continuous sensing and motion feedback. The proposed rewards are generated directly from these interactions and serves as the primary learning signals for the agent. A standard TD3 algorithm is employed as the training backbone to validate the effectiveness and generality of the proposed reward formulation, while PER is applied to improve sample efficiency during training.
This framework emphasizes the role of the proposed reward, which provides dense and physically interpretable feedback directly from environment interactions. The integration of Coulomb-force and LiDAR-vision rewards significantly improves learning stability and convergence speed compared with conventional sparse rewards. Experimental analysis further demonstrates reduced collision frequency and enhanced path planning capability, confirming that the proposed reward design improves both the efficiency and robustness of policy learning. These observed improvements in convergence speed, stability, and trajectory efficiency directly reflect the effectiveness of the proposed reward formulation, which operationalizes the main research contributions within the defined DRL framework.
3.2 DRL algorithm selection
The TD3 algorithm was chosen as the primary DRL framework due to its inherent stability, high sample efficiency, and robustness in continuous control tasks. TD3 extends the Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2015) method by introducing twin Q-networks to mitigate overestimation bias and delayed policy updates to prevent divergence, resulting in smoother convergence. Such properties are particularly important for velocity-based robot navigation, where unstable value estimation can lead to oscillatory motion or unsafe control.
Although Proximal Policy Optimization (PPO) (Schulman et al., 2017) has been successfully applied to various robotic control and path planning problems, its on-policy nature requires frequent policy rollouts and gradient updates, leading to lower sample efficiency and higher computational cost in long-horizon navigation tasks. In contrast, TD3’s off-policy structure enables efficient experience reuse from replay buffers, accelerating convergence while maintaining stable policy updates.
Meanwhile, Soft Actor–Critic (SAC) (Haarnoja et al., 2018) enhances exploration through entropy regularization, which is advantageous in sparse-reward or high-uncertainty environments. However, in our Coulomb-guided setup, where the reward function already provides strong directional gradients and the control space demands consistent velocity regulation, the additional stochasticity of SAC’s policy can introduce unnecessary action variance and make policy updates less stable.
As a result, a deterministic actor-critic algorithm such as TD3 offers a more direct and stable optimization path for continuous navigation tasks with dense, physics-guided rewards, aligning well with the objectives of this work.
3.3 States and actions
States, actions, and rewards are the three fundamental components of most DRL algorithms. In this work, the state at time
where:
• LiDAR-based distance map
• Unit distance to the goal
• Goal angle
• Robot velocities: linear velocity
The design of these state components is intended to enhance generalization. The LiDAR-derived shortest distance captures proximity to obstacles in an environment-independent manner. The goal angle
At each step, the action
which updates the robot’s linear and angular velocities from their values at time
3.4 Reward design
In this work, the overall reward
where:
•
•
•
•
•
•
•
The four (4) reward terms in the box of Equation 2 represent the baseline rewards, indicating that these terms are included in the reward function of every model.
Among these the baseline rewards,
where
where
3.4.1 Coulomb force rewards RCoulomb
As previously introduced, we model the robot, destination, and obstacles as charged particles and utilize Coulomb’s law to represent their interactions. These interactions form the basis of the Coulomb-based reward,
Attraction Reward
Figure 4a illustrates the attractive force exerted by the destination on the robot. To map this attraction into a reward for the DRL agent, we compute the inner product between the attractive force vector
where
Figure 4. An illustration of the Coulomb force rewards
This resultant force is then used to compute the Coulomb-based repulsive reward applied to the robot. For simplicity, Coulomb’s constant is set to 1. The final reward function for the repulsive force is defined as:
where
Our Coulomb-force–based reward was originally inspired by the classical APF formulation, both of which use a physics-inspired combination of attractive and repulsive influences to guide robot motion. In this sense, our formulation conceptually extends the idea of shaping a force field that directs the robot toward the goal while avoiding obstacles. But method, however, has several key differences and advancements over APF:
• Algorithmic vs. Learning-Based Mechanism The classical APF method is algorithmic and deterministic, meaning its behavior remains fixed and does not refine or improve with repeated experience. In contrast, our Coulomb-force rewards are integrated into a learning-based DRL framework. The agent receives reward signals from its local neighborhood and continuously improves its policy through interaction and extensive training. This training process allows the agent to discover effective behavioral patterns that reflect the underlying physical field, leading to more sophisticated navigation and improved escape from local minima where traditional APF typically becomes trapped.
• Smooth, Globally Consistent Reward Field The reward formulation we propose produces a smooth, ubiquitous, and gradually varying guidance field. This contrasts with classical APF, where handcrafted potential shapes and discontinuous distance thresholds often create non-convex fields prone to sharp gradients and local minima. The smoothness and physical consistency of the Coulomb-based reward help the learned policy achieve more stable motion guidance and reduce susceptibility to local minima.
• Exploration and Stochasticity in DRL Our learning model, as a DRL approach, possesses an exploration capability which acts analogously to a simulated annealing process. This property helps the agent occasionally deviate from local optima and, consequently, discover more globally efficient paths.
3.4.2 Vision rewards Rvision
Figure 5. Example of LiDAR segmentation. (a) shows the original LiDAR scan with 40 samples; (b) displays the corresponding segmentation result after applying DIET, where different colors represent distinct objects.
Since obstacle avoidance is less urgent when obstacles are farther away, we normalize the reward using a Gaussian function of the potential collision time
Formally, the vision reward is defined as:
where
In more detail, we employ the DIET (Dietmayer, 2001) algorithm for LiDAR segmentation. The procedure, illustrated in Figure 7, operates by examining the distances between adjacent LiDAR scan points to identify potential object boundaries. This allows neighboring points that belong to the same physical surface to be grouped together, thereby enabling effective segmentation of obstacles. The DIET function is defined as:
where
4 Environment and model setups
We design training and testing environments in Gazebo to simulate the TB3 Burger’s motion planning and collision avoidance under realistic conditions. After training and validating the models in simulation, we deploy them on the TB3 to evaluate their performance in real-world scenarios. Training is conducted on an NVIDIA RTX A6000 GPU. In this framework, simulation serves as the primary stage for model development and verification, while real-world experiments provide the final assessment of robustness and reliability.
4.1 Simulation environments
The motion planning policies are trained in Gazebo using a digital twin of the TB3 robot provided by the manufacturer. The simulation environments, illustrated in Figure 8, include both training and testing setups. Figures 8a,b (referred to as Scene 0 and Scene 1) depict the same environment, with 8(b) containing additional moving obstacles. The DRL model is trained in 8(b), Scene 1, since it represents a more complex environment. Figures 8c,d (Scene 2 and Scene 3) present an unseen environment used exclusively for testing, designed to evaluate the model’s generalization and robustness. In 8(a) and 8(c), only static obstacles (walls) are present, while 8(b) and 8(d) include dynamic obstacles represented by gray cylinders.
Figure 8. Simulation environments for model training and testing. (a) Test scene 0 without moving obstacles; (b) training and test scene 1 with moving obstacles; (c) test scene 2 without moving obstacles; and (d) test scene 3 with moving obstacles. Gray cylinders in (b,d) denote dynamic (moving) obstacles.
In all environments, the robot starts at the center of the scene (0,0). At the beginning of each epoch, a goal is randomly selected from a predefined set of locations. If the robot reaches the goal without collision, it continues toward a new goal from its current position. In the event of a collision, the environment is reset by relocating the robot to the center and assigning a new random goal.
4.2 Real environments
In the real-world experiments, we used the TB3 to evaluate our DRL-based models. The hardware setup is shown in Figure 9a. From top to bottom, the robot is equipped with a
Figure 9. Deployment of a TB3 in real-world environments. (a) Our robot; (b) Real Test Scene 1; (c) Real Test Scene 2: with an extended obstacle.
Sim-to-real transfer remains a major challenge for RL-based algorithms, as models that perform well in simulation may fail in physical environments. Therefore, real-world testing is essential. Figure 9b shows the first real-world test environment (referred to as Real Test Scene 1), which contains two obstacles to evaluate each model’s collision avoidance and goal-reaching capabilities. Figure 9c depicts a second test environment (Real Test Scene 2), where one obstacle is extended to further test the robot’s ability to find alternative paths. The third test environment, not shown in the figure, builds on the first by introducing dynamic obstacles, allowing assessment of the robot’s performance under moving hazards. Success is defined as reaching within a radius of
4.3 Model setups
To evaluate the impact of our proposed reward terms on collision avoidance, goal reaching, and sim-to-real generalization performance, we design five (5) models for comparative experiments. The first model,
The second model,
The third model,
The fourth model,
These five models are constructed through different combinations of reward terms. By comparing models with and without the
4.4 Evaluation metrics
To more comprehensively assess the navigation and collision avoidance performance of different models, three quantitative metrics were employed: Success Rate (SR), Collision Ratio (CR), and Average Goal Distance (GD). These metrics jointly evaluate navigation reliability, safety, and efficiency, enabling a more rigorous comparison among all tested models and environments. The details of these three metrics are described below:
• SR: Success rate is defined as the ratio of successful rollouts, where the robot reaches the goal without collision, to the total number of rollouts. It measures the overall reliability of the navigation policy in completing tasks successfully. A higher SR indicates stronger obstacle avoidance and goal-reaching capability. SR serves as the primary evaluation metric in both simulation and real-world experiments.
• CR: Collision ratio is designed to evaluate navigation safety and efficiency when the total runtime cannot be directly compared. In our simulation, each episode restarts from the initial position after collision, and start-goal pairs differ across different model tests, making navigation time or path length unsuitable for fair comparison. Therefore, CR reflects the frequency of collisions normalized by total steps, indicating how safely and efficiently the robot navigates over longer trajectories. A lower CR means the model can operate longer with fewer collisions, demonstrating better collision-avoidance capability and stability. CR is evaluated only in simulation experiments.
• Avg. GD: Average goal distance captures the model’s tendency to approach the goal, even in failed attempts. While SR only measures how often a model reaches the goal, GD quantifies how close the robot remains to the target at the end of each rollout. A lower GD indicates that the model either successfully reaches the goal or, in failure cases, terminates nearer to it, which demonstrates stronger goal-reaching tendency and better awareness of feasible solutions. This metric is evaluated only in simulation for controlled quantitative comparison.
5 Experiments and results
The major innovation of our design lies in the two reward components,
5.1 Results in simulation environments
The five (5) models are trained in the training environment for 7,000 epochs, and the corresponding training data are shown in Figure 10. The figure plots the average reward obtained over each set of 10 epochs against the training epochs. From Figure 10, we observe that
After training converged, the models were evaluated in the test scenes using success rate as the performance metric. Table 1 summarizes the results of the five trained models tested across four (4) simulation environments. Dynamic indicates the presence of moving obstacles, while unseen refers to deployment in a previously unseen (new) environment.
Table 1. Quantitative test results for
The key observations are as follows:
1. The
2. Incorporating Coulomb reward components
3. Adding vision rewards
4. The overall best-performing model in simulation, based on success rate, collision ratio and average goal distance is
5.2 Evaluation of robot performance in real environments
To evaluate the trained policies on a real robot, we deployed the five DRL models onto a TB3. As described in the previous subsection, the test environments are categorized into three types: Real Test Scene 1, Scene 2, and Scene 3. The robot trajectorieshe test environments are categorized into three typ are visualized in RViz using green dots, which represent real-time TB3 position data.
The setups and results are as follows. In Real Test Scene 1, two static obstacles are placed directly between the start point and the goal, requiring the TB3 to navigate around them to reach its destination. This setup evaluates each model’s collision-avoidance capability under the sim-to-real challenge. From the trajectories shown in Figure 11, we observe that although
Figure 11. Plots of trajectories for the models in Real Test Scene 1 with static obstacles. (a)
In contrast, the other four models successfully navigate around the obstacles and reach the goal. Their trajectories in this test scene are generally similar, all bypassing the obstacles from the right side with slight variations in clearance. Among them,
In Real Test Scene 2, to further challenge the models, we extended the second obstacle to evaluate their ability to find an alternative path when the previous route was no longer feasible. Figure 12 illustrates the trajectory results of the five models. It can be observed that
Figure 12. Plots of trajectories for models under Real Test Scene 2 with static obstacles. (a)
Among the successful models,
The effectiveness of the Coulomb reward
Figure 13. Environment overview with a red bounding box in (a) indicating the zoom-in region, and detailed views of the highlighted area in (b,c). In (b,c), the black arrowed curves show the robot’s motion under obstacle influence with
In Real Test Scene 3, moving obstacles were placed along the TB3’s travel path to evaluate its ability to avoid obstacles in dynamic environments. Based on visual inspection,
We conducted a statistical analysis of the five models across the three real-world test scenes. Each model was run five times (i.e., Epoch = 5), and the average success rates are summarized in Table 2. These results confirm our visual observations:
Table 2. Statistical results for
After examining the actual trajectories taken by the robots in simulation tests, we identified a plausible explanation, illustrated in Figure 14. As shown in the figure, under identical start and goal positions,
Figure 14. Schematic path comparison under the same start and goal: (a)
To summarize, TB robots equipped with the trained motion planning policies (
5.3 Additional cluttered simulated environments and test results
To further validate the robustness and generalization of the proposed DRL framework, two additional cluttered simulation environments were constructed, as shown in Figure 15.
Figure 15. Additional cluttered environments introduced in response to the concerns. (a) test scene 4 includes static obstacles only, (b) test scene 5 introduces additional moving obstacles, which are cylindrical in shape.
Test Scene 4 (illustrated in Figure 15a) contains densely arranged static obstacles, forming multiple narrow corridors and enclosed regions that challenge precise local navigation and obstacle avoidance. Test Scene 5 (shown in Figure 15b) extends this design by introducing multiple cylindrical obstacles that move dynamically along predefined trajectories. As these cylindrical obstacles move, the layout of traversable space changes over time, forming a dense and dynamic scene that explicitly requires adaptive path planning.
Both environments substantially increase the navigation difficulty compared with the Test Scenes 0–3, imposing tighter spatial constraints and more complex interactions between the robot, obstacles, and goal. The policies trained in earlier experiments were directly deployed in these new scenes without retraining to evaluate cross-environment generalization.
The quantitative outcomes summarized in Table 3 provide a detailed characterization of policy robustness as environmental complexity increases from the previously tested sparse settings (Test Scenes 0–3) to the newly introduced cluttered environments (Test Scenes 4–5). Across all five (5) models, performance declined notably as the environment became more cluttered. This drop was primarily caused by the reduced navigable space and the increased probability of the agent becoming trapped in local minima created by narrow corridors and complex obstacle layouts.
Table 3. Quantitative results in additional cluttered environments. SR represents for Success Rate (%), CR denotes Collision Rates, and Avg. GD (m) is the averaged distance from robot to goal at the end of each run, measured in meters. Rollout = 200.
Nevertheless, the Coulomb- and vision-guided model
When compared to the earlier, sparser Test Scene 0–3 environments, the benefit of using Coulomb- and vision-guided policy becomes more obvious as the environment gets more cluttered, while purely using
6 Conclusion
In this paper, we presented a physics-inspired DRL framework for mobile robot motion planning that leverages Coulomb-force modeling to provide interpretable and effective guidance. By representing the robot, goal, and obstacles as electrical charges, we introduced a novel Coulomb-based reward mechanism that delivers smooth, pervasive, and consistent signals during training. To the best of our knowledge, this is the first work to employ Coulomb forces in path planning and reinforcement learning.
Our approach further incorporates obstacle boundaries extracted from LiDAR segmentation, enabling the robot to anticipate and avoid collisions in advance. Through training in a digital twin environment and deployment on a real TB3 robot, we demonstrated that the proposed framework significantly reduces collisions, maintains safe obstacle clearances, and improves trajectory smoothness across both simulated and real-world scenarios. These results confirm not only the effectiveness but also the strong explainability of our Coulomb- and vision-based rewards in shaping robot behavior.
Finally, by carefully designing environment-invariant components, our system exhibits enhanced generalization, suggesting broad applicability to diverse navigation tasks. Moving forward, this framework provides a promising foundation for extending physics-inspired reinforcement learning to multi-robot systems, more complex environments, and real-time adaptive planning.
Data availability statement
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.
Author contributions
SS: Software, Writing – review and editing, Methodology, Supervision, Funding acquisition, Conceptualization, Writing – original draft, Investigation, Formal Analysis, Visualization, Resources, Project administration, Validation, Data curation. TB: Writing – original draft, Resources, Conceptualization, Writing – review and editing, Methodology, Supervision, Investigation. JL: Data curation, Validation, Conceptualization, Project administration, Methodology, Visualization, Investigation, Supervision, Resources, Writing – review and editing, Funding acquisition, Formal Analysis, Writing – original draft.
Funding
The author(s) declared that financial support was received for this work and/or its publication. This work was partially supported by the Ohio University Research Committee (OURC) Fund.
Conflict of interest
The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Generative AI statement
The author(s) declared that generative AI was used in the creation of this manuscript. GenAI was used only for final sentence refinement.
Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
References
Anderson, P., Shrivastava, A., Truong, J., Majumdar, A., Parikh, D., Batra, D., et al. (2021). “Sim-to-real transfer for vision-and-language navigation,” in Conference on robot learning (PMLR), 671–681.
Back, S., Cho, G., Oh, J., Tran, X.-T., and Oh, H. (2020). Autonomous uav trail navigation with obstacle avoidance using deep neural networks. J. Intelligent and Robotic Syst. 100, 1195–1211. doi:10.1007/s10846-020-01254-5
Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. (2013). The arcade learning environment: an evaluation platform for general agents. J. Artif. Intell. Res. 47, 253–279. doi:10.1613/jair.3912
Beomsoo, H., Ravankar, A. A., and Emaru, T. (2021). “Mobile robot navigation based on deep reinforcement learning with 2d-lidar sensor using stochastic approach,” in 2021 IEEE international conference on intelligence and safety for robotics (ISR) (IEEE), 417–422.
Borenstein, J., and Koren, Y. (1990). “Real-time obstacle avoidance for fast Mobile robots in cluttered environments,” in Proceedings., IEEE international conference on robotics and automation (IEEE), 572–577.
Borenstein, J., Koren, Y., et al. (1991). The vector field histogram-fast obstacle avoidance for mobile robots. IEEE Transactions Robotics Automation 7, 278–288. doi:10.1109/70.88137
Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., et al. (2016). Openai gym.
Cao, Y., and Nor, N. M. (2024). An improved dynamic window approach algorithm for dynamic obstacle avoidance in mobile robot formation. Decis. Anal. J. 11, 100471. doi:10.1016/j.dajour.2024.100471
Chang, L., Shan, L., Jiang, C., and Dai, Y. (2021). Reinforcement based mobile robot path planning with improved dynamic window approach in unknown environment. Aut. Robots 45, 51–76. doi:10.1007/s10514-020-09947-4
Chen, W., Sun, J., Li, W., and Zhao, D. (2020). A real-time multi-constraints obstacle avoidance method using lidar. J. Intelligent and Fuzzy Syst. 39, 119–131. doi:10.3233/jifs-190766
Cosío, F. A., and Castañeda, M. P. (2004). Autonomous robot navigation using adaptive potential fields. Math. Computer Modelling 40, 1141–1156. doi:10.1016/j.mcm.2004.05.001
Da, L., Turnau, J., Kutralingam, T. P., Velasquez, A., Shakarian, P., and Wei, H. (2025). A survey of sim-to-real methods in rl: progress, prospects and challenges with foundation models. arXiv Preprint arXiv:2502.13187.
Dai, X., Mao, Y., Huang, T., Qin, N., Huang, D., and Li, Y. (2020). Automatic obstacle avoidance of quadrotor uav via cnn-based learning. Neurocomputing 402, 346–358. doi:10.1016/j.neucom.2020.04.020
Dietmayer, K. (2001). Model-based object classification and object tracking in traffic scenes from range-images. IV2001, 25–30.
Doukhi, O., and Lee, D.-J. (2021). Deep reinforcement learning for end-to-end local motion planning of autonomous aerial robots in unknown outdoor environments: Real-time flight experiments. Sensors 21, 2534. doi:10.3390/s21072534
Fox, D., Burgard, W., and Thrun, S. (2002). The dynamic window approach to collision avoidance. IEEE Robotics and Automation Magazine 4, 23–33. doi:10.1109/100.580977
Fujimoto, S., Hoof, H., and Meger, D. (2018). “Addressing function approximation error in actor-critic methods,” in International conference on machine learning (Stockholm, Sweden: Proceedings of Machine Learning Research), 1587–1596.
Gao, J., Ye, W., Guo, J., and Li, Z. (2020). Deep reinforcement learning for indoor mobile robot path planning. Sensors 20, 5493. doi:10.3390/s20195493
Giusti, A., Guzzi, J., Cireşan, D. C., He, F.-L., Rodríguez, J. P., Fontana, F., et al. (2015). A machine learning approach to visual perception of forest trails for mobile robots. IEEE Robotics Automation Lett. 1, 661–667. doi:10.1109/LRA.2015.2509024
Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in Proceedings of the 35th international conference on machine learning (Stockholm, Sweden: Proceedings of Machine Learning Research), 1861–1870.
Hart, P. E., Nilsson, N. J., and Raphael, B. (1968). A formal basis for the heuristic determination of minimum cost paths. IEEE Transactions Syst. Sci. Cybern. 4, 100–107. doi:10.1109/tssc.1968.300136
He, L., Aouf, N., Whidborne, J. F., and Song, B. (2020). “Integrated moment-based lgmd and deep reinforcement learning for uav obstacle avoidance,” in 2020 IEEE international conference on robotics and automation (ICRA) (IEEE), 7491–7497.
Joshi, B., Kapur, D., and Kandath, H. (2024). “Sim-to-real deep reinforcement learning based obstacle avoidance for uavs under measurement uncertainty,” in 2024 10th international conference on automation, robotics and applications (ICARA) (IEEE), 278–284.
Karaman, S., and Frazzoli, E. (2011). Sampling-based algorithms for optimal motion planning. Int. J. Robotics Res. 30, 846–894. doi:10.1177/0278364911406761
Khatib, O. (1986). Real-time obstacle avoidance for manipulators and mobile robots. International Journal Robotics Research 5, 90–98. doi:10.1177/027836498600500106
Kherudkar, R., Tiwari, S., Vedantham, U., Chouti, N., Prasad, B. P., Vanahalli, M. K., et al. (2024). “Implementation and comparison of path planning algorithms for autonomous navigation,” in 2024 IEEE conference on engineering informatics (ICEI) (IEEE), 1–9.
Kim, D. K., and Chen, T. (2015). Deep neural network for real-time autonomous indoor navigation. arXiv Preprint arXiv:1511.04668. doi:10.48550/arXiv.1511.04668
Kim, I., Nengroo, S. H., and Har, D. (2021). “Reinforcement learning for navigation of mobile robot with lidar,” in 2021 5th international conference on electronics, communication and aerospace technology (ICECA) (IEEE), 148–154.
Koenig, N., and Howard, A. (2004). “Design and use paradigms for gazebo, an open-source multi-robot simulator,” in 2004 IEEE/RSJ international conference on intelligent robots and systems (IROS) (IEEE Cat. No. 04CH37566) Ieee), 3, 2149–2154. doi:10.1109/iros.2004.1389727
Lee, J.-W., Kim, K.-W., Shin, S.-H., and Kim, S.-W. (2022). “Vision-based collision avoidance for mobile robots through sim-to-real transfer,” in 2022 international conference on electronics, information, and communication (ICEIC) (IEEE), 1–4.
Li, Z., Shi, N., Zhao, L., and Zhang, M. (2024). Deep reinforcement learning path planning and task allocation for multi-robot collaboration. Alexandria Eng. J. 109, 408–423. doi:10.1016/j.aej.2024.08.102
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., et al. (2015). Continuous control with deep reinforcement learning.
Liu, H., Shen, Y., Yu, S., Gao, Z., and Wu, T. (2024). Deep reinforcement learning for mobile robot path planning. arXiv Preprint arXiv:2404.06974 4, 37–44. doi:10.53469/jtpes.2024.04(04).07
Long, P., Fan, T., Liao, X., Liu, W., Zhang, H., and Pan, J. (2018). “Towards optimally decentralized multi-robot collision avoidance via deep reinforcement learning,” in 2018 IEEE international conference on robotics and automation (ICRA) (IEEE), 6252–6259.
Michels, J., Saxena, A., and Ng, A. Y. (2005). “High speed obstacle avoidance using monocular vision and reinforcement learning,” in Proceedings of the 22nd international conference on Machine learning, 593–600.
Mirowski, P., Pascanu, R., Viola, F., Soyer, H., Ballard, A. J., Banino, A., et al. (2016). Learning to navigate in complex environments.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., et al. (2015). Human-level control through deep reinforcement learning. Nature 518, 529–533. doi:10.1038/nature14236
Murillo, J. (2023). Deep learning for autonomous vehicle real-time hazard detection and avoidance. J. AI-Assisted Sci. Discov. 3, 175–194.
Olayemi, K. B., Van, M., McLoone, S., McIlvanna, S., Sun, Y., Close, J., et al. (2023). The impact of lidar configuration on goal-based navigation within a deep reinforcement learning framework. Sensors 23, 9732. doi:10.3390/s23249732
Ouahouah, S., Bagaa, M., Prados-Garzon, J., and Taleb, T. (2021). Deep reinforcement learning based collision avoidance in uav environment. IEEE Internet Things J. 9, 4015–4030. doi:10.1109/jiot.2021.3118949
Philippsen, R., and Siegwart, R. (2003). “Smooth and efficient obstacle avoidance for a tour guide robot,” in 2003 IEEE international conference on robotics and automation (IEEE), 1, 446–451. doi:10.1109/robot.2003.1241635
Quraishi, A., Gudala, L., Keshta, I., Putha, S., Nimmagadda, V. S. P., and Thakkar, D. (2025). “Deep reinforcement learning-based multi-robotic agent motion planning,” in 2025 4th OPJU international technology conference (OTCON) on smart computing for innovation and advancement in industry 5.0 (IEEE), 1–6.
Schaul, T., Quan, J., Antonoglou, I., and Silver, D. (2016). “Prioritized experience replay,” in International conference on learning representations (ICLR).
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms.
Shi, H., Shi, L., Xu, M., and Hwang, K.-S. (2019). End-to-end navigation strategy with deep reinforcement learning for mobile robots. IEEE Trans. Industrial Inf. 16, 2393–2402. doi:10.1109/tii.2019.2936167
Singla, A., Padakandla, S., and Bhatnagar, S. (2019). Memory-based deep reinforcement learning for obstacle avoidance in uav with limited environment knowledge. IEEE Transactions Intelligent Transportation Systems 22, 107–118. doi:10.1109/tits.2019.2954952
Song, S., Zhang, Y., Qin, X., Saunders, K., and Liu, J. (2021). “Vision-guided collision avoidance through deep reinforcement learning,” in NAECON 2021-IEEE national aerospace and electronics conference (IEEE), 191–194.
Song, S., Saunders, K., Yue, Y., and Liu, J. (2022). “Smooth trajectory collision avoidance through deep reinforcement learning,” in 2022 21st IEEE international conference on machine learning and applications (ICMLA) (IEEE), 914–919.
Stentz, A. (1995). Optimal and efficient path planning for partially known environments. Intell. Unmanned Ground Veh., 203–220. doi:10.1007/978-1-4615-6325-9_11
Sun, J., Zhao, J., Hu, X., Gao, H., and Yu, J. (2023). Autonomous navigation system of indoor mobile robots using 2d lidar. Mathematics 11, 1455. doi:10.3390/math11061455
Tai, L., Li, S., and Liu, M. (2016). “A deep-network solution towards model-less obstacle avoidance,” in 2016 IEEE/RSJ international conference on intelligent robots and systems (IROS) (IEEE), 2759–2764.
Wang, J., and Huang, R. (2022). “A mapless navigation method based on deep reinforcement learning and path planning,” in 2022 IEEE international conference on robotics and biomimetics (ROBIO) (IEEE), 1781–1786.
Wang, Y., He, H., and Sun, C. (2018). Learning to navigate through complex dynamic environment with modular deep reinforcement learning. IEEE Trans. Games 10, 400–412. doi:10.1109/tg.2018.2849942
Wang, N., Zhang, D., and Wang, Y. (2020). “Learning to navigate for mobile robot with continual reinforcement learning,” in 2020 39th Chinese Control Conference (CCC) (IEEE), 3701–3706.
Wen, T., Wang, X., Zheng, Z., and Sun, Z. (2024). A drl-based path planning method for wheeled mobile robots in unknown environments. Comput. Electr. Eng. 118, 109425. doi:10.1016/j.compeleceng.2024.109425
Wu, J., Zhou, Y., Yang, H., Huang, Z., and Lv, C. (2023). Human-guided reinforcement learning with sim-to-real transfer for autonomous navigation. IEEE Trans. Pattern Analysis Mach. Intell. 45, 14745–14759. doi:10.1109/TPAMI.2023.3314762
Xie, L., Wang, S., Markham, A., and Trigoni, N. (2017). Towards monocular vision based obstacle avoidance through deep reinforcement learning. arXiv Preprint arXiv:1706.09829. doi:10.48550/arXiv.1706.09829
Xing, X., Ding, H., Liang, Z., Li, B., and Yang, Z. (2022). Robot path planner based on deep reinforcement learning and the seeker optimization algorithm. Mechatronics 88, 102918. doi:10.1016/j.mechatronics.2022.102918
Xue, Z., and Gonsalves, T. (2021). Vision based drone obstacle avoidance by deep reinforcement learning. AI 2, 366–380. doi:10.3390/ai2030023
Xue, X., Li, Z., Zhang, D., and Yan, Y. (2019). “A deep reinforcement learning method for mobile robot collision avoidance based on double dqn,” in 2019 IEEE 28th international symposium on industrial electronics (ISIE) (IEEE), 2131–2136.
Yan, C., Chen, G., Li, Y., Sun, F., and Wu, Y. (2023). Immune deep reinforcement learning-based path planning for mobile robot in unknown environment. Appl. Soft Comput. 145, 110601. doi:10.1016/j.asoc.2023.110601
Yang, C., Li, Y., Zheng, Y., He, F., and Yan, C. (2018). Asynchronous multithreading reinforcement-learning-based path planning and tracking for unmanned underwater vehicle. IEEE Trans. Syst. Man, Cybern. Syst. 48, 1055–1066. doi:10.1109/TSMC.2021.3050960
Yu, J., Su, Y., and Liao, Y. (2020). in The path planning of mobile robot by neural networks and hierarchical reinforcement learning (Lausanne, Switzerland: Frontiers in Neurorobotics), 63.
Yu, W., Peng, J., Qiu, Q., Wang, H., Zhang, L., and Ji, J. (2024). “Pathrl: an end-to-end path generation method for collision avoidance via deep reinforcement learning,” in 2024 IEEE international conference on robotics and automation (ICRA) (IEEE), 9278–9284.
Zhang, T., Zhang, K., Lin, J., Louie, W.-Y. G., and Huang, H. (2021). Sim2real learning of obstacle avoidance for robotic manipulators in uncertain environments. IEEE Robotics Automation Lett. 7, 65–72. doi:10.1109/lra.2021.3116700
Keywords: Coulomb force, deep reinforcement learning, Gazebo, lidar, motion planning, TurtleBot3
Citation: Song S, Bihl T and Liu J (2026) Coulomb force-guided deep reinforcement learning for effective and explainable robotic motion planning. Front. Robot. AI 12:1697155. doi: 10.3389/frobt.2025.1697155
Received: 01 September 2025; Accepted: 15 December 2025;
Published: 30 January 2026.
Edited by:
Giovanni Iacca, University of Trento, ItalyReviewed by:
Feitian Zhang, Peking University, ChinaSamantha Rajapaksha, Sri Lanka Institute of Information Technology, Sri Lanka
Copyright © 2026 Song, Bihl and Liu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Jundong Liu, bGl1ajFAb2hpby5lZHU=