Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Robot. AI, 30 January 2026

Sec. Robot Learning and Evolution

Volume 12 - 2025 | https://doi.org/10.3389/frobt.2025.1697155

Coulomb force-guided deep reinforcement learning for effective and explainable robotic motion planning

  • Learning and Intelligent Systems Lab (LiSL), School of Electrical Engineering and Computer Science, Ohio University, Athens, OH, United States

Training mobile robots through digital twins with deep reinforcement learning (DRL) has gained increasing attention to ensure efficient and safe navigation in complex environments. In this paper, we propose a novel physics-inspired DRL framework that achieves both effective and explainable motion planning. We represent the robot, destination, and obstacles as electrical charges and model their interactions using Coulomb forces. These forces are incorporated into the reward function, providing both attractive and repulsive signals to guide robot behavior. In addition, obstacle boundaries extracted from LiDAR segmentation are integrated as anticipatory rewards, allowing the robot to avoid collisions from a distance. The proposed model is first trained in Gazebo simulation environments and subsequently deployed on a real TurtleBot v3 robot. Extensive experiments in both simulation and real-world scenarios demonstrate the effectiveness of the proposed framework. Results show that our method significantly reduces collisions, maintains safe distances from obstacles, and generates safer trajectories toward the destinations.

1 Introduction

LiDAR-based mobile robot navigation marks a significant advancement in robotics, offering a wide range of advantages and applications. Unlike traditional global-map-based systems, LiDAR generates a real-time, detailed 3D map of the robot’s surroundings, enabling operators to make informed decisions with precise spatial data. This capability is crucial in dynamic and unpredictable environments, where adaptive, sensor-driven awareness outperforms reliance on fixed perspectives.

Motion planning and collision avoidance are critical components of high-performance robotic autonomy. Traditional motion planning approaches typically rely on geometric, sampling-based, or optimization-based methods to create feasible and efficient paths from a starting point to a goal while avoiding obstacles. Graph-based methods, such as the A* (Hart et al., 1968), and D* (Stentz, 1995) alongside sampling-based techniques like Rapidly-Exploring Random Trees* (RRT*) (Karaman and Frazzoli, 2011) and Probabilistic Roadmap (PRM) (Yang et al., 2018), remain among the most widely adopted solutions.

In recent years, machine learning (ML)-based solutions have gained popularity for enabling mobile robots to perceive their environments and make maneuvering decisions. Supervised learning methods perform perception and decision-making simultaneously, directly predicting control policies from sensor data such as images (Kim and Chen, 2015; Giusti et al., 2015; Tai et al., 2016; Dai et al., 2020; Back et al., 2020) and LiDAR scans (Chen et al., 2020; Murillo, 2023). In contrast, reinforcement learning (RL) (Michels et al., 2005) allows robots to learn optimal navigation strategies through trial and error. By interacting with the environment and receiving feedback, robots can gradually enhance their navigation performance. When combined with neural networks, deep reinforcement learning (DRL) has demonstrated superhuman performance in various games (Mnih et al., 2015; Xie et al., 2017; He et al., 2020). More recently, DRL-based solutions for collision avoidance and goal-reaching have also been proposed (Singla et al., 2019; Xue and Gonsalves, 2021; Song et al., 2022; Chang et al., 2021; Ouahouah et al., 2021; Olayemi et al., 2023). To reduce costs and improve effectiveness, training is often initially conducted in simulated environments.

LiDAR-based DRL methods have been investigated in recent studies, with particular attention to intrinsic motivation as a means to improve generalization. Zhelo et al. (2018) addressed RL limitations in scenarios such as long corridors and dead ends by incorporating an intrinsic curiosity module, which enhanced exploration and outperformed predefined reward functions in virtual 2D tasks (Mirowski et al., 2016; Long et al., 2018). Shi et al. (2019) applied a similar curiosity-driven approach within an A3C framework using sparse LiDAR input, enabling policies to transfer effectively from simulation to realistic mixed environments.

In parallel, researchers have explored novel architectural designs to address persistent challenges in motion planning. Wang et al. (2018) decomposed planning into obstacle avoidance and goal navigation, employing raw laser rangefinder data within a dual-stream Q-network to generate force-based actions. Kim et al. (2021) proposed a DQN-GRU-based navigation method that incorporated action skipping to improve performance in partially observable MDP-modeled environments, achieving superior results in simulation compared to standard DQN and non-skipping baselines. To address cross-task generalization, Wang et al. (2020) introduced elastic weight consolidation (EWC) into a DDPG framework, enabling policies to preserve prior knowledge and mitigate catastrophic forgetting without full retraining. Yan et al. (2023) also employed DDPG for mapless navigation and demonstrated improved performance in unknown environments compared to A*.

While DRL-based LiDAR navigation methods show considerable promise, their architectures are often constrained by reward designs that lack strong physical grounding. In particular, the reward structures in existing approaches frequently lack clear physical interpretation. For instance, several studies (Xue et al., 2019; Gao et al., 2020; Song et al., 2021) employ a fixed distance penalty, yet its actual impact remains insufficiently examined. Furthermore, many classical path-planning techniques, such as artificial potential fields, have not been effectively integrated into DRL frameworks. Finally, approaches remain limited to simulated validation, and even those tested on real-world robots often provide little quantitative evidence to confirm their effectiveness in practical deployment.

In this paper, we propose a physics-inspired DRL-based motion planning algorithm that generates continuous commands without relying on a map. We utilize Coulomb force to model interactions between the robot, its destination, and surrounding obstacles. To enhance safety, we incorporate object segmentation, enabling the robot to anticipate and avoid collisions from a distance. A 2D LiDAR sensor provides the data necessary to support the robot’s behaviors, including collision avoidance and goal-reaching. We also develop a carefully designed method to enhance the generalization of our solution, validated across various environment settings.

The proposed designs are trained and tested in Gazebo simulation scenes (Robotics, 2014) with large geometric obstacles and deployed in real TurtleBot v3 (TB3) robots (Robotics, 2017). Experiments conducted in both simulated and real environments demonstrate the effectiveness of our overall design and individual components.

The contributions of our work can be summarized as follows:

• We model the robot, the destination, and obstacles as electrical charges and use Coulomb forces to model the interactions between these charges.

• These Coulomb forces are integrated into our DRL framework, providing encouraging and preventative rewards for the robots. The proposed Coulomb-based rewards are smooth and pervasive, offering consistent guidance throughout the entire training field. To the best of our knowledge, this is the first work ever to employ Coulomb force in path planning and reinforcement learning.

• Obstacle boundaries extracted from LiDAR segmentation enable the robot to anticipate and avoid collisions from a distance.

• The proposed Coulomb- and vision-based rewards have clear, interpretable effects on robot behavior and performance, thereby providing strong overall explainability of our model.

• We carefully design environment-invariant components in our DRL system to improve the generalization of our solution.

2 Background

2.1 Gravitational and coulomb forces

Gravity, as formulated by Isaac Newton, is a force of attraction that acts between all objects with mass. Mathematically, it is expressed as

Fgravity=Gm1m2r2,

where m1 and m2 are the masses of the two objects, r is the distance between their centers, and G is the universal gravitational constant. This relationship indicates that the gravitational force grows larger with increasing mass and weaker with increasing distance, following an inverse-square law.

Coulomb’s force, in contrast, describes the electrostatic interaction between charged particles. As illustrated in Figure 1, Coulomb’s law quantifies the strength of the repulsive or attractive force between two point charges, such as a proton and an electron in an atom. The law states that the electric force exerted by one charge on another depends on the magnitudes of the charges and the square of the distance r between them. In its simplest form, Coulomb’s law for the magnitude of this force is expressed as:

F12=F21=ke|q1q2|r2,(1)

where q1 and q2 are electric charges, r is the distance between the charges, and ke is the Coulomb constant. This force can be attractive when q1 and q2 have opposite signs or repulsive when q1 and q2 have the same sign. Like gravity, it follows an inverse-square dependence on distance.

Figure 1
Three charged particles are shown in a square formation. Two blue spheres labeled q1 and q2 are opposite a red sphere labeled -q2. Arrows indicate forces F12 and F21 both acting horizontally, with the diagram indicating distances with r.

Figure 1. Coulomb force between two particles q1 and q2 can be either attractive or repulsive depending on the signs of the charges.

The two forces share a striking mathematical similarity, both decreasing with the square of the distance between interacting entities. In fact, the forms of Newton’s law of gravitation and Coulomb’s law look very much alike, reflecting the inverse-square nature common to both. Nonetheless, they differ fundamentally in that gravity is always attractive, while electrostatic forces can attract or repel. Another key difference is their relative strength: on atomic and subatomic scales, the electrostatic force between charged particles is far stronger than their mutual gravitational attraction. Yet over astronomical distances, neutrality of charge means that gravity dominates, shaping the large-scale structure of the universe by pulling together planets, stars, and galaxies into stable orbits and clusters.

2.2 Classical and RL-Based motion planning methods

Robot motion planning methods can be broadly divided into classical and RL-based approaches. Classical planners follow a structured pipeline of global planning, local obstacle avoidance, and trajectory generation, often using graph search or sampling strategies such as A*, RRT*, or DWA (Fox et al., 2002). While reliable in static, well-mapped environments, they struggle in dynamic or uncertain settings due to rigid rule-based logic and the need for labor-intensive map construction. Early classical methods were limited to simple static models (Philippsen and Siegwart, 2003; Cosío and Castañeda, 2004) or treated dynamic objects as static at discrete time steps (Borenstein and Koren, 1990; Borenstein and Koren, 1991), restricting real-world applicability. More recent efforts have improved collision avoidance through algorithmic refinements, including A* and DWA for real-time navigation (Kherudkar et al., 2024), GBI-RRT combined with SLAM (Sun et al., 2023), and enhanced DWA variants for dynamic environments (Cao and Nor, 2024).

In contrast, RL-based motion planning can be broadly categorized into hybrid, end-to-end, and multi-robot approaches. Hybrid methods integrate RL with classical planners to combine reliability with adaptability, such as DRL-enhanced DWA for smoother trajectories (Wang and Huang, 2022), A2C with optimization outperforming Dijkstra + DWA (Xing et al., 2022), and DRL combined with A* to reduce computation (Liu et al., 2024). These methods, however, remain dependent on map quality and consistency. End-to-end methods learn directly from sensor data, with LiDAR-based DQN-GRU achieving superior performance over standard DQN (Kim et al., 2021), stochastic sampling with 2D LiDAR enabling faster training and improved collision avoidance (Beomsoo et al., 2021), and LSTM-TD3 models offering improved temporal decision-making (Wen et al., 2024).

While promising for handling unseen environments, RL-based solutions suffer from high training demands and limited real-world generalization. Multi-robot approaches often adopt centralized training and decentralized execution (CTDE), such as enhanced DQN for warehouse path planning a DRL-MPC-GNN framework for task allocation and coordination (Li et al., 2024), and curriculum-learning with LiDAR costmaps yielding strong real-world results (Yu et al., 2024). Memory-augmented DQN variants have also improved multi-robot coordination (Quraishi et al., 2025). Despite scalability and coordination benefits, these methods face challenges in training complexity, dynamic interactions, and partial observability, limiting their practical deployment in robot systems.

2.3 Artificial potential field (APF) algorithm

The APF algorithm, first proposed by Khatib (Khatib, 1986), is a classical physics-inspired framework for real-time path planning and obstacle avoidance. In this approach, the robot is modeled as a particle moving under the influence of a synthetic potential function U(x) that combines attractive and repulsive components:

Ux=Uattx+Urepx,Fx=Ux

where F(x) represents the virtual force acting on the robot.

The attractive potential pulls the robot toward its goal and is commonly modeled as a quadratic function of the Euclidean distance:

Uattx=12kattxxg2,Fattx=kattxxg

where katt>0 controls the rate of attraction, x is the robot’s current location and xg denotes the goal position.

The repulsive potential is activated only within a finite influence range d0 to maintain a safety margin from obstacles:

Urepx=12krep1dx1d02,dxd0,0,dx>d0,Frepx=Urepx,

where krep>0 is a scaling factor, d(x) is the distance from robot to obstacle, and d0 defines the repulsion boundary.

The resultant virtual force F(x)=Fatt(x)+Frep(x) guides the robot toward the goal while avoiding obstacles. Despite its simplicity and intuitive physical interpretation, APF suffers from two major limitations: (1) the handcrafted potential shapes and distance thresholds make the field non-convex and highly sensitive to parameter tuning, and (2) the method is prone to local minima in concave or cluttered environments where attractive and repulsive forces may cancel, preventing progress toward the goal.

The DRL framework presented in this work is motivated by the principles of the APF algorithm. To overcome its inherent limitations, we design RL reward functions that yield a physically grounded and globally smooth motion-guidance field. This field is realized using Coulomb-force–based rewards, with the full formulation presented in Section 3.

2.4 Generalization and sim-to-real transfer in DRL

Model generalization is a critical issue in machine learning, and it is especially important for DRL-based navigation and control. In robotics, the ability of a policy to adapt to new or changing environments is vital, as operating conditions are often unpredictable and diverse. A well-generalized policy can handle unseen scenarios, task variations, and sensor differences, whereas a poorly generalized one risks catastrophic failure outside its training distribution. Despite its importance, many DRL studies still evaluate methods only on the same environments they were trained on, such as Atari (Bellemare et al., 2013), Gazebo (Koenig and Howard, 2004), or OpenAI Gym (Brockman et al., 2016), providing limited insight into generalization.

Recent efforts have begun to address this issue. Yu et al. (Yu et al., 2020) studied the generalization of multiple DRL algorithms by training across diverse environments, while Doukui et al. (Doukhi and Lee, 2021) mapped sensor data, robot states, and goals to continuous velocity commands, though their work was restricted to unseen targets rather than unseen scenes. Increasingly, DRL-based robot obstacle avoidance research has emphasized sim-to-real transfer in (Lee et al., 2022; Wu et al., 2023; Joshi et al., 2024). For instance, Anderson et al. (Anderson et al., 2021) introduced a subgoal model that aligns simulation-trained discrete actions with real-world continuous control, using domain randomization to reduce visual discrepancies. Similarly, Zhang et al. (Zhang et al., 2021) applied object detection to generate real-time 3D bounding boxes, mitigating the effect of varying obstacle shapes and appearances on robot navigation and improving robustness to sim-to-real differences.

3 Methods

In this work, a mobile robot begins at a specified location and autonomously navigates toward a target destination. Static and dynamic obstacles are placed along the straight line connecting the start and end points. The primary objective is to enable the robot to reach the destination while effectively avoiding collisions. This capability is achieved through a physics-inspired DRL framework with two key considerations: (1) obstacle avoidance and (2) generalization to previously unseen environments.

The proposed models are first trained in the Gazebo simulation environment and then deployed on a TB3, which functions as a simple two-wheeled differential-drive robot. The TB3 is equipped with multiple sensors, including a 360° LiDAR and an Inertial Measurement Unit (IMU). It also supports the Robot Operating System (ROS 2), which enables seamless communication and control between the DRL algorithms and the hardware platform.

Figure 2 illustrates the experimental setup. The green dashed line represents the straight path between the robot’s initial position and the destination, while the red dashed line shows the actual trajectory, deviating from the path to avoid obstacles. At each time step, the LiDAR produces a distance map of sampled points from the surrounding environment, represented by yellow dashed lines on the obstacles. The robot’s motion is controlled by two velocity components: (1) linear velocity, which determines the speed of forward movement, and (2) angular velocity, which controls the rotation rate of the two-wheeled base.

Figure 2
Diagram showing a robot navigating obstacles using LiDAR scans. The robot moves with linear and angular velocity, avoiding cylindrical and rectangular obstacles toward a yellow cylindrical destination. The actual trajectory is indicated by a red, curved dotted line.

Figure 2. An illustration of the overall motion planning setup. Refer to text for details.

3.1 Overall design and key innovations

The primary innovation of this work lies in: (1) modeling the robot, destination, and obstacles as charged particles, and (2) utilizing Coulomb forces to represent their interactions while formulating DRL rewards based on these interactions. Specifically, the robot and its destination are modeled as electric charges with opposite signs, generating an attractive force that guides the robot toward the goal. Obstacles are represented as an array of charges with the same sign as the robot, producing a combined repulsive force that steers the robot away from collisions. In both cases, the force magnitudes scale inversely with the square of the distance between interacting entities.

Our design introduces two breakthrough innovations that enable highly effective agent learning. The first is the use of gradually varying and ubiquitous forces, which provide consistent guidance across the entire training field. The second is the inverse-square distance formulation, which is particularly effective for collision avoidance, as the repulsive force increases sharply when the robot approaches an obstacle. By incorporating the robot’s direction of movement, these forces are translated into reward signals for the DRL agent, either encouraging or redirecting its trajectory.

Our innovative designs have clear physical interpretations, making the behavior and performance of our model highly explainable. Furthermore, we propose an object-avoidance reward based on LiDAR scan segmentation, which enables the robot to avoid large obstacles from a distance, significantly enhancing the overall performance of the models.

The overall workflow of the proposed framework is illustrated in Figure 3, which outlines the interaction among the environment, a Twin Delayed Deep Deterministic Policy Gradient (TD3) control agent (Fujimoto et al., 2018), and Prioritized Experience Replay (PER) (Schaul et al., 2016). The environment interacts with the agent through continuous sensing and motion feedback. The proposed rewards are generated directly from these interactions and serves as the primary learning signals for the agent. A standard TD3 algorithm is employed as the training backbone to validate the effectiveness and generality of the proposed reward formulation, while PER is applied to improve sample efficiency during training.

Figure 3
Diagram illustrating a TD3 agent's interaction with its environment. On the left, the environment includes LiDAR scans and reward design based on goal distance and heading. The agent comprises actor and critic networks. The actor network includes an actor and a target for DPG updates. The critic network has two critics and targets for TD-error updates. Actions are determined by Q-values. A prioritized experience replay buffer samples mini-batches for training.

Figure 3. Overview of the proposed Coulomb-guided TD3 framework with LiDAR-based perception.

This framework emphasizes the role of the proposed reward, which provides dense and physically interpretable feedback directly from environment interactions. The integration of Coulomb-force and LiDAR-vision rewards significantly improves learning stability and convergence speed compared with conventional sparse rewards. Experimental analysis further demonstrates reduced collision frequency and enhanced path planning capability, confirming that the proposed reward design improves both the efficiency and robustness of policy learning. These observed improvements in convergence speed, stability, and trajectory efficiency directly reflect the effectiveness of the proposed reward formulation, which operationalizes the main research contributions within the defined DRL framework.

3.2 DRL algorithm selection

The TD3 algorithm was chosen as the primary DRL framework due to its inherent stability, high sample efficiency, and robustness in continuous control tasks. TD3 extends the Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2015) method by introducing twin Q-networks to mitigate overestimation bias and delayed policy updates to prevent divergence, resulting in smoother convergence. Such properties are particularly important for velocity-based robot navigation, where unstable value estimation can lead to oscillatory motion or unsafe control.

Although Proximal Policy Optimization (PPO) (Schulman et al., 2017) has been successfully applied to various robotic control and path planning problems, its on-policy nature requires frequent policy rollouts and gradient updates, leading to lower sample efficiency and higher computational cost in long-horizon navigation tasks. In contrast, TD3’s off-policy structure enables efficient experience reuse from replay buffers, accelerating convergence while maintaining stable policy updates.

Meanwhile, Soft Actor–Critic (SAC) (Haarnoja et al., 2018) enhances exploration through entropy regularization, which is advantageous in sparse-reward or high-uncertainty environments. However, in our Coulomb-guided setup, where the reward function already provides strong directional gradients and the control space demands consistent velocity regulation, the additional stochasticity of SAC’s policy can introduce unnecessary action variance and make policy updates less stable.

As a result, a deterministic actor-critic algorithm such as TD3 offers a more direct and stable optimization path for continuous navigation tasks with dense, physics-guided rewards, aligning well with the objectives of this work.

3.3 States and actions

States, actions, and rewards are the three fundamental components of most DRL algorithms. In this work, the state at time t, denoted st, is defined as:

st=[distobstt,unit_distgoalt,anglegoalt,lin_speedt,ang_speedt]

where:

• LiDAR-based distance map (distobst(t)): obtained from 40 LiDAR samples (9-degree each). The minimum value in the scan is extracted at each step, reflecting the likelihood of collision.

• Unit distance to the goal (unit_distgoal(t)): defined as the ratio of the current distance to the goal over the maximum (initial) distance, serving as a normalized proximity measure.

• Goal angle (anglegoal(t)): the angular difference between the robot’s current heading and the destination direction.

• Robot velocities: linear velocity lin_speed(t) and angular velocity ang_speed(t), where angular velocity is expressed in degrees per second clockwise.

The design of these state components is intended to enhance generalization. The LiDAR-derived shortest distance captures proximity to obstacles in an environment-independent manner. The goal angle anglegoal(t) generates a scale- and environment-independent rotational factor, while the unit distance unit_distgoal(t) both drives forward motion and quantifies progress toward the destination.

At each step, the action at is represented as:

at=lin_speedt+1,ang_speedt+1

which updates the robot’s linear and angular velocities from their values at time t. The design of the reward functions, derived from these states and actions, will be detailed in the next subsection.

3.4 Reward design

In this work, the overall reward rt at time t is designed to incorporate multiple components, each corresponding to a desired system behavior. The design goals include: (1) encouraging the robot to reach the destination, (2) avoiding collisions, and (3) enhancing model generalization. The overall reward is expressed in Equation 2,

www.frontiersin.org

where:

Rtowards: incentivizes the robot to move closer to the target,

Rstable: rewards smooth and stable motion by minimizing excessive rotation and acceleration,

Rsucc: provides a significant positive reward when the robot successfully reaches its destination,

Rcol: imposes a penalty (negative reward) if the robot collides with obstacles,

Robst: an interval-based function that penalizes the robot when it gets too close to obstacles,

RCoulomb: combines attractive and repulsive terms derived from Coulomb forces, guiding the robot toward the goal while repelling it from obstacles,

Rvision: allows the robot to see upfront from a distance and encourages the robot to move toward regions free of collisions.

The four (4) reward terms in the box of Equation 2 represent the baseline rewards, indicating that these terms are included in the reward function of every model.

Among these the baseline rewards, Rtowards can be formulated in different ways. In this study, we define it as the sum of two components: (1) a reward for reducing anglegoal(t), which encourages the robot to align its heading with the target, and (2) a reward for reducing the normalized distance to the destination unit_distgoal(t). These two components, shown in Equation 3, work together to effectively incentivize progress toward the goal:

Rtowards=c1|anglegoal|+2×distgoal−inidistgoal−ini+distgoal(3)

where c1 is a positive scaling coefficient, distgoal−ini is the initial distance to the goal, and distgoal is the current distance.

Rstable is designed to ensure that the robot moves smoothly and steadily, avoiding sudden turns or unstable oscillations. To achieve this, the stable reward consists of two components: (1) a penalty on the rotational angular velocity ang_speed(t), particularly when it is excessively high, and (2) a penalty on the linear velocity lin_speed(t) when the robot moves too slowly, i.e., when its velocity is significantly lower than the maximum robot speed, Max_Speed. In this work, the TB3 used in real-world experiments has a maximum speed of 0.22 m per second. The stable reward is defined as:

Rstable=c21ang_speedt2c22Max_Speedlin_speedt2

where c21 and c22 are positive constants.

Robst is designed to repel the robot when it gets too close to obstacles. This is implemented by applying a constant penalty whenever the minimum distance between the robot and the nearest obstacle falls below a predefined threshold. In our setup, the threshold is set to 0.22 m, corresponding to the distance the robot would travel in one second at its maximum speed, which could otherwise result in a collision. Formally, Robst is defined as:

Robst=20,if mindistobsttthreshold0,otherwise

RCoulomb and Rvision are reward components designed to guide the robot’s movement. Specifically, RCoulomb drives the robot toward the goal while simultaneously repelling it from nearby obstacles, and Rvision encourages the robot to avoid obstacles from a greater distance. Both rewards will be explained in detail in the following subsections.

3.4.1 Coulomb force rewards RCoulomb

As previously introduced, we model the robot, destination, and obstacles as charged particles and utilize Coulomb’s law to represent their interactions. These interactions form the basis of the Coulomb-based reward, RCoulomb, which consists of two components: an attractive force reward from the goal that pulls the robot toward the destination, and a repulsive force reward from obstacles that pushes the robot away to avoid collisions. The overall formulation is given by:

RCoulomb=Rattr+Rrepu

Attraction Reward Rattr is designed to attract the robot to the destination as quickly as possible. In this model, the robot is represented as a positive charge and the destination as a negative charge. By Coulomb’s law, oppositely charged particles attract each other, and the attractive force between the robot and the destination is calculated accordingly. The closer the robot is to the destination, the stronger this attraction becomes.

Figure 4a illustrates the attractive force exerted by the destination on the robot. To map this attraction into a reward for the DRL agent, we compute the inner product between the attractive force vector Fattr and the robot’s direction of motion (linear velocity v̂). This inner product serves as the attraction reward, encouraging the robot to align heading direction towards the destination as closely as possible. Inspired by Coulomb’s law, Rattr is formulated as:

Rattr=Fattrv̂=Fattrcosanglegoal=c31distgoal2cosanglegoal

where c3 is a positive constant, anglegoal is the angle between the force Fattr and robot’s motion direction v̂, and distgoal is the distance between the robot and the destination. This relationship is illustrated in Figure 4a.

Figure 4
Diagram illustrating two scenarios involving forces on a moving object. In (a), labeled

Figure 4. An illustration of the Coulomb force rewards RCoulomb setup. (a) Attractive force from the destination; (b) repulsive forces from obstacle points. Refer to text for details.

Rrepu is designed to keep the robot away from obstacles. As the robot approaches an obstacle, a repulsive force is generated to push it away, thereby preventing collisions. We model the robot as a positively charged particle with a charge of q1=1 (see Figure 4b). Each LiDAR scan point is likewise modeled as a positively charged particle with a charge of q2=1. According to Coulomb’s law, these like-charged particles exert repulsive forces on one another, as expressed in Equation 1. By summing the individual repulsive force vectors, we obtain a resultant force that prevents the robot from moving closer to obstacles, as illustrated in Figure 4b.

This resultant force is then used to compute the Coulomb-based repulsive reward applied to the robot. For simplicity, Coulomb’s constant is set to 1. The final reward function for the repulsive force is defined as:

Rrepu=c4i=1N1di2cosθi

where c4 is a positive constant; di is the distance between the robot and the i-th obstacle point (from LiDAR), and θi is the angle between the i-th repulsive force vector and the robot’s direction of motion. In this study, N is set to 40.

Our Coulomb-force–based reward was originally inspired by the classical APF formulation, both of which use a physics-inspired combination of attractive and repulsive influences to guide robot motion. In this sense, our formulation conceptually extends the idea of shaping a force field that directs the robot toward the goal while avoiding obstacles. But method, however, has several key differences and advancements over APF:

• Algorithmic vs. Learning-Based Mechanism The classical APF method is algorithmic and deterministic, meaning its behavior remains fixed and does not refine or improve with repeated experience. In contrast, our Coulomb-force rewards are integrated into a learning-based DRL framework. The agent receives reward signals from its local neighborhood and continuously improves its policy through interaction and extensive training. This training process allows the agent to discover effective behavioral patterns that reflect the underlying physical field, leading to more sophisticated navigation and improved escape from local minima where traditional APF typically becomes trapped.

• Smooth, Globally Consistent Reward Field The reward formulation we propose produces a smooth, ubiquitous, and gradually varying guidance field. This contrasts with classical APF, where handcrafted potential shapes and discontinuous distance thresholds often create non-convex fields prone to sharp gradients and local minima. The smoothness and physical consistency of the Coulomb-based reward help the learned policy achieve more stable motion guidance and reduce susceptibility to local minima.

• Exploration and Stochasticity in DRL Our learning model, as a DRL approach, possesses an exploration capability which acts analogously to a simulated annealing process. This property helps the agent occasionally deviate from local optima and, consequently, discover more globally efficient paths.

3.4.2 Vision rewards Rvision

Rvision encourages the robot to move tangentially along obstacle edges within collision-free zones, enabling early avoidance and improving both the success rate and efficiency of obstacle evasion. To identify the boundary of the object ahead, we first perform LiDAR scan segmentation on all scan points. By classifying these points, we can distinguish individual objects, as illustrated in Figure 5, which shows the segmentation result using 40-sample LiDAR scans. From the classified points, we select the object directly facing the robot and determine its leftmost and rightmost LiDAR points. The corresponding angles with respect to the robot’s heading direction are denoted as θ1 and θ2. The smaller of the two angles is chosen to minimize the effort required for obstacle avoidance, and a fixed offset is applied to this angle to maintain a safety buffer.

Figure 5
Two scatter plots labeled (a) and (b). Plot (a) features blue dots forming a face pattern with eyes, mouth, and a nose in black. Plot (b) shows the same pattern with multicolored dots, including orange, gray, red, brown, purple, pink, and green, replacing the blue pattern.

Figure 5. Example of LiDAR segmentation. (a) shows the original LiDAR scan with 40 samples; (b) displays the corresponding segmentation result after applying DIET, where different colors represent distinct objects.

Since obstacle avoidance is less urgent when obstacles are farther away, we normalize the reward using a Gaussian function of the potential collision time tc. In this formulation, imminent collisions generate stronger reward signals, whereas distant collisions contribute weaker signals. This design ensures that the reward varies smoothly with decreasing collision time, resulting in smoother trajectories and more efficient obstacle avoidance.

Formally, the vision reward is defined as:

Rvision=c5cosminθ1,θ2+bufferexptc22σ2,

where c5 is a positive constant, σ is set to 1, θ1 and θ2 denote the angles between the robot’s orientation and the lines connecting the robot to the leftmost and rightmost boundaries of the object ahead (with a fixed safety buffer, as shown in Figure 6), and tc is the estimated collision time, calculated from the obstacle distance and the robot’s speed.

Figure 6
Diagram showing a navigation system with three labeled obstacles: Obst\(_1\) (green), Obst\(_2\) (orange), and Obst\(_3\) (purple). A robot is positioned centrally with a red arrow indicating the current direction. Two angles, \(\theta_1\) and \(\theta_2\), are shown from the robot to Obst\(_1\), forming a buffer angle with dashed lines.

Figure 6. An illustration of the LiDAR vision reward setup. Refer to text for details.

In more detail, we employ the DIET (Dietmayer, 2001) algorithm for LiDAR segmentation. The procedure, illustrated in Figure 7, operates by examining the distances between adjacent LiDAR scan points to identify potential object boundaries. This allows neighboring points that belong to the same physical surface to be grouped together, thereby enabling effective segmentation of obstacles. The DIET function is defined as:

DIET=C0+C1minri,ri+1

where C0 is a positive constant used to reduce noise; C1 is a positive constant dependent on the angle α between two LiDAR beams; ri and ri+1 denote the distances from two adjacent scan points to the robot’s position (0, 0).

Figure 7
Diagram illustrating a LiDAR system with points \(p_i\) and \(p_{i+1}\) within a circle labeled DIET. Distances \(r_i\), \(r_{i+1}\), and \(d\) are shown, with angle \(\alpha\). Text describes \(r_{\text{min}}\) as the minimum of \(r_i\) and \(r_{i+1}\). It states \(d \leq \) DIET indicates the same object, while \(d > \) DIET indicates different objects.

Figure 7. Illustration of the DIet algorithm applied to LiDAR scan segmentation.

4 Environment and model setups

We design training and testing environments in Gazebo to simulate the TB3 Burger’s motion planning and collision avoidance under realistic conditions. After training and validating the models in simulation, we deploy them on the TB3 to evaluate their performance in real-world scenarios. Training is conducted on an NVIDIA RTX A6000 GPU. In this framework, simulation serves as the primary stage for model development and verification, while real-world experiments provide the final assessment of robustness and reliability.

4.1 Simulation environments

The motion planning policies are trained in Gazebo using a digital twin of the TB3 robot provided by the manufacturer. The simulation environments, illustrated in Figure 8, include both training and testing setups. Figures 8a,b (referred to as Scene 0 and Scene 1) depict the same environment, with 8(b) containing additional moving obstacles. The DRL model is trained in 8(b), Scene 1, since it represents a more complex environment. Figures 8c,d (Scene 2 and Scene 3) present an unseen environment used exclusively for testing, designed to evaluate the model’s generalization and robustness. In 8(a) and 8(c), only static obstacles (walls) are present, while 8(b) and 8(d) include dynamic obstacles represented by gray cylinders.

Figure 8
Four grid maps depict scenes with labeled

Figure 8. Simulation environments for model training and testing. (a) Test scene 0 without moving obstacles; (b) training and test scene 1 with moving obstacles; (c) test scene 2 without moving obstacles; and (d) test scene 3 with moving obstacles. Gray cylinders in (b,d) denote dynamic (moving) obstacles.

In all environments, the robot starts at the center of the scene (0,0). At the beginning of each epoch, a goal is randomly selected from a predefined set of locations. If the robot reaches the goal without collision, it continues toward a new goal from its current position. In the event of a collision, the environment is reset by relocating the robot to the center and assigning a new random goal.

4.2 Real environments

In the real-world experiments, we used the TB3 to evaluate our DRL-based models. The hardware setup is shown in Figure 9a. From top to bottom, the robot is equipped with a 360° LiDAR for scan acquisition, a Raspberry Pi 4 for wireless communication over WiFi, an OpenCR board that controls the robot and exchanges data with the Raspberry Pi via USB, and a LiPo battery at the base that powers the entire system. The DRL models are executed on a remote PC, with action values transmitted to the Raspberry Pi 4 in real time through ROS 2.

Figure 9
(a) A wheeled robot with multiple stacked platforms and wiring, on a table. (b) The robot is on a tiled floor, labeled

Figure 9. Deployment of a TB3 in real-world environments. (a) Our robot; (b) Real Test Scene 1; (c) Real Test Scene 2: with an extended obstacle.

Sim-to-real transfer remains a major challenge for RL-based algorithms, as models that perform well in simulation may fail in physical environments. Therefore, real-world testing is essential. Figure 9b shows the first real-world test environment (referred to as Real Test Scene 1), which contains two obstacles to evaluate each model’s collision avoidance and goal-reaching capabilities. Figure 9c depicts a second test environment (Real Test Scene 2), where one obstacle is extended to further test the robot’s ability to find alternative paths. The third test environment, not shown in the figure, builds on the first by introducing dynamic obstacles, allowing assessment of the robot’s performance under moving hazards. Success is defined as reaching within a radius of r=0.3m from the goal at coordinates (3,0). In all real-world tests, the robot starts at (0,0), as indicated in Figures 9b,c, and attempts to reach the goal near the distant yellow area at (3,0).

4.3 Model setups

To evaluate the impact of our proposed reward terms on collision avoidance, goal reaching, and sim-to-real generalization performance, we design five (5) models for comparative experiments. The first model, Wobst, serves as the baseline and is implemented using the TD3 algorithm with baseline reward terms and Robst. This model is a basic version that can be trained to convergence in the simulation environment.

The second model, WC, is designed to test the effect of the RCoulomb term, with a reward function comprising the baseline and RCoulomb. Ideally, due to the repulsive and attractive forces, WC should achieve a higher obstacle-avoidance success rate and reach the goal more efficiently compared to Wobst.

The third model, WC+v, is designed to test whether the combination of RCoulomb and Rvision further improves the success rates of obstacle avoidance. From a design perspective, we expect that this reward combination to yield a more robust robot maneuver ability, with a higher success rate than the previous two models.

The fourth model, WC+obst, incorporates the baseline reward terms along with Robst and RCoulomb. We expect this model to perform slightly better than WC, but not as well as WC+v. The fifth model, WC+v+obst, includes all the reward terms in Equation 2. Since our proposed reward terms, RCoulomb and Rvision are designed to enhance the robot’s collision-avoidance ability, we expect this model outperforms Wobst, WC and WC+obst, while achieving performance comparable to WC+v.

These five models are constructed through different combinations of reward terms. By comparing models with and without the RCoulomb term, we can evaluate its contribution to collision-free motion planning. Similarly, by comparing models with and without the Rvision term, we can evaluate the role of Rvision in robot maneuver. Using this controlled-variable approach, we can systematically test and validate the effectiveness of the proposed reward terms, RCoulomb and Rvision.

4.4 Evaluation metrics

To more comprehensively assess the navigation and collision avoidance performance of different models, three quantitative metrics were employed: Success Rate (SR), Collision Ratio (CR), and Average Goal Distance (GD). These metrics jointly evaluate navigation reliability, safety, and efficiency, enabling a more rigorous comparison among all tested models and environments. The details of these three metrics are described below:

• SR: Success rate is defined as the ratio of successful rollouts, where the robot reaches the goal without collision, to the total number of rollouts. It measures the overall reliability of the navigation policy in completing tasks successfully. A higher SR indicates stronger obstacle avoidance and goal-reaching capability. SR serves as the primary evaluation metric in both simulation and real-world experiments.

SR=success_countrollout

• CR: Collision ratio is designed to evaluate navigation safety and efficiency when the total runtime cannot be directly compared. In our simulation, each episode restarts from the initial position after collision, and start-goal pairs differ across different model tests, making navigation time or path length unsuitable for fair comparison. Therefore, CR reflects the frequency of collisions normalized by total steps, indicating how safely and efficiently the robot navigates over longer trajectories. A lower CR means the model can operate longer with fewer collisions, demonstrating better collision-avoidance capability and stability. CR is evaluated only in simulation experiments.

CR=collision_counttotal_steps/1000

• Avg. GD: Average goal distance captures the model’s tendency to approach the goal, even in failed attempts. While SR only measures how often a model reaches the goal, GD quantifies how close the robot remains to the target at the end of each rollout. A lower GD indicates that the model either successfully reaches the goal or, in failure cases, terminates nearer to it, which demonstrates stronger goal-reaching tendency and better awareness of feasible solutions. This metric is evaluated only in simulation for controlled quantitative comparison.

Avg.GD=total_goal_distancerollout

5 Experiments and results

The major innovation of our design lies in the two reward components, RCoulomb and Rvision, in Equation 2. We conducted a series of experiments to evaluate their effectiveness through statistical analysis and visual inspection. The models are tested and compared in both simulation and real-world environments.

5.1 Results in simulation environments

The five (5) models are trained in the training environment for 7,000 epochs, and the corresponding training data are shown in Figure 10. The figure plots the average reward obtained over each set of 10 epochs against the training epochs. From Figure 10, we observe that Wobst converges the slowest, stabilizing only after approximately 4,800 epochs, and its performance between 4,800 and 7,000 epochs remains less consistent than that of the other four models. WC and WC+obst converge much faster, beginning around 800 epochs, with stable performance after convergence. WC+v and WC+v+obst converges even faster, showing rapid convergence around 400 epochs. These results suggest that the RCoulomb term significantly enhances training efficiency, while the Rvision term further accelerates convergence.

Figure 10
Five line graphs labeled (a) to (e) show average rewards over ten episodes. Each graph displays varying levels of fluctuation and stabilization over time. The horizontal axis represents the number of episodes, while the vertical axis shows the rewards. Gradual improvement in rewards is observable across the graphs, with initial volatility settling into steady performance around the middle to later episodes.

Figure 10. Plots of training results (epoch = 7,000) for models: (a) Wobst, (b) WC, (c) WC+v, (d) WC+obst and (e) WC+v+obst.

After training converged, the models were evaluated in the test scenes using success rate as the performance metric. Table 1 summarizes the results of the five trained models tested across four (4) simulation environments. Dynamic indicates the presence of moving obstacles, while unseen refers to deployment in a previously unseen (new) environment.

Table 1
www.frontiersin.org

Table 1. Quantitative test results for Wobst, WC, WC+v, WC+obst and WC+v+obst in simulation environments. Metrics: SR (in %), CR and Avg. GD (in meters). Rollout = 200.

The key observations are as follows:

1. The Wobst model performs the worst across all test environments, exhibiting the lowest success rate, the highest collision ratio and the highest average goal distance. This confirms its limited effectiveness when used alone.

2. Incorporating Coulomb reward components (RCoulomb) consistently boosts robot performance across all test scenes with higher success rate, lower collision ratio and lower average goal distance, validating the effectiveness of our reward design.

3. Adding vision rewards (Rvision) further improves success rate and reduces collision ratio and average goal distance, as seen in the comparisons of WC vs. WC+v and WC+obst vs. WC+v+obst.

4. The overall best-performing model in simulation, based on success rate, collision ratio and average goal distance is WC+v+obst.

5.2 Evaluation of robot performance in real environments

To evaluate the trained policies on a real robot, we deployed the five DRL models onto a TB3. As described in the previous subsection, the test environments are categorized into three types: Real Test Scene 1, Scene 2, and Scene 3. The robot trajectorieshe test environments are categorized into three typ are visualized in RViz using green dots, which represent real-time TB3 position data.

The setups and results are as follows. In Real Test Scene 1, two static obstacles are placed directly between the start point and the goal, requiring the TB3 to navigate around them to reach its destination. This setup evaluates each model’s collision-avoidance capability under the sim-to-real challenge. From the trajectories shown in Figure 11, we observe that although Wobst initially moves toward the goal, it fails to find a feasible path after attempting to avoid an obstacle and ultimately collides with a table leg.

Figure 11
Five diagrams of a floor plan show various navigation paths. In (a), a path from

Figure 11. Plots of trajectories for the models in Real Test Scene 1 with static obstacles. (a) Wobst failed, (b) WC succeeded, (c) WC+v succeeded, (d) WC+obst succeeded and (e) WC+v+obst succeeded. In each test, the start position is marked by a yellow star, the goal position by a red dot, and collisions by blue dots.

In contrast, the other four models successfully navigate around the obstacles and reach the goal. Their trajectories in this test scene are generally similar, all bypassing the obstacles from the right side with slight variations in clearance. Among them, WC+obst maintained the greatest distance from obstacles during navigation. This behavior can be attributed to the influence of the Robst and Rrepu terms in the reward function, which encouraged the TB3 to maintain a larger buffer from obstacles.

In Real Test Scene 2, to further challenge the models, we extended the second obstacle to evaluate their ability to find an alternative path when the previous route was no longer feasible. Figure 12 illustrates the trajectory results of the five models. It can be observed that Wobst and WC failed to reach the goal: Wobst collided with a table leg after moving toward the goal, while WC followed its previous path and crashed into the extended obstacle. In contrast, WC+v, WC+obst, and WC+v+obst all successfully reached the goal.

Figure 12
Five diagrams depict different path planning scenarios in a floor plan with obstacles. Each shows a green path from a yellow star labeled

Figure 12. Plots of trajectories for models under Real Test Scene 2 with static obstacles. (a) Wobst failed, (b) WC failed, (c) WC+v succeeded, (d) WC+obst succeeded and (e) WC+v+obst succeeded.

Among the successful models, WC+v demonstrated the most reasonable trajectory, maintaining the largest collision-free buffer. The effect of the vision reward (Rvision) can be observed by comparing WC+v (Figure 12c) with models that do not incorporate vision (Figures 12a,b,d). Vision reward, based on LiDAR segmentation, enables the robot to detect obstacles from a distance, leading to earlier avoidance and smoother trajectories.

The effectiveness of the Coulomb reward (RCoulomb) compared to the obstacle reward (Robst) in Test Scene 2 is illustrated in Figures 13. Figures 13b,c show a zoomed-in region of the scene (highlighted by the red box in Figure 13a). As shown in Figure 13b, the obstacle reward essentially creates rigid surrounding zones that the robot cannot penetrate. Consequently, the robot is repeatedly repelled, first from the upper red-box region and then from the blue-box region. This sequence of rejections prevents the robot from finding a feasible path between the two obstacles. In contrast, the Coulomb reward, RCoulomb, varies smoothly and provides global guidance across the entire field. While still discouraging the robot from moving too close to obstacles, it forms a watershed-like potential field that directs the robot toward the goal more effectively. This mechanism partly explains the superior performance of WC+v, as shown in Figure 12c.

Figure 13
Three-part image illustrating a space mapping process. Panel (a) shows a map with a red highlighted area. Panel (b) displays enlarged segments from the map, outlined in red, blue, and purple, with arrows indicating transformation. Panel (c) depicts arched paths in red and blue between the two segments, indicating movement or flow.

Figure 13. Environment overview with a red bounding box in (a) indicating the zoom-in region, and detailed views of the highlighted area in (b,c). In (b,c), the black arrowed curves show the robot’s motion under obstacle influence with Robst and RCoulomb respectively.

In Real Test Scene 3, moving obstacles were placed along the TB3’s travel path to evaluate its ability to avoid obstacles in dynamic environments. Based on visual inspection, WC+v performed best, consistently avoiding obstacles and reaching the goal with relatively short paths. The second-best model was WC+v+obst, which also avoided obstacles and reached the goal with high probability. For demonstration, see the accompanying YouTube video.

We conducted a statistical analysis of the five models across the three real-world test scenes. Each model was run five times (i.e., Epoch = 5), and the average success rates are summarized in Table 2. These results confirm our visual observations: WC+v achieved the best overall performance across all scenes, followed by WC+v+obst. However, this finding contrasts with the simulation results, where WC+v+obst showed the best performance.

Table 2
www.frontiersin.org

Table 2. Statistical results for Wobst, WC, WC+v, WC+obst, and WC+v+obst under real-world test scenes. SR represents for Success Rate (%). Rollout = 5.

After examining the actual trajectories taken by the robots in simulation tests, we identified a plausible explanation, illustrated in Figure 14. As shown in the figure, under identical start and goal positions, WC+v+obst tended to generate more detours. While such behavior had little impact in idealized simulations, it posed challenges in real-world deployment. Increased detours elevated the risk of failure due to collisions, loss of the goal, or becoming stuck, ultimately contributing to the lower success rate observed for WC+v+obst in real-world tests. Our findings are consistent with those of Da et al. (Da et al., 2025), who noted that the sim-to-real gap arises primarily from perception-related limitations and execution discrepancies.

Figure 14
Two maze diagrams showing paths to a goal. Left diagram: simple path without obstacles labeled

Figure 14. Schematic path comparison under the same start and goal: (a) WC+v, (b) WC+v+obst. Green curves indicate illustrative trajectories only and are not directly generated by the robot.

To summarize, TB robots equipped with the trained motion planning policies (WC+v and WC+v+obst) exhibit strong maneuvering capabilities in complex simulation and real environments, demonstrating the effectiveness of our design.

5.3 Additional cluttered simulated environments and test results

To further validate the robustness and generalization of the proposed DRL framework, two additional cluttered simulation environments were constructed, as shown in Figure 15.

Figure 15
Two test scenes depict a robot's pathfinding in rooms. In Scene 4, labeled areas

Figure 15. Additional cluttered environments introduced in response to the concerns. (a) test scene 4 includes static obstacles only, (b) test scene 5 introduces additional moving obstacles, which are cylindrical in shape.

Test Scene 4 (illustrated in Figure 15a) contains densely arranged static obstacles, forming multiple narrow corridors and enclosed regions that challenge precise local navigation and obstacle avoidance. Test Scene 5 (shown in Figure 15b) extends this design by introducing multiple cylindrical obstacles that move dynamically along predefined trajectories. As these cylindrical obstacles move, the layout of traversable space changes over time, forming a dense and dynamic scene that explicitly requires adaptive path planning.

Both environments substantially increase the navigation difficulty compared with the Test Scenes 0–3, imposing tighter spatial constraints and more complex interactions between the robot, obstacles, and goal. The policies trained in earlier experiments were directly deployed in these new scenes without retraining to evaluate cross-environment generalization.

The quantitative outcomes summarized in Table 3 provide a detailed characterization of policy robustness as environmental complexity increases from the previously tested sparse settings (Test Scenes 0–3) to the newly introduced cluttered environments (Test Scenes 4–5). Across all five (5) models, performance declined notably as the environment became more cluttered. This drop was primarily caused by the reduced navigable space and the increased probability of the agent becoming trapped in local minima created by narrow corridors and complex obstacle layouts.

Table 3
www.frontiersin.org

Table 3. Quantitative results in additional cluttered environments. SR represents for Success Rate (%), CR denotes Collision Rates, and Avg. GD (m) is the averaged distance from robot to goal at the end of each run, measured in meters. Rollout = 200.

Nevertheless, the Coulomb- and vision-guided model WC+v consistently achieved the highest success rates and the lowest collision ratios and average goal distance in both cluttered settings. In Test 5 (dynamic clutter), moving obstacles further lowered the success rate for all methods, but WC+v remained the most robust. This superior performance confirms that the Coulomb-based reward, which effectively models obstacle and goal forces, provides a strong global guidance signal. Furthermore, the inclusion of the vision input further improves performance by giving the agent richer spatial information and helping it adapt when the free space changes over time. The low CR values reflect the model’s superior collision-avoidance capability, while the low average GD suggests that WC+v either successfully reached the goal or, in failure cases, terminated closer to the goal than other models.

When compared to the earlier, sparser Test Scene 0–3 environments, the benefit of using Coulomb- and vision-guided policy becomes more obvious as the environment gets more cluttered, while purely using Robst model’s performance drops quickly. This trend aligns with the real-world outcomes discussed previously: WC+obst+v showed degraded performance in the physical setup because the environment is more cluttered, and the Robst reward can dominate in narrow passages, effectively blocking feasible routes. The new cluttered-scene results further confirm that WC+v is the most effective model, showing strong path adaptability and efficient route generation even in dense or dynamic environments.

6 Conclusion

In this paper, we presented a physics-inspired DRL framework for mobile robot motion planning that leverages Coulomb-force modeling to provide interpretable and effective guidance. By representing the robot, goal, and obstacles as electrical charges, we introduced a novel Coulomb-based reward mechanism that delivers smooth, pervasive, and consistent signals during training. To the best of our knowledge, this is the first work to employ Coulomb forces in path planning and reinforcement learning.

Our approach further incorporates obstacle boundaries extracted from LiDAR segmentation, enabling the robot to anticipate and avoid collisions in advance. Through training in a digital twin environment and deployment on a real TB3 robot, we demonstrated that the proposed framework significantly reduces collisions, maintains safe obstacle clearances, and improves trajectory smoothness across both simulated and real-world scenarios. These results confirm not only the effectiveness but also the strong explainability of our Coulomb- and vision-based rewards in shaping robot behavior.

Finally, by carefully designing environment-invariant components, our system exhibits enhanced generalization, suggesting broad applicability to diverse navigation tasks. Moving forward, this framework provides a promising foundation for extending physics-inspired reinforcement learning to multi-robot systems, more complex environments, and real-time adaptive planning.

Data availability statement

The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Author contributions

SS: Software, Writing – review and editing, Methodology, Supervision, Funding acquisition, Conceptualization, Writing – original draft, Investigation, Formal Analysis, Visualization, Resources, Project administration, Validation, Data curation. TB: Writing – original draft, Resources, Conceptualization, Writing – review and editing, Methodology, Supervision, Investigation. JL: Data curation, Validation, Conceptualization, Project administration, Methodology, Visualization, Investigation, Supervision, Resources, Writing – review and editing, Funding acquisition, Formal Analysis, Writing – original draft.

Funding

The author(s) declared that financial support was received for this work and/or its publication. This work was partially supported by the Ohio University Research Committee (OURC) Fund.

Conflict of interest

The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that generative AI was used in the creation of this manuscript. GenAI was used only for final sentence refinement.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Anderson, P., Shrivastava, A., Truong, J., Majumdar, A., Parikh, D., Batra, D., et al. (2021). “Sim-to-real transfer for vision-and-language navigation,” in Conference on robot learning (PMLR), 671–681.

Google Scholar

Back, S., Cho, G., Oh, J., Tran, X.-T., and Oh, H. (2020). Autonomous uav trail navigation with obstacle avoidance using deep neural networks. J. Intelligent and Robotic Syst. 100, 1195–1211. doi:10.1007/s10846-020-01254-5

CrossRef Full Text | Google Scholar

Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. (2013). The arcade learning environment: an evaluation platform for general agents. J. Artif. Intell. Res. 47, 253–279. doi:10.1613/jair.3912

CrossRef Full Text | Google Scholar

Beomsoo, H., Ravankar, A. A., and Emaru, T. (2021). “Mobile robot navigation based on deep reinforcement learning with 2d-lidar sensor using stochastic approach,” in 2021 IEEE international conference on intelligence and safety for robotics (ISR) (IEEE), 417–422.

CrossRef Full Text | Google Scholar

Borenstein, J., and Koren, Y. (1990). “Real-time obstacle avoidance for fast Mobile robots in cluttered environments,” in Proceedings., IEEE international conference on robotics and automation (IEEE), 572–577.

Google Scholar

Borenstein, J., Koren, Y., et al. (1991). The vector field histogram-fast obstacle avoidance for mobile robots. IEEE Transactions Robotics Automation 7, 278–288. doi:10.1109/70.88137

CrossRef Full Text | Google Scholar

Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., et al. (2016). Openai gym.

Google Scholar

Cao, Y., and Nor, N. M. (2024). An improved dynamic window approach algorithm for dynamic obstacle avoidance in mobile robot formation. Decis. Anal. J. 11, 100471. doi:10.1016/j.dajour.2024.100471

CrossRef Full Text | Google Scholar

Chang, L., Shan, L., Jiang, C., and Dai, Y. (2021). Reinforcement based mobile robot path planning with improved dynamic window approach in unknown environment. Aut. Robots 45, 51–76. doi:10.1007/s10514-020-09947-4

CrossRef Full Text | Google Scholar

Chen, W., Sun, J., Li, W., and Zhao, D. (2020). A real-time multi-constraints obstacle avoidance method using lidar. J. Intelligent and Fuzzy Syst. 39, 119–131. doi:10.3233/jifs-190766

CrossRef Full Text | Google Scholar

Cosío, F. A., and Castañeda, M. P. (2004). Autonomous robot navigation using adaptive potential fields. Math. Computer Modelling 40, 1141–1156. doi:10.1016/j.mcm.2004.05.001

CrossRef Full Text | Google Scholar

Da, L., Turnau, J., Kutralingam, T. P., Velasquez, A., Shakarian, P., and Wei, H. (2025). A survey of sim-to-real methods in rl: progress, prospects and challenges with foundation models. arXiv Preprint arXiv:2502.13187.

Google Scholar

Dai, X., Mao, Y., Huang, T., Qin, N., Huang, D., and Li, Y. (2020). Automatic obstacle avoidance of quadrotor uav via cnn-based learning. Neurocomputing 402, 346–358. doi:10.1016/j.neucom.2020.04.020

CrossRef Full Text | Google Scholar

Dietmayer, K. (2001). Model-based object classification and object tracking in traffic scenes from range-images. IV2001, 25–30.

Google Scholar

Doukhi, O., and Lee, D.-J. (2021). Deep reinforcement learning for end-to-end local motion planning of autonomous aerial robots in unknown outdoor environments: Real-time flight experiments. Sensors 21, 2534. doi:10.3390/s21072534

PubMed Abstract | CrossRef Full Text | Google Scholar

Fox, D., Burgard, W., and Thrun, S. (2002). The dynamic window approach to collision avoidance. IEEE Robotics and Automation Magazine 4, 23–33. doi:10.1109/100.580977

CrossRef Full Text | Google Scholar

Fujimoto, S., Hoof, H., and Meger, D. (2018). “Addressing function approximation error in actor-critic methods,” in International conference on machine learning (Stockholm, Sweden: Proceedings of Machine Learning Research), 1587–1596.

Google Scholar

Gao, J., Ye, W., Guo, J., and Li, Z. (2020). Deep reinforcement learning for indoor mobile robot path planning. Sensors 20, 5493. doi:10.3390/s20195493

PubMed Abstract | CrossRef Full Text | Google Scholar

Giusti, A., Guzzi, J., Cireşan, D. C., He, F.-L., Rodríguez, J. P., Fontana, F., et al. (2015). A machine learning approach to visual perception of forest trails for mobile robots. IEEE Robotics Automation Lett. 1, 661–667. doi:10.1109/LRA.2015.2509024

CrossRef Full Text | Google Scholar

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in Proceedings of the 35th international conference on machine learning (Stockholm, Sweden: Proceedings of Machine Learning Research), 1861–1870.

Google Scholar

Hart, P. E., Nilsson, N. J., and Raphael, B. (1968). A formal basis for the heuristic determination of minimum cost paths. IEEE Transactions Syst. Sci. Cybern. 4, 100–107. doi:10.1109/tssc.1968.300136

CrossRef Full Text | Google Scholar

He, L., Aouf, N., Whidborne, J. F., and Song, B. (2020). “Integrated moment-based lgmd and deep reinforcement learning for uav obstacle avoidance,” in 2020 IEEE international conference on robotics and automation (ICRA) (IEEE), 7491–7497.

Google Scholar

Joshi, B., Kapur, D., and Kandath, H. (2024). “Sim-to-real deep reinforcement learning based obstacle avoidance for uavs under measurement uncertainty,” in 2024 10th international conference on automation, robotics and applications (ICARA) (IEEE), 278–284.

CrossRef Full Text | Google Scholar

Karaman, S., and Frazzoli, E. (2011). Sampling-based algorithms for optimal motion planning. Int. J. Robotics Res. 30, 846–894. doi:10.1177/0278364911406761

CrossRef Full Text | Google Scholar

Khatib, O. (1986). Real-time obstacle avoidance for manipulators and mobile robots. International Journal Robotics Research 5, 90–98. doi:10.1177/027836498600500106

CrossRef Full Text | Google Scholar

Kherudkar, R., Tiwari, S., Vedantham, U., Chouti, N., Prasad, B. P., Vanahalli, M. K., et al. (2024). “Implementation and comparison of path planning algorithms for autonomous navigation,” in 2024 IEEE conference on engineering informatics (ICEI) (IEEE), 1–9.

CrossRef Full Text | Google Scholar

Kim, D. K., and Chen, T. (2015). Deep neural network for real-time autonomous indoor navigation. arXiv Preprint arXiv:1511.04668. doi:10.48550/arXiv.1511.04668

CrossRef Full Text | Google Scholar

Kim, I., Nengroo, S. H., and Har, D. (2021). “Reinforcement learning for navigation of mobile robot with lidar,” in 2021 5th international conference on electronics, communication and aerospace technology (ICECA) (IEEE), 148–154.

Google Scholar

Koenig, N., and Howard, A. (2004). “Design and use paradigms for gazebo, an open-source multi-robot simulator,” in 2004 IEEE/RSJ international conference on intelligent robots and systems (IROS) (IEEE Cat. No. 04CH37566) Ieee), 3, 2149–2154. doi:10.1109/iros.2004.1389727

CrossRef Full Text | Google Scholar

Lee, J.-W., Kim, K.-W., Shin, S.-H., and Kim, S.-W. (2022). “Vision-based collision avoidance for mobile robots through sim-to-real transfer,” in 2022 international conference on electronics, information, and communication (ICEIC) (IEEE), 1–4.

CrossRef Full Text | Google Scholar

Li, Z., Shi, N., Zhao, L., and Zhang, M. (2024). Deep reinforcement learning path planning and task allocation for multi-robot collaboration. Alexandria Eng. J. 109, 408–423. doi:10.1016/j.aej.2024.08.102

CrossRef Full Text | Google Scholar

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., et al. (2015). Continuous control with deep reinforcement learning.

Google Scholar

Liu, H., Shen, Y., Yu, S., Gao, Z., and Wu, T. (2024). Deep reinforcement learning for mobile robot path planning. arXiv Preprint arXiv:2404.06974 4, 37–44. doi:10.53469/jtpes.2024.04(04).07

CrossRef Full Text | Google Scholar

Long, P., Fan, T., Liao, X., Liu, W., Zhang, H., and Pan, J. (2018). “Towards optimally decentralized multi-robot collision avoidance via deep reinforcement learning,” in 2018 IEEE international conference on robotics and automation (ICRA) (IEEE), 6252–6259.

Google Scholar

Michels, J., Saxena, A., and Ng, A. Y. (2005). “High speed obstacle avoidance using monocular vision and reinforcement learning,” in Proceedings of the 22nd international conference on Machine learning, 593–600.

Google Scholar

Mirowski, P., Pascanu, R., Viola, F., Soyer, H., Ballard, A. J., Banino, A., et al. (2016). Learning to navigate in complex environments.

Google Scholar

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., et al. (2015). Human-level control through deep reinforcement learning. Nature 518, 529–533. doi:10.1038/nature14236

PubMed Abstract | CrossRef Full Text | Google Scholar

Murillo, J. (2023). Deep learning for autonomous vehicle real-time hazard detection and avoidance. J. AI-Assisted Sci. Discov. 3, 175–194.

Google Scholar

Olayemi, K. B., Van, M., McLoone, S., McIlvanna, S., Sun, Y., Close, J., et al. (2023). The impact of lidar configuration on goal-based navigation within a deep reinforcement learning framework. Sensors 23, 9732. doi:10.3390/s23249732

PubMed Abstract | CrossRef Full Text | Google Scholar

Ouahouah, S., Bagaa, M., Prados-Garzon, J., and Taleb, T. (2021). Deep reinforcement learning based collision avoidance in uav environment. IEEE Internet Things J. 9, 4015–4030. doi:10.1109/jiot.2021.3118949

CrossRef Full Text | Google Scholar

Philippsen, R., and Siegwart, R. (2003). “Smooth and efficient obstacle avoidance for a tour guide robot,” in 2003 IEEE international conference on robotics and automation (IEEE), 1, 446–451. doi:10.1109/robot.2003.1241635

CrossRef Full Text | Google Scholar

Quraishi, A., Gudala, L., Keshta, I., Putha, S., Nimmagadda, V. S. P., and Thakkar, D. (2025). “Deep reinforcement learning-based multi-robotic agent motion planning,” in 2025 4th OPJU international technology conference (OTCON) on smart computing for innovation and advancement in industry 5.0 (IEEE), 1–6.

CrossRef Full Text | Google Scholar

Robotics, O. (2014). Gazebo simulation environment.

Google Scholar

Robotics, O. (2017). Turtlebot3.

Google Scholar

Schaul, T., Quan, J., Antonoglou, I., and Silver, D. (2016). “Prioritized experience replay,” in International conference on learning representations (ICLR).

Google Scholar

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms.

Google Scholar

Shi, H., Shi, L., Xu, M., and Hwang, K.-S. (2019). End-to-end navigation strategy with deep reinforcement learning for mobile robots. IEEE Trans. Industrial Inf. 16, 2393–2402. doi:10.1109/tii.2019.2936167

CrossRef Full Text | Google Scholar

Singla, A., Padakandla, S., and Bhatnagar, S. (2019). Memory-based deep reinforcement learning for obstacle avoidance in uav with limited environment knowledge. IEEE Transactions Intelligent Transportation Systems 22, 107–118. doi:10.1109/tits.2019.2954952

CrossRef Full Text | Google Scholar

Song, S., Zhang, Y., Qin, X., Saunders, K., and Liu, J. (2021). “Vision-guided collision avoidance through deep reinforcement learning,” in NAECON 2021-IEEE national aerospace and electronics conference (IEEE), 191–194.

Google Scholar

Song, S., Saunders, K., Yue, Y., and Liu, J. (2022). “Smooth trajectory collision avoidance through deep reinforcement learning,” in 2022 21st IEEE international conference on machine learning and applications (ICMLA) (IEEE), 914–919.

Google Scholar

Stentz, A. (1995). Optimal and efficient path planning for partially known environments. Intell. Unmanned Ground Veh., 203–220. doi:10.1007/978-1-4615-6325-9_11

CrossRef Full Text | Google Scholar

Sun, J., Zhao, J., Hu, X., Gao, H., and Yu, J. (2023). Autonomous navigation system of indoor mobile robots using 2d lidar. Mathematics 11, 1455. doi:10.3390/math11061455

CrossRef Full Text | Google Scholar

Tai, L., Li, S., and Liu, M. (2016). “A deep-network solution towards model-less obstacle avoidance,” in 2016 IEEE/RSJ international conference on intelligent robots and systems (IROS) (IEEE), 2759–2764.

CrossRef Full Text | Google Scholar

Wang, J., and Huang, R. (2022). “A mapless navigation method based on deep reinforcement learning and path planning,” in 2022 IEEE international conference on robotics and biomimetics (ROBIO) (IEEE), 1781–1786.

CrossRef Full Text | Google Scholar

Wang, Y., He, H., and Sun, C. (2018). Learning to navigate through complex dynamic environment with modular deep reinforcement learning. IEEE Trans. Games 10, 400–412. doi:10.1109/tg.2018.2849942

CrossRef Full Text | Google Scholar

Wang, N., Zhang, D., and Wang, Y. (2020). “Learning to navigate for mobile robot with continual reinforcement learning,” in 2020 39th Chinese Control Conference (CCC) (IEEE), 3701–3706.

CrossRef Full Text | Google Scholar

Wen, T., Wang, X., Zheng, Z., and Sun, Z. (2024). A drl-based path planning method for wheeled mobile robots in unknown environments. Comput. Electr. Eng. 118, 109425. doi:10.1016/j.compeleceng.2024.109425

CrossRef Full Text | Google Scholar

Wu, J., Zhou, Y., Yang, H., Huang, Z., and Lv, C. (2023). Human-guided reinforcement learning with sim-to-real transfer for autonomous navigation. IEEE Trans. Pattern Analysis Mach. Intell. 45, 14745–14759. doi:10.1109/TPAMI.2023.3314762

PubMed Abstract | CrossRef Full Text | Google Scholar

Xie, L., Wang, S., Markham, A., and Trigoni, N. (2017). Towards monocular vision based obstacle avoidance through deep reinforcement learning. arXiv Preprint arXiv:1706.09829. doi:10.48550/arXiv.1706.09829

CrossRef Full Text | Google Scholar

Xing, X., Ding, H., Liang, Z., Li, B., and Yang, Z. (2022). Robot path planner based on deep reinforcement learning and the seeker optimization algorithm. Mechatronics 88, 102918. doi:10.1016/j.mechatronics.2022.102918

CrossRef Full Text | Google Scholar

Xue, Z., and Gonsalves, T. (2021). Vision based drone obstacle avoidance by deep reinforcement learning. AI 2, 366–380. doi:10.3390/ai2030023

CrossRef Full Text | Google Scholar

Xue, X., Li, Z., Zhang, D., and Yan, Y. (2019). “A deep reinforcement learning method for mobile robot collision avoidance based on double dqn,” in 2019 IEEE 28th international symposium on industrial electronics (ISIE) (IEEE), 2131–2136.

Google Scholar

Yan, C., Chen, G., Li, Y., Sun, F., and Wu, Y. (2023). Immune deep reinforcement learning-based path planning for mobile robot in unknown environment. Appl. Soft Comput. 145, 110601. doi:10.1016/j.asoc.2023.110601

CrossRef Full Text | Google Scholar

Yang, C., Li, Y., Zheng, Y., He, F., and Yan, C. (2018). Asynchronous multithreading reinforcement-learning-based path planning and tracking for unmanned underwater vehicle. IEEE Trans. Syst. Man, Cybern. Syst. 48, 1055–1066. doi:10.1109/TSMC.2021.3050960

CrossRef Full Text | Google Scholar

Yu, J., Su, Y., and Liao, Y. (2020). in The path planning of mobile robot by neural networks and hierarchical reinforcement learning (Lausanne, Switzerland: Frontiers in Neurorobotics), 63.

Google Scholar

Yu, W., Peng, J., Qiu, Q., Wang, H., Zhang, L., and Ji, J. (2024). “Pathrl: an end-to-end path generation method for collision avoidance via deep reinforcement learning,” in 2024 IEEE international conference on robotics and automation (ICRA) (IEEE), 9278–9284.

CrossRef Full Text | Google Scholar

Zhang, T., Zhang, K., Lin, J., Louie, W.-Y. G., and Huang, H. (2021). Sim2real learning of obstacle avoidance for robotic manipulators in uncertain environments. IEEE Robotics Automation Lett. 7, 65–72. doi:10.1109/lra.2021.3116700

CrossRef Full Text | Google Scholar

Zhelo, O., Zhang, J., Tai, L., Liu, M., and Burgard, W. (2018). Curiosity-driven exploration for mapless navigation with deep reinforcement learning. arXiv Preprint arXiv:1804.00456. doi:10.48550/arXiv.1804.00456

CrossRef Full Text | Google Scholar

Keywords: Coulomb force, deep reinforcement learning, Gazebo, lidar, motion planning, TurtleBot3

Citation: Song S, Bihl T and Liu J (2026) Coulomb force-guided deep reinforcement learning for effective and explainable robotic motion planning. Front. Robot. AI 12:1697155. doi: 10.3389/frobt.2025.1697155

Received: 01 September 2025; Accepted: 15 December 2025;
Published: 30 January 2026.

Edited by:

Giovanni Iacca, University of Trento, Italy

Reviewed by:

Feitian Zhang, Peking University, China
Samantha Rajapaksha, Sri Lanka Institute of Information Technology, Sri Lanka

Copyright © 2026 Song, Bihl and Liu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Jundong Liu, bGl1ajFAb2hpby5lZHU=

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.