Reinforcement learning for path planning of free-ﬂ oating space robotic manipulator with collision avoidance and observation noise

This study introduces a novel approach for the path planning of a 6-degree-of-freedom free-ﬂ oating space robotic manipulator, focusing on collision and obstacle avoidance through reinforcement learning. It addresses the challenges of dynamic coupling between the spacecraft and the robotic manipulator, which signi ﬁ cantly affects control and precision in the space environment. An innovative reward function is introduced in the reinforcement learning framework to ensure accurate alignment of the manipulator ’ s end effector with its target, despite disturbances from the spacecraft and the need for obstacle and collision avoidance. A key feature of this study is the use of quaternions for orientation representation to avoid the singularities associated with conventional Euler angles and enhance the training process ’ ef ﬁ ciency. Furthermore, the reward function incorporates joint velocity constraints to re ﬁ ne the path planning for the manipulator joints, enabling ef ﬁ cient obstacle and collision avoidance. Another key feature of this study is the inclusion of observation noise in the training process to enhance the robustness of the agent. Results demonstrate that the proposed reward function enables effective exploration of the action space, leading to high precision in achieving the desired objectives. The study provides a solid theoretical foundation for the application of reinforcement learning in complex free-ﬂ oating space robotic operations and offers insights for future space missions.


Introduction
Space exploration represents a sign of human curiosity that drives scientific and technological advancements at the frontier of our knowledge.Space technologies have materialized our wildest dreams, from the exploration of space to the potential colonization of different planets.Among these technologies, robotic manipulators designed for space operations have emerged as critical tools for various missions, including on-orbit servicing, space debris removal, and inspace assembly of large structures due to their high technological readiness level (Papadopoulos et al., 2021).These manipulators perform tasks that would be too dangerous, costly, or time-consuming for human astronauts to perform (Flores-Abad et al., 2014).
Research in space robotic manipulator divides into two distinct categories: free-flying and free-floating (Dubowsky and Papadopoulos, 1993).Free-flying manipulators are characterized by the active control of position and orientation (attitude) of the base spacecraft through the use of thrusters, while the manipulator executes its designated tasks (Seddaoui et al., 2021).This approach, which is analogue to the operation of a fixed-base manipulator, has been extensively studied with focus on path planning and control (Pfeiffer, 1968;Wang et al., 2021).However, the reliance on the thrusters for pose (position and attitude) control incurs significant fuel consumption, potentially reducing the operational lifespan of the system in orbit.
Conversely, free-floating manipulator systems operate with a base spacecraft whose pose is not actively controlled.Instead, the pose of the spacecraft is allowed to naturally react to the manipulator's joint movements due to the conservation of momentum (Ratajczak and Tchon, 2021;Tsiotras et al., 2023).Intuitively, this approach offers significant advantages, such as the saving of spacecraft fuel and a reduced risk of collision with targets, which might otherwise result from the active attitude control of the base spacecraft.Despite these advantages, controlling freefloating space robots represents a significant challenge.The dynamic interaction between the manipulator and its base spacecraft, due to the conservation of momentum, introduces a complex coupling effect.This effect significantly complicates path planning and motion control, as actions performed by the manipulator can inadvertently after the pose of the base spacecraft, and vice versa.The determination of the pose of the end-effector of the manipulator depends not only on the current joint angles on the manipulator but also on the previous velocities of both the joints and the base spacecraft (Nanos and Papadopoulos, 2017).Additionally, the nonholonomic nature of angular momentum further complicates control strategies, making traditional path-planning solutions, typically used for fixed-base manipulators, ineffective for freefloating space manipulators (Rybus et al., 2017;Dai et al., 2022), highlighting the need for innovative solutions in this area.

Related works
To bridge the research gap in the control and path planning of free-floating space manipulator, a challenge not encountered with fixed-base manipulators, various techniques have been proposed (Agrawal et al., 1996;Luo et al., 2018).Early efforts, as outlined in (Agrawal et al., 1996) applied inverse kinematics to path planning while respecting the nonholonomic constraints of angular momentum.However, this approach has limitations in guaranteeing path optimality, as dynamic singularities, which are often unpredictable uniquely from the manipulator's kinematics, can adversely affect the manipulator's performance (Xu et al., 2011).In response to these limitations, more recent studies have turned to optimization theory to enhance path-planning solutions, such as particle swarm optimization (Wang et al., 2015) and non-singular terminal sliding mode control (Shao et al., 2021).Despite their promise, these optimization-based solutions often demand substantial computational resources, complicating their application in real-time implementation, especially when facing uncertain disturbances.Additionally, any changes in the initial conditions of the manipulator or its target pose necessitate a comprehensive re-optimization process.
To address this shortfall, researchers have explored the potential of using machine learning (Ye et al., 2019), and more specifically, reinforcement learning (RL), to overcome these limitations (Xie et al., 2020).In essence, RL employs an agent that learns a decisionmaking policy dynamically through trial-and-error interactions with its environment (in this case, the space manipulator) to maximize a predefined reward function (Sutton and Barto, 2018;Nguyen and Hung, 2019).This learning process, which is within a Markov Decision Process framework, allows the agent to issue commands to alter the environment state and receive feedback in the form of rewards based on the efficacy of its actions.Remarkably, this process does not rely on prior data samples or predefined rules.Hence, the motivation behind exploring the application of RL in tasks like capturing a target by a free-floating space manipulator system lies in its ability to navigate the complexities of the task easily while operating within the constraints of space-based environments.After the training phase concludes, the trained agent is encapsulated solely by a set of weights and biases, represented within neural networks.These parameters are lightweight and can be readily deployed on various hardware platforms without necessitating extensive computational resources.This characteristic is particularly advantageous in space missions where acquiring powerful computational resources is inherently challenging.
Several RL techniques, such as the Deep Deterministic Policy Gradient (DDPG), Actor-Critic algorithm, and Q-learning, have proven effective in path planning for manipulator operating in continuous action spaces.For instance, a study implemented soft Q-learning for path planning of a 3-DOF (Degrees of Freedom) space robotic manipulator (Yan et al., 2018).The reward function in this study was designed based on the distance between the endeffector of the manipulator and the target.It returns positive rewards for minimizing the distance and imposing penalties (negative rewards) for cumulative joint torques.
While the value-based Q-learning performs well when agents select actions from a finite set, policy-gradient based methods like DDPG excel for continuous state and action spaces (Liu and Huang, 2021).In another instance, the DDPG algorithm was applied to path planning for a simple 2-DOF space manipulator (Hu et al., 2018).The reward function in the study included terms for obstacle avoidance, penalties for disturbances from the base spacecraft, and a small penalty for the path length.Additionally, DDPG was used to enhancing the training efficiency of a 3-DOF space robotic manipulator through pre-training (Du et al., 2019).Similarly, a dualarm space robotic manipulator, each arm with 4-DOF, was trained using DDPG, with the reward function incorporating velocity constraints and penalties to prevent collisions between the arms (Li Haoxuan et al., 2021).Furthermore, the Actor-Critic algorithm was utilized for path planning for a 6-DOF space robotic manipulator, aiming for position and orientation alignment (Liang et al., 2021).The reward function in this case was based not only on the distance difference but also a term for the Euler angle difference between the end effector and the target.In a separate context, DDPG was applied for collision-free path planning for a fixed-base manipulator with 7-DOF, using artificial potential fields within the reward functions to represent obstacles (Li Yinkang et al., 2021).The same algorithm was later applied to a redundant 7-DOF manipulator in a free-floating condition, aiming for target capture with a reward function also based on artificial potential fields (Li et al., 2022).
Table 1 summarizes the current state of RL applications in path planning for space robotic manipulators.This emerging field has seen limited publications to date.The majority of research has focused on training 3-DOF space manipulators, with fewer studies exploring 6-DOF or higher.A significant oversight in many studies in the neglect of joint velocity limitations.Movover, there is a lack of incorporation of quaternion-based expressions for attitude control and consideration of obstacle avoidance in path planning, both of which are crucial for practical applications.Additionally, the use of scalar reward functions in the Markov Decision Process can potentially restrict the search space or result in local optima if not designed with caution.The challenge of balancing exploration-exploitation in RL continues to be a prominent area of research (Jepma and Sander, 2011).
The development of RL path planning in free-floating space manipulators remains scarce.Research suggests that constructing an effective reward function capable of optimizing the control strategy of an agent in a perturbed environment is still a significant challenge.
This paper proposes a deep RL algorithm that integrates DDPG and Actor-Critic neural networks for efficient path planning for a 6-DOF free-floating space manipulator.Our methodology aims to expedite training convergence in dynamic settings by designing reward functions that consider several critical factors.These include aligning the manipulator's end effector (EE) with the target using a novel quaternion-based approach for orientation comparison and incorporating joint velocity constraints to achieve smoother EE trajectories.Additionally, our algorithm introduces measures to prevent self-collision of the manipulator's links and, importantly, incorporates external obstacle avoidance capabilities during target approach -a novel feature not previously implemented in the literature.Recognizing the imperfection of real-world sensors, we further enhance our model's practical applicability by including white noise in the observations during training.This ensures that the trained agent is resilient to signal noise and capable of achieving successful target capture.The effectiveness and advantages of our proposed algorithm are validated through numerical simulations.
3 Mathematic model of a free-floating space manipulator A general free-floating space robotic manipulator is composed of an n-DOF manipulator mounted on a floating base spacecraft, as shown in Figure 1.An inertial reference frame, designated as OXYZ, is defined in space.Within this frame, the base spacecraft (B 0 ) is treated as link 0 with 6-DOF and the first link of the manipulator is affixed to the base spacecraft at joint J 1 .The manipulator itself is constructed from n rigid links connected in series by n independently actuated revolute joints.The position of the EE is denoted by a vector p e ∈ R 3 .Schematic of a general free-floating space robotic manipulator with n rigid links.
The position (p e ) and orientation (A e ) of the EE can be accurately determined by propagating the relative position and orientation of each joint from the base spacecraft to the endeffector, such that where p i is the position vector of the ith joint in the inertial frame and A i is the rotation matrix of the ith link.Differentiating Eqs 1, 2 with respect to time results in the linear (v e ∈ R 3 ) and angular (ω e ∈ R 3 ) velocities of the EE, such that ) where _ φ i and j i are the ith joint's angular velocity and direction in the inertial frame, respectively.Rewriting the velocity equations into a compact matrix form with the base state _ where T is the joint angular velocity vector, and (J b , J m ) are the Jacobian matrices dependent on the base spacecraft and manipulator motions, respectively.
In a free-floating space manipulator system, the total momentum is conserved.Assuming that the initial momenta of the system is zero, we can express the conservation of momentum conservation as or in the compact matric form as where m i is the mass of the ith link, I i is the inertial moment of the ith link at its center of mass (CM), (a i , b i ) are fractional coefficients measuring the distances from the CM of the ith link to its two ends and satisfy the condition of a i + b i 1, M b is the inertia matrix of the base spacecraft, and M m is the coupling inertia matrix of the manipulator.The detailed expression of these inertia matrices can be found in Wilde et al. (2018).Combining Eqs. 5, 8 yields where J g is the generalized Jacobian matrix dependent on the base spacecraft attitude, joint angles, and mass properties of the base and links.Based on the above kinematics, the dynamics of the free-floating n-link robotic manipulator system, as shown in Figure 1, can be derived by Lagrange equations (Wilde et al., 2018). where and q [x b , Φ] T is the generalized displacement vector of the freefloating space robotic manipulator, M bm is the inertia matrix reflecting the dynamic coupling between the base spacecraft and the manipulator, c b and c m represents the Coriolis forces acting on the base spacecraft and the manipulator, and u is the control torque at the joints.This equation serves as the mathematical base in our dynamic simulation environment of a free-floating manipulator, which will be used for reinforcement learning training.
To ascertain the algorithm's performance and effectiveness under real-world conditions, the parameters used in the simulations are based on actual space robotic manipulator missions (Li et al., 2019).Table 2 shows a brief summary of past space robotic manipulator missions.Notably the mass ratios in most cases are small except for the Orbital Express mission, with a 7.4% mass ratio being the highest.Accordingly, this mass ratio was selected in study to show the need to consider the free-floating base condition.
4 Reinforcement learning for path planning

DDPG reinforcement learning algorithm
The control objective of the free-floating manipulator is to move the EE smoothly towards a specified target with precise position and orientation without any collisions between the manipulator's links, the target, and any obstacles along the path.With this in mind, the observations that are critical to the learning process are defined as where subscripts "T", "e", and "Ob" refer to the target, the endeffector, and Obstacle, respectively, d and ϑ are the distance and orientation difference between the EE and the target, respectively, to be described in details in Section 4. The actions that control the manipulator are the torque input at each joint: Assuming that the continuous system dynamics between states s and the actions a are discretized in the time domain at time step t i , we obtain: where u i is the control input applied to the system and w s,i is the noise representing the unmodeled system dynamics in the environment.
Applying RL algorithms, such as Q-learning, to a continuous system like robotic path planning is challenging.This challenge arises from the necessity of optimizing actions a i at each timestep i through a greedy policy.An alternative, the Deep Q-Network (DQN) algorithm (Silver et al., 2014) offers a solution by estimating the Q function with deep neural networks (DNNs) that take states and actions as inputs and return the expected cumulative reward as output.The DQN algorithm uses two distinct DNNs, the main network and target, which operate concurrently.DQN is able to learn value functions using in a stable and robust way due to two innovations (Lillicrap et al., 2015).The network is trained off-policy with samples from a replay buffer to minimize correlations between samples, and the network is trained with a target Q network to give consistent targets during temporal difference backups.This is also used in the DDPG algorithm.Moreover, the response to the input actions by the robotic manipulator is deterministic.Thus, instead of learning a set of probability distributions, a deterministic action can be directly learned from given states for the robotic manipulator system (Silver et al., 2014).Accordingly, the DDPG algorithm is used here to compute the policy gradient ∇J(θ).
The stability of policy gradient methods has been well-studied under the Lyapunov framework.For instance, Hejase and Ozguner, (2023) proved that the policy gradient is negative statistically, which is a crucial factor in ensuring the stability of DDPG methods from a statistical standpoint.
In the DDPG, the actor network deterministically returns in a continuous action space based on its interaction with the environment by the parameterized policy μ θ Q and is updated by the application of gradient ascent on the policy gradient, where N is the time steps in each episode, μ(θ) is a deterministic policy, and β(θ) is a behavior policy with different trajectories β θ (a | s) from the policy trajectories μ θ (a | s).The deterministic policy gradient can be calculated by trajectories that are sampled from a distinct behavior policy β θ (a | s) based on an Actor-Critic approach (Degris et al., 2012), Here, Q μ θ (s, a) is the Q-function that represents the expected discounted return calculated by starting from the ith time step, taking the state-action pair [s i , a i μ θ (s i )] and generating rewards based on the policy μ θ until the end of the episode.Notably, the Q-function depends exclusively on environment, meaning that it can be learned off policy by the Q-learning algorithm (Watkins and Peter, 1992).The parameter γ ∈ [0, 1] is a discount factor.If γ 1, the agent treats all future rewards equally, which is suitable for longterm planning.On the opposite, the agent is for short-term planning if γ ≪ 1.
The critic network evaluates the agent's performance by estimating the Q-value of state-action pairs (s i , a i ).The Q-function is approximated by the Bellman equation (Bellman, 1996).
and using Temporal difference, the Q-function can be calculated recursively over a parameter θ Q , to guide the next-generation of actions.
The critic parameters are updated by reducing the loss function, such that, To ensure the samples are independently distributed, a replay buffer is created to store the tuple (s i+1 , s i , a i , R i ) that are sampled from the environment according to the exploration policy (Lillicrap et al., 2015).At each time step, the actor and critic are updated by sampling a minibatch uniformly from the buffer.When the replay buffer is full the oldest samples are discarded.Because DDPG is an off-policy algorithm, the replay buffer can be large, allowing the algorithm to benefit from learning across a set of uncorrelated transitions.Directly implementing Q learning with neural networks proved to be unstable in many environments.However, for DDPG the target actor and critic parameters are soft updated, rather than directly copying the weights.The weights of these target networks are updated by having them slowly track the learned networks such that θ Q,i+1 τθ Q,i + (1 − τ)θ Q,i+1 and θ μ,i+1 τθ μ + (1 − τ)θ μ,i+1 to improve the Q-learning stability, where τ ≪ 1 is the smoothing factor (Lillicrap et al., 2015).This means that the target values are constrained to change slowly, greatly improving the stability of learning.This simple change moves the relatively unstable problem of learning the action-value function closer to the case of supervised learning.The pseudo-code that describes the detailed training process for the DDPG algorithm is shown in Table 3.

Actor-Critic networks
The actor and critic DNNs are created using standard feedforward neural networks, each one with two hidden layers.The first and second hidden layers have 400 and 300 neurons, respectively.The actor network inputs are the states of the system, and its outputs are the actions.However, the critic network inputs are the stateaction pairs, and its outputs are the action values.In this study, the robotic manipulator includes 6 links, i.e., n = 6.Thus, the dimensions of the state and actions are s ∈ S 32 , and a ∈ A 6 , respectively.

Reward function
As the most critical part of reinforcement learning, the reward function is directly responsible for how the algorithm trains and what objectives are achieved.With that in mind, an innovative reward function is developed to minimize the pose difference between the EE and the target as well as the sudden changes in joint velocities.The reward function (R t ) is evaluated at each time step, and the sum of these evaluations across all time steps constitutes the episode reward.
The pose difference is represented by the position error d and the orientation error ϑ.The primary part of the reward function must include: where K d and K ϑ are positive reward coefficients.As the pose difference is minimized, the reward R t is maximized.Notably this guidance term will ensure the agent is learning to align the EE and the target.
The position error is determined straightforward by: The orientation error of the EE can be defined either by a unit quaternion (ξ) or a set of Euler angles (α, β, γ).In the previous work as shown in Table 1, the expression of the orientation error is based on a set of Euler angles, such that, where (α e , β e , γ e ) and (α T , β T , γ T ) are the Euler angles of the EE and the target, respectively.It is worth pointing out that the orientation difference in such a form loses the information of rotation sequence of Euler angles.This increases the search space in the learning process and may even lead to divergence.Thus, the Euler angle representation is not used in the current work.Instead, an innovative reward function is built using the quaternion representation of orientations.
Let the orientations of the EE and the target be represented by two unit quaternions ξ E and ξ T , respectively, as shown in Figure 2.  Fill in the critic parameters by reducing L (loss function) Fill in the actor parameters by increasing the anticipated reward using J (gradient) Fill in the target actor/critic with where τ is the smoothing factor

End for FIGURE 2
Quaternion representation of orientation difference.

Frontiers in Control Engineering
frontiersin.org The orientation error of the EE with respect to the target can be expressed as (Campa and Camarillo, 2008): where ξ * e is the conjugate of the quaternion ξ e .Accordingly, an orientation difference between the EE and the target is determined by: ϑ 2 arccos Re ξ (24) As the EE approaches the desired pose, the pose error (d, ϑ) diminishes as well as the reward function in Eq. 20.To ensure the EE stays within the desired pose for capture, large rewards (r d , r ϑ ) are added to the reward function in Eq. 20 if d and ϑ are within predefined ranges, such that: where d 0 and ϑ 0 are pre-defined indicators for close proximity operation region for position and orientation, respectively.Once the end-effector enters into this region, large positive rewards are assigned to the reward function to encourage and accelerate the agent converging towards the desired pose.Furthermore, a large negative penalty (p l ) is added to the reward function in Eq. 25 to avoid self-collision between the robotic links and the target.If the distance between any two is less than a predefined safe distance, the penalty p l is added to the reward function, such that: where p l > 0 is a penalty for two links being too close to each other.
To facilitate obstacle avoidance, a substantial negative penalty (p Ob ) is incorporated into the reward function in Eq. 26.This penalty (p Ob ) is activated whenever the distance between the obstacle's CM and any link's CM falls below a predefined safety range: Figure 3 shows a simplified illustration of how the safety distance defined between any link and the obstacle is defined.
Finally, in order to ensure a smooth path and avoid sudden changes in motion, a velocity penalty is incorporated as a negative reward (p j ) into the reward function described in Eq. 27.This leads to the final reward function R t , where p j > 0 is a velocity penalty for smooth joint velocities by multiplying the squared sum of joint angular velocities.
In summary, the core aspect of RL lies in crafting a robust reward function.The adaptability of our approach shines as we can tailor the reward terms in the reward function to suit varied missions and tasks.This adaptability is evidently demonstrated by the above process to augment the reward function with multiple terms-comprising rewards or penalties-that correspond to the specific objectives at hand.This iterative process ensures our algorithm remains versatile and effective across diverse scenarios.
5 Simulation outcomes and analysis

Simulation environment
In this study, the proposed RL approach for motion planning of a free-floating space robotic manipulator is verified by simulation using MATLAB ® and Simulink ® .The physical parameters of the manipulator are given in Table 4. Table 5 provides the RL training hyperparameters and the initial conditions for all simulation cases.These simulations were performed on a PC equipped with an Intel ® Core ™ i7-10600K Processor and 32 GB of RAM.

Simulation results
Three different cases of path planning were conducted to demonstrate the capabilities of the proposed RL approach, each employing different reward functions.The first case used the reward function in Eq. 27 without velocity constraints.The second case used the reward function in Eq. 28 with the velocity constraints.The third case used the same reward function as the second case.However, it included noisy observations by adding white noise, thereby closely reflecting realistic sensor observations.

Case 1-approach target without velocity constraints
In this case, the free-floating space robotic manipulator has been trained by the reward function in Eq. 27 toward the desired pose (x, y, z, ξ) T .The parameters in this reward function are defined as k d 0.1, k ϑ 0.001, d 0 0.2m, ϑ 0 30 + , r d 1, r ϑ 2, p Ob 0.5, d Ob 0.3.These parameters are used in all simulation cases described below.
The training progress is shown in Figure 4, where the blue line is the cumulative rewards from all timesteps in each episode, and the red line is the moving average of the cumulative rewards of all episodes.These curves show that the agent learns to reach the desired pose by searching for the highest reward.The term (−K d d − K ϑ ϑ) in the reward function efficiently guides the EE towards the capture range within the first 5,000 episodes, while the term (−p l − p Ob ) in the reward function effectively prevents collisions of the manipulator with the obstacle, the target and itself.
Upon entering the capture range, the large rewards (r d + r ϑ ) in the reward function are activated to keep the EE within the capture range while simultaneously refining the EE's pose throughout the remaining timesteps to prepare the gripper for capture.The large variation of the episode reward reflects the activation and deactivation of the large rewards (r d + r ϑ ) in this process.Figure 5 shows several snapshots of the output path for the space robotic manipulator towards the target.
Figure 6A shows the pose error between the EE and the target over time with the left subfigure representing the distance error (d) and the right subfigure depicting the orientation difference (ϑ).The trained agent guides the EE towards the desired target pose range (d 0 , ϑ 0 ) within approximately 6 s.Subsequently, the EE stays within that region to achieve capture for the remaining duration of the simulation.It is worth mentioning that the threshold (d 0 0.20m) does not imply that the minimum distance error the trained agent can reach; instead, it signifies that crossing this threshold secures a large positive reward (r d ).With this large positive reward, the agent achieves a minimum distance error of Δd = 0.058 m and orientation error of Δθ = 4.85 °.
Figure 6B shows the time histories of actions (control torques) at all 6 joints as given by the agent.The agent effectively provides the necessary torques to achieve a valid path for the manipulator to move the EE towards the target, ensuring no self-collision occurs between manipulator links, and most importantly, avoiding the obstacle placed between the EE and the target.
It is worth pointing out that the RL hyperparameters play a pivotal role in governing various aspects of the learning process, including learning dynamics, exploration versus exploitation trade-offs, convergence, adaptation to changing environments,   RL training of case 1 for 10,000 episodes.
Frontiers in Control Engineering frontiersin.organd computational resource utilization.The training is greatly influenced by the learning rate, replay buffer and discount factor.The learning rate dictates the pace at which the RL agent updates its estimates based on new information, influencing both the speed and stability of learning.Similarly, a larger replay buffer enhances sample efficiency and stability by enabling the agent to learn from a diverse array of past experiences while mitigating overfitting risks.Moreover, the discount factor is essential for the agent's long-term decisionmaking, allowing it to balance immediate rewards with future gains.This factor significantly influences the convergence and stability of RL algorithms and aids in handling uncertainty and delayed rewards, thus facilitating effective exploration and exploitation strategies.Case 1 solution for space manipulator path reaching the target.

Case 2-approach target with velocity constraints
To achieve smoother motion and eliminate undesired sudden changes in the path, as shown in Figure 6A at the beginning of the path, often associated with high joint angular velocities, a velocity penalty term for all six joint velocities (−p j ( _ φ) 2 ) has been added to the reward function, as shown in Eq. 27.The parameters for this simulation case are the same as those in case 1, with the inclusion of an additional parameter of p j 0.005.
The training results are shown in Figures 7, 8. Figure 7 shows the convergence of the reward function.Figure 8A shows the pose error between the EE and the target over time, and Figure 8B shows the time histories of actions (control torques) for all 6 joints given by the agent.
In comparison to Figures 4, 7 shows that the training process is similar to Case 1.The training converges to the capture range in approximately 5,000 episodes.Upon entering the capture range, the large rewards (r d + r ϑ ) in the reward function are activated to keep the EE within the capture range while simultaneously refining the EE's pose throughout the remaining timesteps to prepare the gripper for capture.However, it is noted that in Case 2, the introduction of a velocity penalty term (−p j ( _ φ) 2 ) to the reward function results in a smoother path.Comparing Figure 8A with Figure 6A reveals an improvement in the convergence of the path in the first 3 s.Furthermore, the end-effector successfully converges towards the capture range.The high rewards (r d + r ϑ ) helps the agent staying in this region for future capture and servicing with a minimum  To further examine the effect of the velocity constraint term on agent training, the square sum of all six joint velocities ( ( _ φ) 2 ) along the path is shown in Figure 9.It is observed that the square sum of velocities obtained from the path in Case 2 (right plot) barely surpasses 80 in comparison to the 160 reached in Case 1 (left plot).These results clearly demonstrate the efficacy of modifying the reward function on the solution.Such findings open the door for future research to enhance path optimization by designing better reward functions, potentially incorporating torque penalties, energy consumption penalties, and the integration of optimal path constraints commonly used in optimization to RL.

Case 3-approach target with noisy observations
In real missions, assuming perfect observations of the state is unrealistic, particularly in estimating the target's pose using vision sensors like cameras.To evaluate the robustness of the proposed reward function in handling noisy observations, training of the agent was proceeded with the same reward function and parameters in Case 2, except introducing white noise to the observations that would be obtained from cameras in real mission.The noise is added to the target pose and obstacle position: p T , α T , β T , γ T , p Ob with a signal-to-noise ratio (SNR) of 10/1 to simulate realistic conditions: where R n ∈ [−0.1, 0.1] is a random number with Mean = 0 and Variance = 1.
The training outcome is shown in Figure 10.It is noted that the agent is able to obtain high rewards despite the presence of observational noise.Figure 11A shows the pose error between the EE and the target over time.Despite the presence of noise in the target position and orientation observations, the agent effectively manages to align the EE and the target for capture.The agent achieves a minimum distance error of Δd = 0.048 m and orientation difference of Δθ = 7.79 °.Notably, as the EE approaches the target, the absolute value of the noise decreases, imitating the characteristic behavior observed in real camera observations.The agent successfully overcomes the challenge of noisy inputs, which proves the effectiveness and robustness of the proposed solution.
Figure 11B shows the time histories of actions (control torques) for all six joints given by the agent.Observations reveal that the torques required for capture exhibit more oscillations compared to the previous solution in Figure 8.This increase in oscillation results from the noisy observations that add uncertainties during training.It is worth mentioning that the agent trained in Case 3 (noise on observations) still manages to achieve the desired task when implemented in a noise-free environment.Conversely, the agent trained in Case 2 (no noise on observations) fails its task in a noisy environment.This highlights the importance of RL training under realistic and challenging conditions.Exposing the agent to noise during training equips it with the ability to effectively achieve its task in real-world applications.

Conclusion
This study investigated the obstacle avoidance path planning problem for a free-floating space robotic manipulator using RL.The DDPG algorithm was used to train an agent to control a 6-DOF space robotic manipulator to achieve the desired goals.Specifically, it aims to find a feasible  path towards position and orientation alignment between the EE and the target, without any self-collation between manipulator links, while avoiding external obstacles along the path.The proposed method was verified using simulation with three different simulation cases-without joint velocity constraints, with joint velocity constraints, and with noisy observations.The results show the proper construction of a reward function is critical for training the agent using RL and affects the quality of the path.The inclusion of observation noise in the training will significantly enhance the robustness of the agent.The offline nature of the training in the RL approaches diminishes the need for immediate computational efficiency considerations.After the training phase concludes, the trained agent is encapsulated solely by a set of weights and biases, represented within neural networks.These parameters are lightweight and can be readily deployed on various hardware platforms without necessitating extensive computational resources.This characteristic is particularly advantageous in space missions where acquiring powerful computational resources is inherently challenging.

For
each timestep According to current state s, choose action a μ(s | θ μ ) + N. N is a random process exploration noise Perform action a, then detect reward r, and new state s′ Record the incident (s, a, r, s′) in replay buffer Select a small batch of incidents that occurred (s, a, r, s′) out of the buffer If the final timestep is reached, the target value function is stored y r Else store the target value function y r + γQ′(s′, μ′(s′ | θ μ′ ) | θ Q′ )

FIGURE 3
FIGURE 3Simplified illustration of obstacle avoidance safety distances.

FIGURE 5
FIGURE 5 FIGURE 6 (A) Position and orientation tracking errors vs. time.(B) Time histories of joint torques (actions) given by the agent.

FIGURE 7 RL
FIGURE 7RL training of case 2 for 10,000 episodes.

FIGURE 8 (
FIGURE 8 (A) Position and orientation tracking errors vs. time, (B) Time histories of joint torques (actions) given by agent.

FIGURE 9
FIGURE 9Square sum of all joint angular velocities in Case 1 (left) and Case 2 (right).

FIGURE 10 RL
FIGURE 10RL training of case 3 for 10,000 episodes.

TABLE 1
Summary of current research on RL-based path planning for space manipulators.

TABLE 2
Past space robotic capture missions.