^{1}Department of Engineering Cybernetics, Norwegian University of Science and Technology, Trondheim, Norway^{2}Mathematics and Cybernetics, SINTEF Digital, Trondheim, Norway^{3}School of Mechanical and Aerospace Engineering, Oklahoma State University, Stillwater, OK, United States

Control theory provides engineers with a multitude of tools to design controllers that manipulate the closed-loop behavior and stability of dynamical systems. These methods rely heavily on insights into the mathematical model governing the physical system. However, in complex systems, such as autonomous underwater vehicles performing the dual objective of path following and collision avoidance, decision making becomes nontrivial. We propose a solution using state-of-the-art Deep Reinforcement Learning (DRL) techniques to develop autonomous agents capable of achieving this hybrid objective without having a priori knowledge about the goal or the environment. Our results demonstrate the viability of DRL in path following and avoiding collisions towards achieving human-level decision making in autonomous vehicle systems within extreme obstacle configurations.

## 1 Introduction

Autonomous Underwater Vehicles (AUVs) are used in many subsea commercial applications such as seafloor mapping, inspection of pipelines and subsea structures, ocean exploration, environmental monitoring, and various research operations. The wide range of operational contexts implies that truly autonomous vehicles must be able to follow spatial trajectories (path following), avoid collisions along these trajectories (collision avoidance), and maintain a desired velocity profile (velocity control). In addition, AUVs are often underactuated by the fact that they operate with three generalized actuators (propeller, elevation, and rudder fins) in six degrees of freedom (6-DOF) (Fossen, 2011). The complexity that arises when combining the control objectives, a complicated hydrodynamic environment and disturbances, and the physical design with three generalized actuators spurs an intriguing control challenge. The current work is an attempt to address these challenges. Since path following and collision avoidance are the two main challenges addressed in this paper, a brief overview of the state of the art is provided in the following subsections.

### 1.1 Path Following

The path-planning and path-following problems are heavily researched and documented in classical control literature. The control objective is to plan a collision-free optimal path, defined relative to some inertial frame, and minimize tracking errors, i.e., the distance between the vehicle and the path. For road vehicles, several path/motion planning strategies are demonstrated (Karaman and Frazzoli, 2011). In their recent studies, Ljungqvist et al. (2019) and Cirillo (2017) demonstrated the use of the Lattice-based method for path/motion planning. In both works where the focus was path planning, the algorithms developed were tested on real vehicles. Pivtoraiko et al. (2009) contributed with a principled mechanism to construct an efficient, precise, and deferentially constrained search space upon which any planner might operate. However, most of these works were limited to fully actuated surface vehicles. Three-dimensional (3D) path following involves tracking errors that are composed of horizontal and vertical components and forms an accurate representation of the real engineering operations for AUVs (Chu and Zhu, 2015). Typically, a variant of the Proportional Integral Derivative (PID) controller based on reduced order models (ROM) is used to control elevator and rudder to eliminate tracking errors (Fossen, 2011, ch. 12).

More advanced approaches are also available. A classical nonlinear approach is found in Encarnacao and Pascoal (2000), where a kinematic controller was designed based on Lyapunov theory and integrator backstepping. To extend the nonlinear approach reliably in the presence of disturbances and parametric uncertainties, Chu and Zhu (2015) proposed an adaptive sliding mode controller, where an adaptive control law is implemented using a radial basis function neural network. To alleviate chattering, a well-known “zig-zag” phenomenon occurring when implementing sliding mode controllers due to a finite sampling time, an adaptation rate was selected based on the so-called minimum disturbance estimate. Xiang et al. (2017) proposed fuzzy logic for adaptive tuning of a feedback linearization PID controller. The heuristic, adaptive scheme accounts for modeling errors and time-varying disturbances. They also compare the performance on 3D path following with conventional PID and nonadaptive backstepping-based controllers, both tuned with inaccurate and accurate model parameters, to demonstrate the robust performance of the suggested controller. Liang et al. (2018) suggested using fuzzy backstepping sliding mode control to tackle the control problem. Here, the fuzzy logic was used to approximate terms for the nonlinear uncertainties and disturbances, specifically for use in the update laws for the controller design parameters.

Many other methods exist, but most published works on the 3D path-following problem incorporate either fuzzy logic, variants of PID control, backstepping techniques, or any combination thereof. More recently, there have been numerous attempts to achieve path following and motion control for AUVs by applying machine learning (ML) directly to low-level control. Specifically, Deep Reinforcement Learning (DRL) seems to be the favored approach. DRL controllers are based on experience gained from self-play or exploration, using algorithms that can learn to execute tasks by reinforcing good actions based on a performance metric. Yu et al. (2017) used a DRL algorithm known as Deep Deterministic Policy Gradients (DDPG) (Lillicrap et al., 2015) to obtain a controller that outperformed PID on trajectory tracking for AUVs. A DRL controller for underactuated marine vessels was implemented in Martinsen and Lekkas (2018a) to achieve path following for straight-line paths and later in Martinsen and Lekkas (2018b) for curved paths using transfer learning from the first study. The DRL controller demonstrated excellent performance, even compared to traditional Line-of-Sight (LOS) guidance. More recently, Meyer et al. (2020b) demonstrated that a DRL controller can achieve very impressive results in achieving the combined objective of path following and collision avoidance with a complex layout of stationary obstacles. Exciting results validating the real-world applications of DRL controllers for AUVs and unmanned surface vehicles can be found in Carlucho et al. (2018) and Woo et al. (2019). The first paper implemented the controller on an AUV equipped with six thrusters configured to generate actuation in pitch moment, yaw moment, and surge force. They demonstrated velocity control in both linear and angular velocities. The latter paper implemented a DRL controller on an unmanned surface vehicle with path following as the control objective and presented impressive experimental results from the full-scale test. Common to all these studies is that all the potential of DRL in path following is demonstrated in a 2D context only.

### 1.2 Collision Avoidance

Collision Avoidance (COLAV) system is an important part of the control systems for all types of autonomous vehicles. AUVs are costly to produce and typically equipped with expensive gears as well. Therefore, maximum efforts must be made to ensure their safe movements at all times. COLAV systems must be able to do *obstacle detection* using sensor data and information processing and *obstacle avoidance* by applying steering commands based on detection and avoidance logic. The two fundamental perspectives of COLAV control architectures are described in the literature: *deliberate* and *reactive* (Tan, 2006).

Deliberate architectures are plan driven and therefore necessitates a priori information about the environment and terrain. It could be integrated as part of the on-board guidance system (McGann et al., 2008), or at an even higher level in the control architecture, such as a waypoint planner (Ataei and Yousefi-Koma, 2015). Popular methods to solve the path-planning problem include A* algorithms (Carroll et al., 1992; Garau et al., 2005), genetic algorithms (Sugihara and Yuh, 1996), and probabilistic roadmaps (Kavraki et al., 1996; Cashmore et al., 2014). Deliberate methods are computationally expensive, due to information processing about the global environment. However, they are more likely to make the vehicle converge to the objective (Eriksen et al., 2016). Reactive methods on the other hand are faster and process only real-time sensor data to make decisions. In this sense, the reactive methods are considered local and are used when rapid action is required. Examples of reactive methods are the dynamic window approach (Fox et al., 1997; Eriksen et al., 2016), artificial potential fields (Williams et al., 1990), and constant avoidance angle (Wiig et al., 2018). A potential pitfall with reactive methods is local minima manifested as dead ends (Eriksen et al., 2016).

To improve both the deliberate and the reactive approach, a *hybrid* approach is used in practice by combining the strengths of both. Such architectures are comprised of deliberate, reactive, and execution layers. The deliberate layer handles high-level planning, while the reactive layer tackles incidents happening in real time. The execution layer facilitates the interaction between the deliberate and reactive architectures and decides the final commanded steering (Tan, 2006). The hybrid approach is demonstrated in Meyer et al. (2020a) where a DRL agent trained in a purely synthetic environment could achieve the combined objective of path following and collision avoidance with real sea traffic data (moving obstacles) in the Trondheim fjord while complying with collision avoidance regulations. There are still challenges in state-of-the-art COLAV methods for vehicles subjected to nonholonomic constraints, such as AUVs. Instability issues, neglecting vehicle dynamics and actuator constraints leading to infeasible reference paths, and algorithms causing the vehicle to stop are recurring challenges seen in the literature. Additionally, extensive research discusses methods for COLAV in 2D that cannot be directly applied to 3D. In many cases where such methods are adapted to 3D, however, they do not optimally take advantage of the extra dimension (Wiig et al., 2018).

### 1.3 Research Goals and Methods

The current study is aimed at the following:

• Figuring out if it is possible to tame an underactuated AUV with 6-DOF to achieve the combined objective of path following and collision avoidance in 3D using DRL. It is hoped that this will provide additional insight into the dynamical system from a new perspective.

• Exploring the potential of curriculum learning (Bengio et al., 2009) in the context of training AUVs. The basic idea behind curriculum learning is to start small, learn easier aspects of the task or easier subtasks, and then gradually increase the difficulty level.

In order to achieve the research goals, we employ a DRL controller as the motion control system operating the control fins of the AUV, to learn the control law through exploration. As the DRL algorithm, we have used the Policy Proximal Optimization (PPO), proposed by Schulman et al. (2017). The agent commands the control fins, while a traditional PI-controller maintains a desired cruise speed. The key idea lies in the fact that the agent learns to operate both the elevator and rudder at the same time and should therefore be able to learn an optimal strategy for navigating in both planes. The challenge of DRL control is establishing a reward function such that safe and effective tracking behavior is incentivized. To implement curriculum learning, scenarios ranging from *beginner* to *expert* level difficulty are constructed. Initially, the agent starts with only a path without any obstacle or ocean current disturbances and trains until it masters that difficulty level. Then, obstacles are progressively added and eventually ocean current disturbances are introduced to form the expert level scenario. The scenarios are detailed in Section 3.1. In a COLAV sense, the predefined path can be viewed as the deliberate architecture, where it is assumed that the waypoints are generated by some path-planning scheme, and the random and unforeseen obstacles are placed on this presumed collision-free path. The DRL agent operates in effect as the reactive system that must handle the threat of collisions rapidly but at the same time chooses effective trajectories to reach the target.

The article is organized as follows: In Section 2, the background theory on AUV modeling, the path-generation, and DRL is given. The implementation of the simulation model and the utilized DRL algorithm are briefly described in Section 3. Section 4 presents the results followed by conclusions and proposed future work in Section 5.

## 2 Theory

### 2.1 AUV Model

This section introduces a dynamic model that can be used to accurately simulate an AUV in a hydrodynamic environment. This is done by using a 6-DOF maneuvering model which is represented by 12 highly coupled and nonlinear first-order ordinary differential equations (ODEs). Dynamic models for AUVs comprise a *kinematic* (Section 2.1.2) and a *kinetic* (Section 2.1.3) part. Kinematics represent the geometrical evolution of the vehicle and involves a coordinate transformation between two important reference frames. Kinetics considers the forces and moments causing vehicle motion. The kinetic analysis is typically important when designing motion control systems because actuation can only be achieved by applying control forces and moments. Before delving into the details of the kinematic and kinetic equations, the notation used to detail the model’s states and parameters are presented in Table 1. This notation is used by the Society of Naval Architects and Marine Engineers (SNAME (1950)) (Fossen, 2011, p. 16).

##### 2.1.1 Reference Frames

Two reference frames are especially important in the modeling of vehicle dynamics. The North-East-Down (NED) frame denoted {n} and the body frame denoted {b}. The NED coordinate system is considered to be inertial, with principal axis pointing towards true north, east, and downwards—normal to Earth’s surface—for the

where

The Euler angles describing the vehicle’s attitude is contained in

##### 2.1.2 Kinematic Equations

The kinematic state vector is the concatenation of the position of the vehicle in NED coordinates and the vehicle’s attitude with respect to the NED frame. This vector is symbolized by

where the body-fixed velocity vector is defined as

To write a differential equation for the whole kinematic state vector, an equation describing the time-evolution of the Euler angles is obtained by transforming the linear velocities expressed in {b}, according to Eq. 3. Note that this transformation is not well defined for

where

##### 2.1.3 Kinetic Equations

The kinetic equations of motion for a marine craft can be expressed as a mass-spring-damper system. The mass terms naturally stem from vessel body, while the spring forces acting on the body arise from buoyancy. The damping is a result of the hydrodynamic forces caused by motion. The model implemented is adapted from da Silva et al. (2007) and all model parameters can be seen in Table 2. The AUV specifications on which the model parameters are based are given by the tables given in the appendix. Furthermore, it is based on the following assumptions:

• The AUV operates at a depth below disturbances from wind and waves.

• The maximum speed is

• The moment of inertia can be approximated by that of a spheroid.

• The AUV is passively stabilized in roll and pitch by placing the CM a distance

• The AUV shape is top-bottom and port-starboard symmetric.

• As a fail-safe mechanism, the AUV is slightly buoyant.

The vessel’s motion is governed by the nonlinear kinetic equations expressed in {b} according to Equation 5:

where

##### 2.1.3.1 *Mass Forces*

The systems inertia matrix,

##### 2.1.3.2 *Coriolis Forces*

Naturally, the added mass will also affect the Coriolis-centripetal matrix,

##### 2.1.3.3 *Damping Forces*

The components of hydrodynamic damping modeled are linear viscous damping, nonlinear (quadratic) damping due to vortex shedding, and lift forces from the body and control fins. Thus, the damping matrix,

The linear damping is given by

The nonlinear damping is given by

Finally, the lift is given by

##### 2.1.3.4 *Restoring Forces*

The restoring forces working on the AUV body are functions of the orientation, weight, and buoyancy of the vehicle. Because the vehicle is assumed to be slightly buoyant and the passive stabilization of roll and pitch, the restoring force vector can be written as

##### 2.1.3.5 *Control Inputs*

There are three control inputs: propeller thrust, rudder, and elevator fins denoted as

**TABLE 3**. Specifications for simulated AUV adapted from da Silva et al. (2007).

This completes the details of the model implemented. The numerical values used in the simulation can be found in Table 2. For a complete derivation of the model and how the numerical values are obtained from the specifications and assumptions, da Silva et al. (2007) and Fossen (2011) are referred to for extensive explanations.

##### 2.1.4 Simulation Model for Ocean Current

To simulate the environmental disturbances in the form of ocean currents, a 3D irrotational ocean current model is implemented. The model is based on generating the intensity of the current, *Gauss-Markov Process* (Fossen, 2011, Ch. 8):

where *w* is *white noise* and

There are no dynamics associated with the sideslip angle and the angle of attack in the simulations. The current direction stays fixed throughout an episode. To obtain the linear velocities in the body frame, we apply the inverse Euler-angle rotation matrix, as seen in Eq. 13:

Since the ocean current is defined to be irrotational, the full current velocity vector is written

##### 2.1.5 Control Fin Dynamics

To simulate the operation of the control fins more realistically, the output of the controller passes through a first-order low-pass filter with time constant

where the filter parameter *a* is related to the time constant by

### 2.2 3D Path Following

In this section, the path-following problem is formally introduced. A set of *s*, such that *m* has a Euclidian distance

And the coefficients can be found by solving the following matrix equations:

A path represented by

and

Note that the first and the last polynomial are not overlapping any of the others; hence, the membership functions can be thought of as equal to one and zero in these regions. In the intermediate waypoints, the polynomials are blended smoothly by linearly increasing and decreasing the contribution of the two polynomials. In the paper by Chang and Huh (2015), they go on to prove

##### 2.2.1 Guidance Laws for 3D Path Following

To define the tracking errors, the Serret-Frenet ({SF}) reference frame associated with each point of the path is introduced. The

To achieve path following, it is important to align the velocity vector of the vessel in *n*,

where

where *e* and *h* to zero will in turn drive the correction angles

### 2.3 Deep Reinforcement Learning

In RL, an algorithm, an **agent**, makes an **observation****environment** and performs an **action***T*, we say that the problem is episodic and that each iteration through the task is an episode.

After performing an action, the agent receives a scalar **reward** signal **policy**

##### 2.3.1 Proximal Policy Optimization

The actor-critic algorithm known as Proximal Policy Optimization was proposed by Schulman et al. (2017). We briefly present the general theory and the algorithm which is used in this research. Let the value function *a*. Then the advantage function

The advantage function represents the difference in expected return by taking action *a* in state *s*, as opposed to the following policy. Because both

Here, *T* is a truncation point which is typically much smaller than the duration of an entire episode. As before, γ is the discount factor. As the GAE is a sum of uncertain terms, the tunable parameter

The second key component in PPO is introducing a surrogate objective. It is hard to apply gradient ascent directly to the RL objective. Therefore, Schulman et al. suggest a surrogate objective such that an increase in the surrogate provably leads to an increase in the original objective Schulman et al. (2017). The proposed surrogate objective function is given by Equation 27.

The tuning parameter *N* actors (parallelized agents) are enabled to execute the policy and in that way sample trajectories for *T* timesteps. Then the GAE is computed based on the sampled trajectories, and the advantage estimation is used to optimize the surrogate objective for K epochs using minibatches of size M per update. The PPO method is seen in its general form in Algorithm 1 (Schulman et al., 2017).

Algorithm 1: Proximal Policy Optimization, actor-critic style

**for** iteration: 1,2... **do**

**for** actor: 1,2...*N***do**

Run policy

Compute advantage estimate

**end**

Optimize surrogate L w.r.t. *M* < *NT*

**end**

## 3 Method and Implementation

The simulation environments are built to comply with the **OpenAI Gym** (Brockman et al., 2016) standard interface. For the RL algorithms, the **Stable Baselines** package which offers improved parallelizable implementations based on **OpenAI Baselines** (Dhariwal et al., 2017) library is utilized. The complete code can be found on Github^{1}. Ten different scenarios are created: five for training and five for testing.

### 3.1 Environment Scenarios

Training scenarios are constructed by generating a path from a random set of *beginner*, where only a path and no obstacles or ocean current is present. Then the agent is introduced to the *intermediate* level, where a single obstacle is placed on the half-way mark. The next level is called *proficient*. Here, two more obstacles are placed equally distanced from the half-way mark.

The last part of training happens in the *advanced* and *expert* level scenarios. In the advanced level difficulty, we generate the *proficient* challenge, but additionally five more obstacles are placed randomly off-path, such that an avoidance maneuver could induce a new collision course. The distinction between the expert and the advanced level is the inclusion of the ocean current disturbance. In all scenarios, the first and the last third of the path are collision-free, in order to keep part of the curriculum from the beginner scenario (pure path following) present throughout the learning process. This enables the agent to not forget knowledge learned from doing path following only. Figure 1 illustrates the different training contexts the agent is exposed to. In addition to training, the agent progressively through these scenarios, quantitative evaluation is performed by sampling a number of episodes such that the agents’ performance across the various difficulty levels can be established. After evaluating the controllers by statistical averages, qualitative analysis is done in designated test scenarios. These are designed to test specific aspects of the agents’ behavior. The first scenario tests a pure path following on a nonrandom path (in order for results to be reproducible) both with and without the presence of an ocean current. Next, special (extreme) cases where it would be preferable to use only one actuator for COLAV, i.e., horizontally and vertically stacked obstacles, are generated. The agents are also tested in a typical pitfall scenario for reactive COLAV algorithms: a dead end. See Section 4.2 for illustrations of the test scenarios.

### 3.2 Obstacle Detection

Being able to react to the unforeseen obstacles requires the AUV to perceive the environment through sensory inputs. This perception, or obstacle detection, is simulated by providing the agent a 2D sonar image, representing distance measurements to a potential intersecting object in front of the AUV. This setup emulates a Forward Looking Sonar (FLS) mounted on the front of the AUV. A 3D rendering of the FLS simulation is seen in Figure 2. The specific sensor suite, the sonar range, and the sonar apex angle are configurable and can therefore be thought of as hyperparameters.

Depending on the sensor suite of choice, the number of sensor rays can get quite large. It is also notable that this issue is exponentially larger in 3D compared to 2D, slowing the simulation speed significantly as searching through the sonar rays (line search) is computationally expensive. For this reason, the sensor suite used in this research is 15 by 15, providing a grid with

### 3.3 Reward Function

Reward function design is a crucial part of any RL process. The goal is to establish an incentive so the agent learns certain behavioral aspects. This is done by trying to imitate human-like behavior. For instance, following the path is objectively desirable, but this goal must be suspended in the case of a potential collision. When (and by what safety margin) to react is inherently a subjective choice. Regulating this trade-off is a balancing act, where following the path notoriously would result in many collisions and being too cautious would be ineffective. Additionally, a configuration involving excessive roll, i.e., the angular displacement of the AUV around its own longitudinal axis, is undesirable because that implies inverting or even swapping the two actuators’ effect (the rudder would operate as the elevator and vise versa) in terms of combating course and elevation errors. Not using the actuators too aggressively is therefore key to achieving smooth and safe operation. Thus, a reward function incorporating these important aspects of AUV motion control is developed.

The first part focuses on path following and simply penalizes errors between desired and actual course and elevation angle, as given by Equation 28:

where *obstacle closeness*. This quantity is written as *i*’th and *j*’th pixel distance measurement and

Since the vessel-relative orientation of an obstacle determines whether a collision is likely, the penalty related to a specific closeness measurement is scaled by an orientation factor dependent on the relative orientation. The vessel-relative scaling factor is written as

**Figure 3**. How the reward is scaled according to the sonar-data’s vessel-relative direction. Note that the grid illustrated is much finer than the 15 by 15 sensor suite used during simulation.

To find the right balance between penalizing being off-track and avoiding obstacles—which are competing objectives—the weight parameter

### 3.4 Feedback/Observations

The list of state observations, referring to the states of the dynamical model, and the agents' inputs during training and in operation is seen in Table 4. The inputs are normalized by the true or the empirical maximum, so that values passed into the neural network are in the range *o* to indicate that these are the actual values passed as observations. The nonlinear activation functions of neural networks tend to saturate if the inputs get too large; hence, normalization is a means used to counteract this effect. Furthermore, large input values might lead to huge error gradients, which in turn causes unstable training. Normalization is therefore a simple form of preprocessing contributing to faster and more stable training (Yann LeCun et al., 1998).

**TABLE 4**. Observation table for end-to-end training for path following. All states and errors are normalized by the empirical or true maximum value.

In addition to the state observations, the neural network inputs a flattened 2D sonar image measuring closeness. It is possible to pass the sonar image directly through the neural network, essentially learning to map raw sensor data to control action. By the fact that neural networks are capable of representing any continuous nonlinear function, this should be feasible in theory (Nielsen, 2015). However, as this requires a high-dimensional observation space, a larger neural network is needed to learn a control law. In turn, a larger neural net requires more data and more updates to converge, prolonging an already time-consuming process. To address this issue, dimensionality reduction is performed by max pooling the raw closeness image from

For the neural networks, we utilize the *MLP-Policy* (multilayer perceptron) provided by Stable Baselines which incorporates a fully connected, two hidden-layer neural networks with 64 neurons in each layer using hyperbolic tangents

## 4 Results and Discussions

This section covers the results obtained from applying the finalized DRL controllers in the various scenarios introduced in Section 3.1. Firstly, test reports from quantitative tests, which are obtained by running the simulation for a large sample of episodes and calculating statistical averages, are given. In light of these results, the behavioral aspects can be interpolated to visualize and pinpoint clearer trends. Secondly, the reports from testing the controllers in special-purpose scenarios are shown and analyzed to qualify if the agents have indeed learned to operate the AUV intelligently. Three values for the trade-off parameter

### 4.1 Quantitative Results

The quantitative results are obtained by running each training scenario, configured randomly in each episode, for

**Figure 4**. Curve-fitted data from Table 4. The average tracking error and the collision are fitted to exponential functions, while the success rate is fitted to a quadratic polynomial.

### 4.2 Qualitative Results

In the qualitative tests, four different scenarios (Section 3.1) are set up in order to test different behavioral aspects of the controllers. The first test sees the controllers tackle a pure path-following test, both with and without the presence of an ocean current. Figure 5 plots the results of simulating one episode.

**Figure 5**. The pure path-following test. As expected, higher

For

A recurring problem seen when applying purely reactive algorithms is getting trapped in local minima, which in a practical sense materialize as dead ends. Therefore, the last test investigates if the agents have acquired the intelligence to solve a local minima trap in the form of a dead-end challenge. In addition, this can affirm the robustness and generality learned by the agents, as this is a completely novel situation. The obstacles are configured as a half sphere with radius 20 m. This means that the agent will sense the dead end

**Figure 6**. A dead-end test, where the obstacles are configuration as a half sphere with a radius of

In the next test, we dissect if the agents learned to operate the actuators effectively according to how the obstacles are posed. In extreme cases, obstacles would be stacked horizontally and vertically, and optimally no control energy should be spent on taking the AUV towards “the long way around.” Instead, it should use the actuator in order to avoid the path on the lateral side of the stacking direction. Surely, an intelligent pilot would pass the obstacles in this manner. From the plots (Figure 7), it is observed that all agents waste little control energy using the “opposite” control fin. The agent with

**Figure 7**. The horizontal and vertical obstacle test. Here, we are interested in seeing if the agent has learned which actuator to use to avoid the obstacles.

The results obtained from the test scenarios demonstrate a clear connection to the reward function, as intended. In a pure path-following test, the agent biased towards path following manages to track the path with great precision. On the other hand, regulating the trade-off closer to COLAV yields agents that are willing to go further off track to find safe trajectories. This is reflected in the average tracking error and in the collision rate. Furthermore, it is seen that the latter controllers seem to react by spending less aggressive control. The controller tuned with

A current limitation in the simulated setup is the assumption that all states, including the ocean current, is available for feedback. We have therefore omitted the *navigation* part of the classical feedback loop for marine crafts. In a full-scale test, state estimation and sensor noise would naturally be part of the feedback loop, necessitating the need for a navigation module.

## 5 Conclusion and Future Work

In this research, DRL agents were trained using state-of-the-art RL algorithm PPO and deployed to tackle the hybrid objective of 3D path following and collision avoidance by an AUV. A *curriculum learning* approach was utilized to train the agent with increasing levels of complexities starting with path following, followed by the introduction of complexities in the obstacle layouts and ultimately the introduction of ocean currents. The AUV was operated by commanding three actuator signals in the form of propeller thrust, rudder, and elevator fin angles. A PI-controller maintained a desired cruise speed, while the DRL agent operated the control fins. The agent made decisions based on the observation of the state variables of the dynamical model, control errors, the disturbances, and sensory inputs from an FLS. The main conclusions are as follows:

• It was observed that agents biased towards path following achieved the objective with an average error of

• Quantitative evaluation was performed using statistical averages by sampling

• A reward system based on quadratic penalizations was designed to incentivize the agent to follow the path and also was willing to deviate if further on-path progress was unsafe. In addition, avoiding excessive roll and use of control actuation was avoided by penalizing such behavior. As path following and avoiding collisions are competing objectives, the agent must trade off one for the others in order to achieve a successful outcome in an episode. Since this trade-off is nontrivial, a regulating parameter

From the current studies, it is clear that DLR using curriculum learning can be an effective approach to taming an underactuated AUV with 6-DOF to achieve the combined objective of path following and collision avoidance in 3D. However, it is also important to stress that despite the demonstrated potential of the DRL approach holds, it will have very limited acceptability in safety-critical applications because the whole learning process happens in a black-box way, thereby lacking its explainability and analysability. Part of this black-box nature is attributed to the deep neural network that is at the heart of DRL because they lack functional expressibility. To address this issue, the learning of the trained agent can be put in the form of equations using symbolic regression. The symbolic regression based on gene expression programming has been demonstrated to discover new physics and equations directly from sparse data Vaddireddy et al. (2020). This will enable stability analysis of the system to make them more reliable but this kind of research is in its infancy at the moment.

## Data Availability Statement

The original contributions presented in the study are included in the article/Supplementary Material; further inquiries can be directed to the corresponding author.

## Author Contributions

SH developed the software framework facilitating this research and is the lead author. AR and OS supervised the research and provided guidance throughout the process, as well as proofreading.

## Conflict of Interest

AR is employed by SINTEF Digital.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The reviewer ML declared a shared affiliation, with no collaboration, with the authors SH and AR to the handling editor at the time of the review.

## Footnotes

^{1}Link: https://github.com/simentha/gym-auv

## References

Ataei, M., and Yousefi-Koma, A. (2015). Three-dimensional optimal path planning for waypoint guidance of an autonomous underwater vehicle. *Robot. Autonom. Syst.* 67, 23–32. doi:10.1016/j.robot.2014.10.007

Bengio, Y., Louradour, J., Collobert, R., and Weston, J. (2009). “Curriculum learning,” in *Proceedings of the 26th annual international conference on machine learning.* (New York, NY: Association for Computing Machinery)), 41–48. doi:10.1145/1553374.1553380

Breivik, M., and Fossen, T. I. (2009). “Guidance laws for autonomous underwater vehicles,” in *Underwater vehicles*, Editor. A. V. Inzartsev (Rijeka: IntechOpen). doi:10.5772/6696

Brockman, Greg & Cheung, Vicki & Pettersson, Ludwig & Schneider, Jonas & Schulman, John & Tang, Jie & Zaremba, Wojciech. (2016). OpenAI Gym.

Carlucho, I., Paula, M. D., Wang, S., Petillot, Y., and Acosta, G. G. (2018). Adaptive low-level control of autonomous underwater vehicles using deep reinforcement learning. *Robot. Autonom. Syst.* 107, 71–86. doi:10.1016/j.robot.2018.05.016

Carroll, K. P., McClaran, S. R., Nelson, E. L., Barnett, D. M., Friesen, D. K., and William, G. N. (1992). “Auv path planning: an a* approach to path planning with consideration of variable vehicle speeds and multiple, overlapping, time-dependent exclusion zones,” in *Proceedings of the 1992 symposium on autonomous underwater vehicle technology.* (New York, NY, United States: IEEE), 79–84. doi:10.1109/AUV.1992.225191

Cashmore, M., Fox, M., Larkworthy, T., Long, D., and Magazzeni, D. (2014). “Auv mission control via temporal planning,” in 2014 IEEE international conference on Robotics and automation (ICRA), Hong Kong, China, June 7, 2014 (New York, NY, United States: IEEE), 6535–6541. doi:10.1109/ICRA.2014.6907823

Chang, S.-R., and Huh, U.-Y. (2015). Curvature-continuous 3d path-planning using qpmi method. *Int. J. Adv. Rob. Syst.* 12, 76. doi:10.5772/60718

Chu, Z., and Zhu, D. (2015). 3d path-following control for autonomous underwater vehicle based on adaptive backstepping sliding mode. 2015 IEEE international conference on information and automation, Lijiang, China, August 8, 2015 (New York, NY, United States: IEEE), 1143–1147. doi:10.1109/ICInfA.2015.7279458

Cirillo, M. (2017). “From videogames to autonomous trucks: a new algorithm for lattice-based motion planning,” in 2017 IEEE intelligent vehicles symposium, Redondo Beach, CA, August 8, 2015 (New York, NY, United States: IEEE), 148–153.

da Silva, Jorge Estrela, et al. (2007). “Modeling and simulation of the LAUV autonomous underwater vehicle.” 13th IEEE IFAC International Conference on Methods and Models in Automation and Robotics. Szczecin, Poland Szczecin, Poland, 2007.

Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert, M., Radford, A., et al. (2017). Openai baselines. *GitHub repository*.

Encarnacao, P., and Pascoal, A. (2000). 3d path following for autonomous underwater vehicle. *Proceedings of the 39th IEEE conference on decision and control (cat. No.00CH37187)* 3, 2977–2982. doi:10.1109/CDC.2000.914272

Eriksen, B. H., Breivik, M., Pettersen, K. Y., and Wiig, M. S. (2016). “A modified dynamic window algorithm for horizontal collision avoidance for auvs,” in 2016 IEEE conference on control applications, Buenos Aires, Argentina, September 19–22, 2014 (New York, NY, United States: IEEE), 499–506. doi:10.1109/CCA.2016.7587879

Fossen, T. (2011). *Handbook of Marine Craft Hydrodynamics and Motion Control*. New Jersey, NJ: John Wiley & Sons.

Fox, D., Burgard, W., and Thrun, S. (1997). The dynamic window approach to collision avoidance. *IEEE Robot. Autom. Mag.* 4, 23–33. doi:10.1109/100.580977

Garau, B., Alvarez, A., and Oliver, G. (2005). “Path planning of autonomous underwater vehicles in current fields with complex spatial variability: an a* approach,” in Proceedings of the 2005 IEEE international Conference on Robotics and automation, Barcelona, Spain, April 18–22, 2005 (New York, NY, United States: IEEE), 194–198. doi:10.1109/ROBOT.2005.1570118

Karaman, S., and Frazzoli, E. (2011). Sampling-based algorithms for optimal motion planning. *Int. J. Robot Res.* 30, 846–894. doi:10.1177/02F0278364911406761

Kavraki, L. E., Svestka, P., Latombe, J. ., and Overmars, M. H. (1996). Probabilistic roadmaps for path planning in high-dimensional configuration spaces. *IEEE Trans. Robot. Autom.* 12, 566–580. doi:10.1109/70.508439

Liang, X., Qu, X., Wan, L., and Ma, Q. (2018). Three-dimensional path following of an underactuated auv based on fuzzy backstepping sliding mode control. *Int. J. Fuzzy Syst.* 20, 640–649. doi:10.1007/s40815-017-0386-y

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N. M. O., Erez, T., Tassa, Y., et al. (2015). Continuous control with deep reinforcement learning. *CoRR abs*. 1509, 02971.

Ljungqvist, O., Evestedt, N., Axehill, D., Cirillo, M., and Pettersson, H. (2019). A path planning and path-following control framework for a general 2-trailer with a car-like tractor. *J. Field Robot.* 36, 1345–1377. doi:10.1002/rob.21908

Martinsen, A. B., and Lekkas, A. M. (2018a). “Curved path following with deep reinforcement learning: results from three vessel models,” in OCEANS 2018 MTS/IEEE charleston, Charleston, SC, October 22–25, 2018 (New York, NY, United States: IEEE), 1–8. doi:10.1109/OCEANS.2018.8604829

Martinsen, A. B., and Lekkas, A. M. (2018b). Straight-path following for underactuated marine vessels using deep reinforcement learning. *IFAC-PapersOnLine* 51, 329–334. doi:10.1016/j.ifacol.2018.09.502

McGann, C., Py, F., Rajan, K., Thomas, H., Henthorn, R., and McEwen, R. (2008). A deliberative architecture for auv control. 2008 IEEE international conference on robotics and automation, Pasadena, CA, May 19–23, 2018 (New York, NY, United States: IEEE) 1049–1054. doi:10.1109/ROBOT.2008.4543343

Meyer, E., Heiberg, A., Rasheed, A., and San, O. (2020a). COLREG-compliant collision avoidance for unmanned surface vehicle using deep reinforcement learning. *IEEE Access* 8, 165344–165364. doi:10.1109/ACCESS.2020.3022600

Meyer, E., Robinson, H., Rasheed, A., and San, O. (2020b). Taming an autonomous surface vehicle for path following and collision avoidance using deep reinforcement learning. *IEEE Access* 8, 41466–41481.

Nielsen, M. A. (2015). *Neural networks and deep learning*. Determination Press. http://neuralnetworksanddeeplearning.com/

Pivtoraiko, M., Knepper, R. A., and Kelly, A. (2009). Differentially constrained mobile robot motion planning in state lattices. *J. Field Robot.* 26, 308–333. doi:10.1002/rob

Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. (2015). *High-Dimensional Continuous Control Using Generalized Advantage Estimation*.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. *CoRR abs* 1707, 06347.

Sugihara, K., and Yuh, J. (1996). “Ga-based motion planning for underwater robotic vehicles,” in Proc. 10th international symp. on unmanned untethered submersible technology, Lee, NH, September 7–10, 1996 (Lee, NH, United States: Autonomous Undersea Systems Institute). 406–415.

Tan, C. S. (2006). A Collision avoidance system for autonomous underwater vehicles. PhD thesis. Plymouth (United Kingdom): University of Plymouth.

Vaddireddy, H., Rasheed, A., Staples, A. E., and San, O. (2020). Feature engineering and symbolic regression methods for detecting hidden physics from sparse sensors. *Phys. Fluids* 32, 015113. doi:10.1063/1.5136351

Wiig, M. S., Pettersen, K. Y., and Krogstad, T. R. (2018). “A 3d reactive collision avoidance algorithm for nonholonomic vehicles,” in 2018 IEEE conference on control technology and applications (CCTA), Copenhagen, Denmark, August 21–24, 2018 (New York, NY, United States: IEEE), 67–74. doi:10.1109/CCTA.2018.8511437

Williams, G. N., Lagace, G. E., and Woodfin, A. (1990). “A collision avoidance controller for autonomous underwater vehicles,” in Symposium on autonomous underwater vehicle technology, Washington, DC, June 5–6, 1990 (New York, NY, United States: IEEE), 206–212. doi:10.1109/AUV.1990.110458

Woo, J., Yu, C., and Kim, N. (2019). Deep reinforcement learning-based controller for path following of an unmanned surface vehicle. *Ocean Engineering* 183, 155–166. doi:10.1016/j.oceaneng.2019.04.099

Xiang, X., Yu, C., and Zhang, Q. (2017). Robust fuzzy 3D path following for autonomous underwater vehicle subject to uncertainties. *Comput. Oper. Res.* 84, 165–177. doi:10.1016/j.cor.2016.09.017

Yann LeCun, G. B. O., Leon, B., and Müller, K.-R. (1998). *Efficient BackProp*. Berlin, Heidelberg: Springer.

Yu, R., Shi, Z., Huang, C., Li, T., and Ma, Q. (2017). “Deep reinforcement learning based optimal trajectory tracking control of autonomous underwater vehicle” in 2017 36th Chinese control Conference, Dalian, China, June 5–6, 1990 (New York, NY, United States: IEEE), 4958–4965. doi:10.23919/ChiCC.2017.8028138

Keywords: continuous control, collision avoidance, path following, deep reinforcement learning, autonomous under water vehicle, curriculum learning

Citation: Havenstrøm ST, Rasheed A and San O (2021) Deep Reinforcement Learning Controller for 3D Path Following and Collision Avoidance by Autonomous Underwater Vehicles. *Front. Robot. AI* 7:566037. doi: 10.3389/frobt.2020.566037

Received: 26 May 2020; Accepted: 08 December 2020;

Published: 25 January 2021.

Edited by:

Holger Voos, University of Luxembourg, LuxembourgReviewed by:

Marcello Cirillo, Scania, SwedenMartin Ludvigsen, Norwegian University of Science and Technology, Norway

Copyright © 2021 Havenstrøm, Rasheed and San. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Adil Rasheed, adil.rasheed@ntnu.no