Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Mar. Sci., 21 January 2026

Sec. Marine Affairs and Policy

Volume 12 - 2025 | https://doi.org/10.3389/fmars.2025.1756233

This article is part of the Research TopicEmerging Computational Intelligence Techniques to Address Challenges in Oceanic ComputingView all 9 articles

COLREGs-compliant ship collision avoidance strategy based on proximal policy optimization algorithm

Qiaosheng Zhao,Qiaosheng Zhao1,2Tianyu Yang*Tianyu Yang3*Chaoxu Mu,Chaoxu Mu1,2Qiyu ChenQiyu Chen4Tao LuoTao Luo4Mingkai LiuMingkai Liu4Xin Wang*Xin Wang4*
  • 1Tianjin University, Tianjin, China
  • 2China Ship Scientific Research Center, Wuxi, China
  • 3School of Ocean and Civil Engineering, Shanghai Jiao Tong University, Shanghai, China
  • 4College of Navigation, Dalian Maritime University, Dalian, China

The safe and efficient collision avoidance of multiple ships is essential for maritime navigation and intelligent shipping systems. In this paper, we propose a novel COLREGs-compliant multi-ship collision avoidance strategy based on deep reinforcement learning. A cooperative training framework using the Proximal Policy Optimization (PPO) algorithm enables multiple ship agents to learn optimal collision avoidance actions while considering the interactions and motions of neighboring ships. Encounter situation awareness mechanisms and carefully designed reward functions are integrated to ensure strict adherence to the International Regulations for Preventing Collisions at Sea (COLREGs), while a multi-objective optimization approach embedded in the reward function balances collision risk, navigational efficiency, route smoothness, and destination achievement. Extensive simulations covering diverse ship encounter scenarios demonstrate the effectiveness, robustness, and COLREGs compliance of the proposed strategy, highlighting its practical potential for multi-ship navigation systems.

1 Introduction

The ship collision avoidance problem represents a critical challenge in maritime navigation, with profound implications for safety, economic efficiency, and environmental sustainability. Collisions at sea can cause severe human casualties, substantial economic losses, and significant environmental damage, such as oil spills or hazardous material release. As maritime traffic continues to increase, especially in congested waterways and busy ports, effective collision avoidance becomes even more crucial. Consequently, this challenge has spurred extensive research efforts across multiple disciplines, leading to diverse methodological approaches and technological innovations, ranging from rule-based strategies and optimization methods to advanced machine learning and multi-agent reinforcement learning techniques.

Considering the importance of effective interaction between ships, a decentralized two-ship negotiation protocol in the open sea for Convention on the International Regulations for Preventing Collisions at Sea (COLREGs)-compliant give-way and stand-on responsibilities is proposed by (Hu et al., 2006) to address the multi-ship collision avoidance problem. The initial negotiation protocol was further improved by (Hu et al., 2008) by integrating the planned routes of both ships into the negotiation. Liu et al. (2007) presented a multiagent-based simulation system for the decision-making research of ship collision avoidance. The system has the characteristics of flexible agent, variable topology, isomorphic function structure, distributed knowledge storage, and integrated control method. (Zhang et al., 2012) (Zhang et al., 2015) developed a multi-ship real-time collision avoidance algorithm based on COLREGS rules, where ships dynamically assess neighboring ships’ movements to determine appropriate avoidance maneuvers as either give-way or stand-on ships. While this approach enables autonomous decision-making for individual ships, it lacks a cooperative mechanism for multi-ship coordination, instead treating each encounter as an independent two-ship scenario. In Kim et al. (2017) a Distributed Stochastic Search Algorithm (DSSA) was introduced, which allows each ship to change her intention in a stochastic manner immediately after receiving all of the intentions from the target ships. The algorithm incorporates an advanced cost function that balances collision risk against operational efficiency in its distributed decision-making process. However, COLREGs and the ship maneuvering performance difference were not considered in the study. In Liu et al. (2022), the collision avoidance decision and coordination mechanism of conventional and intelligent ships under mixed scenarios are studied.

Intelligent prediction technology based on machine learning has developed rapidly in the past decade. Reinforcement learning (RL), deep reinforcement learning (DRL), and other deep learning (DL) methods have been increasingly applied in the field of ship collision avoidance. Yoo and Kim (2016) solved the problem of minimum temporal path planning for Marine vehicles in ocean current environments through the Q-learning algorithm in reinforcement learning, combined with state space discretization and path smoothing techniques, while satisfying nonholomorphistic motion constraints. Cheng and Zhang (2018) proposed a concise deep reinforcement learning obstacle avoidance algorithm (CDRLOA), combined with convolutional neural networks for data fusion and Q-learning decision-making, for real-time obstacle avoidance of underactuated unmanned Marine vessels in unknown environments. Zhou et al. (2019) adopted the deep Reinforcement Learning (DQN) framework and utilized deep neural networks and comprehensive reward function design to achieve collaborative path planning for unmanned surface vehicles (USVs) and their formations in complex Marine environments, effectively addressing challenges such as obstacle avoidance, formation maintenance, and shape adaptation. Cao et al. (2019) combined cutting-edge exploration methods and asynchronous advantage-critic (A3C) networks, using dual-stream Q-learning for navigation to achieve efficient target search and dynamic obstacle avoidance for AUVs in unknown underwater environments. Xie et al. (2020) developed a composite learning method that integrates A3C reinforcement learning, long short-term memory (LSTM) inverse control, and Q-learning adaptive decision-making to enhance the learning efficiency and strategy optimization of multi-vessel collision avoidance. Chun et al. (2021) based on the Proximal Policy Optimization (PPO) algorithm, combined the ship domain and the Closest Point Method (CPA) to assess the collision risk, and generated an autonomous ship collision avoidance path that conforms to COLREGs. Wu et al. (2023) proposed a collision avoidance path planning algorithm combining deep reinforcement learning (PPO algorithm) and Dynamic Window Method (DWA), which solves the problem of sparse rewards by improving the action space and reward function and is applicable to the dynamic obstacle avoidance of autonomous surface ships at sea in complex environments. Yu et al. (2025) introduced a hierarchical reinforcement learning framework that combines high-level global intent planning with low-level fine rudder control and integrates multi-dimensional uncertainty modeling to enhance the adaptability and stability of collision avoidance strategies for autonomous ships in dynamic and uncertain scenarios. Cui et al. (2024) proposed a multi-unmanned surface vessel collision avoidance decision-making strategy based on an improved deep deterministic policy gradient (DDPG) algorithm, combining a priority experience replay mechanism and a gated recurrent unit (GRU) network to enhance convergence speed and decision ac-curacy. Fan et al. (2025) designed an intelligent collision avoidance algorithm based on progressive deep reinforcement learning, adopting a three-layer training structure of transfer learning (from barrier-free to multi-obstacle environments) and a multi-step bootlift method to enhance the obstacle avoidance efficiency of unmanned surface vessels in complex maritime environments. Jia et al. (2025) proposed a collision avoidance method based on meta-reinforcement learning, which uses a two-layer cyclic model for rapid adaptation and combines high-risk scenario task sampling and risk assessment objective functions (such as CVaR) to enhance the safety of autonomous ships in various encounter scenarios. While deep reinforcement learning algorithms can generate collision avoidance strategies for ships by learning through environmental interaction, they face several challenges. These include limited generalization, insufficient safety assurances, difficulties in real-world implementation, and the inherent constraints of various reinforcement learning methods.

Based on above observation, intelligent ships exhibit distinct competitive edges over traditional vessels in energy efficiency, environmental sustainability, safety, and economic viability, primarily driven by their advanced cognitive capabilities, decision-making process, and sophisticated control technologies. However, a central challenge in their autonomous navigation lies in ensuring reliable collision avoidance with other sailing ships or dynamic obstacles. The design of a practically viable ship collision avoidance decision strategy is rendered highly complex by multiple factors: the intricate constraints of maritime environments, the mandatory compliance with the COLREGs, and the under-actuated nature of ship dynamics. To ensure safe navigation in the mixed scenarios, reliable ship collision avoidance is the most important issues that need to be resolved.

To enable rapid collision avoidance decisions that satisfy both safety requirements and COLREGs in complex maritime traffic environments with multi-ships, a novel COLREGs-compliant collision avoidance strategy for ship collision avoidance is proposed in this paper. The main contributions of this paper are summarized as follows:

1. A dynamic risk assessment model is integrated into the PPO algorithm, which comprehensively evaluates collision risks in real time by fusing multiple factors such as DCPA/TCPA, relative distance, heading, and speed between ships, enabling more accurate identification of potential collision threats compared to traditional static threshold-based methods.

2. A multi-objective optimization algorithm is employed to generate collision avoidance trajectories, balancing safety (minimum distance to obstacles), efficiency (minimum deviation from the original route), and COLREGs compliance, thus ensuring the generated strategies are both practically feasible and in line with maritime rules.

3. The proposed strategy is validated through extensive simulations in typical ship encounter scenarios (including head-on, crossing, and overtaking situations), demonstrating its superiority.

The rest of the paper is organized as follows. Preliminaries on ship collision avoidance problems are described in Section 2. Section 3 introduces the Proposed ship collision avoidance strategy. The simulation experimental design and experimental result analysis are presented in Section 4. Conclusions and future work are presented in Section 5.

2 Preliminary

2.1 Reference frame and ship motion mathematical model

In real-world navigation practices, ship motion is typically characterized by a six-degree of freedom (6-DOF) model, which comprehensively describes a ship’s dynamic behavior through three translational (surge, sway, heave) and three rotational (roll, pitch, yaw) motions. This model captures the ship’s response to environmental forces and control inputs, providing a holistic representation of its operational dynamics. However, inherent challenges such as large inertia, time delays, and nonlinearity in ship motion pose significant difficulties for accurate modeling and prediction. To address these complexities, the Maneuvering Model Group (MMG) model is employed, enhancing the framework by integrating detailed hydrodynamic interactions and environmental factors. Developed by Ogawa and Kasai (1978), the MMG model offers a more precise and practical representation of ship motion, making it a critical tool for designing robust navigation and path planning systems in complex maritime environments.

Given the study’s focus on ship path planning, the analysis simplifies ship motion to three key degrees of freedom: surge, sway, and yaw. These three motions are sufficient to capture the essential dynamics required for navigation and obstacle avoidance. Additionally, the research assumes calm water conditions, excluding external environmental influences such as wind, waves, and currents. This simplification allows for a concentrated analysis of the ship collision avoidance while remaining applicable to practical maritime scenarios.

To accurately describe ship motion, two reference frames are established: the inertial reference frame (IRF) x0o0y0 and the body-fixed reference frame (BRF) xoy, as depicted in Figure 1.

Figure 1
Diagram illustrating a moving object on a coordinate plane, with vectors labeled U, v sub m, u, and r indicating direction and magnitude. Angles ψ and β are marked, along with axes labeled x, y, x sub 0, and y sub 0.

Figure 1. Reference frames.

The lateral velocity v of the ship’s center of gravity, the drift Angle β and the joint velocity u of the ship are respectively:

v=vm+xGr
β=arctan(νmu)
U=u2+νm2

The standardized 3-DOF MMG model is:

{(m+mx)u˙(m+my)vmrxGmr2=XH+XR+XP(m+my)v˙+(m+mx)ur+xGmr˙=YH+YR(IZG+xG2m+Jz)r˙+xGm(v˙m+ur)=NH+NR

Hull hydrodynamic XH, YH and NH are:

{XH=(12)ρLppdU2X'H(v'm,r')YH=(12)ρLppdU2Y'H(v'm,r')NH=(12)ρLpp2dU2N'H(v'm,r')

where v’m and r’ are dimensionless values. Propeller propulsion force Xp is:

{Xp=(1tp)ρnp2Dp4(kt2Jp2+kt1Jp+kt0)Jp=μ[1wp0exp(C0βp2)]npDpβp=βχp'r

x’p is the dimensionless value of the propeller position. The rudder forces XR, YR and NR during steering are:

{XR=(1tR)FNsinδYR=(1+aH)FNcosδNR=(xR+aHxH)FNcosδ

The symbolic meanings in the formula are listed as Table 1.

Table 1
www.frontiersin.org

Table 1. Reference frames.

2.2 COLREGs and ship encounter situations identification

The 1972 Convention on the International Regulations for Preventing Collisions at Sea, which entered into force in July 1977, constitutes a globally ratified regulatory framework with near-universal acceptance among maritime nations (IMO, 1972). As a foundational normative instrument governing navigational behavior, COLREGs assumes irreplaceable significance in addressing multi-ship collision avoidance scenarios, where adherence to its provisions serves as a prerequisite for ensuring navigational safety. Specifically, COLREGs articulates a set of obligations regarding evasive maneuvering: in situations where vessels are on a collision-bearing trajectory, the regulations explicitly mandate that the give-way vessel shall undertake timely and unambiguous evasive actions, thereby establishing a clear behavioral paradigm for resolving potential collision risks. These stipulations not only codify the division of navigational responsibilities but also provide a shared cognitive reference for seafarers, facilitating predictable interaction dynamics even in spatially complex and temporally constrained maritime environments (He et al., 2025).

All crewed vessels are legally obligated to adhere to the collision avoidance protocols outlined in the International Maritime Organization’s (IMO) COLREGs. Specifically, this study focuses on the core COLREGs rules governing “vessels in sight of each other” that directly guide collision avoidance decision-making. To guarantee precise interpretation and consistent implementation of these regulations, a thorough analysis of their original textual provisions is indispensable. Against this backdrop, the following section presents a succinct synthesis of pertinent COLREGs clauses, with each entry retaining its original rule number to facilitate cross-referencing and clarity of reference.

Rule 8 of COLREGs - Actions to avoid collision: actions shall be made in ample time. If there is sufficient sea-room, alteration of course alone may be most effective. Reduce speed, stop or reverse if necessary. Action by a ship is required if there is a risk of collision, and when the ship has right-of-way.

Rule 13 of COLREGs - Overtaking: any vessel overtaking any other shall keep out of the way of the vessel being overtaken.

Rule 14 of COLREGs - Head-on Situation: When two power-driven vessels are meeting on reciprocal or nearly reciprocal courses so as to involve risk of collision, each shall alter her course to starboard so that each shall pass on the port side of the other.

Rule 15 of COLREGs - Crossing Situation: when two power-driven vessels are crossing so as to involve risk of collision, the vessel which has the other on her own starboard side shall keep out of the way and shall, if the circumstances of the case admit, avoid crossing ahead of the other vessel.

Rule 16 of COLREGs - Actions by give-way vessel: every vessel which is directed to keep clear of another vessel shall, so far as possible, take early and substantial action to keep well clear.

Rule 17 of COLREGs - Actions by stand-on vessel: where one of two vessels is to keep out of the way, the other shall keep her course and speed. The latter vessel may however take action to avoid collision by her manoeuvre alone, as soon as it becomes apparent to her that the vessel required to keep out of the way is not taking appropriate action in compliance with these rules.

Under Rules 13 to 15 of the COLREGs, scenarios involving interacting vessels are classified into three distinct encounter categories: overtaking, head-on, and crossing situations. Each classification is accompanied by tailored stipulations that delineate the navigational duties of the involved vessels, thereby underpinning safe passage at sea. Specifically, the roles of “stand-on” and “give-way” vessels critical to resolving these encounters are explicitly codified in Rules 8, 16, and 17, which outline the respective obligations of each party. A visual representation of these three encounter types as defined by the COLREGs is provided in Figure 2.

Figure 2
Diagram illustrating three ship encounter scenarios: (a) Encountering ship from the left, own ship turns right. (b) Encountering ship from the front, both ships turn right. (c) Encountering ship from the right, own ship turns left. Arrows indicate directions.

Figure 2. The illustration of ship encounter types and their responsibilities in the COLREGs. (a) overtaking, (b) head-on, (c) crossing.

In the present study, the ship encounter classification framework proposed by Thyri et al. (2020) is adopted. Specifically, the head-on situation is defined when the relative bearing falls within 345°-15°, the crossing situation is identified for relative bearings of 15°-112.5° and 247.5°-345°, and the overtaking situation corresponds to the range of 112.5°-247.5°. This framework enables the OS to precisely identify the specific situation of its encounter with other ships by virtue of analyzing the relative bearing between them, as visually depicted in Figure 3.

Figure 3
Diagram showing a circular view from a ship. Sectors labeled “Head-on” (345° to 15°), “Crossing” on both sides, and “Overtaking” between 112.5° and 247.5°. Arrows represent direction angles.

Figure 3. The ship encounter types classification framework.

2.3 PPO algorithm

As a strategy iteration-based reinforcement learning technique, the Policy Gradient (PG) algorithm operates by sampling state-action-reward tuples to maximize the expected cumulative reward. An important improvement upon this approach is the Proximal Policy Optimization (PPO) algorithm. Unlike traditional policy gradient algorithms that often struggle with unstable updates due to large policy changes, PPO introduces a clipped surrogate objective function to restrict the divergence between new and old policies during training. This innovative design enables stable learning even with small batches of data and multiple training epochs, addressing the challenge of determining optimal step sizes in classic policy gradient approaches.

Structurally, PPO has a classic actor-critic framework: the actor network (policy function) and the critic network (value function). These networks work in tandem to optimize the policy while ensuring stable and efficient learning. The actor network, parameterized by θ, defines a stochastic policy π_θ(a|s) that outputs a probability distribution over actions given the current state s. Mathematically, the policy is expressed as:

Vϕ(s)=E[GtSt=s]

The actor is trained to maximize the clipped surrogate objective function, which ensures that policy updates remain within a safe region to prevent drastic changes that could destabilize learning. The objective function is given by:

LCLIP(θ)=Et[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)]

Where ϵ is the clipping hyperparameter, which limits the magnitude of policy updates. rt(θ)=πθ(at|st)πθold(at|st)is the probability ratio between the new and old policies, A^t is the generalized advantage estimator (GAE), which measures how much better or worse a particular action is compared to the average action in a given situation.

The clipping mechanism in PPO serves as a fundamental stabilization technique that constrains policy updates within a trusted region by bounding the probability ratio between consecutive policies. Specifically, this mechanism enforces a conservative update strategy through a carefully designed objective function that clips the policy ratio rt(θ) within the range [1-ϵ, 1+ϵ]. By maintaining this bounded ratio, the algorithm effectively prevents excessively large policy updates that could destabilize the learning process while still permitting meaningful policy improvements.

As for the critic network, it serves as a learned value function approximator that estimates the expected cumulative return for a given state under the current policy. The critic network, parameterized by weights ϕ, outputs the state-value function V_ϕ(s), which is trained to minimize the mean squared error (MSE) between its predictions and either the empirical returns or bootstrapped value targets. This optimization is achieved through the value loss function:

LVF(ϕ)=E[(Vϕ(st)Vt^)2]

Through this architecture, the critic network learns a stable and informative value function that progressively improves in accuracy as training proceeds, while the actor networks leverages these estimates to make increasingly refined policy updates within the bounds of the clipping mechanism. In practice, the actor and critic net-works often share lower-level feature extraction layers while maintaining separate output heads, allowing for efficient learning of shared representations while preserving the distinct objectives of each network component.

3 Proposed ship collision avoidance strategy

3.1 Calculation of ship collision risk

Ship collision risk calculation serves as a quantitative foundation for collision avoidance decision-making, enabling the assessment of potential collision probabilities between other sailing ships and facilitating proactive collision avoidance maneuvering. This section outlines the core methodologies and key factors involved in quantifying collision risk employed in this research.

To accurately determine the spatial-temporal relationships between the own ship (OS) and the target ship (TS). Extant research on the ship collision risk has made extensive use of the notions of Distance to Closest Point of Approach (DCPA) and Time to Closest Point of Approach (TCPA). Considering that, these two metrics are selected as the main indicator of the collision risk assessment in this paper.

Based on the reference frame defined in Figure 1, the initial position of the midship of OS and TS can be denoted by (Xo0, Yo0) and (Xt0, Yt0), respectively. The coordinates of the OS and TS at any subsequent time t can be calculated as:

{Xo(t)=Xo0+0t(uocosψovosinψo)dtYo(t)=Yo0+0t(uosinψo+vocosψo)dt
{Xt(t)=Xt0+0t(utcosψtvtsinψt)dtYt(t)=Yt0+0t(utsinψt+vtcosψt)dt

Then, the relative distance R(t) between the OS and TS can be expressed as

{ΔX(t)=Xt(t)Xo(t)ΔY(t)=Yt(t)Yo(t)
R(t)=ΔX2(t)+ΔY2(t)

And the relative bearing αR(t) can be calculated by

αR(t)=arctanΔY(t)ΔX(t)ψo+Δα
Δα={0if ΔX0,ΔY0360if ΔX0,ΔY<0180other

Similarly, the relative velocities from the TS to the OS and the relative resultant velocity UR(t) can be calculated as

{ΔURx(t)=(utcosψtvtsinψt)(uocosψovosinψo)ΔURy(t)=(utsinψt+vtcosψt)(uosinψovocosψo)
UR(t)=ΔURx2(t)+ΔURy2(t)

Similar to the relative bearing, the relative heading angle ψR(t)can be expressed as

ψR(t)=arctanΔUry(t)ΔUrx(t)ψo+ΔψR
ΔψR={0if ΔUrx0,ΔUry0360if ΔUrx0,ΔUry<0180other

The DCPA and TCPA between the OS and the TS are expressed as:

{DCPA=R(t)sin[ψR(t)αr(t)π]TCPA=R(t)cos[ψR(t)αr(t)π]/UR(t)

Based on the abovementioned quantities, the specific value of DCPA/TCPA at any given time t can be determined. Notably, these values are not static or fixed, they evolve dynamically as ships adjust their maneuvers, which making continuous recalculation essential for accurate collision risk assessment. However, their inherent dimensional differences, with DCPA quantified in spatial units (e.g., nautical miles) and TCPA in temporal units (e.g., minutes, seconds), the introduce challenges when integrating them into a unified collision risk metric or index. This disparity complicates direct comparison or aggregation, as a “small” DCPA does not inherently correspond to a “small” TCPA, nor can their individual thresholds be easily weighted to reflect overall risk severity.

To address this, a space-time collision risk model proposed by Zhen et al. (2022) is employed in this paper and the specific ship collision risk calculation method is as follows:

CRIDCPA=k1eα1DCPAD

CRITCPA=k2eα2TCPAT

where CRIDCPA is the space collision risk and CRITCPA is the time collision risk. D is the safe distance between the OS and the TS, T is the safest time of the two ships, k1,k2,α1,α2 are adjustment coefficients.

Based on the CRIDCPA and the CRITCPA, the overall ship collision risk RIcollision can be expressed as RIcollision=ω1CRIDCPA+ω2CRITCPAThe proposed collision risk assessment framework constitutes a dynamic risk assessment model, as the risk index RIcollision is continuously updated along the ship trajectories based on the real-time kinematic states of both the OS and the TS. At each time step, the relative distance, relative velocity, DCPA and TCPA are recalculated using the latest motion information, enabling the collision risk to evolve with the maneuvering behaviors of the encountered ships.

Unlike static risk evaluation approaches that rely on fixed encounter geometries, the proposed model explicitly incorporates the temporal evolution of ship motion, allowing it to capture changes in collision risk induced by course alterations, speed variations, and encounter geometry transitions. Consequently, the dynamic collision risk index serves as a time-varying safety indicator, which can effectively reflect the impact of different maneuvering strategies on navigational safety.

3.2 Design of deep neural network

The neural network of the PPO algorithm in this study is composed of two main components: the actor network and the critic network. The actor network, as the policy executor, is responsible for mapping the current ship navigational state to a specific action distribution, ensuring that the output maneuvers are both feasible within the ship’s maneuvering constraints (e.g., maximum rudder angle) and aligned with the learned strategy. Meanwhile, the critic network functions as a value estimator, evaluating the expected cumulative reward of the current state under the actor’s policy. By predicting the state value, it provides a baseline for assessing the advantage of each action, i.e. how much better an action is compared to the average expected outcome. Thereby guiding the actor network to refine its policy toward higher-reward behaviors. This division of labor between the actor and critic networks creates a feedback loop: the actor generates actions to interact with the environment, the critic evaluates the quality of those actions, and both networks are updated iteratively using PPO’s clipped surrogate objective to ensure stable and efficient learning.

Specifically, the actor network adopts a dual-layer architecture with layer normalization to balance expressiveness and training stability in high-dimensional navigation scenarios. The network processes input features including relative positions, velocities, and COLREGs-compliance metrics through two hidden layers of 256 neurons each, followed by layer normalization (LN) layer to mitigate internal covariate shift and improve gradient flow. Leaky ReLU activation with a slope of 0.01 is applied post-normalization to retain non-linearity while avoiding dead neurons, ensuring robust encoding of complex state-action relationships. Correspondingly, the critic network complements the actor by adopting a symmetrical architecture with layer normalization to estimate state values accurately. It processes the same high-dimensional input features as the actor but diverges in its output layer, which generates a scalar value representing the expected cumulative reward from the current state. This design mirrors the actor’s structure, i.e. two hidden layers of 256 neurons with Leaky ReLU activation and layer normalization to ensure consistent feature extraction, while orthogonal initialization with gain=0.01 for the value head stabilizes the estimation of large-scale rewards.

The specific structures of the employed actor network and the critic network are shown in Table 2.

Table 2
www.frontiersin.org

Table 2. Neural network structures.

3.3 State space and reward function design

In ship collision avoidance problem, the state space encompasses a comprehensive set of navigational features that collectively characterize the dynamic motion information. It typically includes the OS kinematic parameters, such as position, course, speed, and rudder angle are critical components. In this study, the state space is defined by State=[x,y,u,v,r,ψ]Collision avoidance requires a navigator to adjust the heading of the Own Ship (OS), a skill dependent on long-term experience. Autonomous ships aim to emulate this expert decision-making by undergoing a simulated learning process. Within this framework, the agent’s action is defined as a rudder angle, the execution of which modifies the OS’s trajectory. In this study, the action space that can be selected by the OS is defined as action=[δmax,δmax],δmax=35Besides the aforementioned state space and the action space, the reward function is another core component of the reinforcement learning framework for ship collision avoidance, serving as a feedback mechanism to guide the OS toward safe and efficient navigational behaviors. Its design must balance multiple, often competing objectives, such as minimizing collision risk, adhering to COLREGs, reducing deviations from the original route, and optimizing efficiency. In this study, the rewards accrued at each time step are categorized into two primary types. The first, rewards for collision avoidance (RCA), quantify the own ship’s (OS) collision avoidance status relative to target ships (TS). RCA is further subdivided into three components: collision penalties, warning signals, and COLREGs compliance rewards. The second type, rewards for path following (RPF), evaluate the OS’s adherence to the reference path route to the destination, with subcategories including time efficiency, goal achievement, and route deviation metrics.

To standardize the impact of dense rewards, each component is normalized to the range [-1, 1]. Conversely, sparse rewards triggered by critical events are scaled to ±1000 to accelerate training convergence by emphasizing high-priority outcomes. A detailed breakdown of each reward component is provided in the following sections.

Rdcpa={1,DCPA>01,others
Rroute=2*e(dO*2)1,dO=(xoSXoS)2+(yoSYoS)2
Rgoal={2*e(L*2)1,other1000,L0.1nmileL=(xoSX)2+(yoSY)2
RCOLREGs=500,if satisfies COLREGs

For the OS to arrive at a destination from a start node, a calculation of several time steps is required. One iteration is performed after the number of time steps reaches the batch size, and the iteration process continues until the Max iteration. The reward summed in each iteration is called total reward. As learning progresses and iterations increase, the collision avoidance performance of the OS can be determined by examining the change in the total reward.

Rtime=1
reward=Rtime+Rdcpa+Rroute+Rgoal+Rcollision+RCOLREGs

Specifically, the total reward incorporates sub-rewards designed to guide the agent toward optimal performance across critical dimensions: a collision avoidance safety reward that grants positive feedback when the OS maintains a safe distance from other ships or obstacles (as reflected by the Distance to Closest Point of Approach, DCPA, meeting safety thresholds) and imposes severe negative rewards (penalties) in the event of a collision to discourage unsafe behaviors; a COLREGs compliance reward that provides positive incentives for the OS’s maneuvering actions (e.g., give-way behavior in crossing situations, course adjustment in head-on scenarios) strictly adhering to relevant COLREGs rules (e.g., Rules 13–16) while applying negative rewards for non-compliant behaviors; a time efficiency reward that encourages the OS to reach the destination within a reasonable time frame by rewarding shorter navigation durations and penalizing unnecessary delays; a route optimization reward that promotes smooth and efficient navigation by rewarding the OS for maintaining a stable course, minimizing excessive maneuvering, and adhering to a planned route trajectory; and a destination achievement reward that offers a positive reward upon successful arrival at the preset destination to reinforce the primary navigational goal. As the reinforcement learning process progresses and the number of iterations in-creases, the collision avoidance performance of the OS can be effectively evaluated by examining the variation trend of the total reward—consistent increases in total re-ward typically indicate that the agent is gradually mastering safe, compliant, and ef-ficient navigational strategies, while stable convergence of the total reward reflects the robustness and reliability of the trained collision avoidance policy.

It is worth emphasizing that the above reward formulation inherently constitutes a multi-objective optimization framework. Specifically, multiple navigation objectives—including collision avoidance safety, compliance with COLREGs, time efficiency, route smoothness, and destination achievement—are simultaneously considered and integrated into the learning process through distinct sub-reward components.

Rather than optimizing a single scalar objective, the proposed method transforms these competing and potentially conflicting objectives into a unified optimization problem by constructing a composite reward function. Each sub-reward corresponds to an individual optimization objective, and their joint contribution guides the agent toward a balanced navigation policy that achieves safety, legality, and efficiency concurrently.

4 Simulations and discussion

4.1 Simulation setting

In this section, all training tasks were performed on a server with an NVIDIA RTX 4060 Ti GPU, an Intel i7–12700 CPU, and 32GB of DDR4 memory. Neural network models are trained separately via the PyTorch deep learning framework. Detailed configurations of PPO hyperparameters are provided in Table 3. It should be noted that no comparison with a standard PPO baseline is included in this section. This is because ship collision avoidance is inherently risk-sensitive, and the decision-making performance is tightly coupled with collision risk assessment. A standard PPO framework without risk modeling tends to produce unsafe or overly aggressive trajectories that are not compliant with maritime practice, making the comparison less meaningful and potentially misleading in such safety-critical contexts. Therefore, this experiment focuses on validating the feasibility and safety of the proposed method rather than conducting algorithmic benchmarking.

Table 3
www.frontiersin.org

Table 3. Hyper parameters of the PPO algorithm.

In the simulation, the key parameters of the proposed algorithm are set as follows. κ1=1,κ2=1,α1=3,α2=2. And the target ship is modeled using the widely adopted KVLCC2 tanker, a standard benchmark in ship maneuverability research. The ship has a length of 7.0 m, a beam of 1.269 m, a draft of 0.455 m, and a block coefficient of 0.8098. The hydrodynamic forces and moments in Eq. (1) are determined using comprehensive hydrodynamic coefficients taken from Yasukawa and Yoshimura, (2014).

4.2 Scenario 1: static obstacle avoidance

To verify the effectiveness of the proposed PPO-based collision avoidance strategy in simple and deterministic environments, the first simulation scenario focuses on static obstacle avoidance. In this case, the own ship navigates through a water area where several immovable obstacles are randomly distributed. The purpose of this experiment is to evaluate the ability of the trained policy to generate safe and smooth trajectories in the absence of dynamic encounter situations.

The environment is modeled in a two-dimensional Cartesian coordinate system, with the initial position of the own ship set to (x0, y0) = (1000, 1000) and the target position located at (xg, yg) = (4000,4000). Two circular obstacles with radii ranging from 100 m to 200 m are placed along the path between the initial and target positions to emulate stationary obstacles.

The simulation results demonstrate that the own ship guided by PPO successfully learns to avoid static obstacles without direct supervision. As shown in Figure 4, the planned trajectory smoothly curves around the obstacles while maintaining a stable heading toward the target. The ship adjusts its rudder angle gradually, avoiding abrupt maneuvers, and restores its heading once it clears the obstacle region. This indicates that the policy achieves a balance between safety and navigation efficiency.

Figure 4
A graph depicting a ship's path from a green start node to a red target node, avoiding static obstacles marked as gray circles. The path curves upward and rightward on an X, Y coordinate plane, with axes labeled in meters.

Figure 4. Ship trajectory for static obstacle avoidance generated by the proposed algorithm.

More specifically, the trajectory shows that the own ship initiates an early deviation when an obstacle enters its perception range, indicating that the learned policy has developed anticipatory behavior rather than reactive avoidance. The turning maneuver is performed with a moderate curvature, keeping the ship’s lateral acceleration within a safe and realistic range. During the avoidance process, the surge velocity remains nearly constant, and only small fluctuations in yaw rate are observed, demonstrating that the PPO agent effectively coordinates propulsion and steering control. The minimum distance between the ship and the nearest obstacle is consistently greater than the predefined safety margin (50 m in this simulation), ensuring collision-free navigation. After bypassing the obstacle cluster, the ship gradually realigns with the nominal path and proceeds directly toward the target waypoint, showing strong path recovery capability. The resulting motion pattern is smooth, interpretable, and physically feasible, which is essential for real-world maritime applications. In summary, the results in Figure 4 confirm that the proposed PPO-based control strategy can generate safe, smooth, and efficient trajectories for static obstacle avoidance. This verifies that the trained policy has effectively captured the spatial characteristics of the environment and can generalize well to different obstacle configurations without additional retraining.

In addition, the minimum passing distance between the ship and the nearest obstacle remains larger than the predefined safe threshold, confirming that the learned strategy effectively enforces spatial separation. Compared with a conventional rule-based approach, the PPO-based controller exhibits better adaptability to different obstacle layouts and produces smoother motion profiles. Overall, the results of this scenario validate the feasibility of the proposed method in simple environments and establish a solid foundation for more complex dynamic encounter cases, which are discussed in the following sections.

4.3 Scenario 2: two ships system: crossing situation

To further validate the effectiveness and generalization capability of the pro-posed COLREGs-compliant PPO-based collision avoidance strategy, the second simulation scenario considers a two-ship encounter situation. In this case, both the own ship and the target ship are dynamically moving, which requires the decision-making policy to adaptively generate collision-free maneuvers while complying with the COLREGs.

In this scenario, two surface ships navigate within the same operating area. The own ship starts from the origin at (x0, y0) = (1000, 1000) with an initial heading of 45°, while the target ship is initially positioned at (xt, yt) = (1000, 4000) with a heading of 135°, moving toward the lower-right direction. The own ship follows the PPO-trained policy, whereas the target ship maintains a constant speed and heading.

As illustrated in Figure 5, this scenario represents a relatively complex crossing encounter situation between the own ship and the target ship. According to Rule 15 of COLREGs - Crossing Situation: when two power-driven vessels are crossing so as to involve risk of collision, the vessel which has the other on her own starboard side shall keep out of the way and shall, if the circumstances of the case admit, avoid crossing ahead of the other vessel. The trajectories clearly show that both ships’ paths intersect within the operational area, requiring the PPO-based agent to make timely and rule-compliant maneuvering decisions. When the potential collision zone is detected, the own ship initiates a smooth starboard alteration in advance, creating a safe passing distance while still progressing toward its destination.

Figure 5
Graph showing a ship's path from a green start node to a red target node. A blue line represents the ship's path, intersecting with a red dashed line indicating a dynamic obstacle. Blue arrows point along the path, illustrating movement. Static obstacles are absent. Axes are labeled X in meters and Y in meters.

Figure 5. Simulation results of the two-ship crossing scenario.

4.4 Scenario 3: two ships system: head-on situation

In this scenario, the own ship starts from the origin at (x0, y0) = (1000, 1000) with an initial heading of 45°, while the target ship is initially positioned at (xt, yt) = (4000, 4000) with a heading of 225°, moving toward the lower-left direction. This constitutes a classic head-on encounter situation as defined by COLREGs, where the two vessels are approaching each other on reciprocal courses with a high risk of collision if no evasive action is taken. As illustrated in Figure 6, the proposed collision avoidance strategy successfully triggers compliant maneuvering in accordance with COLREGs requirements—specifically adhering to the core principles of Rule 14 (Head-on situation) by initiating timely and appropriate course or speed adjustments to ensure safe separation, thereby verifying the strategy’s effectiveness in handling head-on encounter scenarios while strictly complying with international navigational regulations.

Figure 6
Graph showing a ship's path in blue from a green start node at (1000, 1000) to a red target node at (4000, 4000). A blue arrow indicates direction. A red dashed line signifies a dynamic obstacle.

Figure 6. Simulation results of the two-ship head-on scenario.

The resulting trajectory demonstrates the policy’s ability to handle nontrivial interaction geometry without abrupt or oscillatory behavior. The own ship effectively predicts the motion trend of the target ship and adjusts its control inputs continuously to ensure compliance with COLREGs Rule, giving way to the target ship approaching from its starboard side. This behavior highlights the model’s capacity for context-aware decision-making and confirms that the PPO-based control strategy can manage multi-directional motion interactions while maintaining navigation safety and smoothness.

5 Conclusion

This study proposed a COLREGs-compliant ship collision avoidance strategy based on DRL, specifically using the PPO algorithm. The method enables an autonomous ship to make real-time, safe, and regulation-abiding navigation decisions in complex maritime environments without relying on explicit encounter classification or handcrafted maneuvering rules.

A dynamic ship motion model was employed to capture realistic maneuvering characteristics, and the reward function was carefully designed to integrate COLREGs rules, safety constraints, and navigation efficiency. Through extensive simulations, the proposed PPO-based policy demonstrated the ability to generate smooth, collision-free, and regulation-compliant trajectories across different scenarios, including static obstacle avoidance and two-ship encounters. The results confirmed that the trained agent can effectively adapt to varying encounter geometries, predict the potential collision risks, and perform appropriate maneuvers such as turning to starboard or adjusting speed to maintain safe distances.

Compared with traditional rule-based or optimization-based methods, the proposed DRL framework exhibits better adaptability, autonomy, and generalization in uncertain and dynamic maritime environments. The integration of COLREGs-based rewards ensures interpretable behavior consistent with human navigational practices, while the continuous control formulation produces stable and realistic motion commands suitable for real-world implementation.

Future work will primarily focus on extending the proposed framework to multi-ship scenarios with cooperative decision-making and improving the training efficiency through approaches such as curriculum learning or transfer learning. Additionally, efforts will be made to enhance robustness under realistic conditions by integrating sensor uncertainty and environmental disturbances, as well as conducting critical evaluations of real-time feasibility. Comparative studies with baseline controllers (e.g., APF, DDPG, A3C) and ablation experiments to assess the influence of the COLREGs reward will also be pursued to validate and strengthen the practical applicability of the proposed approach.

Data availability statement

The original contributions presented in the study are included in the article/supplementary material. Further inquiries can be directed to the corresponding authors.

Author contributions

QZ: Writing – original draft, Writing – review & editing. TY: Writing – original draft, Writing – review & editing. CM: Writing – original draft, Writing – review & editing. QC: Writing – original draft, Writing – review & editing. TL: Writing – original draft, Writing – review & editing. ML: Writing – original draft, Writing – review & editing. XW: Writing – original draft, Writing – review & editing.

Funding

The author(s) declared that financial support was received for this work and/or its publication. This research was funded by the National Key R&D Program of China, grant number 2021YFC2803401, and by the National Natural Science Foundation of China, grant number 51909022.

Conflict of interest

The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Cao X., Sun C. Y., and Yan M. Z. (2019). Target search control of AUV in underwater environment with deep reinforcement learning. IEEE Access 7, 96549–96559. doi: 10.1109/ACCESS.2019.2929120

Crossref Full Text | Google Scholar

Cheng Y. and Zhang W. D. (2018). Concise deep reinforcement learning obstacle avoidance for underactuated unmanned marine vessels. J. Neurocomput. 272, 63–73. doi: 10.1016/j.neucom.2017.06.066

Crossref Full Text | Google Scholar

Chun D.-H., Roh M.-I., Lee H.-W., Ha J., and Yu D. (2021). Deep reinforcement learning-based collision avoidance for an autonomous ship. J. Ocean Eng. 234, 109216. doi: 10.1016/j.oceaneng.2021.109216

Crossref Full Text | Google Scholar

Cui Z., Guan W., and Zhang X. (2024). Collision avoidance decision-making strategy for multiple USVs based on Deep Reinforcement Learning algorithm. J. Ocean Eng. 308, 118323. doi: 10.1016/j.oceaneng.2024.118323

Crossref Full Text | Google Scholar

Fan Y., Sun Z., and Wang G. (2025). Progressive deep reinforcement learning for intelligent collision avoidance in unmanned surface vehicles. J. Ocean Eng. 332, 121438. doi: 10.1016/j.oceaneng.2025.121438

Crossref Full Text | Google Scholar

He Y., Zou L., Wu Z. X., Liu S. Y., Chen W. M., Zou Z. J., et al. (2025). Integrated path following and collision avoidance control for an underactuated ship based on MFAPC. Ocean Eng. 324, 120706. doi: 10.1016/j.oceaneng.2025.120706

Crossref Full Text | Google Scholar

Hu Q., Shi C., Chen H., and Hu Q. (2006). “Enabling vessel collision avoidance expert systems to negotiate,” in Proceedings of the Korean Institute of Navigation and Port Research Conference, vol. 1. (Korean Institute of Navigation and Port Research), 77–82.

Google Scholar

Hu Q., Yang C., Cheng H., and Xiao B. (2008). Planned route based negotiation for collision avoidance between vessels. TransNav.: Int. J. Mar. Navig. Saf. Sea Transp. 2, 363–368.

Google Scholar

IMO (1972). Convention on the international Regulations for Preventing Collisions at Sea (COLREGs) London, United Kingdom: International Maritime Organization (IMO).

Google Scholar

Jia X., Gao S., and He W. (2025). Meta-reinforcement learning-based collision avoidance for autonomous ship. J. Ocean Eng. 339, 122064. doi: 10.1016/j.oceaneng.2025.122064

Crossref Full Text | Google Scholar

Kim D., Hirayama K., and Okimoto T. (2017). Distributed stochastic Search algorithm for multi-ship encounter situations. J. Navig. 70, 699–718. doi: 10.1017/S037346331700008X

Crossref Full Text | Google Scholar

Liu Y., Yang C., and Du X. (2007). “A multiagent-based simulation system for ship collision avoidance,” in International Conference on Intelligent Computing (Berlin, Heidelberg: Springer), 316–326.

Google Scholar

Liu J. J., Zhang J. F., Yan X. P., and Guedes Soares C. (2022). Multi-ship collision avoidance decision-making and coordination mechanism in Mixed Navigation Scenarios. J. Ocean Eng. 257, 111666. doi: 10.1016/j.oceaneng.2022.111666

Crossref Full Text | Google Scholar

Ogawa A. and Kasai H. (1978). On the mathematical model of manoeuvring motion of ships. Int. shipbuilding Prog. 25, 306–319. doi: 10.3233/ISP-1978-2529202

Crossref Full Text | Google Scholar

Thyri E. H., Basso E. A., Breivik M., Pettersen K. Y., Skjetne R., and Lekkas A. M. (2020). “Reactive collision avoidance for ASVs based on control barrier functions,” in 2020 IEEE Conference on Control Technology and Applications (CCTA). 380–387.

Google Scholar

Wu C., Yu W., Li G., and Liao W. (2023). Deep reinforcement learning with dynamic window approach based collision avoidance path planning for maritime autonomous surface ships. J. Ocean Eng. 284, 115208. doi: 10.1016/j.oceaneng.2023.115208

Crossref Full Text | Google Scholar

Xie S., Chu X. M., Zheng M., and Liu C. (2020). A composite learning method for multi-ship collision avoidance based on reinforcement learning and inverse control. J. Neurocomput. 411, 375–392. doi: 10.1016/j.neucom.2020.05.089

Crossref Full Text | Google Scholar

Yasukawa H. and Yoshimura Y. (2014). Introduction of MMG standard method for ship maneuvering predictions. J. Mar. Sci. Technol. 20, 37–52. doi: 10.1007/s00773-014-0293-y

Crossref Full Text | Google Scholar

Yoo B. and Kim. J. (2016). Path optimization for marine vehicles in ocean currents using reinforcement learning. J. J. Mar. Sci. Technol. 21, 334–343. doi: 10.1007/s00773-015-0355-9

Crossref Full Text | Google Scholar

Yu S., Li Y., and Gong J. (2025). Hierarchical reinforcement learning for dynamic collision avoidance of autonomous ships under uncertain scenarios. J. Knowledge-Based Syst. 2025, 114528. doi: 10.1016/j.knosys.2025.114528

Crossref Full Text | Google Scholar

Zhang J., Yan X., Chen X., Sang L., and Zhang D. (2012). A novel approach for assistance with anti-collision decision making based on the International Regulations for Preventing Collisions at Sea. Proc. IME M J. Eng. Marit. Environ. 226, 250–259. doi: 10.1177/1475090211434869

Crossref Full Text | Google Scholar

Zhang J., Zhang D., Yan X., Haugen S., and Guedes Soares C. (2015). A distributed anti- collision decision support formulation in multi-ship encounter situations under COLREGs. Ocean Eng. 105, 336–348. doi: 10.1016/j.oceaneng.2015.06.054

Crossref Full Text | Google Scholar

Zhen R., Shi Z., Shao Z., and Liu J. (2022). A novel regional collision risk assessment method considering aggregation density under. J. Navig. 75, 76–94. doi: 10.1017/S0373463321000849

Crossref Full Text | Google Scholar

Zhou X. Y., Wu P., Zhang H. F., Guo W., and Liu Y. (2019). Learn to navigate: cooperative path planning for unmanned surface vehicles using deep reinforcement learning. IEEE Access 7, 165262–165278. doi: 10.1109/ACCESS.2019.2953326

Crossref Full Text | Google Scholar

Keywords: collision avoidance, COLREGs, PPO, reinforcement learning, ship-ship encounter

Citation: Zhao Q, Yang T, Mu C, Chen Q, Luo T, Liu M and Wang X (2026) COLREGs-compliant ship collision avoidance strategy based on proximal policy optimization algorithm. Front. Mar. Sci. 12:1756233. doi: 10.3389/fmars.2025.1756233

Received: 28 November 2025; Accepted: 26 December 2025; Revised: 18 December 2025;
Published: 21 January 2026.

Edited by:

Maohan Liang, National University of Singapore, Singapore

Reviewed by:

Guibing Zhu, Zhejiang Ocean University, China
Jiansen Zhao, Shanghai Maritime University, China

Copyright © 2026 Zhao, Yang, Mu, Chen, Luo, Liu and Wang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Xin Wang, eGluLndhbmdAZGxtdS5lZHUuY24=; Tianyu Yang, dGlhbnl1eWFuZzIwMjNAc2p0dS5lZHUuY24=

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.