Multi-UAV simultaneous target assignment and path planning based on deep reinforcement learning in dynamic multiple obstacles environments

Target assignment and path planning are crucial for the cooperativity of multiple unmanned aerial vehicles (UAV) systems. However, it is a challenge considering the dynamics of environments and the partial observability of UAVs. In this article, the problem of multi-UAV target assignment and path planning is formulated as a partially observable Markov decision process (POMDP), and a novel deep reinforcement learning (DRL)-based algorithm is proposed to address it. Specifically, a target assignment network is introduced into the twin-delayed deep deterministic policy gradient (TD3) algorithm to solve the target assignment problem and path planning problem simultaneously. The target assignment network executes target assignment for each step of UAVs, while the TD3 guides UAVs to plan paths for this step based on the assignment result and provides training labels for the optimization of the target assignment network. Experimental results demonstrate that the proposed approach can ensure an optimal complete target allocation and achieve a collision-free path for each UAV in three-dimensional (3D) dynamic multiple-obstacle environments, and present a superior performance in target completion and a better adaptability to complex environments compared with existing methods.


Introduction
Recently, unmanned aerial vehicles (UAV) have been widely applied to a variety of fields due to their advantages of high flexibility, low operating cost, and ease of deployment.In the military field, UAVs have become an important part of modern warfare and can be used for missions such as reconnaissance (Qin et al., 2021), strikes (Chamola et al., 2021), and surveillance (Liu et al., 2021), reducing casualties and enhancing combat efficiency.In the field of agriculture, UAVs have good applications in plant protection (Xu et al., 2019;Chen et al., 2021), agricultural monitoring (Zhang et al., 2021), and so on, improving the efficiency and precision of agricultural operations.In the field of environmental protection, UAVs are extensively employed in environmental monitoring (Yang et al., 2022), pollution source tracking (Liu et al., 2023), nature reserve inspection (Su et al., 2018), and other tasks, effectively supporting environmental protection work.In addition, for search and rescue tasks (Fei et al., 2022;Lyu et al., 2023), UAVs can quickly obtain disaster information through airborne sensors to provide efficient and timely assistance for subsequent rescue.However, it is difficult to apply lone single UAVs to complex and diverse missions due to their limited functionality and payload.Cooperation between multiple UAVs (Song et al., 2023) has greatly expanded the ability and scope of task execution, and has gradually replaced the single UAV as the nontrivial technology for various complex tasks.The key to solving the multi-UAV cooperative problems (Wang T. et al., 2020;Xing et al., 2022;Wang et al., 2023) is target assignment and path planning for UAVs, which is the guarantee of task completion.
The above problem consists of two fundamental sub-problems.Target assignment (Gerkey and Matarić, 2004) means assigning one UAV for each target to maximize the overall efficiency or minimize the total costs.It has many effective solutions such as the Genetic algorithm (GA) (Tian et al., 2018) and the Hungarian algorithm (Kuhn, 1955).Lee et al. (2003) introduced greedy eugenics to GA to improve the performance of GA in weapontarget assignment problems.Aiming at the multi-task allocation problem, Samiei et al. (2019) proposed a novel cluster-based Hungarian algorithm.Path planning (Aggarwal and Kumar, 2020) refers to each drone planning an optimal path from its initial location to its designated target with the collision-free constraint.It has been studied extensively, and A* (Grenouilleau et al., 2019), rapidly-exploring random tree algorithm (RRT) (Li et al., 2022) and particle swarm optimization (PSO) (Fernandes et al., 2022) are classical methods.Fan et al. (2023) incorporated the artificial potential field method into RRT to reduce the cost of path planning.He W. et al. (2021) proposed a novel hybrid algorithm for UAV path planning by combining PSO with the symbiotic organism search.While most previous works tackle the problem in static environments, and a common feature of these solutions is that they rely on global information of the task environment for explicit planning, which may lead to unexpected failure in the face of uncertain circumstances or unpredictable obstacles.
Therefore, some studies resort to learning-based approaches such as deep learning (DL) (Kouris and Bouganis, 2018;Mansouri et al., 2020;Pan et al., 2021).Pan et al. (2021) combined DL and GA to plan the path for UAV data collection.The proposed method collected various paths and states in different task environments by GA, and used them to train the neural network of DL, which can give an optimal path in familiar scenarios with realtime requirements.Kouris and Bouganis (2018) proposed a selfsupervised CNN-based approach for indoor UAV navigation.This method used an indoor-flight dataset to train the CNN and utilized the CNN to predict collision distance based on an on-board camera.However, Deep learning-based approaches require labels for learning and they are infeasible when the environment is highly variable.
Unlike DL methods, reinforcement learning (RL) (Thrun and Littman, 2000;Busoniu et al., 2008;Zhang et al., 2016) can optimize strategies directly through trial-and-error iteration interacting with the environment without prior knowledge, which is adaptable to dynamic environments.Moreover, deep reinforcement learning (DRL) (Mnih et al., 2015) combines DL and RL to implement end-to-end learning.It makes RL no longer limited to lowdimensional space and greatly expands the scope of application of RL (Wang C. et al., 2020;Chane-Sane et al., 2021;He L. et al., 2021;Kiran et al., 2021;Luo et al., 2021;Wu et al., 2021;Yan et al., 2022;Yue et al., 2023;Zhao et al., 2023).Wu et al. (2021) introduced a curiosity-driven method into DRL to improve training efficiency and performance in autonomous driving tasks.Yan et al. (2022) proposed a simplified, unified, and applicable DRL method for vehicular systems.Chane-Sane et al. (2021) designed a new RL method with imagined possible subgoals to facilitate learning of complex tasks such as challenging navigation and vision-based robotic manipulation.Luo et al. ( 2021) designed a DRL-based method to generate solutions for the missile-target assignment problem autonomously.He L. et al. (2021) presented an autonomous path planning method based on DRL for quadrotors in unknown environments.Wang C. et al. (2020) proposed DRL algorithm with nonexpert helpers to address the autonomous navigation problem for UAVs in large-scale complex environments.
DRL is suitable to solve the target assignment problem and path planning problem of UAVs, but there are still some challenges when multiple UAVs perform tasks in dynamic environments.The first challenge is inefficient target assignment.Typically, UAVs execute target assignment first and then perform path planning based on the result of the target assignment.However, the dynamism and uncertainty of the environment always lead to an inaccurate assignment result, which directly affects the subsequent path planning.In this respect, UAVs need to perform autonomous target assignment and path planning simultaneously.There are only a few scholars who have studied this field.Qie et al. (2019) constructed the multiple UAVs target assignment and path planning problem as a multi-agent system and used the multi-agent deep deterministic policy gradient (MADDPG) (Lowe et al., 2017) framework to train the system to solve two problems simultaneously.They traverse all targets and select the agent closest to each target after each step of the agent, which often results in an incomplete assignment of targets when two agents are at the same and shortest distance from one target.Han et al. (2020) proposed a navigation policy for multiple robots in a dynamic environment based on the Proximal Policy Optimization (PPO) (Schulman et al., 2017) algorithm.The target assignment scheme was proposed depending on the distance between robots and targets.However, this assignment method does not take into account the obstacles in the task environment, which is vulnerable to leads to inaccurate allocation in a multi-obstacle environment similar to the real world.The second challenge is that UAVs' onboard sensors have limited detection range.The realtime decision-making of UAVs depends on observation returned by sensors, especially in dynamic and uncertain environments.If the detection range of sensors is limited, the current state cannot fully represent the global environmental information, which greatly increases the difficulty of autonomous flight.
To overcome these challenges, this article models the multi-UAV target assignment and path planning problem as a partially observable Markov decision process (POMDP) (Spaan, 2012) and designs a simultaneous target assignment and path planning method based on DRL to settle it.Among the DRL-based methods, the twin-delayed deep deterministic policy gradient (TD3) (Fujimoto et al., 2018) is a state-of-the-art (SOTA) DRL algorithm and has been widely used in training the policy of UAVs.It significantly improves the learning speed and performance of deep deterministic policy gradient (DDPG) (Lillicrap et al., 2015) algorithm by reducing the overestimation of DDPG.Zhang et al. (2022) introduced the spatial change information of environment to the TD3, and used it to guide a UAV to complete navigation tasks in complex environments with multiple obstacles.Hong et al. (2021) proposed an advanced TD3 model to perform energyefficient path planning at the edge-level drone.In this regard, a more effective DRL algorithm based on TD3 is proposed to solve the POMDP in this article.
The main contributions of this article can be summarized as follows: • A DRL framework for multi-UAV target assignment and path planning is developed in 3D dynamic multiple obstacles environments, where the target assignment and path planning problem is modeled as a POMDP.• A simultaneous target assignment and path planning method taking into account UAVs, targets, and moving obstacles is proposed, which can achieve an optimal target assignment and complete collision-free path planning for each UAV simultaneously.• A 3D stochastic complex simulation environment is built to train an algorithm, and the experimental results validate the effectiveness of the proposed method.
The remainder of this article is organized as follows: The background is presented in Section 2, Section 3 introduces the formulation of the multi-UAV problem.In Section 4, a detailed introduction to our method is provided.Section 5 presents the simulation experiments and results.Finally, the conclusion of this paper and future work are summarized in Section 6.

Background
This section gives a brief introduction to the multi-UAV target assignment and path planning problem in this article first, followed by the multi-UAV problem formulated as a POMDP in 3D dynamic environments.

. Multi-UAV target assignment and path planning problem
The multiple UAVs target assignment and path planning scenario of this paper is shown in Figure 1: (1) A series of UAVs are commanded to fly across a 3D mission area until they reach the targets distributed in different locations.
(2) The mission area is scattered with some static or irregularly moving obstacles.
(3) UAVs are required to avoid collision with each other and obstacles.(4) UAVs are isomorphic and targets are identical.
The object of multi-UAV target assignment and path planning is to minimize the total flight path length of all UAVs [Equation (1)] under the constraints of target completely assignment and collision-free: min( (2) where  collide at all times, while the second defines each UAV's path is collision-free with obstacles.

. Modeling multiple UAVs problem as an POMDP
The multi-UAV problem can be modeled as POMDP, which is composed of a tuple N, S, O, A, P, R .In this tuple, N = {1, 2, • • • , N} represents the collection of N UAVs, S is the state space of UAVs, O = {o 1 , o 2 , • • • , o N } is the observation of all UAVs, where o i represents the observation of UAV i.When the environment is partially observable, at time t, each UAV only obtains its own local observation o t,i ∈ o i .A = {a 1 , a 2 , • • • , a N } is the action collection of UAVs, where a i is the action taken by UAV i; P : S × A × S ′ ∈ [0, 1] denotes the probability that state transfers from S to S ′ after performing action A The multi-UAV reinforcement learning process in a partially observable environment is shown in Figure 2. At each epoch t, UAV i selects its optimal action a t,i based on the policy π to maximize the joint cumulative reward of all UAVs, and π (a |s ) = P[A t = a |S t = s ] represents the probability of action a under state s.Then the joint action A t = {a t,1 , a t,2 , . . ., a t,N } of UAVs is executed to control the movement of UAVs, the joint state is changed to S t+1 and the reward received by the UAV i is R t,i .The cumulative reward of UAV i is defined as Equation (4), where γ ∈ [0, 1] is the discount factor that balances the current rewards and the future rewards.

Problem formulation
UAVs use onboard sensors to acquire their internal state information and environmental state information, execute actions according to the DRL model, and obtain the corresponding reward.Figure 3 describes the problem formulation.

. State space
The state space consists of the internal state of the UAV and the environmental information within the max detection distance d det of onboard sensors.The state space of UAV i can be defined as

. Action space
In this paper, the action space of UAV i is defined as 3B, where the F i X , F i Y , F i Z represent the component forces applied to UAV i in X, Y, and Z three directions, respectively.The force produces an acceleration to change the velocity of the UAV.

. Reward function
In this paper, the goal of the reward function is to guide UAVs to fly to the assigned target without any collision.In order to address the problem of underperforming training efficiency caused by sparse rewards, the reward function in this article uses a combination of guided rewards and sparse rewards.In the process of interacting with the environment, if a UAV reaches the target, collides with other UAVs, or hits an obstacle, a sparse reward is applied; when none of these three situations occurs, a guided reward is applied.
(1) Approaching the target This reward function is to guide the UAV to head for the target and reach the target.When a UAV moves away from the target, it will receive a larger penalty related to the distance between the UAV and the target, and a reward of value 0 will be given to the UAV when it arrives at the target.Consequently, the reward for UAV i approaching the target can be defined as Equation ( 5), where d t i denotes the distance between UAV i and the target, r u is the radius of UAVs, r t is the radius of targets. (

2) Avoiding collision with other UAVs
This reward is to avoid collision with other UAVs in the process of approaching the target.When the distance between UAV i and UAV j is shorter than their minimum safe distance, a collision will occur and the penalty value is set as Equation ( 6), where d j i represents the distance between UAV i and UAV j, d u safe = 2r u is the minimum safe distance between UAVs.
(3) Avoiding obstacles The aim of this reward function is to keep UAVs away from obstacles.If the obstacle appears within the detection range d det , the UAV will obtain a punishment, and the closer the UAV gets to the obstacle, the greater the penalty.When the distance between the UAV and the obstacle is less than their minimum safe distance, a penalty of −1 will be given to the UAV.
In Equation ( 7), d o i is the distance between UAV i and the nearest obstacle within the detection range, d o safe = r u + r o is the minimum safe distance between UAV and obstacles, r o is the radius of obstacles.
In conclusion, the reward function received by UAV i can be summarized as Equation ( 8), As can be seen from the reward function designed in this article, the guided reward functions are all negative.It means that each additional step taken by the UAV will have a negative value as a step penalty before reaching the target.Therefore, the reward value in this article can reflect the length of the flight path.A longer flight path corresponds to a smaller reward value.

Algorithm
In this section, the proposed algorithm, TANet-TD3, is illustrated in detail.  .

TD algorithm
This paper uses the TD3 as a basic algorithm to address the multi-UAV target assignment and path planning problem.As an improvement of the DDPG algorithm, TD3 also uses an Actor-Critic structure, but it introduces three technologies to prevent the overestimation problem of DDPG: (1) Clipped double-Q learning.
TD3 has two Critic networks Q θ n parameterized by θ n , n = 1, 2 and two Critic target networks The smaller one of two target Q-values is used to calculate the target value function y [Equation ( 9)] to alleviate the overestimation problem of the value function, as shown in (1) of Figure 4.
Therefore, the two Critic networks are updated by minimizing the loss function as Equation ( 10),

FIGURE
The framework of target assignment.
(2) Delayed policy update.TD3 updates the policy after getting an accurate estimation of the value function to ensure more stable training of the Actornetwork.Usually updating the Actor once when the Critic is updated twice, as shown in (2) of Figure 4.
The policy in DDPG is susceptible to influence by the function approximation error.TD3 adds the clipped noise into the target policy to make the value estimate more accurate, as shown in (3) of Figure 4.
In Equation ( 11), s ′ and a ′ represent the state and action at the next time, respectively.π φ ′ represents the Actor target network with the parameter φ ′ and ε denotes the clipped noise.
Each UAV executes action a to transform the state s to next state s ′ , and obtains a reward R from the environment.The data (s, a, R, s ′ ) is stored in the replay buffer D as a tuple.Sample a minibatch transition randomly from D, and input the s ′ into the Actor target network π φ ′ to get the next action a ′ .Then, input the (s ′ , a ′ ) into the two Critic target networks Q θ ′ 1 , Q θ ′ 2 to calculate the Q-values and select the smaller one to calculate the target value y.In the meantime, input (s, a) into the two Critic network Q θ 1 , Q θ 2 and calculate the MSE with y to update the parameters θ 1 , θ 2 of two Critic networks.After that, input the Q-value acquired from Critic network Q θ 1 into the Actor-network π φ , and update its parameter φ in the direction of increasing the Q-value as Equation ( 12), Finally, the target Actor network' parameter φ ′ and the two target Critic networks' parameters θ ′ 1 , θ ′ 2 are updated by soft update as follows Equation ( 13) and Equation ( 14), . TANet-TD . .Framework of the TANet-TD This paper proposed the twin-delayed deep deterministic policy gradient algorithm with target assignment network (TANet-TD3), different from the existing methods that assign targets for the whole task first and then planning the path according to the assignment results, TANet-TD3 can solve the multiple UAVs target assignment and path planning simultaneously in dynamic multiobstacle environments.The framework of the TANet-TD3 is shown in Figure 5, it can be seen that the object of the task is to minimize the total flight path length of all UAVs with the complete target assignment constraint and collision-free constraint.TANet-TD3 introduces a target assignment network into the framework of TD3 to solve the two problems simultaneously.Among the overall process, the target assignment network provides the optimal complete assignment of targets for each step of UAVs (the green dashed box), and then the TD3 algorithm guides each UAV plan a feasible path for this step (the blue dashed box) according to the assigned result (the yellow dashed box).In the meantime, the training labels of assignment network are obtained from the process of path planning driven by TD3 algorithm (the purple dashed box).This method not only takes into account the distance between UAVs and targets but also considers the dynamic obstacles in task environments, so it can generate an optimal assignment and path.

. . Framework of target assignment
and then the Cross-Entropy calculation is performed with the assigned labels to update the assignment network [Equation ( 16)], (2) Construction of the assignment label From the bottom section of the Figure 6, it can be seen that the training labels of the assignment network are provided by TD3 framework.The task objective is to achieve a complete assignment and minimize the total flight path, but it is not accurate to only consider the distance between UAVs and targets to make decisions in random and dynamic environments.As mentioned in Section 2.2, a multi-UAV problem means to maximize the joint cumulative reward of all UAVs in DRL, that is, each UAV will choose the action that maximized the Q-value based on its current state.Compared with selecting the target only according to distance, this method determines the assigned target according to the Q-value comprehensively taking into account UAVs, targets, and obstacles, even if obstacles are moving, so the targets can get an optimal assignment.
For UAV i, a 1 × N T Q-value list Q i1 , Q i2 , . . ., Q iN T can be obtained for each step from the initial position by considering each target T j , j = 1, 2, . . ., N T as the destination the UAV i will eventually reach, and for N U UAVs, a N U × N T value matrix is formed as Equation ( 17) by traversing all targets, In order to ensure the constraints of complete target assignment, among many methods, the Hungarian algorithm has fast solution speed and stable solution quality, and with the aid of the independent 0 element theorem, it can obtain the exact solution of the problem by making elementary changes for the matrix with finite steps.Therefore, the Hungarian algorithm is introduced to achieve a complete allocation for targets in this article.After the Hungarian transformation, the Q-value matrix can be transformed into a permutation matrix with only 0 and 1 elements such as in Equation ( 18) if element 1 of row i is located in column j, it means that the j-th target is assigned to the i-th UAV.Thus, the target assignment can be achieved according to the Q-value Qi,j of each step, and the result of Hungarian transformation is used as the training label of the target assignment network.
(3) Construction of environmental state information with a new sequence After the target assignment network of UAV i has been fully trained, a list of probabilities of UAV i moving to each target in the current state can be obtained, among which the assigned target has the largest probability in the list.As shown in the top section of the Figure 6, if the target T j is assigned to UAV i, then the index of target T j can be calculated by Equation ( 19), index(T j ) = argmax(P i1 , P i2 , . . ., P iN T ) The original environmental state information , that is the assigned target T j is placed in the first place of the target sequence to guide UAV i to recognize its own target.
The target assignment network realizes the optimal target assignment every step in the dynamic environment, and then UAV i uses the TD3 algorithm to plan the path for the assigned target according to the new state information si = (s ui , õi ).The Actor network updates according to the Qi,n and the state information si with new sequence using Equation ( 20), the TANet-TD3 is described in Algorithm 1.

Experiments and results
In this section, the simulation environment is introduced first.Then, the training experiments, testing experiments, and statistical experiments are presented to verify the effectiveness of the proposed method in different scenarios.Randomly sample a mini-batch samples from D 10: Calculate target actions using Equation ( 11) 11: Calculate Q-targets using Equation ( 9) 12: Update θ i,1 and θ i,2 using Equation ( 10) Obtain the assignment label using Equation ( 18) 15: Update target assignment network using Equation ( 16) 16: Construct the observation si with the new sequence 17: If t mod d then 18: Update Actor using Equation ( 20) 19: Update target networks: 20: Update θ ′ i,1 and θ ′ i,2 using Equation ( 13) 21: Update φ ′ i using Equation ( 14) 22:

End if
23: End for 24: End for Algorithm .TANet-TD .Experimental settings A 3D simulation environment with two-dimensional three views is constructed based on the OpenAI platform to implement multi-UAV simultaneous target assignment and path planning in dynamic multiple obstacle environments.As shown in Figure 7, the simulation environment covers a 2 × 2 × 2 cubic area, UAVs, targets, and obstacles are simplified to a sphere and randomly initialized in this area.The radius of UAVs r u = 0.02, and the maximum detection range d det of UAVs is set as 0.5, which is denoted by the color spherical shades around UAVs.The radius of targets is set to r t = 0.12.The obstacles have static mode and mobile mode with a radius r o = 0.1, In motion mode, they move in a linear motion with a randomly initialized direction and velocity v i ∈ [−0.05, 0.05], i ∈ [X, Y, Z]. v i represents the sub-velocity of obstacles in the X, Y, Z three directions.When it hits the boundary of the simulation environment, it moves in the opposite direction with the same velocity.
In this paper, the network of TD3 is shown in Figure 8, N UAVs include N Actor-Critic structures.For UAV i, the Actor network is constructed by s i × 64 × 128 × 64 × a i , where the input s i represents the state of UAV i, and the output a i represents the action performed by UAV i.The first three layers use a rectified linear unit (Relu) as the activation function, and the last layer uses a hyperbolic tangent (tanh) activation function to limit the output of action within the range of [−1,1].The Critic owns a network structure of (s i + a i ) × 64 × 128 × 64 × Q i , after three fully connected neural network layers (FCs) activated by Relu, the Critic maps the combination of state and action of UAV i to the Q-value evaluated by UAV i.The hyperparameters of TANet-TD3 and TD3 are given in Table 1.

. Training experiments
Training experiments include two sections, the first section is to verify the advantages of TD3 in path planning, and the second section is to validate the effectiveness of TANet-TD3 in multi-UAV simultaneous target assignment and path planning.These algorithms have been trained in dynamic and mixed task environments as depicted in Figure 7, and in each episode, UAVs, targets, and obstacles are randomly initialized in the task area.There are three indicators used to measure the performance of training shown in Equation ( 21), including the average reward, the average arrival rate and the average target completion rate, where N ver is the number of verification episodes, R i-th is the reward of the i-th verification episode, N U i is the number of UAVs that reach the target in the i-th verification episode and N T i is the number of targets that have UAV reached in the i-th verification episode. .

. Training experiments for path planning tasks
Firstly, the TD3 algorithm is trained in single-UAV and multi-UAV dynamic scenarios respectively.Scenario I: one UAV, one target, and 20 moving obstacles; Scenario II: three UAVs, one target, and 20 moving obstacles.Each experiment is trained for 5,000 episodes, and 50 episodes of verification are conducted after every 10 episodes of training.The average reward and average arrival rate of these 50 verification episodes are counted to evaluate the algorithm.As a comparison, the DDPG algorithm is trained in the same task scenarios as TD3 with the same hyperparameters in Table 1.
As can be seen from the training results depicted in Figures 9A, B, after adequate training, the TD3 algorithm has a good average arrival rate for path planning tasks, whether the scenario I of a single UAV (95%) or the scenario II of multiple UAVs (90%).It is evident that compared to the DDPG algorithm, TD3 has a better convergence effect and a faster convergence speed.
Therefore, in this paper, the TD3 algorithm is used as the basic algorithm for path planning, which can provide accurate assignment labels for the training of the target assignment network.

. . Training experiments for simultaneous target assignment and path planning tasks
Next, the proposed algorithm TANet-TD3 is trained in the dynamic environment (five UAVs, five targets, and 20 moving obstacles) and the mixed environment (five UAVs, five targets, 10 static obstacles, and 10 moving obstacles).Each experiment has 10,000 episodes, and 50 episodes of verification are conducted after every 10 episodes of training.To verify the feasibility of the assignment network of TANet-TD3, DDPG with the target assignment network (TANet-DDPG) is introduced for comparison.In addition, the scheme of target assignment based on the distance between the target and UAV is introduced to the DDPG (DDPG(distance)) and TD3 (TD3(distance)) respectively to verify the advantages of TANet-TD3.Four algorithms are trained with the same hyperparameters in Table 1, and the target completion rate and the average reward are used as indicators to evaluate the performance of algorithms.
As shown Figures 9C, E, in the initial stage, all algorithms generated training samples by the interaction process between UAVs and the environment, and the training started when the number of samples reached the capacity of batch size.The reward is very low and UAVs do not know what the goal is before the first 3,000 episodes.With the gradual rise of the samples in the reply buffer, each UAV gradually began to learn more intelligent strategies and finally reached the convergence result.The bold values represents the training result of our algorithm, and it is optimal among the four algorithms.

Environments Algorithms
The number of targets reached by UAVs Rewards Additionally, the relevant statistics in Table 2 illustrate that TANet-TD3 has the highest average target completion rate and average reward in both the dynamic and mixed environments.
Overall, the improvement of TANet-TD3 and TANet-DDPG is remarkable compared to DDPG(distance) and TD3(distance), this demonstrated that it is effective for the assignment method proposed in the paper, which can achieve simultaneous target assignment and path planning.Moreover, the training results also illustrated that TANet-TD3 outperforms TANet-DDPG in terms of convergence effect and convergence speed, which is mainly due to the superiority of TD3 in completing path planning tasks, which provides better Q-value labels for the optimization of target assignment network.

. Testing experiments and results
In order to evaluate the application efficiency of the algorithm after convergence and further verify the advantages of the TANet-TD3 algorithm in the simultaneous target assignment and path planning of multiple UAVs, a series of test experiments are conducted, in which the network parameters after convergence of TANet-TD3 and TANet-DDPG are used to control UAVs move in two environments.One is a dynamic environment, where all obstacles are mobile; the other environment is a mixed environment, where obstacles are static or mobile.As shown in Figure 10, UAVs, targets, and obstacles are randomly deployed in task areas, Figure 10A presents the 3D scenario of a dynamic environment with five UAVs, five targets, and 20 mobile obstacles, Figure 10B  As a result, TANet-TD3 presents a better adaptability to dynamic environments compared to TANet-DDPG.Besides, the test statical results shown in Table 3 illustrate that TANet-TD3 exceeds TANet-DDPG in both the number of targets reached by UAVs and the reward value.According to the design of the reward function in Section 3.3, the value of reward also reflects the length of the flight path of UAVs, which also indicates that the path derived by TANet-TD3 is shorter than that derived by TANet-DDPG.

. Statistical experiments
In this section, the statistical experiments about different numbers of UAVs and different numbers of obstacles are presented to further verify the advantage of TANet-TD3.

. . Adaptability to di erent numbers of UAVs
In this experiment, the average target completion rate of the TANet-DDPG and TANet-TD3 are sequentially compared in terms of the number of UAVs from 3 to 7. The obstacles are set to 20 moving obstacles in a dynamic environment, and 10 static obstacles and 10 moving obstacles in a mixed environment.Each experiment with a specific number of UAVs is repeated 1,000 episodes, and in each episode, UAVs, targets, and obstacles are initialization with random position and velocity.Figures 15A, B depict the statistical results in dynamic environment and mixed environment, respectively.
As the number of UAVs increases, the difficulty of the simultaneous target assignment and path planning tasks increases dramatically, and the average target completion rate of TANet-DDPG and TANet-TD3 gradually decreases in both dynamic and mixed environments.Faced with the complex mission scenario of seven UAVs and 20 obstacles, TANet-TD3 can maintain an average target completion rate of more than 71% (71.54%, 71.06%).In contrast, TANet-DDPG has dropped to just over 70% (70.35%, 70.45%) at six UAVs and falls sharply below 65% (64.93%, 63.03%) at the seven UAVs.In addition, as the number of UAVs increases, the gap between TANet-DDPG and TANet-TD3 grows wider.

. . Adaptability to di erent number of obstacles
This experiment verifies the effect of different numbers of obstacles on TANet-TD3 and TANet-DDPG.Specifically, the two algorithms are compared in dynamic and mixed environments with five UAVs and different numbers of obstacles including 10, 15, 20, 25, and 30, respectively.Each experiment is repeated 1,000 episodes, and the state of UAVs, targets, and obstacles are randomly initialized for each episode.The comparison results of the average target completion rate are presented in Figures 15C, D.
As shown in Figure 15, the increase in the number of obstacles has affected the performance of two algorithms both in dynamic and mixed environments, but TANet-TD3 consistently outperforms TANet-DDPG in all scenarios.Additionally, when the number of obstacles is 25, the average target completion rate of TANet-DDPG is below 80% in both dynamic (79.36%) and mixed environments (79.35%), while the average target completion rate of TANet-TD3 remains above 81% (81.36%, 82.40%) under the complex environment with 30 obstacles.
In summary, TANet-TD3 can effectively complete simultaneous target assignment and path planning.Besides, it has demonstrated that TANet-TD3 has a better adaptability to dynamic and random environments compared with TANet-DDPG.

Conclusion and discussion
This paper proposes a novel DRL-based method TANet-TD3 for multiple UAVs target assignment and path planning in dynamic multi-obstacle environments.The problem is formulated as a POMDP and a target assignment network is introduced to TD3 algorithm to complete the target assignment and path planning simultaneously.Specifically, each UAV considers each target as its final target to be reached in turn and executes its action derived by TD3 for the next step.A Q-value matrix can be obtained by reward function and the Hungarian algorithm is used to act on the Q-value matrix to achieve an exact match between UAVs and targets.The matching result is used as labels to train the target assignment network, so as to obtain the optimal allocation for targets.Then each UAV moves to its assigned target under the planning of the TD3 algorithm.The experiment results demonstrate that TANet-TD3 can achieve simultaneous target assignment and path planning in dynamic multiple obstacle environments, and the performance of TANet-TD3 outperforms the existing methods in both convergence speed and target completion rate.
For future research, we will further improve the proposed method by combining it with specific applications, such as multi-UAV target search tasks and multi-UAV target-tracking tasks.Additionally, we will study the method of calculating the Q-value matrix in high-dimensional scenarios to deal with complex tasks with a large number of targets.Furthermore, we will build a more realistic simulation environment, in which the shape and movement of obstacles are more complex, to verify the effectiveness of the proposed algorithm.

FIGURE
FIGURESchematic diagram of multi-UAV target assignment and path planning.

FIGURE
FIGUREMulti-UAV reinforcement learning process in partially observable environment.
other UAVs with the velocity v u and the radius r u .s O represents the state of obstacles.If obstacles are within the max detection range, s O = (p o , v o , r o ) is the relative position p o = (x o − x i , y o − y i , z o − z i ), the velocity v o and the radius r o of the obstacles, otherwise, s O = (±d det , ±d det , ±d det , 0, 0).

FIGURE
FIGURE Problem formulation.(A) State space.(B) Action space.(C) Reward design.

FIGURE
FIGURE Simulation environment.(A) The D simulation environment.(B) The simulation environment from X-Y view.(C) The simulation environment from Y-Z view.(D) The simulation environment from X-Z view.The color spherical shade around UAV in (A) denotes the detection range of UAV.

Figure 6
Figure 6 illustrates the overall framework of target assignment.It is composed of three parts, including the target assignment network, construction of the assignment label, and construction of the environmental state information with new sequence.(1) Target assignment network The network structure of target assignment network is designed as the middle section of Figure 6, it consists of a (7 + 4(N U − 1) + 4N T + 7N O ) × 64 × 128 × 64 × N T fully-connected neural network

FIGURE
FIGURE Convergence curves.(A) The average arrival rate of DDPG and TD in scenario I. (B) The average arrival rate of DDPG and TD in scenario II.(C) The average target completion rate of TANet-DDPG and TANet-TD in dynamic environment.(D) The average reward of TANet-DDPG and TANet-TD in dynamic environment.(E) The average target completion rate of TANet-DDPG and TANet-TD in mixed environment.(F) The average reward of TANet-DDPG and TANet-TD in mixed environment.The solid line denotes the statistical means and the % confidence interval of the means is shown shaded.

FIGURE
FIGURE The test scenarios.(A) The D scenario of dynamic environment.(B) The dynamic environment from X-Y view.(C) The dynamic environment from Y-Z view.(D) The dynamic environment from X-Z view.(E) The D scenario of mixed environment.(F) The mixed environment from X-Y view.(G) The mixed environment from Y-Z view.(H) The mixed environment from X-Z view.
FIGURE The D trajectories and corresponding D three views of UAVs driven by TANet-DDPG at di erent times in a dynamic environment.(A) The D trajectories at t = s.(B-D) The corresponding D three views of (A).(E) The D trajectories at t = s.(F-H) The corresponding D three views of (E).(I) The D trajectories at t = s.(J-L) The corresponding D three views of (I).

FIGURE
FIGURE The D trajectories and corresponding D three views of UAVs driven by TANet-DDPG at di erent times in mixed environments.(A) The D trajectories at t = s.(B-D) The corresponding D three views of (A).(E) The D trajectories at t = s.(F-H) The corresponding D three views of (E).(I) The D trajectories at t = s.(J-L) The corresponding D three views of (I).

FIGURE
FIGURE The D trajectories and corresponding D three views of UAVs driven by TANet-TD at di erent times in mixed environment.(A) The D trajectories at t = s.(B-D) The corresponding D three views of (A).(E) The D trajectories at t = s.(F-H) The corresponding D three views of (E).(I) The D trajectories at t = s.(J-L) The corresponding D three views of (I).

FIGURE
FIGURE The comparison result of the average target completion rate of TANet-DDPG and TANet-TD .(A) The comparison result under di erent numbers of UAVs in a dynamic environment.(B) The comparison result under di erent numbers of UAVs in mixed environments.(C) The comparison result under di erent numbers of obstacles in a dynamic environment.(D) The comparison result under di erent numbers of obstacles in a mixed environment.
2, . . ., N U denotes the UAVs and O k , k = 1, 2, . . ., M denotes the obstacles.U t i , O t k represents the positions of UAV i and obstacle k at time t, respectively.d i is the flight length of UAV i, r u and r o are the radius of UAVs and obstacles.Equation (2) denotes the target complete assignment constraint, which means each target is only assigned to one UAV.Equation (3) defines the collisionfree constraint, where the first one means any two UAVs cannot layers, where (7 + 4(N U − 1) + 4N T + 7N O ) represents the state information s i = (s ui , o i ) of each UAV under the scenario of N U UAVs, N T targets and N O obstacles within the detection range.For UAV i, after four FCs, the target assignment network maps the state information s i = (s ui , o i ) to the probability (P iT 1 , P iT 2 , • • • , P iT N T ) of UAV i flying to targets (T 1 , T 2 , • • • , T N T ).The probability is first normalized by the Softmax function [Equation (15)], TABLE The hyperparameters of TANet-TD .Initialize Critic networks Qθ i,1 , Q θ i,2and Actor-network π φ i with random parameters θ i,1 , θ i,2 , φ i for each UAV i; 2: Initialize target networks for each UAV i,θ i,1 ′ ← θ i,1 , θ i,2 ′ ← θ i,2 , φ i ′ ← φ i ; The training results are listed in Table 2. Figures 9C, D present that TANet-TD3 has the fastest convergence rate in the dynamic environment, reachingTABLE The training results of TANet-DDPG and TANet-TD .
TABLE The test statical results of TANet-DDPG and TANet-TD .