MW-MADDPG: a meta-learning based decision-making method for collaborative UAV swarm

Unmanned Aerial Vehicles (UAVs) have gained popularity due to their low lifecycle cost and minimal human risk, resulting in their widespread use in recent years. In the UAV swarm cooperative decision domain, multi-agent deep reinforcement learning has significant potential. However, current approaches are challenged by the multivariate mission environment and mission time constraints. In light of this, the present study proposes a meta-learning based multi-agent deep reinforcement learning approach that provides a viable solution to this problem. This paper presents an improved MAML-based multi-agent deep deterministic policy gradient (MADDPG) algorithm that achieves an unbiased initialization network by automatically assigning weights to meta-learning trajectories. In addition, a Reward-TD prioritized experience replay technique is introduced, which takes into account immediate reward and TD-error to improve the resilience and sample utilization of the algorithm. Experiment results show that the proposed approach effectively accomplishes the task in the new scenario, with significantly improved task success rate, average reward, and robustness compared to existing methods.


. Introduction
As a reusable vehicle, Unmanned Aerial Vehicles (UAVs) do not need to be piloted.Instead, they are capable of accomplishing the given tasks by remote control or autonomous control (Silveira et al., 2020;Yao et al., 2021).This has received much attention from the industry in recent years.UAVs have several advantages, including low life-cycle cost (Lei et al., 2021), low personnel risk (Rodriguez-Fernandez et al., 2017), long duration of flight (Ge et al., 2022;Pasha et al., 2022), and maneuverability, size, and speed (Poudel and Moh, 2022).These UAVs are increasingly being used in various fields such as tracking targets (Hu et al., 2023), agriculture (Liu et al., 2022b), rescue (Jin et al., 2023), and transportation (Li et al., 2021) for "Dull, Dirty, Dangerous, and Deep" (4D) missions (Aleksander, 2018;Chamola et al., 2021).The applications of UAVs are illustrated in Figure 1.During a mission, UAVs typically operate in swarms to accomplish their objectives.Consequently, the cooperative control and decision-making methods used by UAV swarms have become increasingly critical.Effective collaborative decision-making techniques can enhance the efficiency and effectiveness of mission accomplishment.However, it is important to note that current cooperative decision-making methods, including non-learning methods and traditional heuristics for UAVs, have limited capacity to effectively manage conflicts between multiple aircraft and maintain a balance between adapting to variable mission environments and meeting time constraints.Therefore, this area has received significant attention from researchers seeking to develop more robust and versatile methods for UAV cooperative decision-making.
At present, methods for cooperative control and decisionmaking of UAV swarms are typically classified into two main categories: top-down and bottom-up (Giles and Giammarco, 2019).Top-down approaches are primarily utilized for centralized collaborative control and decision-making, while bottom-up approaches are mainly applied to distributed collaborative decision-making and control (Wang et al., 2022).
The main advantage of the top-down approach is its ability to decompose complex tasks into smaller, more manageable components.In the context of UAV swarm collaborative decisionmaking, this approach can be used to break down the task into a task assignment problem, a trajectory planning problem, and a swarm control problem (Tang et al., 2023).For example, Zhang et al. (2022) proposed a method for assigning search and rescue tasks to a combination of helicopters and UAVs.They analyzed the search and rescue level of each point and the hovering endurance of the UAV using principal component analysis and cluster analysis.They then constructed a multi-objective optimization model and solved it using the non-dominated sorting genetic algorithm-II to assign tasks to the UAVs.Liu et al. (2021) utilized the "Divide and Conquer" approach to create a hierarchical task scheduling framework that decomposed the UAV scheduling problem into several subproblems.They proposed a tabu-list-based simulated annealing (SATL) algorithm for task assignment and a variable neighborhood descent (VND) algorithm for generating the scheduling scheme.In another study, Liu et al. (2022a) proposed a particle swarm optimization algorithm for cluster scheduling of UAVs performing remote sensing tasks in emergency scenarios.While centralized decision-making methods have better global reach and simpler structures, their communication and computational costs increase significantly with an increase in the number of UAVs in the swarm.Therefore, there is a need to develop a distributed cooperative decision-making method for UAV swarms.
The bottom-up approach facilitates cooperative decisionmaking of UAV swarms through the observation, judgment, decision-making, and distributed negotiation of individual UAVs.This approach aligns well with the observe-orient-decideact (OODA) theory and is particularly suited for distributed decision-making scenarios (Puente-Castro et al., 2022), which are increasingly becoming the future trend (Ouyang et al., 2023).Wang and Zhang (2022) proposed a UAV cluster task allocation method based on the bionic wolf pack approach, which decomposes task allocation into three processes: task assignment, path planning, and coverage search.The UAV swarm is modeled according to the characteristics of a wolf pack, and distributed collaborative decision-making is achieved through information sharing within the UAV swarm.Yang et al. (2022) presented a distributed task reallocation method for the dynamic environment where tasks need to be reassigned among a UAV swarm.They proposed a distributed decision framework based on time-type processing policies and used a partial reassignment algorithm (PRA) to generate conflict-free solutions with less data communication and faster execution.Wei et al. (2021) introduced a distributed UAV cluster computational offloading method that leverages distributed Q-learning and proposes a cooperative explorationbased, prioritized experience replay method using distributed deep reinforcement learning techniques.This approach achieves distributed computational offloading and outperforms traditional methods in terms of average processing time, energy-task efficiency, and convergence rate (Ouyang et al., 2023).
In recent years, deep reinforcement learning has shown promising results in various fields, such as training championshiplevel racers in Gran Turismo (Wurman et al., 2022), achieving alltime top-three Stratego game ranking (Perolat et al., 2022), and optimizing matrix multiplication operations (Fawzi et al., 2022).However, when addressing the challenge of cooperative decisionmaking in UAV swarms, reinforcement learning suffers from weak generalization ability, low sample utilization, and slow learning speed (Beck et al., 2023).To address these challenges, researchers have turned to meta-reinforcement learning, which is currently a hot topic in machine learning.
Meta-learning, also referred to as learn to learn, is a technique that involves training on a relevant task to learn metaknowledge, which can then be applied to a new environment.This approach reduces the number of samples required and increases the training speed in the new environment (Hospedales et al., 2022).Researchers have proposed meta-reinforcement learning methods by combining meta-learning with reinforcement learning techniques.Meta-reinforcement learning enhances the generalization ability and learning efficiency by utilizing the acquired meta-knowledge to guide the subsequent training process and achieve cross-task learning with limited samples (Beck et al., 2023).Despite its successful implementation in various fields (Chen et al., 2022;Jiang et al., 2022;Zhao et al., 2023), meta-reinforcement learning has not yet been widely adopted in the field of cooperative decision-making for heterogeneous UAV swarms.
The experience replay mechanism is a critical technique in deep reinforcement learning, first proposed in the deep Q network model (Mnih et al., 2015).It improves data utilization, increases policy stability, and breaks correlations between states in the training data.To measure the priority of experience, Hou et al. (2017) proposed a method that uses the Temporal-Difference (TD) error, which improves the convergence speed of the algorithm.Pan et al. (2022) proposed a TD-Error and Time-based experience sampling method to reduce the influence of outdated experience.Li et al. (2022) introduced a Clustering experience replay (CER) method that clusters and replays transition using a divide-andconquer framework based on time division, effectively exploiting the experience hidden in all explored transitions in the current training.However, prioritized experience replay algorithms that only consider TD-error in the learning process tend to ignore the role of immediate payoffs and experience with small timedifferential errors, and the learning effectiveness of the algorithm is susceptible to the detrimental effects of temporal error outliers.
In this paper, we propose an improved MAML-based MADDPG algorithm to enhance the generalization capability, learning rate, and robustness of deep reinforcement learning methods used in UAV swarm collaborative decision-making for heterogeneous UAV swarms.The proposed algorithm incorporates a Reward-TD prioritized experience replay mechanism and buffer experience forgetting mechanism to improve the overall performance of the system.Firstly, the paper describes the problem of cooperative attack on ground targets by UAV swarms, models the UAV motion model, and formulates the cooperative decisionmaking problem as a POMDP model.Next, inspired by the Meta Weight Learning algorithm (Xu et al., 2021), the paper proposes an improved meta-weight multi-agent deep deterministic policy gradient (MW-MADDPG) algorithm to obtain an unbiased initialization model by setting playback weights for trajectories and updates the meta-weights by gradient and momentum.To increase the effectiveness of the experience replay mechanism, the paper proposes a Reward-TD prioritized experience replay method with a forgetting mechanism.Finally, experiments are conducted to verify the generalization, robustness, and learning rate of the proposed approach.The main contributions of this paper include: 1. Proposing the meta-weight multi-agent deep deterministic policy gradient (MW-MADDPG) algorithm for UAV swarm collaborative decision-making, which achieves end-to-end learning across tasks and can be applied to new scenarios quickly and stably after training.
2. Introducing the Reward-TD prioritized experience replay method to improve the convergence speed and utilization of experiences in the MW-MADDPG algorithm.The proposed method determines the priority of experience replay based on immediate reward and TD-error, thereby enhancing the quality of experience replay.3. Employing a forgetting mechanism in the proposed MW-MADDPG algorithm to improve algorithm robustness and reduce overfitting.A threshold of sampling times is set to reduce the repetition of a small number of experiences during the experience replay process.

. . Reinforcement learning
Reinforcement learning is a trial-and-error technique for continuous learning, where an agent interacts with its external environment.The objective of the agent is to obtain the maximum cumulative reward from the external environment.Typically, reinforcement learning models the problem as a Markov decision process (MDP) or a partially observable Markov decision process (POMDP), which allows the agent to make decisions based on current states and future rewards, without requiring knowledge of the full environment model.Through repeated interactions with the environment, the agent learns through experience to select actions that lead to higher cumulative rewards, thereby improving its performance over time.A Markov reward process is usually represented by the tuple M =< S, A, T, R, γ >, where: S = (s 1 , s 2 , • • • , s n ), S is the set of all possible states in the MDP; A = (a 1 , a 2 , • • • , a m ), A denotes the set of all possible actions in the MDP, γ ∈ [0, 1], is the discount factor, which indicates the degree of influence of future rewards on the current behavior of the agents.γ = 1 indicates that the future reward has the same effect as the current reward.γ = 0 indicates that the future reward does not affect the current intelligence's action.In the reinforcement learning process, at each time step t, the intelligence is in state s t , observes the environment, takes action a t , gets feedback from the environment R t , and moves to the next state s t+1 .In an MDP, a state is called a Markov state when it satisfies the following conditions: The property that the state of the next moment is independent of the state of the past moment is known as the Markov property.In a Markov decision process (MDP), the state transition matrix P (also known as the state transition probability matrix) specifies the probability of transitioning from the current state s to the subsequent state s ′ .Specifically, each element P ss ′ represents the probability of transitioning from state s to state s ′ under a given action.
The reward R t is also called cumulative reward, which is the sum of all rewards from the beginning to the end of the round: The reward function indicates that the agent takes action a, and the expected reward after the transfer: .

. Multi-agent reinforcement learning
In a multi-agent system, each agent has a limited observation range and can only obtain local information, making it challenging to observe the global environment.This problem is modeled as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP) defined by the tuple M =< N, S, A, P, R, O, γ >.Here, N represents the set of agents, S represents the set of agent states, In Dec-POMDP, all agents select actions based on their own observations O i in the state s t , leading to a transition to the next state s t+1 and receiving an environmental reward value r i .The goal of each agent is to maximize the cumulative reward G = T t=0 γ t r t i .This paper employs the classical MARL algorithm MADDPG, with further details provided in Section 4.1.

. . Meta-learning
Meta-learning, also known as learn-to-learn, is a recent research direction aimed at training an initial model to quickly adapt to new tasks with fewer data.Meta-learning comprises three phases: meta-training, meta-validation, and meta-testing.In the meta-training phase, a neural network uses support set data to train for a set of tasks and learn general knowledge for these tasks.In the meta-validation phase, the neural network selects query set data to verify model generalization and adjust hyperparameters used in meta-learning.Finally, in the meta-testing stage, the model is tested on new tasks to evaluate its training effect.The meta-learning paradigm is depicted in Figure 2. The formal definition of metareinforcement learning is presented below, whereas the learning task of reinforcement learning is: Here, L T represents the loss function that maps a given trajectory τ = (s 0 , a 1 , s 1 , r 1 , . . ., a H , s H , r H ) to a loss value.P T (s) denotes the initial state distribution, while P T (s t+1 |s t , a t ) refers to the state transition probability distribution.H corresponds to the trajectory length.
This paper discusses Model Agnostic Meta Learning (MAML), which is a model-independent general meta-learning algorithm that can be applied to any algorithm trained using gradient descent.MAML is adapted to deep neural network models through the use of meta-gradient updates and can be used for various neural network architectures such as convolutional, fully connected, recurrent neural networks, and more.Additionally, it can be applied to different types of machine-learning problems, such as regression, classification, clustering, reinforcement learning, and others.
The main idea of Model-Agnostic Meta-Learning (MAML) is to obtain an initial model that can be applied to a range of tasks and requires only a small amount of task-specific training to achieve good performance.Specifically, the strategy π θ is obtained by interacting with the environment through the strategy π θ , collecting K trajectories τ 1 : K FIGURE Schematic diagram of the meta-learning process.Meta-learning facilitates rapid adaptation to new tasks by leveraging knowledge acquired from previous tasks.
Here, L T (τ 1 : K θ ) is the average loss over K trajectories, where τ k θ ∼ P T (τ |θ ).The loss function L T (τ θ ) for each trajectory τ θ is defined as: where β is the meta-learning rate.
. Problem formulation

. . Task description
The objective of the UAV in the paper is to destroy the opponent's (blue side) strategic key location and ensure the survival of our side as much as possible while achieving this objective.The blue's strategic location is protected by Surface-to-air missiles (SAMs), which have a longer detection and attack range than our UAVs.Thus, it is imperative for the Red UAVs to exhibit cooperative behavior to successfully achieve the mission objective, which may involve the strategic "sacrifice" of detecting UAVs for locating SAM positions when necessary while minimizing the loss of attack UAVs.The neural network's strategy generation through learning is reliant on the adversary's strategy during training.Typically, the opponent's strategies are formulated by humans, which limits the samples to encompass the entire situation.To circumvent this issue, this work incorporates a large number of random variables into the SAM strategy modeling, such as the randomization of firing timing, firing number, and firing units.These variations introduce a dynamic battlefield environment in each confrontation, posing a challenge for the neural network.Although we know the location of the blue's strategic key location beforehand, we do not know the location of their SAMs, which can vary from mission to mission.Therefore, the red-side UAV algorithm needs to have fast adaptation capability.Figure 3 in the paper shows the experimental environment.

Blue side:
• Strategic key location: command post, airport; • SAM: three sets, each set is called a fire unit, attack range 35 km, with a guidance radar detection range of 40 km.

. . Winning rules
Red side: • Victory condition: command post is destroyed; • Failure condition: command post is not destroyed at the endgame.
Blue side: • Victory condition: command post is not destroyed at the endgame; • Failure condition: command post is destroyed.

. . . Battlefield environment settings
• The red side is unable to detect the position of the blue side's SAMs until the guidance radar of the blue side's fire unit is activated; • The information collected by the Red Detect UAV regarding fire units is automatically synchronized and shared with other Red UAVs; • In each game, the position of the fire unit will remain'unchanged; • The guidance radar of the fire units must be activated before they are able to launch their missiles; • Once the guidance radar of the fire units is turned on, it cannot be turned off again; • If the guidance radar of the fire unit is destroyed, the fire unit becomes inoperable and unable to launch missiles; • The guidance radar must be activated during the guidance procedure; • If the guidance radar of a fire unit is destroyed, any missiles launched by that unit will immediately self-destruct; • The ARM and ATG have a shooting range of 30 km and an 80% hit rate; • In the kill zone, ARM, ATG have a high kill probability of 75% and a low kill probability of 55%. .

. UAV kinematic model
Typically, the flight control of UAVs involves considering their six degrees of freedom, such as heading, pitch, and roll.However, in this paper, we focus on studying the application of deep reinforcement learning methods in multi-UAV cooperative mission planning while taking into account the maneuvering performance of UAVs, which generally do not perform large-angle maneuvers or drastic changes in acceleration.Therefore, we establish a simplified UAV motion model as follows: where (x i , y i ) denotes the position of UAV i, ϕ i andv i denote the heading angle and velocity of UAV i, and ̟ i and ūi denote the angular velocity and acceleration of UAV.
The UAV motion model has the following motion constraints:

. . POMDP model
This section models the decision problem for the UAVs as a POMDP and defines the observation space, action space, and reward function.

. . . Observation space
In this paper, the state space for the UAV decision-making process includes the necessary information for the UAVs.For UAV i, the observation space is defined as Here, c ij = (x j , y j , ϕ j , v j , a t−1 j ) represents the information obtained by UAV i from UAV j within its observation range.The action of UAV j at the previous moment is denoted by a t−1 j = (̟ j (t − 1), ūj (t − 1), M j (t − 1)), where M j (t − 1) represents the action taken by UAV j in firing a missile.Additionally, ) represents the information of fire unit k within UAV i's observation range.Here, R t−1 k denotes the state of the radar of fire unit k at the previous moment, while M t−1 k denotes the last moment of missile-firing action taken by fire unit k.
Let the set of all UAVs be defined as D = {UAV 1 , . . ., UAV i . . ., UAV n }.Here, UAV i represents the UAV numbered i and n is the total number of UAVs.Similarly, let the set of all fire units be defined as F = {F 1 , . . ., F k . . ., F h }, where F k denotes the fire unit numbered k, and h is the total number of fire units.

. . . Action space
The action space in this paper includes angular velocity, acceleration, launch missile, and radar state.The specific action space is defined as shown in Table 1.

. . . Reward function
The reward design should account for a large number of units on both the blue and red sides, resulting in a significant amount of status and action space.Providing a single reward value at the end of each battle round may result in sparse rewards and make it difficult for agents to explore winning states independently.Therefore, it is essential to create a well-designed reward function that can guide the agent's learning process effectively.
The approach is to assign a reward value for each type of unit on both the red and blue sides, such that the loss or victory of a unit during the battle triggers an appropriate bonus value (negative for losses suffered by our side, positive for those suffered by the opposing side).Additionally, to encourage the UAV to approach the fire unit, a reward is provided when the UAV moves closer to the target.
Providing rewards solely based on wins and losses can result in long training times and sparse rewards, particularly due to the duration of each round.To expedite the training process and enhance the quality of feedback provided during training, additional reward types such as episodic rewards, key event-driven rewards, and distance-based rewards are incorporated.The detailed reward design is presented in Table 2.

. Method . . MADDPG-based collaborative decision-making method
Traditional single-agent reinforcement learning algorithms face challenges when dealing with collaborative multi-UAV tasks, such as large action spaces and unstable environments.In a multiagent system, the increase in the number of agents leads to a larger state and action space.In addition, each agent's actions dynamically affect the environment in a way that does not exist in a static environment.For these reasons, traditional single-agent reinforcement learning algorithms are ineffective in a multi-agent environment.To address this problem, this paper employs the MADDPG algorithm in the framework of centralized training and decentralized execution.This approach alleviates the difficulties associated with fully centralized or fully decentralized algorithms by striking a balance between the two.In contrast to traditional DRL algorithms, the MADDPG algorithm can leverage global information during training while utilizing only local information for decision-making.The following method is employed: Suppose there are M agents in the multi-agent system, with a set of strategy networks denoted as µ = (µ 1 , µ 2 , • • • , µ M ), where µ i represents the strategy network of the i-th agent.Additionally, there is a set of value networks denoted as q = (q 1 , q 2 , • • • , q M ), where q i represents the value network of the i-th agent.The parameter set for the strategy network is denoted as θ where θ i represents the strategy parameters of the i-th agent.Similarly, the parameter set for the value network is denoted as , where ω i represents the value network parameters of the i-th agent.The objective function for the i-th agent is expressed as follows: (10) For the deterministic strategy µ i , the strategy gradient can be expressed as: Here, ∇ represents the gradient operator.
A state is sampled from the experience pool D as follows: , which can be used as an observation of the random variable.The agent's action is obtained from the policy network as: The gradient of the objective function is: The updated formula for the policy network parameters is: Here, α 1 represents the Actor learning rate.The value network is updated through the TD algorithm as follows: For the value network q i (s, a; ω i ) of agent i, given the tuple (s t , a t , r t , s t+1 ), the computational action according to the policy network is given by: The TD target is computed as: The TD-error is calculated as: The value network parameters are then updated using gradient descent w.r.t.ω i .
Update target network parameters for each agent i: Here, τ 1 is the soft update parameter.

. . Improved algorithm for MAML
This paper presents an improvement to the traditional MAML algorithm.The original MAML algorithm employs an average update method during gradient updates for each task in the task distribution.However, this can lead to biased models that perform better on one task than others.To overcome this issue, we propose an improved MAML method that introduces weights during the gradient update of different trajectories and incorporates an automatic weight calculation method.This approach aims to obtain an unbiased initialized network model.
The traditional MAML method updates the gradients of different trajectories without any distinction during the trajectory update process.This paper proposes a trajectory weighting method that leverages the concept of Adam's algorithm and utilizes gradient and momentum values to set the weights.This approach addresses the issue of subjective weight assignment and accelerates the convergence of the objective function to its minimum value.
The objective function for meta-learning in this paper is expressed as: Here, to satisfy the normalization condition, let W k = w k w 1 +w 2 +•••+w K be the weight of the k-th trajectory, where K is the total number of trajectories.
To obtain the optimal weights w * k that minimize the objective function, we update the weights w k by computing their gradient.The gradient of the objective function w.r.t. the weights w k is given as: Drawing inspiration from the Adam optimization algorithm, we set the following parameters: First-order momentum: The updated weight for the next time: Frontiers in Neurorobotics frontiersin.orgZhao et al.
where β 1 and β 2 are exponential decay rates for the moment estimates, ε = 10 −8 is fuzz factor, α is weight learning rate.

Meta update: ξ
where β is the meta-learning rate.The proposed improved MAML algorithm is presented in Algorithm 1.
Input: Weight learning rate α, meta-learning rate β, and exponential decay rate β 1 , β 2 ; Input: The distribution over tasks P T (s); Optimize ξ with gradient descent: end for 10: The objective function w.r.t. the weights Compute the first-order and second-order momentum: 13: Compute the bias-corrected first and second-moment estimates: 14: Update the model weights: for each trajectory . .Improved prioritized experience replay mechanism . . .Prioritized experience replay method based on immediate rewards and TD-error Experience replay methods typically prioritize replay based on the size of TD-error to enhance neural network convergence speed and experience utilization.In this approach, sampling probability is proportional to the absolute value of TD-error, without considering the quality of the experience in supporting task performance.To address this limitation, this paper proposes an experience replay method based on reward and TD-error that includes immediate rewards from actions during the prioritization process.
By considering the immediate reward as well as the TD-error, this improved approach can more accurately prioritize experiences that contribute most effectively to task completion.
The priority of TD-error and immediate reward-based experience replay is defined as: where ε is a small constant that ensures the priority value is not zero.
The priority based on immediate rewards is given as: By sorting and ranking these priorities by size, we obtain rank r (i) and rank T (i).The combined ranking takes both priorities into account and is computed as: Here, ρ denotes the coefficient of importance of the experience which regulates the relative significance of the two experiences under consideration.When ρ = 0, only the TD-error is considered, while when ρ = 1, only the immediate reward is considered.
The combined priority of an experience is given as: Here, η is the priority importance parameter that determines the degree of consideration given to priority.When η = 0, we have uniform experience sampling.
The experience sampling probability of an experience is obtained by normalizing its combined priority w.r.t.all experiences in the replay buffer: This probability is used to sample experiences from the replay buffer during the learning process.Experiences with higher combined priorities are more likely to be sampled.

. . . Forgetting mechanism
The immediate reward and TD-error are used to evaluate the learning value of experiences in the replay buffer, but excessive sampling of high-priority experiences can lead to overfitting.To alleviate this issue, this paper introduces a forgetting mechanism to alleviate overfitting.
The forgetting mechanism introduced in this paper includes setting a sampling threshold ψ.When the number of times an experience has been sampled, denoted as m i , exceeds this threshold, its sampling probability is set to zero.This helps prevent overfitting by reducing the impact of experiences that have been repeatedly sampled.
The updated sampling probability of experience i after being processed by the forgetting mechanism is denoted as p i ′ , and is given by: Input: Act noise N t , discount factor γ , constant ε, coefficient of importance ρ, priority importance parameter η, sampling threshold ψ, actor 1: learning rate α 1 , and soft update parameter τ 1 ; 2: Initialize strategy networks µ = (µ 1 , µ 2 , • • • , 3: µ M ), value networks q = (q 1 , q 2 , • • • , q M ) and replay 4: buffer D 5: for t = 1 to max-episode-length do 6: Observe initial state s 1 7: Compute target y i t = r i t + γ q (s t+1 , a t+1 ; ω i ) 18: Compute TD-error: Compute rank rank T (i) and rank r (i) based on P T (i) and P r (i), respectively 22: Compute combined priority P C (i) = Sample a minibatch of B samples from D using probabilities p i

33:
Compute the gradient g i θ of the policy network of agent i 34: Update policy network parameters: Update the value network parameters by minimizing the loss w.r.t.TD-error: end for 38: Update target network parameters for each agent i: end for 40: end for Algorithm .MADDPG with improved Prioritized Experience Replay.
Here, if m i is less than or equal to the sampling threshold ψ, the sampling probability of experience i remains unchanged (p i ).Otherwise, if m i is greater than ψ, the sampling probability of experience i is set to zero.When the replay buffer reaches capacity, experiences are removed in order of sampling replay priority from smallest to largest, based on the grooming of new experiences.This ensures that new experiences can enter the experience pool and contribute to the learning process.
The MADDPG algorithm with an improved prioritized experience replay mechanism is shown in Algorithm 2.

. Experiment . . Experiment setup
To assess the efficacy of the proposed method, the algorithm was validated in two simulation scenarios (as depicted in Figure 4 for training scenarios and Figure 5 for test scenarios) and compared against the MADDPG algorithm.The simulation scenarios are designed based on the force settings and battlefield environment assumptions described in Section 3. The primary focus of the evaluation is on the improved MAML method and the Reward-TD prioritized experience replay method proposed in this paper.
The simulation scenario consists of four red reconnaissance UAVs and three attack UAVs, whose objective is to destroy the opponent's command post.During training, the position of the red UAVs is fixed at the beginning of each episode, while the positions of the opponent's command post and SAM are changed in the two training scenarios to enable meta-training of the neural network.The training hardware used for the experiments includes Intel Xeon E5-4655V4 CPU with eight cores, 512 GB RAM, and RTX3060 GPU with 12GB video memory.The proposed method is implemented using a standard fully connected multilayer perception (MLP) network with ReLU nonlinearities, consisting of three hidden layers.The size of the experimental environment is 240 km × 240 km, and the hyperparameters used in the experiments are shown in Table 3, with the settings referred to from Xu et al. (2021).During meta-training, the meta-training process lasts for 5 × 10 5 episodes to allow for sufficient learning and optimization of the neural network.

. . Experiment result
This section aims to evaluate the meta-learning and cold-start capability of the proposed MW-MADDPG algorithm in new task environments, as well as its generalization, convergence speed, and robustness compared to existing algorithms.Additionally, the performance of the proposed Reward-TD prioritized experience replay method with the forgetting mechanism is evaluated and compared to conventional methods.

. . . Cross-task performance comparison
The performance of the three algorithms (MW-MADDPG, MAML-MADDPG, and MADDPG) is evaluated using the reward value as the evaluation index across five random seeds in the two scenarios, as shown in Figure 6.The results demonstrate that the MW-MADDPG and MAML-MADDPG algorithms with meta-learning outperform the MADDPG algorithm without metalearning in both scenarios from the beginning episodes.Spechis indicates that the use of metaifically, the average reward for the MW-MADDPG method is −1.39, for the MAML-MADDPG environment and alleviate the cold-start problem, showcasing an advantage over existing methods.
Figure 7 depicts the success rate of task execution in red, and it is evident that the MW-MADDPG method achieves a success Additionally, the variance of the MW-MADDPG method is smaller than that of the MAML-MADDPG method, indicating that the stability of the proposed method is better than that of the traditional meta-learning method.
Overall, the experiments demonstrate that the MW-MADDPG algorithm proposed in this paper can effectively learn the features of similar tasks, and learn from historical experience to obtain more effective strategies.The proposed method exhibits better initial performance, faster learning rate, better-expected performance,

FIGURE
Red side task execution success rate.In various test scenarios, the proposed method exhibits a higher winning rate compared to both the traditional meta-learning method and the non-meta-learning method.higher task success rate, and improved strategy stability in terms of reward and task execution success rate.

. . . Reward-TD and FIFO performance
This section aims to verify the effectiveness of the proposed Reward-TD prioritized experience replay method and forgetting mechanism.Two sets of experiments are designed to apply the above experience replay mechanism to the MADDPG algorithm in training scenario 1 and training scenario 2, respectively.The reward curves obtained by the agent are analyzed across five random seeds to evaluate the performance of the method.
Figure 8 illustrates the reward curves of different experience replay methods in scenario 1 and scenario 2, with RPER representing the Reward-TD prioritized experience replay method, PER indicating the use of TD-error prioritized experience replay method, and VER standing for the random experience replay method.It can be observed that the final rewards obtained by using the RPER mechanism are significantly better than the other two methods (p < 0.05), indicating that the RPER mechanism can effectively improve the final reward level.In contrast, the difference between the final rewards of the PER and VER methods is not significant, suggesting that the TD-error-based preferred experience replay method has little effect on the final reward.
Regarding robustness, the RPER mechanism outperforms the PER mechanism, while the PER mechanism outperforms the VER mechanism.This indicates that the prioritized experience replay mechanism is better than the random uniform experience replay mechanism, and the Reward-TD based experience prioritization is better than the TD-error based experience prioritization.
In terms of convergence speed, the RPER algorithm achieves convergence significantly faster than the PER and VER algorithms.Specifically, RPER reaches convergence at around 8 × 10 5 episodes in both scenarios, while PER and VER reach convergence only after around 9 × 10 5 episodes.These results demonstrate that the RPER mechanism helps to improve the convergence speed of the algorithm, while PER and VER have no significant impact on the convergence speed.Figure 9 illustrates the graphs of different experience retention methods reward, comparing the effects of the forgetting mechanism (FM) and the first-in-first-out mechanism (FIFO) while using RPER and the MADDPG algorithm.
From Figure 9, it can be observed that the training speed using the forgetting mechanism is significantly better than the FIFO mechanism in terms of convergence speed (p < 0.05).This suggests that the forgetting mechanism proposed in this paper can effectively retain experience fragments that are beneficial to the agent and improve the training speed.
In terms of robustness, the FM mechanism exhibits fewer curve fluctuations and a smaller range of error bands compared to the FIFO mechanism, as seen from the curve fluctuations and error band shading in the figure.The data show that the variance is reduced by 27.35% using FM compared to FIFO, indicating that FM can improve the algorithm's robustness during training.
Notably, there is no significant difference between the final rewards of the two experience retention mechanisms, suggesting that the use of different experience retention mechanisms has no significant effect on the final training effect.
Table 4 compares the proposed method with the original MADDPG method in terms of task success rate, strategic location ruin number, and other metrics to evaluate their advantages and disadvantages.The table shows that the proposed method outperforms the MADDPG method on both training and testing tasks.Specifically, the MW-MADDPG method exhibits significantly better attack UAV survival than detect UAV survival on testing tasks, indicating that it can learn an efficient strategy for attacking UAVs.These results suggest that the MW-MADDPG method proposed in this paper can effectively learn the common knowledge among tasks from training tasks and apply it to test scenarios, showcasing better cross-task capability.
Furthermore, the proposed Reward-TD prioritized experience replay method with the forgetting mechanism can improve the algorithm's robustness, exhibiting less variance and greater robustness for the MW-MADDPG method.

. Conclusion
In summary, this paper proposes the MW-MADDPG algorithm for the cross-task heterogeneous UAV swarm cooperative decision-making problem.The proposed algorithm includes the improved MAML meta-learning method and the Reward-TD priority reward replay method with a forgetting mechanism, enabling cross-task intelligent UAV decision-making based on the MADDPG algorithm and achieving the expected goals.Experimental results demonstrate that the proposed methods can achieve better task success rates, robustness, and rewards compared to traditional methods, while also exhibiting better generalization performance, overcoming the cold start problem in traditional methods.The proposed algorithm has the potential to be extended to larger-scale scenarios and provide a solution to the cross-task heterogeneous UAV swarm surprise defense problem.
In the future, further research can be done by introducing meta-learning methods into intelligent decisionmaking in air defense systems to enable self-play between UAV penetration and air defense systems.Additionally, combining transfer learning with meta-learning may improve generalization performance.Furthermore, we prepare to build a high-fidelity battlefield environment that can provide a more accurate simulation of the battle process and enable more realistic testing of the proposed algorithms.

FIGURE
FIGUREUnmanned aerial vehicles (UAVs) application scope diagram.UAVs have been widely utilized across various fields due to their numerous advantages.
the joint action set of agents, where the action set of agent i is A i , with i ∈ [1, N].The state transition function P : S × A × S → [0, 1] represents the probability of equipment transition.R is the reward function for all agents, and O = O 1 × O 2 × • • • × O N represents the joint observation value of agents, where O i denotes the observation value of agent i.Finally, γ ∈ [0, 1] is the discount factor.

FIGURE
FIGUREExperimental environment diagram.The objective of the red UAV swarm is to eliminate the blue airports and command posts, while the blue SAM is tasked with defending these targets.

FIGURE
FIGURETest scenario experiment setup diagram, (A) is reward curve of test scenario and (B) is reward curve of test scenario .Various test scenarios were employed to evaluate the generalization capacity of the proposed algorithm.

FIGURE
FIGURE Reward curves for di erent experience replay methods.(A) Training scenario .(B) Training scenario .The RPER method outperforms the other two experience replay methods in terms of final reward, robustness, and algorithm convergence speed.

FIGURE
FIGURE Reward curves for di erent experience retention methods.(A) Training scenario .(B) Training scenario .The forgetting mechanism shows better convergence speed and robustness compared to the first-in-first-out mechanism in di erent training tasks, but the di erence in the final reward is not significant.
TABLE Actions definition.
i (t) Angular velocity of UAV i at moment t ūi (t) Acceleration of UAV i at moment t M i (t) The target number of missile attacks fired by UAV/launch unit i at time t, which has an initial value of 0 R t k Fire unit k radar state at moment t (0 for off, 1 for on) TABLE UAV reward definition.
i = min( (xi − x k ) 2 + y i − y k 2 , i ∈ D, k ∈ F)Weighting factor λ determines the magnitude of the distance-based reward Frontiers in Neurorobotics frontiersin.org r.t.Add experience (s t , a t , r t , s t+1 ) to replay buffer D TABLE Hyperparameter setting for training process. of 77.71 and 72.21% in the two scenarios, respectively, which is significantly higher than the success rate of the other two methods (p <0.05).These results indicate that the proposed method can effectively improve the performance of the agent under new tasks. rate TABLE Algorithm performance comparison.