ORIGINAL RESEARCH article

Front. Phys., 07 May 2026

Sec. Interdisciplinary Physics

Volume 14 - 2026 | https://doi.org/10.3389/fphy.2026.1817865

Joint security and energy optimization in UAV-enabled smart grid networks

  • Information and Communication Branch, State Grid Shanxi Electric Power Company Limited, Taiyuan, Shanxi, China

Abstract

Introduction:

Recent years have witnessed an increasing number of Internet of Things devices (IoTDs) deployed in power grids to monitor bidirectional information and power transfer, transforming them into smart grids. The densification of IoTDs in smart grids demands communication solutions that are simultaneously secure against eavesdropping and energy-efficient for sustainable operation.

Methods:

This article proposes an unmanned aerial vehicle (UAV) and reconfigurable intelligent surface (RIS)-assisted framework in smart grids that maximizes worst-case secrecy energy efficiency via joint optimization of the UAV’s trajectory, beamforming, and phase shifts of RIS. A twin attention-driven deep reinforcement learning algorithm, TAMRRTD3, is developed, featuring attention-based state representation and regret-aware reward design to enhance learning accuracy and convergence.

Results and Discussion:

Simulation results indicate that the proposed algorithm achieves a faster convergence rate and enhanced secrecy energy efficiency than the benchmark algorithms.

1 Introduction

Recent years have witnessed extensive deployments of Internet of Things (IoT) nodes in power grids to monitor the operation of distributed energy sources, transmissions, transformation, distribution, and consumption. Specifically, through devices such as smart meters, operational and real-time energy consumption data can flow bi-directionally within the grid infrastructure, thereby transforming the traditional grid into a smart grid (SG). On the other hand, unmanned aerial vehicles (UAVs) are widely adopted as aerial mobile base stations (BSs) [1, 2], relay nodes to provide communication support for ground devices, or used as data collectors to enlarge the coverage in IoT-driven smart grids [3]. Due to the line-of-sight (LoS) links established by UAVs, they can increase the wireless network’s capacity and coverage. Thus, UAVs have become an important part of 5G wireless communication systems [4].

However, the LoS links of UAV communications are susceptible to the influence of obstacles, such as dense buildings in urban areas. The presence of obstacles can obviously reduce the reliability and communication capacity of UAV communications [5]. Furthermore, the openness of wireless channels makes the information easily eavesdropped on by illegal devices, which leads to insecure communication [6]. Therefore, providing secure communications with UAVs for legitimate devices in areas with dense buildings presents a challenge for practical implementations.

With the advances in materials and hardware technology, reconfigurable intelligent surfaces (RISs) have appeared as a critical enabler in wireless networks by constructing a new wireless environment through the precise programming of the reflection units. RIS can create non-line-of-sight (NLoS) links for UAVs in situations where LoS links are blocked, thus providing UAVs with reliable reflective links by programmatically controlling the reflection units [7]. By controlling the propagation direction and strength of the signal, RIS can prevent signal leakage into unauthorized areas [8]. Physical-layer security (PLS) technology can also provide secure communication for UAV communications by enhancing the capacity of legitimate users.

Some researchers have proposed algorithms to enhance the system’s secrecy rate in various UAV-enabled communication systems based on PLS technology. A PLS communication scheme for UAV ground communication was proposed. The average secrecy rate was maximized using trajectory and power adaptation design [9]. Then, the authors jointly optimized the path of the UAV and the power of the fixed legitimate transmitter to maximize the average secrecy rate between the UAV and the ground communication system. The optimization problem was solved by the block coordinate descent (BCD) and successive convex approximation (SCA) algorithms [10]. Considering the uncertainty of the eavesdropper’s location and the maximum tolerable signal-to-noise ratio (SNR) leakage, an optimal path with maximum secrecy energy efficiency was proposed [11]. The average secrecy rate was maximized by designing an optimal UAV trajectory with appropriate transmit power [12].

[13] used a mobile UAV as a dynamic jamming source to maximize the average secrecy rate from the ground source to the destination within a limited duration by jointly optimizing UAV trajectory and jamming capability. Considering the incomplete information of the eavesdropper’s location, a PLS enhancement scheme based on UAV mobility was proposed to maximize the minimum average secrecy rate of the receiver [14]. The UAV’s trajectory to maximize the system’s security rate was designed to satisfy the dynamic low-delay actual communication scenario [15]. Because RIS can effectively optimize signal propagation paths and avoid eavesdropping risk areas, an alternative optimization technology-based algorithm was proposed to maximize the average security rate in the worst case by jointly optimizing UAV trajectory, the passive beamforming of RIS, and transmit power [16]. Power efficiencies have been improved while ensuring the constraints of the worst-case secrecy rate under the influence of jitter in the active RIS-assisted UAV system [17]. The average secrecy rate was maximized by collaboratively optimizing power allocation, RIS phase shift, and UAV trajectory [18].

Although these traditional convex optimization methods have provided optimal security rate trajectories for UAV-assisted systems, they exhibit suboptimal adaptability in complex dynamic environments and are overly reliant on high-precision channel state information. Unlike traditional optimization methods, deep reinforcement learning (DRL) methods can adapt to complex and dynamic communication environments, providing a novel solution to solve dynamic optimization problems in UAV-enabled systems [19]. A Q-learning and neural network combined reinforcement learning (RL) method was put forward to improve downlink capacity in RIS-aided UAV networks [20]. In addition, a model-free RL algorithm was proposed to improve the uplink capacity for UAVs in complex environments [21]. A distributed DRL algorithm was designed to reduce the total energy consumption in multi-UAV cooperative systems [22]. A Double Deep Q-Network (DDQN)-based algorithm has been proposed to improve capacity and system energy efficiency in RIS-aided full-duplex systems [23]. The system energy efficiency was improved significantly by combining the Dueling Deep Q-Network (Dueling DQN) with the prioritized experience replay (PER) algorithms [24]. Different from the above methods, [25] proposed a DRL framework combining with post-decision state (PDS) and PER to optimize the beamforming of eavesdroppers to improve the system secrecy rate. When extended to cellular-connected UAV scenarios, the PDS-DRL framework not only facilitates the generation of optimal trajectories for evading potential threats [39] but also ensures communication continuity in environments compromised by malicious jamming sources [40]. Building upon this foundation, integrating the PDS approach with the Multi-Agent Twin Delayed Deep Deterministic Policy Gradient (MATD3) algorithm enables the minimization of mission execution time while strictly adhering to communication continuity constraints [41].

In scenarios in which UAVs are equipped with sensing capacities, transfer learning-driven DRL algorithms effectively address situations involving temporary outages of ground IoT devices. By orchestrating collaboration between UAVs and multi-modal sensing entities, this approach generates robust and efficient trajectory planning strategies specifically tailored for emergency rescue operations [42]. In an RIS-assisted UAV non-orthogonal multiple access (NOMA) network, [26] designed a Deep Deterministic Policy Gradient (DDPG)-based algorithm to enhance downlink secure capacity. [27] constructed a twin TD3-based framework to improve secure energy efficiency (SEE) in complex RIS-UAV-enabled NOMA networks. Considering the existence of flying eavesdroppers, [28] employed the Dinkelbach and Taylor expansion methods, as well as a Proximal Policy Optimization (PPO)-based algorithm, to improve secure capacity in a UAV-enabled multi-access edge computing (MEC) network with multiple intelligent reflecting surfaces (IRS).

A twin DDPG (TDDPG)-based algorithm was designed to maximize the secrecy sum rate (SSR) of authorized users under non-ideal channel state information with faster convergence speed [29]. Meanwhile, a Twin-TD3 (TTD3) algorithm was developed to improve the system’s SEE [30]. A dual PPO-based algorithm was developed to maximize the SEE of legitimate users [31]. A double-layer TTD3 algorithm was designed to maximize the weighted sum of satellite-to-UAV transmission rate and legitimate ground users’ security rate in a UAV-RIS-enabled satellite network [32]. [33] designed a DRL framework with dual DDPG networks to jointly optimize the active and passive beamforming to improve the secure rate.

When the communication environments become more complex and high-dynamic, the high-dimensional nature of the state and action spaces can result in an increase in algorithmic complexity. [34] employed a self-attention mechanism with actor-critic frameworks and PER to enhance the decision-making ability on discrete problems. [35] integrated multi-head attention (MHA) and PER into the DDPG framework to improve the execution efficiency of the DRL algorithm.

Table 1 indicates that while various DRL approaches have been employed to enhance secure capacity or SEE in RIS-UAV wireless networks, twin-architecture DRL frameworks demonstrate superior efficiency in decoupling and jointly optimizing the high-dimensional tasks of RIS beamforming and UAV trajectory planning. Evidence from complex Markov decision processes (MDPs) in other domains suggests that attention mechanisms are highly effective in extracting critical features and accelerating the convergence of DRL algorithms [34, 35].

TABLE 1

ScenarioPurpose of the articleDRL method
IRS-aided wireless secure communication system [25]Improving the system secrecy rateDRL with post-decision state (PDS) and PER
RIS-UAV-enabled NOMA networks [26]Enhancing downlink secure capacityDDPG
RIS-UAV-enabled NOMA networks [27]Improving secure energy efficiencyTwin TD3
UAV-enabled multi-access edge computing (MEC) network [28]Improving secure capacityPPO
RIS-assisted mmWave UAV communication systems [29]Maximizing the secrecy sum ratetwin DDPG
RIS-UAV-enabled network [30]Improving secure energy efficiencyTwin TD3
Solving discrete optimization problems [34]Accelerating convergenceSelf-attention mechanism with actor-critic frameworks and PER

Comparison between the DRL algorithms in secure communication.

This work focuses on securing communications for legitimate IoT devices in SG environments, where UAVs serve as mobile base stations, and RIS is strategically deployed to overcome building blockages by establishing virtual NLoS links. To achieve significant improvements in SEE, we integrate an attention mechanism directly into critic networks, building upon a twin-agent decoupling architecture. This design enables twin agents to adaptively identify and amplify critical features, specifically channel state information (CSI) and real-time UAV positions, thereby optimizing system performance.

We formulate a SEE maximization problem. To address the high dimensionality and strong coupling of the optimization variables, we propose the Twin Attention Mechanism with approximate Regret Reward TD3. Utilizing twin agents, our approach jointly optimizes active and passive beamforming for the UAV and RIS, as well as the UAV trajectory, under worst-case channel conditions. The main contributions are as follows.

We establish a long-term SEE maximization framework for a UAV-RIS-enabled IoT network within smart grids. This complex optimization problem is formulated as an MDP. A twin-agent architecture is proposed to simultaneously explore beamforming strategies and trajectories.

We propose the novel TAMRRTD3 algorithm to solve the formulated MDP. Specifically, additive attention mechanism layers are integrated into the hidden layers of the twin TD3 networks to effectively capture dependencies among diverse input states. An approximate regret-based reward mechanism is incorporated into all critic networks to stabilize training and converge to an optimized policy.

We achieve the joint optimization of active/passive beamforming and UAV trajectory. Simulation results demonstrate that, compared with benchmark algorithms, TAMRRTD3 yields significant improvements in SEE with a faster convergence rate. Moreover, the proposed algorithm also performs well under the Gaussian distribution scenario.

2 System model

The system model of this article is shown in Figure 1. The main notations used in this paper are summarized in Table 2. The UAV is furnished with an L-element uniform linear array (ULA) and communicates with multiple legitimate IoT devices in millimeter-wave channels [36]. To address signal blockages caused by high buildings, an RIS comprising uniform planar array (UPA) reflecting elements is deployed. The RIS establishes reliable NLoS virtual links via intelligent reflection, thereby significantly increasing signal strength in shadowed areas and extending effective coverage. The steering vector of the ULA is denoted by , where is the azimuth angle of departure (AoD) (, is the antenna inter-spacing, and is the carrier wavelength. The steering vector of the UPA is denoted by , where , and is the azimuth AoD and the angle of arrival (AoA) .

FIGURE 1

TABLE 2

NotationDescription
MNumber of RIS elements
Steering vector of the uniform linear array (ULA)
Azimuth angle of departure
Steering vector of the uniform planar array (UPA)
Azimuth angle of departure (AoD) and the angle of arrival (AoA)
Eavesdroppers
Legitimate devices
UAV’s flight time
Number of time slots
Length of each time slot
Any time slot
Location of the RIS
Locations of legitimate devices and the eavesdropper at time slot
BArea bound
Height of the UAV
Coordinate of the UAV at time slot n
Maximum distance that the UAV flies in time slot n
Velocity of the UAV at time slot n
Propulsion energy consumption of the UAV
Channel gains from the UAV to the legitimate devices
Channel gains from the UAV to the eavesdropper
Channel gains from the RIS to the devices
Channel gains from the RIS to the eavesdropper
Channel gains from the UAV to the RIS
Channel gains from the UAV to the devices or the eavesdropper
Phase shift matrix of RIS
Beamforming matrix of RIS
Channel gains from the UAV to the receivers
Signal received by the UAV from the receiver
Beamforming matrix of the UAV
Noise
row of G
Achievable data rate by the device
Feasible rate
Secrecy rate from the UAV to the device
UAV’s trajectory
UAV’s maximal transmit power
Approximate regret at the time slot
, , Parameters of critic and critic target networks
Parameters of the actor and the actor target network
Local information
CSI predicted by the UAV
States
Actions
Reward
Experience tuple
Batch size
Update coefficient

Summary of main notations.

To minimize hardware costs and power consumption, the UAV is assumed to operate in an omnidirectional mode, with signal focusing achieved primarily by optimizing the RIS reflection coefficients. There are eavesdroppers, denoted as , and legitimate devices, denoted as with a single antenna. The UAV’s flight time is divided into time slots, where is the length of each time slot. At time slot , the location of the RIS is . The locations of legitimate devices and the eavesdropper are denoted as .

The UAV flies in the area bounded by B at a height of , with the coordinate at time slot n, which must satisfy the following constraint conditions, shown in Equation 1:where is the maximum distance that UAV flies in time slot n. The velocity of the UAV at time slot n is shown in Equation 2.

Based on the velocity , the propulsion energy consumption of the UAV is computed as Equation 3.Here, and are the blade profile power and the induced power of the UAV in hover, respectively. is the rotor blade tip speed. is the average rotor induced speed in hover. is the fuselage drag ratio. is the rotor stiffness. is the air density, and is the rotor disc area.

Let , , , , and denote the channel gains from UAV to the legitimate devices, the UAV to the eavesdropper, the RIS to the devices, the RIS to the eavesdropper, and the UAV to the RIS, respectively. The channel gains are computed as Equation 4.where denotes the power gain at a reference distance of 1 m. denotes the distances between UAV, RIS, and . denotes the path loss exponent of the link. is the Rician K-factor of the link. ; is computed as Equation 5.where is the distance between the UAV and the RIS, is the path loss exponent, is the corresponding Rician factor, is the deterministic LoS array response vector, and contains i.i.d. entries following . Therefore, the channel from the UAV to the devices or eavesdropper can be denoted as . The phase shift matrix of RIS is , for the reflection element. Let the phase shift be and the amplitude reflection coefficient be . For simplicity, . The beamforming matrix of RIS can be vectorized as , so the channel gains from the UAV to the receivers are denoted as . The signal received by UAV from the receiver is denoted as Equation 6. The average transmit power of the signal is normalized to 1, as shown in Equation 7.where is the signal. is the beamforming matrix of the UAV. is the noise, represented as . is the row of G. Then, the achievable data rate by the device is computed as Equation 8.

Suppose the eavesdropper intercepts the signal of the legitimate device. the feasible rate is computed as Equation 9.

The secrecy rate from the UAV to the device is denoted as Equation 10.where .

Finally, the SEE of the system is computed as Equation 11.

3 Problem formulation

The purpose of this article is to maximize the long-term SEE between legitimate devices and an UAV by jointly optimizing the active beamforming matrix G, the passive beamforming matrix θ of RIS, and the UAV’s trajectory Q. The optimization problem is shown as Equation 12.

Constraint specifies the conditions that the UAV trajectory must meet. Constraint specifies the probability that the secure communication rate exceeds the threshold is no less than . Constraint specifies the conditions that the phase shifts of the RIS must satisfy. , , and are nonconvex. Constraint is the power condition that the UAV beamforming must satisfy, which involves nonlinear operations of matrix calculations. is UAV’s maximal transmit power. It is challenging to tackle such a probability-constrained nonconvex optimization problem. We propose a DRL-based solution to solve the problem in the next section.

4 Twin Attention Mechanism Approximate Regret Reward TD3 algorithm

4.1 Attention mechanism

An attention mechanism is used to compute the association of the query and key matrices through an additive function to achieve the attention weights [37]. The similarity scores of the query and keys are first computed. These scores are transformed into weights using a softmax function. The values are summed with these weights to generate a representation containing contextual information.

Given a query vector and a set of key-value pairs , the query and keys are mapped according to a common space as Equation 13.

Then, is combined with , and the output passes through a nonlinear activation tanh function to enhance the expressive power as Equation 14.where is a vector.

The fused vectors computed using Equation 14 are passed through a softmax function to convert them into attention weights , which represent the degree of importance assigned to each vi as Equation 15.

The final vector is computed as Equation 16:

4.2 Approximate regret reward

Because there are two critic networks and in the TD3 network, at time slot , approximate regret is adopted to denote the divergence between valuations of the same action output by two critic networks, which can be computed as Equation 17.

If the approximate regret value for is large, uncertainty exists in the Q-value estimations made by the critic networks. Then, the updating method for the selected critic network should be adjusted accordingly to reduce the uncertainty of Q-value estimations in the subsequent decision-making process. The approximate regret reward can be designed by Equation 18.

Then, the loss function can be computed using Equation 19.

4.3 Framework of the Twin Attention Mechanism Approximate Regret Reward TD3 algorithm

The optimization problem of Equation 12 is equivalent to a Markov decision process. However, it is challenging to solve directly because the UAV trajectory is highly correlated with the beamforming matrix and . To solve this problem, we propose a Twin Attention Mechanism TD3 framework integrated with an approximate Regret Reward TD3 (TAMRRTD3) algorithm. As shown in Figure 2, the proposed algorithm adopts a twin-parallel agent architecture comprising AMRRTD3-Agent1 and AMRRTD3-Agent2. The former optimizes the beamforming matrix and , while the latter handles UAV trajectory planning. These modules operate in parallel to collaboratively solve the high-dimensional coupled problem.

FIGURE 2

Each AMRRTD3 network contains six networks. Specifically, AMRRTD3 Agent1 contains an actor (policy) network , and its corresponding target network , critic networks equipped with attention layers with the parameter and , and their respective target critic networks with the parameter and . Similarly, AMRRTD3 Agent2 follows the same architecture, containing an actor (policy) network , a target actor network , two attention-enhanced critic networks with the parameters and , and two target critic networks with the parameters and The integration of attention layers within the critic networks enables the agents to effectively extract complex dependencies between different input states.

At each actor-critic framework, the current state from the environment is fed into the actor (policy) network , and the generated action is selected, where is the introduced noise, specifically Gaussian noise, to enhance the exploratory nature. When the chosen action is executed, the agent gains reward . The environment transitions to a new state . The data are stored in the experience pool, where a batch of experience samples is randomly selected for training.

As depicted in Figure 2, our proposed framework utilizes two specialized AMRRTD3 agents. In this parallel setup, Agent 1 takes the CSI to generate the near-optimal beamforming matrix G and phase shifts θ. Agent 2 processes local environmental information W to determine a high-quality trajectory . At each time slot n, both agents observe their respective states and , selecting actions and based on their own policy and , respectively. The environment transitions to new states and ,and yields rewards and , respectively. The relevant experiences are stored in the respective experience pools. During the training phase, mini-batches sampled from the buffer are used to update the networks. The twin critic networks estimate action values to minimize Bellman error, while the actor networks are optimized via policy gradient. This iterative process continues until the policies advance toward and , respectively.

The state space, action space, and reward function are set as follows.

4.3.1 State space

In the proposed TAMRRTD3 algorithm, the two parallel agents operate on distinct state spaces, denoted as and for AMRRTD3 Agent 1 and AMRRTD3 Agent 2, respectively. At time slot , the state of AMRRTD3 Agent 1 is the CSI predicted by UAV, denoted as . The state of AMRRTD3 Agent 2 is local information, denoted as . Consequently, the specific states observed by the agents are defined as Equation 20:where .

4.3.2 Action space

In the TAMRRTD3 algorithm, the action spaces for the two AMRRTD3 agents are defined as and , respectively. AMRRTD3 Agent 1 takes CSI as the input and generates a near-optimal joint beamforming strategy, comprising the UAV active matrix and the RIS passive beamforming matrix . This combined action is denoted as . AMRRTD3 Agent 2 takes local information as its state input to obtain a high-quality UAV trajectory . At each time slot , AMRRTD3 Agent 2 outputs a flight direction . Based on , the UAV’s coordinates for the next time slot can be computed . The resulting trajectory action is denoted as . Thus, the actions executed by the agents at time slot are formally defined as Equation 21:

4.3.3 Reward function

To guide the agents toward the global optimization objective at each time slot, an instantaneous reward mechanism is established to evaluate behavioral performance. Taking UAV energy consumption into account, both agents share a unified reward function designed to balance communication performance and energy efficiency. The reward functions is formulated as Equation 22.Here, , , and are penalty terms when constraints for Equation 12, , and are not met, respectively. is the high energy consumption. , and are weight coefficients. The representation of is as Equation 23:

For generalization purposes, energy consumption is normalized to . The normalized is denoted as Equation 24.

The design of this reward function aims to prevent agents from blindly reducing without considering the optimization of .

The TAMRRTD3 algorithm adopts a centralized training and decentralized execution approach. During the training, a batch of experience is selected from the experience pool for training, is the batch size. During the computation of the target Q-value, the action for the subsequent state is generated by employing the target actor network as Equation 25:

The target Q-value is computed using the smaller Q-value from and Qiθ′ is computed as Equation 26.where is the discount factor, used to clip the Q-values to prevent overestimation for the target Q-value.

The approximate regret reward (ARR) is computed according to Equation 17. If is small, the Q-networks update their parameters by minimizing a mean-squared-error loss function, thereby more accurately assessing the quality of the actions. The mean squared error losses for both networks are computed based on Equations 27, 28.

If is large, the Q-network with a lower Q-value updates its parameters by minimizing a mean squared error according to the following Equation 29:

The parameter in the policy network is updated less frequently than the Q-network. The policy network is updated by maximizing the Q-value according to Equation 30:

The parameters of the target network , , and , with parameters , and , respectively, are updated using a soft update method, computed as Equations 31, 32:where is the soft update coefficient.

The TAMRRTD3 algorithm for training the networks is presented in Table 3.

TABLE 3

Algorithm TAMRRTD3
Input: Environment, parameters of actor networks, critic networks, target networks, noise, and experience pool
Output: Parameters for twin AMRRTD3 networks
1.Initialize the corresponding network parameters for agent 1 and
for agent 2
2. For episode = 1, 2, … , Nep do
3. Reset the phase shift matrix of the RIS and the positions of the UAV, the legitimate devices,
and the eavesdropper
4. For step = 1, 2, … , Nstep do
5. Observing the environment to get states and
6. Actions and are selected; and and are obtained
7. Store and in their respective experience pools
8. Sample mini-batch to update
Update
9. end For
10. end For

TAMRRTD3 algorithm.

In the initial stage of the TAMRRTD3 algorithm operation, it is necessary to initialize the twin AMRRTD3 network parameters and configure the corresponding optimizers, including creating a fixed capacity experience buffer and setting up action noise generators (e.g., Gaussian noise) to enhance exploration. At the beginning of each training episode, the environment resets the phase shift matrix of the RIS, the UAV’s position, the legitimate user’s position, the eavesdropper’s position, and acquires the initial state. In the time-step loop, the agents select actions, execute them with noise superimposed, and receive rewards and next state information based on the executed actions. Then, the quaternion is stored in the experience pool. This process is iterated until the capacity of the experience is full. The training begins until the end of the episodes.

5 Simulation results

The performance of the TAMRRTD3 algorithm is simulated to demonstrate its advantages. According to parametric settings in thw literature [29], the size of the simulation scene is (50 m, 50 m, 50 m), and the initial positions of the legitimate devices are (25 m, 25 m, 0 m) and (4 m, 47 m, 0 m), respectively. The initial position of the UAV is (0 m, 25 m, 50 m), and the positions of the RIS and eavesdropper are (0 m, 50 m, 12.5 m) and (47 m, −4 m, 0 m), respectively. The remaining parameters are , , , , , and . The path loss factor for each link is , , and , respectively. The energy consumption parameters of the UAV and the hyperparameters of the proposed TAMRRTD3 algorithm are summarized in Tables 4, 5.

TABLE 4

ParameterValue
Blade profile power 582.65 W
Induced power 790.67 W
Rotor blade tip speed 200 m/s
Air density 1.225 kg/m3
Fuselage drag ratio 0.3
Rotor ruggedness 0.05
Rotor disk area 0.97 m2
Average rotor induction speed 2.567 m/s

UAV energy consumption parameters.

TABLE 5

Hyper-parameterizationValue
AMRRTD3 agent 1 size27 × 800 × 600 × 515 × 256 × 20
AMRRTD3 agent 2 size3 × 400 × 300 × 256 × 128 × 2
Actor learning rate0.0001
Critic learning rate0.001
Nep300
Nstep100
1
Batch size 64
Experience pool size30,000
Actor update interval2

Hyperparameters for the TAMRRTD3 algorithm.

The baseline algorithms are as follows:

  • The Twin-DDPG algorithm is used to optimize the active beam forming matrix , the passive beam forming matrix , and the UAV flight trajectory , respectively, in [29].

  • The Twin-TD3 algorithm is used to optimize the active beam forming matrix , the passive beam forming matrix , and the UAV flight trajectory , respectively, in [30].

Figures 3, 4 show the convergence performance and secure capacity of the proposed algorithm compared to baseline methods. In Figure 3, we present a comprehensive ablation study, where TAMTD3 represents TTD3 with the attention mechanism. Compared to the baselines (TDDPG and TTD3), TAMTD3 significantly boosts the reward by effectively extracting critical state features, outperforming the standard TTD3. Building on this, the integration of the ARR mechanism further mitigates Q-value overestimation, which smooths the training curves and ensures stability in our full TAMRRTD3 method. TAMRRTD3 demonstrates superior convergence characteristics, achieving a higher reward and faster convergence speed than both TDDPG and TTD3. Similarly, Figure 4 reveals that TAMRRTD3 attains the highest safety capacity among all evaluated algorithms.

FIGURE 3

FIGURE 4

The significant advantages of TAMRRTD3 in terms of convergence speed, final reward, secure capacity, and stability can be attributed to its unique architectural enhancements. Specifically, while traditional TD3 critics process all state features with static weights, making them susceptible to noise in high-dimensional environments, our approach integrates an additive attention mechanism and ARR. The attention mechanism empowers the critic to dynamically re-weight input features, prioritizing critical state variables while suppressing irrelevant noise. This selective focus reduces the variance in Q-value estimation and mitigates overestimation bias, leading to more stable policy updates. Consequently, these mechanisms verify the efficiency and reliability of TAMRRTD3 in complex decision-making scenarios.

The impacts of different algorithms on the average sum of secrecy rate and SEE for the TAMRRTD3 and TTD3 algorithms with and without an energy penalty (EP) are compared in Figures 5a,b, respectively. The incorporation of EP effectively improves the average SSR, which is depicted in Figure 5a. By constraining the energy usage of the UAV, EP lowers the energy consumption while playing an important role in ensuring a high average SSR. The TAMRRTD3 (EP) achieves the highest average SSR. Meanwhile, the TAMRRTD3 has a similar performance as the TAMRRTD3 (EP) algorithm, which indicates that the TAMRRTD3 algorithm can effectively balance performance and energy consumption in complex environments. As shown in Figure 5b, the TAMRRTD3 (EP) achieves a higher average SEE throughout the training process. The performance of TAMRRTD3 significantly outperforms the TTD3 and TTD3 (EP) algorithms. By incorporating EP, the TAMRRTD3 (EP) algorithm improves average SEE compared with TAMRRTD3.

FIGURE 5

Figures 6, 7 demonstrate the convergence performance and the secure capacity of different algorithms when the devices follow the Gaussian distribution (GD) [38]. As shown in Figure 6, by fusing an additive attention mechanism with ARR, the GD-TAMRRTD3 algorithm demonstrates the excellent stability, reward, and fast convergence properties of the benchmark algorithms, which fully verifies the stable performance of the GD-TAMRRTD3 algorithm in complex environments. As shown in Figure 7, GD-TTD3 fluctuates more at the beginning but becomes steady as the training episodes increase. The GD-TDDPG algorithm directly optimizes the policy function, but its secure capacity value grows most slowly. The GD-TAMRRTD3 algorithm achieves the highest secure capacity among benchmark algorithms.

FIGURE 6

FIGURE 7

The impacts of different algorithms on average SSR and SEE with and without EP under GD are compared in Figure 8. Regardless of whether EPs are introduced, the TAMRRTD3 algorithm demonstrates a high average SSR under GD. Different from Figure 5a, the average SSR under the GD-TAMRRTD3 algorithm outperforms that of TAMRRTD3 (EP). In addition, by introducing EP, the average SSR of GD-TTD3 (EP) is higher than that of GD-TTD3. Considering the results in Figures 5a, 8a, the TAMRRTD3 algorithm shows remarkable performance compared to the benchmark algorithms under different scenarios. As shown in Figure 8b, all algorithms achieve a higher average SEE than that of Figure 5b. In addition, TAMRRTD3 and TAMRRTD3 (EP) have a similar higher average SEE after 150 episodes. After the introduction of EP, the algorithms exhibit higher SEE stability, indicating that the EP term can effectively constrain the energy usage of UAV and reduce the energy consumption while increasing the average SEE.

FIGURE 8

6 Conclusion

We address secure communication in RIS-aided UAV networks for smart grids, focusing on maximizing secrecy energy efficiency (SEE) under imperfect CSI and worst-case channel conditions. We propose TAMRRTD3, a novel algorithm combining an additive attention mechanism with an approximate regret-based reward. Utilizing a twin-agent architecture, the algorithm simultaneously optimizes UAV/RIS beamforming and UAV trajectory. The attention mechanism facilitates dynamic feature extraction in high-dimensional state spaces, while the regret-based approach ensures stable Q-value estimation. Extensive simulations show that TAMRRTD3 outperforms baseline TDDPG and TTD3 in terms of convergence speed, capacity, and SEE. The proposed method demonstrates superior performance under the condition of a Gaussian distribution.

Statements

Data availability statement

The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Author contributions

JW: Writing – original draft, Software, Resources, Writing – review and editing, Methodology, Investigation, Data curation, Conceptualization, Project administration, Validation. XH: Validation, Conceptualization, Supervision, Writing – original draft. JM: Validation, Writing – original draft, Supervision. YL: Formal Analysis, Writing – original draft, Funding acquisition, Validation.

Funding

The author(s) declared that financial support was received for this work and/or its publication. This research was funded by the State Grid Shanxi Electric Power Company’s Science and Technology Projects (No. 52051C230102). The funder had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Acknowledgments

The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflict of interest

Authors JW, XH, JM, and YL were employed by the Information and Communication Branch, State Grid Shanxi Electric Power Company Limited.

Generative AI statement

The author(s) declared that generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Glossary

  • AoA

    Angle of arrival

  • AoD

    Angle of departure

  • ARR

    Approximate regret reward

  • BCD

    Block coordinate descent

  • BSs

    Base stations

  • CSI

    Channel state information

  • DDPG

    Deep Deterministic Policy Gradient

  • DDQN

    Double Deep Q-network

  • DRL

    Deep reinforcement learning

  • Dueling DQN

    Dueling Deep Q-network

  • EP

    Energy penalty

  • GD

    Gaussian distribution

  • GD-TAMRRTD3

    Gaussian Distribution-Twin Attention Mechanism with Approximate Regret Reward TD3

  • IoT

    Internet of Things

  • IRS

    Intelligent reflecting surface

  • LoS

    Line-of-sight

  • MATD3

    Multi-Agent Twin Delayed Deep Deterministic Policy Gradient

  • MDP

    Markov decision process

  • MEC

    Multi-access edge computing

  • MHA

    Multi-head attention

  • NLoS

    Non-line-of-sight

  • NOMA

    Non-orthogonal multiple access

  • PDS

    Post-decision state

  • PER

    Prioritized experience replay

  • PLS

    Physical layer security

  • PPO

    Proximal Policy Optimization

  • RIS

    Reconfigurable intelligent

  • RL

    Reinforcement learning

  • SCA

    Successive convex approximation

  • SEE

    Secure energy efficiency

  • SG

    Smart grid

  • SNR

    Signal-to-noise ratio

  • SSR

    Secrecy sum rate

  • TAMRRTD3

    Twin Attention Mechanism with Approximate Regret Reward TD3

  • UAVs

    Unmanned aerial vehicles

  • ULA

    Uniform linear array

  • UPA

    Uniform planar array

References

  • 1.

    ChouSFPangACYuYJ. Energy-aware 3D unmanned aerial vehicle deployment for network throughput optimization. IEEE Trans Vehicular Techn (2025) 69(1):56378. 10.1109/TWC.2019.2946822

  • 2.

    ChuNHHoangDTNguyenDNVan HuynhNDutkiewiczE. Joint speed control and energy replenishment optimization for UAV-assisted IoT data collection with deep reinforcement transfer learning. IEEE Internet Things J (2023) 10(7):577893. 10.1109/JIOT.2022.3151201

  • 3.

    LiXLiQKongDZhangXWangX. Learning based trajectory design for low-latency communication in UAV-enabled smart grid networks. In: 2020 IEEE 92nd Vehicular Technology Conference (VTC2020-Fall) (2020). p. 15. 10.1109/VTC2020-Fall49728.2020.9348839

  • 4.

    GuoHZhouXLiuJZhangY. Vehicular intelligence in 6G: networking, communications, and computing. Vehicular Commun (2022) 33:100399. 10.1016/j.vehcom.2021.100399

  • 5.

    YangXJiaHZhouFWuSJiangCKuangL. User association and trajectory optimization for UAV-assisted communication in urban environments. In: 2024 IEEE/CIC International Conference on Communications in China (ICCC) (2024). p. 150712. 10.1109/ICCC62479.2024.10681950

  • 6.

    NguyenTTTranMHLeTTHTranXN. Joint resource and trajectory optimization for secure UAV-based relay NOMA system. Vehicular Commun (2023) 43:100650. 10.1016/j.vehcom.2023.100650

  • 7.

    RafiRMSudhaVReshmaP. Ergodic capacity analysis of distributed RIS wireless communication system: how many RIS elements are required to beat direct LoS path?Sādhanā (2024) 49(4):268. 10.1007/s12046-024-02605-w

  • 8.

    JangaleP. Reconfigurable intelligent surfaces for RF signal enhancement in 5G and 6G wireless networks. Int J Scientific Res Eng Manag (2024) 8(12):15. 10.55041/IJSREM17080

  • 9.

    ZhangGWuQCuiMZhangR.Securing UAV communications via trajectory optimization. In: Globecom 2017 (2017). p. 16. 10.1109/GLOCOM.2017.8254971

  • 10.

    ZhangGWuQCuiMZhangR. Securing UAV communications via joint trajectory and power control. IEEE Trans Wireless Commun (2019) 18(2):137689. 10.1109/TWC.2019.2892461

  • 11.

    CaiYWeiZLiRKwan NgDWYuanJ. Energy-efficient resource allocation for secure UAV communication systems. In: 2019 IEEE Wireless Communications and Networking Conference (WCNC) (2019). p. 18. 10.1109/WCNC.2019.8885416

  • 12.

    DuoBLuoJLiYHuHWangZ. Joint trajectory and power optimization for securing UAV communications against active eavesdrop. China Commun (2021) 18(1):8899. 10.23919/JCC.2021.01.008

  • 13.

    LiAWuQZhangR. UAV-enabled cooperative jamming for improving secrecy of ground wiretap channel. IEEE Wireless Commun Lett (2018) 8(1):1814. 10.1109/LWC.2018.2865774

  • 14.

    LiuZZhuBXieYMaKGuanX. UAV-aided secure communication with imperfect eavesdropper location: robust design for jamming power and trajectory. IEEE Trans Vehicular Techn (2023) 73(5):727686. 10.1109/TVT.2023.3347769

  • 15.

    YangHLinKWangCXiaW. Cooperative jamming and trajectory optimization for UAV-enabled reliable and secure communications. In: 2024 IEEE 100th Vehicular Technology Conference (VTC2024-Fall) (2024). p. 15. 10.1109/VTC2024-Fall63153.2024.10757714

  • 16.

    LiSDuoBDiRMTaoMYuanX. Robust secure UAV communications with the aid of reconfigurable intelligent surfaces. IEEE Trans Wireless Commun (2021) 20(10):640217. 10.1109/TWC.2021.3073746

  • 17.

    GeYFanJZhangJ. Active reconfigurable intelligent surface enhanced secure and energy-efficient communication of jittering UAV. IEEE Internet Things J (2023) 10(24):22386400. 10.1109/JIOT.2023.3304004

  • 18.

    GuoYJXLiY. Optimization of aerial UAV assisted secure wireless communication system based on intelligent reflecting surface. In: 2024 IEEE 100th Vehicular Technology Conference (VTC2024-Fall) (2024). p. 15. 10.1109/VTC2024-Fall63153.2024.10757882

  • 19.

    MnihVKavukcuogluKSilverDGravesAAntonoglouIWierstraDet al (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.

  • 20.

    ZhangQSaadWBennisM. Reflections in the sky: millimeter wave communication with UAV-carried intelligent reflectors. In: 2019 IEEE Global Communications Conference (GLOBECOM) (2019). p. 16. 10.1109/GLOBECOM38437.2019.9013626

  • 21.

    YinSZhaoSZhaoYYuFR. Intelligent trajectory design in UAV-aided communications with reinforcement learning. IEEE Trans Vehicular Techn (2019) 68(8):822731. 10.1109/TVT.2019.2923214

  • 22.

    LiuCHMaXGaoXTangJ. Distributed energy-efficient multi-UAV navigation for long-term communication coverage by deep reinforcement learning. IEEE Trans Mobile Comput (2019) 19(6):127485. 10.1109/TMC.2019.2908171

  • 23.

    WangYDengYKangLJiangFJiangF. Reinforcement learning-based energy efficiency optimization for RIS-assisted UAV hybrid uplink and downlink system. Computer Networks (2024) 245:110390. 10.1016/j.comnet.2024.110390

  • 24.

    TangRWangJJiangFZhangXDuJ. Joint 3D trajectory and phase shift optimization via deep reinforcement learning for RIS-assisted UAV communication systems. Phys Commun (2024) 66:102456. 10.1016/j.phycom.2024.102456

  • 25.

    YangHXiongZZhaoJNiyatoDXiaoLWuQ. Deep reinforcement learning-based intelligent reflecting surface for secure wireless communications. IEEE Trans Wireless Commun (2021) 20(1):37588. 10.1109/TWC.2020.3024860

  • 26.

    BuSMaPYuDLuanCSuH. Secure transmission in RIS-UAV assisted communications based on deep reinforcement learning. In: 2024 12th International Conference on Information Systems and Computing Technology (ISCTech) (2024). p. 16. 10.1109/ISCTech63666.2024.10845550

  • 27.

    ZhengHZhaoSHuangGTangD. RIS-assisted UAV NOMA secure communication based on deep reinforcement learning. Phys Commun (2025) 72:102713. 10.1016/j.phycom.2025.102713

  • 28.

    GaoYWangZZhangYLuWTangJZhaoNet alMulti-IRS-aided secure communication in UAV-MEC networks. IEEE Trans Vehicular Techn (2025) 74(5):732738. 10.1109/TVT.2025.3527586

  • 29.

    GuoXChenYWangY. Learning-based robust and secure transmission for reconfigurable intelligent surface aided millimeter wave UAV communications. IEEE Wireless Commun Lett (2021) 10(8):17959. 10.1109/LWC.2021.3081464

  • 30.

    ThamMLWongYJIqbalARamliNBZhuYDagiuklasT. Deep reinforcement learning for secrecy energy-efficient UAV communication with reconfigurable intelligent surface. In: 2023 IEEE Wireless Communications and Networking Conference (WCNC) (2023). p. 16. 10.1109/WCNC55385.2023.10118891

  • 31.

    ZhangWZhaoRXuY. Aerial reconfigurable intelligent surface-assisted secrecy energy-efficient communication based on deep reinforcement learning. In: 2024 12th International Conference on Intelligent Computing and Wireless Optical Communications (ICWOC) (2024). p. 605. 10.1109/ICWOC62055.2024.10684922

  • 32.

    XuKLongKLuYZhangH.Joint secure transmission and trajectory optimization for reconfigurable intelligent surface-aided non-terrestrial networks. J Electron and Inf Techn (2025) 47(2):296304. 10.11999/JEIT240981

  • 33.

    SummaqAKumarMPChinnaduraiS. Synergistic beamforming in 6G: dual-agent learning for secure high-power transmission in PIRS-empowered wireless systems. In: 2025 17th International Conference on COMmunication Systems and NETworks (COMSNETS) (2025). p. 11427. 10.1109/COMSNETS63942.2025.10885699

  • 34.

    SunYYangB. A priority experience replay actor-critic algorithm using self-attention mechanism for strategy optimization of discrete problems. PeerJ Comp Sci (2024) 10:e2161. 10.7717/peerj-cs.2161

  • 35.

    ChenJJiangYPanHYangM. Path planning in complex environments using attention-based deep deterministic policy gradient. Electronics (2024) 13(18):3746. 10.3390/electronics13183746

  • 36.

    WeiZCaiYSunZNgDWKYuanJZhouMet alSum-rate maximization for IRS-assisted UAV OFDMA communication systems. IEEE Trans Wireless Commun (2021) 20(4):253050. 10.1109/TWC.2020.3042977

  • 37.

    BahdanauDChoKBengioY. Neural machine translation by jointly learning to align and translate. ICLR (2015). 10.48550/arXiv.1409.0473

  • 38.

    UğurelEHuangSCC. Learning to generate synthetic human mobility data: a physics-regularized Gaussian process approach based on multiple kernel learning. Transportation Res B: Methodological (2024) 189:103064. 10.1016/j.trb.2024.103064

  • 39.

    ZhouQWangY. Design of anti-interference path planning for cellular-connected UAVs based on improved DDPG. In: Proceedings of the 2024 IEEE 10th International Conference on High Performance and Smart Computing (2024). p. 716. 10.1109/HPSC62738.2024.00020

  • 40.

    ZhouQWangYShenRNakazatoJTsukadaMGuanZ. Cellular connected UAV anti-interference path planning based on PDS-DDPG and TOPEM. IEEE J Miniatur Air Space Syst (2025) 6(1):218. 10.1109/JMASS.2024.3490762

  • 41.

    ZhouQMaoWNakazatoJJiYTsukadaM. Uncertainty-aware multi-agent reinforcement learning for anti-interference trajectory planning of cellular-connected UAVs. IEEE Trans Veh Technol (2026) 75(2):286480. 10.1109/TVT.2025.3606201

  • 42.

    LiuYZhouQMaoWLiXHuangfuWTsukadaMet alMulti-modal trajectory planning for emergency-oriented air-ground collaborative sensing and communication. IEEE Trans Cogn Commun Netw (2025) 11(5):3094111. 10.1109/TCCN.2025.3585254

Summary

Keywords

deep reinforcement learning, energy efficiency, reconfigurable intelligent surface, secure communication, smart grid, unmanned aerial vehicle

Citation

Wu J, Hao X, Ma J and Li Y (2026) Joint security and energy optimization in UAV-enabled smart grid networks. Front. Phys. 14:1817865. doi: 10.3389/fphy.2026.1817865

Received

26 February 2026

Revised

24 March 2026

Accepted

27 March 2026

Published

07 May 2026

Volume

14 - 2026

Edited by

Junfeng Miao, University of Science and Technology Beijing, China

Reviewed by

Chaowei Wang, Beijing University of Posts and Telecommunications (BUPT), China

Quanxi Zhou, The University of Tokyo, Japan

Guangyu Liao, Harbin Engineering University, China

Updates

Copyright

*Correspondence: Jian Wu,

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Outline

Figures

Cite article

Copy to clipboard


Export citation file


Share article

Article metrics