DDPG-based energy efficiency optimization for ABS-assisted beyond-5G cellular networks with sleep mode management

Saleh, Vala; Eslami, Mohsen; Kazemi, Kamran

doi:10.3389/frcmn.2025.1764320

ORIGINAL RESEARCH article

Front. Commun. Netw., 26 January 2026

Sec. IoT and Sensor Networks

Volume 6 - 2025 | https://doi.org/10.3389/frcmn.2025.1764320

This article is part of the Research TopicNon-Terrestrial IoT Networks: Architectures, Applications, and Future ChallengesView all 3 articles

DDPG-based energy efficiency optimization for ABS-assisted beyond-5G cellular networks with sleep mode management

Vala Saleh¹

Mohsen Eslami^1,2*

Kamran Kazemi¹

¹Department of Electrical Engineering, Shiraz University of Technology, Shiraz, Iran
²Electrical and Computer Engineering Department, University of Alberta, Edmonton, AB, Canada

Introduction: Energy efficiency is a critical challenge in Beyond-5G (B5G) cellular networks, where ground base stations (GBSs) are responsible for a substantial portion of network energy consumption. Reducing this consumption while maintaining minimum user data rate requirements remains a key research problem.

Methods: This paper proposes an Aerial Base Station (ABS)-assisted energy optimization framework that integrates ABS deployment with low-power sleep states of GBSs. Traffic is selectively offloaded from lightly loaded GBSs to ABSs, enabling energy savings without violating user quality-of-service constraints. A Deep Deterministic Policy Gradient (DDPG) algorithm is employed to jointly optimize ABS positioning, GBS sleep mode scheduling, and resource allocation under dynamic traffic conditions.

Results: Simulation results demonstrate that the proposed DDPG-based framework significantly reduces network energy consumption while improving achievable user data rates compared to baseline schemes without ABS assistance or learning-based optimization.

Discussion: The results highlight the effectiveness of integrating ABSs with GBS low-power sleep states using reinforcement learning. By enforcing minimum data rate constraints and dynamically adapting to traffic variations, the proposed approach provides a scalable and energy-efficient solution for sustainable operation.

1 Introduction

The evolution of Beyond 5G (B5G) technology builds upon the worldwide deployment of fifth-generation (5G) networks, aiming to meet growing connectivity demands through more advanced standards and innovative communication solutions. B5G is expected to deliver enhanced mobility, superior reliability, ultra-high data rates, intelligent network management, and improved energy efficiency. With increasing dependency on high-speed and ubiquitous communication, B5G networks are anticipated to progress toward greater resilience and maturity. To enable this transition, key technologies such as Artificial Intelligence (AI), Edge Computing, Reconfigurable Intelligent Surfaces (RIS), Terahertz (THz) communication, Quantum Computing, and Unmanned Aerial Vehicles (UAVs) are being actively explored (Sufyan et al., 2023; Puspitasari et al., 2023; Dogra et al., 2020; Alsharif et al., 2018).

The potential of UAVs in cellular and wireless networks has gained substantial attention in recent literature. UAV-assisted communication systems have shown promise in extending network coverage beyond the limitations of terrestrial infrastructure, enhancing link reliability, and offering flexible, resilient, and sustainable connectivity in diverse deployment scenarios (Gryech et al., 2024). A prominent outcome of this integration is the development of Aerial Access Networks (AANs), where UAVs are utilized to provide communication services from the air. Unlike Terrestrial Access Networks (TANs), AANs overcome geographic and infrastructural constraints, offering wide-area coverage, improved communication quality, and high mobility support—particularly in remote or hard-to-reach environments where TANs may be infeasible (Behjati et al., 2025). In such systems, UAVs often operate as Aerial Base Stations (ABSs) or airborne relays, enabling direct access to users and thereby improving service availability and network adaptability.

Significant research efforts have been directed toward the integration of UAVs into B5G networks, addressing challenges such as trajectory optimization, user association mechanisms, transmission power management, and the cooperative deployment of UAVs with Intelligent Reflecting Surfaces (IRS) for improved signal propagation (Banafaa et al., 2024; Qazzaz et al., 2024; Shahzadi et al., 2021; Geraci et al., 2022; Gu and Zhang, 2023; Sarkar and Gul, 2023; Jangsher et al., 2022). These advancements improve performance, reliability, and energy efficiency in future wireless communication systems (Amponis et al., 2022).

A diverse set of techniques has been proposed to effectively reduce energy consumption in drone networks (Abubakar et al., 2023). These approaches encompass resource management (Masroor et al., 2021; Basharat et al., 2022), flight and transmission scheduling (Wu et al., 2022), path planning (Azadur et al., 2024), and optimal placement and trajectory design (Elnabty et al., 2022; Won et al., 2023; Azarhava et al., 2024; Tung et al., 2022).

Beyond UAV-assisted solutions, a key energy challenge in B5G networks is the high power consumption of Ground Base Stations (GBSs). To mitigate this issue, researchers are actively developing optimization strategies to improve GBS energy efficiency. One widely adopted approach, applicable to both terrestrial networks and ABS-assisted frameworks, is the GBS sleep strategy, which dynamically deactivates underutilized GBSs with a small number of associated users while ensuring that user data rate requirements are satisfied.

GBS sleep strategies are designed to identify optimal opportunities for base stations to enter sleep mode without compromising network coverage or service quality (López-Pérez et al., 2022). These strategies can be broadly classified into binary on/off schemes and multi-level sleep modes, each offering distinct mechanisms for reducing energy consumption while maintaining network performance.

The binary scheme conserves energy by deactivating underutilized GBSs; however, this approach may negatively impact data transmission rates (Kim et al., 2015; Kooshki et al., 2023). To optimize energy efficiency in ultra-dense networks, researchers in (Amine et al., 2022) and (Ju et al., 2022) proposed reinforcement learning (RL)-based cell switching algorithms for managing small cells. Specifically, the work in (Ju et al., 2022) introduces a Decision Selection Network (DSN) to streamline the action space within a Deep Reinforcement Learning (DRL) framework, demonstrating effective management of active and sleep modes while maintaining essential data rate requirements.

In contrast, multi-level sleep modes leverage mobile traffic prediction to dynamically transition idle small cells into different sleep states, further optimizing energy efficiency while ensuring network performance (Kim et al., 2023).

An often-overlooked application of drones is their use as Aerial Base Stations (ABSs) to facilitate GBS sleep strategies. In Chowdary et al. (2021), the authors propose a resource allocation algorithm that leverages active GBSs to serve users in areas where some GBSs have entered sleep mode. While this approach ensures user connectivity in sleeping areas, it introduces challenges such as service instability for users in non-sleeping areas and increased algorithmic complexity due to resource reallocation after GBS deactivation. Moreover, this work does not explore the potential of drones operating explicitly as ABSs to enable GBS sleep modes.

Motivated by these limitations, we propose an ABS-assisted GBS sleep strategy that selectively allows lightly loaded GBSs to enter sleep mode during periods of reduced traffic demand. This approach minimizes overall network energy consumption while ensuring that each ABS satisfies the minimum data rate required to maintain Quality of Service (QoS) during GBS downtime. Our analysis demonstrates that effective ABS deployment not only enhances overall network transmission rates but also significantly reduces GBS power consumption.

To further amplify the energy-saving benefits, we introduce a joint optimization framework that integrates GBS sleep scheduling, resource allocation, and ABS position optimization to minimize network-wide energy consumption. The resulting decision-making process forms a complex binary integer programming problem, motivating the need for efficient learning-based optimization techniques.

To address this challenge, we propose a hybrid Deep Reinforcement Learning (DRL) framework that combines Deep Deterministic Policy Gradient (DDPG) and Deep Double Q-Learning (DDQL) algorithms. The rationale for this integration is that the considered optimization includes both continuous variables (e.g., ABS horizontal positioning and transmit power allocation) and discrete decisions (e.g., GBS sleep mode control and association selection). DDPG is well-suited to continuous control, whereas DDQL is effective in discrete action spaces and reduces Q-value overestimation. By combining them, the proposed framework can efficiently handle the mixed discrete–continuous action space in a unified learning process.

Unless otherwise stated, users are assumed quasi-static during each optimization interval, i.e., user locations remain fixed while decisions on association, sleep scheduling, and resource allocation are optimized.

The remainder of this paper is organized as follows: Section 2 introduces the system model, including the sleep model, terrestrial and aerial channel models, and the power consumption model, followed by the optimization problem formulation. Section 3 presents the theoretical preliminaries of the DRL algorithms and details the proposed hybrid framework. Section 4 discusses the simulation results and provides an in-depth analysis. Finally, Section 5 concludes the paper with key findings.

2 System model

This section provides a structured overview of the key components considered in this study, including the network architecture, user/ABS location assumptions, channel models, energy consumption model, and problem formulation.

As illustrated in Figure 1, this study focuses on an ABS-assisted downlink wireless network. The analysis is conducted within a $L_{x} \times L_{y}$ km area, where $K$ users are served by $M$ strategically positioned Ground Base Stations (GBSs). User distribution follows a Poisson point process, modeling realistic spatial randomness. In line with many related works and to focus on network-level optimization, users are assumed static during each optimization interval (i.e., user locations do not change during a decision epoch).

Figure 1

Illustration of a city grid with roads and buildings. Cell towers emit red signal waves, surrounded by blue rectangles representing mobile devices. Lightning symbols indicate connections. Two drones, one active and one labeled

Figure 1. A wireless network consisting of multiple Ground Base Stations (GBSs) and Aerial Base Stations (ABSs), where the GBSs are equipped with sleep mode capability to optimize energy efficiency.

Additionally, the network includes $U$ Aerial Base Stations (ABSs), each hovering at an altitude $h_{u}$ , where $1 \leq u \leq U$ . The x-y coordinates of the ABSs will be optimized as part of the network design to enhance coverage, ensure seamless connectivity, and improve overall network performance.

The network employs Orthogonal Frequency Division Multiplexing (OFDM) to serve users across $L$ subcarriers, where $L ≫ K$ . Each GBS $m$ transmits a signal with power $p_{m, l}$ on the $l$ th subcarrier, subject to the total power constraint (Equation 1):

\sum_{l = 1}^{L} p_{m, l} \leq P_{m}, (1)

where $P_{m}$ represents the maximum transmit power of GBS $m$ . Similarly, the $u$ th ABS transmits with power $p_{u, l}$ on the $l$ th subcarrier, constrained by (Equation 2):

\sum_{l = 1}^{L} p_{u, l} \leq P_{u}, (2)

where $P_{u}$ is the maximum transmit power of the ABS.

To model user association, we define a binary variable $β_{m, k, l} \in {0,1}$ that indicates whether the $k$ th user is associated with GBS $m$ on subcarrier $l$ . Specifically, $β_{m, k, l} = 1$ implies that $p_{m, l} > 0$ , and the $k$ th user is receiving a signal from GBS $m$ . The set of users served by GBS $m$ is denoted as (Equation 3):

D_{m} = \{1,2, \dots, N_{m}\}, (3)

where the total number of users in the network satisfies (Equation 4):

\sum_{m = 1}^{M} N_{m} = K . (4)

To represent the operational status of each GBS, we introduce a binary sleep indicator $α_{m}$ , defined as (Equation 5):

α_{m} = \{\begin{cases} 1, & if GBS m is active, \\ 0, & if GBS m is in sleep mode . \end{cases} (5)

The $u$ th UAV’s location is denoted by (Equation 6):

{\bar{d}}_{u} = ({\bar{x}}_{u}, {\bar{y}}_{u}, {\bar{z}}_{u}), (6)

while each GBS is positioned at the center of its respective cell. The distance between the $u$ th UAV and the $k$ th user at time slot $n$ is given by (Equation 7)

d_{u, k} [n] = \sqrt{{\bar{z}}_{u}^{2} + {‖w_{u} [n] - w_{k}‖}^{2}}, (7)

where $w_{u} [n] = [{\bar{x}}_{u} [n], {\bar{y}}_{u} [n]]$ and $w_{k} = [x_{k}, y_{k}]$ denote the horizontal locations of the $u$ th UAV and the $k$ th user, respectively. Furthermore, ${\bar{z}}_{u}$ represents the altitude of the UAV, which, without loss of generality, is assumed to be constant in this work.

2.1 Air-to-ground (A2G) channel model

The Line-of-Sight (LoS) channel model is commonly employed in UAV-assisted networks to facilitate communication between Aerial Base Stations (ABSs) and Cellular Users (CUs) (Kim et al., 2015; Khawaja et al., 2019). The expected channel power gain from the $k$ th user to the UAV on the $l$ th subchannel can be expressed as

G_{u, k, l} [n] = [P_{LoS}^{u, k} [n] + (1 - P_{LoS}^{u, k} [n]) \hat{κ}] h_{0} {(d_{u, k} [n])}^{- \hat{α}}, (8)

where $h_{0} = {(\frac{λ}{4 π})}^{2}$ represents the channel power gain under LoS conditions at a reference distance of 1 m, $λ$ denotes the carrier wavelength, $\hat{α}$ is the path-loss exponent constrained within $2 < \hat{α} < 6$ , and $\hat{κ} > 1$ is the additional attenuation factor due to Non-Line-of-Sight (NLoS) propagation.

Furthermore, the probability of establishing an LoS link, $P_{LoS}^{u, k} [n]$ , is modeled as (Equation 9):

P_{LoS}^{u, k} [n] = \frac{1}{1 + α^{'} \exp (- β^{'} θ_{u, k} [n] - α^{'})}, (9)

where $α^{'}$ and $β^{'}$ are environment-dependent parameters, and (Equation 10)

θ_{u, k} [n] = \sin^{- 1} (\frac{{\bar{z}}_{u}}{d_{u, k} [n]}) (10)

denotes the elevation angle between the $u$ th UAV and the $k$ th user at time slot $n$ .

2.2 Ground-to-ground (G2G) channel model

For the terrestrial part of the network, which refers to the channel between Ground Base Stations (GBSs) and users, we adopt a fading channel model. The small-scale fading coefficient, ${\hat{H}}_{m, k, l} \sim C N (0,1)$ , follows a complex Gaussian distribution, while the large-scale fading coefficient, $g_{m, k}$ , is modeled using the Hata-COST231 model (Singh, 2012). The frequency domain downlink channel gain $G_{m, k, l}$ between the $m$ th GBS and the $k$ th Cellular User (CU) on the $l$ th subchannel consists of small-scale fading, path loss and shadowing components. Pathloss is given by (Equation 11):

P L_{m, k}^{*} = \{\begin{cases} - L - 35 \log_{10} d_{m, k}, & d_{m, k} > d_{1}, \\ - L - 15 \log_{10} d_{0} - 20 \log_{10} d_{m, k}, & d_{0} < d_{m, k} \leq d_{1}, \\ - L - 15 \log_{10} d_{1} - 20 \log_{10} d_{0}, & d_{m, k} \leq d_{0} . \end{cases} (11)

where $d_{m, k}$ is the distance between the $k$ th user and the $m$ th GBS. The parameter $L$ is obtained as (Equation 12):

\begin{align} L = 46.3 - 33.9 \log_{10} f - 13.82 \log_{10} h_{GBS} & - (1.1 \log_{10} f - 0.7) h_{user} \\ - (1.56 \log_{10} f - 0.8), \end{align} (12)

where $h_{GBS}$ and $h_{user}$ denote the heights of the GBSs and users, respectively, and $f$ represents the carrier frequency.

Adding shadowing to the pasthloss, we have (Equation 13):

P L_{m, k} = P L_{m, k}^{*} \times 1 0^{\frac{χ_{m, k} σ_{th}}{10}}, (13)

where $1 0^{\frac{χ_{m, k} σ_{th}}{10}}$ represents the shadowing fading. Here, $χ_{m, k} \sim N (0,1)$ is a Gaussian random variable modeling the shadowing effect, and $σ_{th}$ is the standard deviation that determines the scale of the shadowing effect.

Hence, the channel gain is expressed as (Equation 14):

G_{m, k, l} = \sqrt{P L_{m, k}} \times {\hat{H}}_{m, k, l} . (14)

The Signal-to-Interference-plus-Noise Ratio (SINR) of the $k$ th user served by the $m$ th GBS is given by (Equation 15):

γ_{m, k, l} = \frac{p_{m, l} | G_{m, k, l} |^{2}}{\sum_{\tilde{m} = 1 : M, \tilde{m} \neq m} p_{\tilde{m}, l} | G_{\tilde{m}, k, l} |^{2} + σ_{m, l}^{2}} . (15)

Similarly, for the $k$ th user served by the $u$ th ABS, the SINR is (Equation 16):

γ_{u, k, l} = \frac{p_{u, l} | G_{u, k, l} |^{2}}{\sum_{\tilde{u} = 1 : U, \tilde{u} \neq u} p_{\tilde{u}, l} | G_{\tilde{u}, k, l} |^{2} + σ_{u, l}^{2}} . (16)

The achievable data rate for user $k$ is then expressed as (Equation 17):

R_{k} = \log_{2} (1 + γ_{q, k, l}), q \in \{m, u\} . (17)

If no user is served on a given subcarrier $\hat{l}$ , then the corresponding transmit power is set to zero, i.e., $p_{q, \hat{l}} = 0$ . To represent the operational state of each GBS, we introduce a binary parameter $α_{m} \in {0,1}$ , which determines whether the $m$ th GBS is active or in sleep mode. If a GBS is forced into sleep mode, all users associated with it will be disconnected and must connect to neighboring GBSs. Consequently, the transmit power of a sleeping GBS is forced to zero, i.e., $α_{m} P_{m} = 0$ .

2.3 Power consumption model

The total power consumption in UAV-assisted networks comprises two primary components: the power consumption of the ground network and the power consumption of UAVs.

2.3.1 Power consumption of the ground network

The power consumption of the ground network includes the following components:

1. Transmission Power: $P_{m, k, l}^{t}$ , which represents the transmission power of the $m$ th GBS.

2. Circuit Power: $P_{m}^{cte}$ , the power consumed by the circuits and other electronic components of the $m$ th GBS, which is considered constant.

3. Mode-dependent Power Consumption: The power consumed by a GBS due to its active or sleep mode operation, including power supply and air conditioning, given by (Equation 18):

P_{m}^{\mod} = α_{m} P_{m}^{o n} + (1 - α_{m}) P_{m}^{off}, (18)

where $P_{m}^{o n}$ and $P_{m}^{off}$ represent the power consumption of the GBS in active mode and sleep mode, respectively.

1. Mode Transition Power: The power associated with transitioning the base station $m$ between operational modes is given by (Equation 19):

P_{m}^{tran} = | α_{m} - α_{m}^{prev} | P_{m}^{tran} (19)

where $α_{m}^{prev}$ denotes the operational state (active or sleep) of the $m$ th GBS during the previous time slot, and $P_{m}^{tran}$ represents the energy consumed for transitioning between modes.

2.3.2 Power consumption of UAVs

The power consumption of a UAV consists of three main components (Equation 20):

P_{u} = P_{u}^{com} + P_{u}^{hov} + P_{u}^{cte} (20)

where $P_{u}^{com}$ is the power consumed by the UAV’s communication systems, including the radio transmitter, receiver, antennas, and other related components, expressed as (Equation 21):

P_{u}^{com} = \sum_{l = 1}^{L} P_{u, l} . (21)

$P_{u}^{cte}$ is the power consumed by the UAV’s internal circuits and other onboard electronic devices, and $P_{u}^{hov}$ represents the power consumed when the UAV is hovering or stationary in the air. According to Ghorbel et al. (2019), the hovering power is given by (Equation 22):

P^{hov} = \sqrt{\frac{{(m^{tot} g)}^{3}}{2 π r_{p}^{2} n_{p} ρ}} (22)

where $m^{tot}$ is the total mass of the UAV, $g$ is the gravitational acceleration, $ρ$ is the air density, and $r_{p}$ and $n_{p}$ are the radius and number of propellers, respectively.

2.3.3 Total weighted power consumption

In UAV-based networks, the energy consumed by UAVs for hovering and flying is typically much higher than the power required for communication. To address this imbalance, a weighting factor $w$ is introduced to balance the trade-off between hovering/flying power and communication-related power consumption (Zhu et al., 2021). This approach allows for an optimized allocation of energy resources across the network.

The total weighted power consumption of the UAV-assisted network is given by (Equation 23):

\begin{aligned} P^{tot} = & w (\sum_{m = 1}^{M} \sum_{l = 1}^{L} α_{m} P_{m, k, l}^{t} + \sum_{m = 1}^{M} (P_{m}^{\mod} + P_{m}^{tran} + P_{m}^{cte}) \\ + \sum_{u = 1}^{U} (P_{u}^{com} + P_{u}^{cte})) + (1 - w) \sum_{u = 1}^{U} P_{u}^{hov} . \end{aligned} (23)

In the simulations, the weighting factor is set to $w = 0.5$ , providing a balanced trade-off between GBS-related power consumption and ABS hovering energy, consistent with Fährmann et al. (2022).

2.4 Energy efficiency

The Energy Efficiency (EE) criteria serve as a critical framework for evaluating the effectiveness of resource allocation within the network, particularly when GBSs are in sleep mode. By computing the Energy Efficiency metric, we ensure that the power consumption of GBSs does not compromise users’ quality of service. The Energy Efficiency (EE) criterion is defined as (Equation 24):

η_{E E} = \frac{τ_{K}}{P^{tot}}, (24)

where $τ_{K} = \sum_{k = 1}^{K} R_{k}$ represents the total achievable data rate of all $K$ users, and $P^{tot}$ denotes the total power consumption of the network.

2.5 Optimization problem formulation

The optimization problem for maximizing the energy efficiency of the cellular network is formulated as (Equation 25):

\max_{\{α_{m}\}, \{β_{k, m, l}\}, \{β_{k, u, l}\}, \{p_{m, l}\}, \{p_{u, l}\}, \{(x_{u}, y_{u})\}} η_{E E} (25)

subject to the following constraints:

(C 1) 0 \leq p_{m, l} \leq P_{G, L}^{\max}, 0 \leq p_{u, l} \leq P_{U, L}^{\max}, \forall m, u, l (26a)

(C 2) α_{m} \in \{0,1\}, \forall m (26b)

(C 3) \sum_{m = 1}^{M} (α_{m} β_{k, m, l}) + \sum_{u = 1}^{U} β_{k, u, l} = 1, \forall k, l (26c)

(C 4) \sum_{l = 1}^{L} p_{m, l} \leq α_{m} P_{m}, \forall m (26d)

(C 5) \sum_{l = 1}^{L} p_{u, l} \leq P_{u}^{\max}, \forall u (26e)

(C 6) 0 \leq x_{u} \leq L_{x}, \forall u (26f)

(C 7) 0 \leq y_{u} \leq L_{y}, \forall u (26g)

(C 8) R_{k} \geq R_{\min}, \forall k (26h)

Constraint (C1) ensures that the transmission power per subcarrier remains within the maximum allowable thresholds for both GBSs and ABSs, as defined in (Equation 26a).

Constraint (C2) determines the operational status of each GBS, where $α_{m} = 1$ indicates that the GBS is active and $α_{m} = 0$ signifies that it is switched to sleep mode, as given in (Equation 26b).

Constraint (C3) guarantees exclusive user association by ensuring that each user is connected to exactly one serving node—either a single GBS or a single ABS—on each subcarrier. The inclusion of the activity indicator $α_{m}$ prevents user association with sleeping GBSs, thereby avoiding multiple or invalid connections, as stated in (Equation 26c).

Constraint (C4) enforces that the total transmit power of each GBS does not exceed its maximum permissible value, accounting for both active and sleep states, as given in (Equation 26d). Constraint (C5) ensures that ABSs operate within their defined power limitations, as specified in (Equation 26e).

Constraints (C6) and (C7) confine ABSs within the designated region of interest, ensuring that they remain within operational limits, as enforced in (Equation 26f) and (Equation 26g), respectively. Finally, Constraint (C8) guarantees that each user achieves the minimum required data rate, thereby maintaining the network’s quality of service (QoS), as defined in (Equation 26h).

The optimization problem formulated in (Equation 25) is a mixed-integer non-convex problem involving both discrete and continuous variables, rendering it intractable for conventional mathematical optimization techniques. To address this complexity, a learning-based approach is introduced in the following section.

3 DRL-based framework for optimizing complex problems

This section presents the hybrid DDPG–DDQL framework developed to address the energy efficiency optimization problem in ABS-assisted B5G networks incorporating a sleep strategy, as formulated in (Equation 25).

3.1 Basics of deep reinforcement learning (DRL)

Reinforcement Learning (RL) has significantly advanced Artificial Intelligence (AI) by enabling agents to make decisions, observe outcomes, and iteratively refine their strategies to determine an optimal policy (Morocho-Cayamcela et al., 2019; Huang et al., 2019). However, due to its reliance on extensive exploration, traditional RL can be slow and computationally expensive, limiting its applicability in large-scale networks.

Deep Reinforcement Learning (DRL) integrates Deep Neural Networks (DNNs) into RL, significantly enhancing learning speed and efficiency. In applications such as IoT and UAV-assisted networks, devices often need to make independent decisions to optimize network performance. These scenarios are frequently modeled as Markov Decision Processes (MDPs), which are formally defined as a quintuple $(S, A, P, R, ζ)$ :

• $S$ represents a K-dimensional state space, with each state at time $t$ denoted as $s_{t}$ .

• $A$ defines the finite action space available to the agent.

• $P : S \times A \times S \to [0,1]$ is the transition probability function, specifying the likelihood of transitioning from state $s$ to state $s^{'}$ after taking action $a$ , expressed as $P (s, a, s^{'})$ .

• $R : S \times A \to R$ is the expected reward function, quantifying the anticipated reward upon executing action $a$ in state $s$ .

• $ζ \in [0,1)$ is the discount factor that determines the importance of future rewards.

Although traditional methods such as dynamic programming and value iteration can solve MDPs, they become computationally impractical for large-scale and complex networks. DRL techniques, particularly Deep Q-Learning (DQL), provide scalable solutions by approximating value functions using deep neural networks.

3.2 Deep Q-learning (DQL) and its limitations

Deep Q-Learning (DQL) is a fundamental DRL algorithm that estimates Q-values for state-action pairs using neural networks (Braga et al., 2020). For an agent parameterized by $θ^{Q}$ at time $t$ , the DQL update equation after taking action $a_{t}$ in state $s_{t}$ , receiving immediate reward $r_{t + 1}$ , and transitioning to the next state $s_{t + 1}$ is (Equation 27):

\begin{aligned} Q (s, a | θ_{t + 1}^{Q}) & = Q (s, a | θ_{t}^{Q}) + ν [r_{t + 1} + ζ \max_{a^{'}} Q (s_{t + 1}, a^{'} | θ_{t}^{Q}) \\ - Q (s, a | θ_{t}^{Q})], \end{aligned} (27)

Here, $ν$ represents the learning rate. However, the Q-learning update consistently overestimates Q-values due to the use of bootstrapping, where estimates are derived from other estimates. This bias is exacerbated by using the same Q-network for both action selection and evaluation.

3.3 Double deep Q-learning (DDQL)

To mitigate the overestimation bias in DQL, the authors in (Fährmann et al., 2022; Shokrnezhad et al., 2024) introduced Double Deep Q-Learning (DDQL), which decouples action selection and evaluation employing two distinct Q networks:

• The primary Q-network: $Q$ selects actions.

• The target network: $Q^{'}$ evaluates actions, using separate parameters $θ^{Q}$ and $θ^{Q^{'}}$ .

The DDQL update equation is (Equation 28):

\begin{aligned} Q (s, a | θ_{t + 1}^{Q}) & = Q (s, a | θ_{t}^{Q}) + ν [r_{t + 1} + \\ ζ Q (s_{t + 1}, \arg \max_{a^{'}} Q^{'} (s_{t + 1}, a^{'} | θ_{t}^{Q^{'}}) | θ_{t}^{Q}) - Q (s_{t}, a_{t} | θ_{t}^{Q})] . \end{aligned} (28)

The target network $Q^{'}$ is periodically updated using Polyak avenging (Equation 29):

θ_{t + t_{0}}^{Q^{'}} = (1 - τ) θ_{t}^{Q^{'}} + τ θ_{t}^{Q}, (29)

where $τ \in [0,1]$ controls the update rate.

Although DDQL effectively reduces overestimation and improves convergence in discrete action spaces (such as GBS sleep mode decisions), it struggles with continuous action spaces, such as power allocation and ABS positioning.

3.4 Deep deterministic policy gradient (DDPG)

For continuous action spaces, Deep Deterministic Policy Gradient (DDPG) algorithm (Yu et al., 2021) is a more suitable approach. DDPG is an actor-critic algorithm that efficiently handles sequential decision making. It optimizes a policy function $π$ , mapping states to actions, by maximizing the objective function (Equation 30):

J (θ) = E [Q (s, a) | S = s_{t}, a = π (a | s_{t})] . (30)

Unlike DQL, where policies output a probability distribution over discrete actions, DDPG directly maps states to actions through a policy network $π (a | s_{t})$ . The policy network parameters $θ^{μ}$ are updated using the gradient (Equation 31):

\nabla J_{θ μ} (θ) \approx \nabla_{a} Q (s, a) \nabla μ (s | θ^{μ}), (31)

where $μ (s | θ^{μ})$ is the policy network (the actor), and $\nabla_{a} Q (s, a)$ represents the gradient derived from a Q-network (the critic).

In large-scale environments with numerous actions, the actor-critic framework efficiently approximates Q-values using (Equation 32):

\max_{a} Q (s, a) \approx Q (s, a | θ^{Q}) |_{a = μ (s | θ^{μ})} . (32)

Similarly to DDQL, DDPG enhances stability by using:

• Experience replay to train the critic network.

• Target networks for both actors and critics, updated using Polyak averaging.

3.5 Hybrid DDPG-DDQL for UAV-assisted B5G networks

Given the nature of our optimization problem, which involves both discrete decisions (e.g., GBS sleep mode and discrete association choices) and continuous variables (e.g., power allocation and ABS positioning), we propose a Hybrid DDPG–DDQL framework. In this hybrid design, DDQL handles the discrete decision component and mitigates overestimation bias, while DDPG learns a deterministic continuous-control policy through actor–critic training. This explicit separation helps stabilize learning and reduces the overall search complexity compared to using a single algorithm to handle a mixed action space.

3.6 Proposed hybrid DDPG-DDQL framework with ABS-assisted sleep strategy

The objective of this research is to develop a DRL-based framework that optimizes the sleep scheduling of GBSs, the power allocation vector, and the ABS positioning vector, based on a given Channel State Information (CSI) matrix, defined as (Equation 33):

H = [G_{1}, G_{2}, \dots, G_{k}], (33)

where without loss of generality, a single UAV in the network has been assumed and hence, the UAV index of the channel matrix discarded. Furthermore, $G_{k} = [G_{k, 1} \dots G_{k, L}]$ . To achieve this, we propose a hybrid DRL system that integrates DDPG and DDQL to jointly learn optimal sleeping configurations and continuous control actions in relation to the CSI matrix $H$ (or any related network metric). A schematic representation of the proposed hybrid DDPG–DDQL architecture is illustrated in Figure 2.

Figure 2

A diagram illustrating a reinforcement learning model architecture combining DDQL and DDPG. It includes neural networks for DDQL Q-N, and DDPG Critic and Actor, connected to components such as observed state $S$, action outputs $a^c$ and $a^b$, and a Replay Buffer. A visual of telecom towers interacting with mobile devices is also present.

Figure 2. Hybrid DDPG–DDQL framework for optimizing ABS-assisted networks.

For a given sleep configuration, the actor–critic DDPG algorithm (Zhou et al., 2022) is used to optimize the power allocation and ABS horizontal positioning. DDPG continuously outputs the continuous control vector, denoted by $p_{L K \times 1}$ , and the ABS location vector $w = (x, y)$ .

To determine the optimal sleep configuration, the DDQL algorithm (Van Hasselt et al., 2016) is used, since the number of possible sleep configurations is finite and each configuration index is discrete.

3.7 Optimization problem formulation

As illustrated in Figure 2, both algorithms interact with a simulated ABS-assisted network environment to address the optimization problem formulated in Equation 24.

The network environment state is represented as (Equation 34):

S = (s_{1}, s_{2}, \dots, s_{K}), (34)

where each state $s_{k}$ is defined by the Signal-to-Interference-plus-Noise Ratio (SINR) values of users, represented as (Equation 35):

s_{k} = (γ_{k}, α_{m}, β_{m, k, l}, β_{k, u, l}, p_{m, l}, p_{u, l}) . (35)

3.8 Reward function and action space

The immediate reward function $r$ plays a crucial role in estimating optimal policies and Q-values. While energy efficiency is the primary objective, the reward must also discourage QoS violations. Therefore, we define the reward as the energy efficiency when all users satisfy the minimum rate requirement, and apply a penalty otherwise (Equation 36):

r_{t} = \{\begin{cases} η_{E E} (t), & if R_{k} (t) \geq R_{\min}, \forall k, \\ η_{E E} (t) - λ \sum_{k = 1}^{K} \max (0, R_{\min} - R_{k} (t)), & otherwise, \end{cases} (36)

where $λ > 0$ controls the penalty severity. This reward structure encourages the agent to improve energy efficiency while maintaining feasibility with respect to QoS constraints. (If the original Equation 27 corresponds to the reward, Equation 36 replaces it and is referenced accordingly in the revised manuscript).

In this framework, as shown in Figure 2, the action space $A$ consists of an action pair (Equation 37):

a = (a^{b}, a^{c}) = (ω, c) . (37)

Here,

• The continuous action vector $ω$ , which includes power allocation and ABS position, is generated using the DDPG algorithm.

• tThe discrete action vector $c$ , which includes the GBS sleep configuration (and any discrete association selection if applicable), is produced by the DDQL algorithm.

3.9 Computational complexity analysis

The computational complexity of the proposed method mainly arises during the offline training stage due to iterative neural network updates in both DDPG and DDQL. For a fully connected network layer $i$ , let $m_{i}$ and $n_{i}$ denote the input and output dimensions, and let $b$ be the batch size. The dominant cost per training update is due to matrix multiplications, leading to (Equation 38):

\sum_{i} O (m_{i} n_{i} b) . (38)

After training, the online inference stage requires only forward passes through the trained networks. With $b = 1$ (as in online RL decision-making), the complexity becomes (Equation 39):

\sum_{i} O (m_{i} n_{i}) . (39)

In addition, the DDQL component evaluates a discrete action among a finite set of sleep configurations; thus, the per-step selection overhead scales with the number of discrete actions (sleep configurations) considered by the DDQL output layer. Overall, the proposed approach is computationally intensive during offline training but has low online computational overhead, making it suitable for real-time operation once trained.

4 Results and discussions

In this section, we present the results of the proposed hybrid-DRL algorithm to optimize energy efficiency in ABS-assisted cellular networks. The simulation parameters are summarized in Table 1.

Table 1

Table 1. Simulation parameters.

The actor and critic networks, along with their respective target networks, are designed with two hidden layers comprising 256 and 128 neurons, respectively. In contrast, the DDQN architecture includes two fully connected layers with 64 neurons each, followed by ReLU activation functions, and terminates with a linear output layer. To enhance convergence and ensure training stability, the Adam optimizer is employed for the critic network, using its default hyperparameters $β_{1}$ and $β_{2}$ , which control the exponential moving averages of the gradients and their squared values, respectively (Van Hasselt et al., 2016).

Figure 3 illustrates the relationship between the number of sleeping GBSs and the rate requirement across three optimization scenarios: (i) optimizing achievable rate without considering EE, (ii) prioritizing EE while ignoring rate constraints, and (iii) jointly optimizing EE under the minimum-rate constraint. The results show that $R_{\min}$ strongly limits the number of GBSs that can enter sleep mode. In the first scenario, no GBS enters sleep mode due to strict rate-driven operation. In the second scenario, more GBSs can be deactivated, improving EE but violating user rate requirements. The third scenario balances both objectives and is effective in the range $R_{\min} \in [0.1, 0.6]$ bps/Hz. When $R_{\min} = 0.6$ bps/Hz, up to seven GBSs can be switched off, while at $R_{\min} = 0.1$ bps/Hz, only one GBS is switched off. For $R_{\min} > 0.6$ bps/Hz, the method cannot place GBSs into sleep mode without violating the QoS constraint.

Figure 3

Line graph showing the number of Base Stations (BSs) in sleep mode versus minimum rate (Mbps) from 0.1 to 0.8. Two data series are plotted:

Figure 3. The number of sleeping GBSs as a function of rate requirement $R_{\min}$ .

It can be observed from Figure 3 that when the minimum rate requirement exceeds $R_{\min} = 0.6$ bps/Hz, the number of GBSs entering sleep mode drops to zero. This outcome highlights the trade-off between stringent QoS constraints and achievable energy savings. Under high QoS requirements, maintaining user data rates necessitates keeping all GBSs active, which limits the applicability of sleep strategies. Consequently, the proposed ABS-assisted framework is most effective in low-to-medium QoS regimes, such as off-peak or lightly loaded traffic conditions, where energy efficiency gains can be achieved without compromising user performance.

As illustrated in Figure 4, the energy efficiency criterion is expressed in terms of training episodes. To evaluate the role of the ABS in the proposed system, three scenarios were considered: one with the ABS at a higher altitude, another at a lower altitude, and a scenario without an ABS. In this context, $R_{\min}$ was set to 0.2 bps/Hz based on the results in Figure 3. The results indicate that the learning process improves energy efficiency over training as the agent refines sleep scheduling, power allocation, and ABS positioning decisions. Figure 5 presents the energy efficiency criterion as a function of the minimum rate requirement. The observations indicate that, with an average $R_{\min}$ of 0.4, the proposed methods employing low-altitude and high-altitude ABSs can improve the energy efficiency criterion by $30 %$ and $25 %$ , respectively, compared to the scenario without an ABS.

Figure 4

Line graph showing energy efficiency (Mbps/J) over episodes. Three lines represent different scenarios: UAV at low height (black squares), UAV at high height (blue circles), and without UAV (green diamonds). UAV at low height shows the highest efficiency, followed by UAV at high height, with a gradual increase. Without UAV has the lowest efficiency.

Figure 4. Energy efficiency criterion as a function of the number of training episodes for $R_{\min} = 0.2$ .

Figure 5

Line graph showing energy efficiency versus data rate for three scenarios: UAV at low height (red squares), UAV at high height (blue circles), and without UAV (green diamonds). Energy efficiency decreases as data rate increases from 0.1 to 0.6 megabits per second.

Figure 5. Energy efficiency criterion as a function of rate requirement $R_{\min}$ .

5 Conclusion

This paper presents an ABS-assisted energy optimization framework for beyond-5G (B5G) cellular networks, utilizing selective ground base station (GBS) sleep modes and traffic offloading through aerial base stations (ABSs). To address the dynamic and non-convex nature of the problem, a hybrid reinforcement learning algorithm combining Deep Deterministic Policy Gradient (DDPG) and Double Deep Q-Learning (DDQL) is developed. This algorithm jointly optimizes ABS positioning, GBS sleep scheduling, and resource allocation. Simulation results demonstrate that the proposed framework significantly reduces overall network energy consumption while maintaining service quality.

These findings underscore the potential of hybrid deep reinforcement learning techniques in enabling intelligent and energy-efficient wireless communication systems. Future research directions include incorporating renewable-powered ABSs, implementing cooperative multi-ABS coordination, and developing real-time adaptive mechanisms to further enhance system scalability and performance.

Data availability statement

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.

Author contributions

VS: Writing – review and editing, Writing – original draft. ME: Writing – original draft, Writing – review and editing. KK: Writing – review and editing, Writing – original draft.

Funding

The author(s) declared that financial support was not received for this work and/or its publication.

Conflict of interest

The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that generative AI was used in the creation of this manuscript. Generative AI was used exclusively to edit and improve the clarity of the manuscript’s text.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Abubakar, A. I., Ahmad, I., Omeke, K., Ozturk, M., Ozturk, C., Abdel-Salam, A., et al. (2023). A survey on energy optimization techniques in UAV-based cellular networks: from conventional to machine learning approaches. Drones 7 (3), 214. doi:10.3390/drones7030214

CrossRef Full Text | Google Scholar

Alsharif, M., Nordin, R., Abdullah, N. F., and Kelechi, A. H. (2018). How to make key 5G wireless technologies environmental friendly: a review. Trans. Emerg. Telecommun. Technol. 29 (1), e3254. doi:10.1002/ett.3254

CrossRef Full Text | Google Scholar

Amine, A. E., Chaiban, J. P., Hassan, H. A. H., Dini, P., Nuaymi, L., and Achkar, R. (2022). Energy optimization with multi-sleeping control in 5G heterogeneous networks using reinforcement learning. IEEE Trans. Netw. Serv. Manag. 19 (4), 4310–4322. doi:10.1109/tnsm.2022.3157650

CrossRef Full Text | Google Scholar

Amponis, G., Lagkas, T., Zevgara, M., Katsikas, G., Xirofotos, T., Moscholios, I., et al. (2022). Drones in B5G/6G networks as flying base stations. Drones 6 (2), 39. doi:10.3390/drones6020039

CrossRef Full Text | Google Scholar

Azadur, R.Md., Pawase, C. J., and Chang, K. (2024). Multi-UAV path planning utilizing the PGA algorithm for terrestrial IoT sensor network under ISAC framework. Trans. Emerg. Telecommun. Technol. 35 (1), e4916. doi:10.1002/ett.4916

CrossRef Full Text | Google Scholar

Azarhava, H., Abdollahi, M. P., and Musevi Niya, J. (2024). Placement and power assignment for hierarchical UAV networks under hovering fluctuations in mmWave communications. Trans. Emerg. Telecommun. Technol. 35 (11), e70002. doi:10.1002/ett.70002

CrossRef Full Text | Google Scholar

Banafaa, M. K., Pepeoğlu, Ö., Shayea, I., Alhammadi, A., Shamsan, Z. A., Razaz, M. A., et al. (2024). A comprehensive survey on 5G-and-Beyond networks with UAVs: applications, emerging technologies, regulatory aspects, research trends and challenges. IEEE Access 12, 7786–7826. doi:10.1109/access.2023.3349208

CrossRef Full Text | Google Scholar

Basharat, M., Naeem, M., Qadir, Z., and Anpalagan, A. (2022). Resource optimization in UAV-Assisted wireless networks—A comprehensive survey. Trans. Emerg. Telecommun. Technol. 33 (7), e4464. doi:10.1002/ett.4464

CrossRef Full Text | Google Scholar

Behjati, M., Alobaidy, H. A. H., Nordin, R., and Abdullah, N. F. (2025). UAV-assisted federated learning with hybrid LoRa P2P/LoRaWAN for sustainable biosphere. Front. Commun. Netw. 6, 1529453. doi:10.3389/frcmn.2025.1529453

CrossRef Full Text | Google Scholar

Braga, I. M., Cavalcante, E. d. O., Fodor, G., Silva, Y. C. B., e Silva, C. F. M., and Freitas, W. C. (2020). User scheduling based on multi-agent deep Q-learning for robust beamforming in multicell MISO systems. IEEE Commun. Lett. 24 (12), 2809–2813. doi:10.1109/lcomm.2020.3015462

CrossRef Full Text | Google Scholar

Chowdary, A., Ramamoorthi, Y., Kumar, A., and Cenkeramaddi, L. R. (2021). Joint resource allocation and UAV scheduling with ground radio station sleeping. IEEE Access 9, 124505–124518. doi:10.1109/access.2021.3111087

CrossRef Full Text | Google Scholar

Dogra, A., Jha, R. K., and Jain, S. (2020). A survey on beyond 5G network with the advent of 6G: architecture and emerging technologies. IEEE Access 9, 67512–67547. doi:10.1109/access.2020.3031234

CrossRef Full Text | Google Scholar

Elnabty, I. A., Fahmy, Y., and Kafafy, M. (2022). A survey on UAV placement optimization for UAV-Assisted communication in 5G and beyond networks. Phys. Commun. 51, 101564. doi:10.1016/j.phycom.2021.101564

CrossRef Full Text | Google Scholar

Fährmann, D., Jorek, N., Damer, N., Kirchbuchner, F., and Kuijper, A. (2022). Double deep Q-Learning with prioritized experience replay for anomaly detection in smart environments. IEEE Access 10, 60836–60848. doi:10.1109/access.2022.3179720

CrossRef Full Text | Google Scholar

Geraci, G., Garcia-Rodriguez, A., Azari, M. M., Lozano, A., Mezzavilla, M., Chatzinotas, S., et al. (2022). What will the future of UAV cellular communications be? A flight from 5G to 6G. IEEE Commun. Surv. Tutor. 24 (3), 1304–1335. doi:10.1109/comst.2022.3171135

CrossRef Full Text | Google Scholar

Ghorbel, M. B., Rodriguez-Duarte, D., Ghazzai, H., Hossain, M. J., and Menouar, H. (2019). Joint position and travel path optimization for energy efficient wireless data gathering using unmanned aerial vehicles. IEEE Trans. Veh. Technol. 68 (3), 2165–2175. doi:10.1109/tvt.2019.2893374

CrossRef Full Text | Google Scholar

Gryech, I., Vinogradov, E., Saboor, A., Bithas, P. S., Mathiopoulos, P. T., and Pollin, S. (2024). A systematic literature review on the role of UAV-enabled communications in advancing the UN’s sustainable development goals. Front. Commun. Netw. 5, 1286073. doi:10.3389/frcmn.2024.1286073

CrossRef Full Text | Google Scholar

Gu, X., and Zhang, G. (2023). A survey on UAV-assisted wireless communications: recent advances and future trends. Comput. Commun. 208, 44–78. doi:10.1016/j.comcom.2023.05.013

CrossRef Full Text | Google Scholar

Huang, Y., Xu, C., Zhang, C., Hua, M., and Zhang, Z. (2019). An overview of intelligent wireless communications using deep reinforcement learning. J. Commun. Inf. Netw. 4 (2), 15–29. doi:10.23919/jcin.2019.8917869

CrossRef Full Text | Google Scholar

Jangsher, S., Al-Jarrah, M., Al-Dweik, A., Alsusa, E., and Kong, P. Y. (2022). Energy constrained sum-rate maximization in IRS-assisted UAV networks with imperfect channel information. IEEE Trans. Aerosp. Electron. Syst. 59 (3), 2898–2908. doi:10.1109/taes.2022.3220493

CrossRef Full Text | Google Scholar

Ju, H., Kim, S., Kim, Y., and Shim, B. (2022). Energy-efficient ultra-dense network with deep reinforcement learning. IEEE Trans. Wirel. Commun. 21 (8), 6539–6552. doi:10.1109/twc.2022.3150425

CrossRef Full Text | Google Scholar

Khawaja, W., Guvenc, I., Matolak, D. W., Fiebig, U. C., and Schneckenburger, N. (2019). A survey of air-to-ground propagation channel modeling for unmanned aerial vehicles. IEEE Commun. Surv. Tutor. 21 (3), 2361–2391. doi:10.1109/comst.2019.2915069

CrossRef Full Text | Google Scholar

Kim, J., Jeon, W. S., and Jeong, D. G. (2015). Base-station sleep management in open-access femtocell networks. IEEE Trans. Veh. Technol. 65 (5), 3786–3791. doi:10.1109/tvt.2015.2445922

CrossRef Full Text | Google Scholar

Kim, T., Lee, S., Choi, H., Park, H. S., and Choi, J. (2023). An energy-efficient multi-level sleep strategy for periodic uplink transmission in industrial private 5G networks. Sensors 23, 9070. doi:10.3390/s23229070

PubMed Abstract | CrossRef Full Text | Google Scholar

Kooshki, F., Armada, A. G., Mowla, M. M., Flizikowski, A., and Pietrzyk, S. (2023). Energy-efficient sleep mode schemes for cell-less RAN in 5G and beyond 5G networks. IEEE Access 11, 1432–1444. doi:10.1109/access.2022.3233430

CrossRef Full Text | Google Scholar

López-Pérez, D., De Domenico, A., Piovesan, N., Xinli, G., Bao, H., Qitao, S., et al. (2022). A survey on 5G radio access network energy efficiency: massive MIMO, lean carrier design, sleep modes, and machine learning. IEEE Commun. Surv. Tutor. 24 (1), 653–697. doi:10.1109/comst.2022.3142532

CrossRef Full Text | Google Scholar

Masroor, R., Naeem, M., and Ejaz, W. (2021). Resource management in UAV-assisted wireless networks: an optimization perspective. Ad Hoc Netw. 121, 102596. doi:10.1016/j.adhoc.2021.102596

CrossRef Full Text | Google Scholar

Morocho-Cayamcela, M. E., Lee, H., and Lim, W. (2019). Machine learning for 5G/B5G Mobile and wireless communications: potential, limitations, and future directions. IEEE Access 7, 137184–137206. doi:10.1109/access.2019.2942390

CrossRef Full Text | Google Scholar

Puspitasari, A., An, T. T., Alsharif, M. H., and Lee, B. M. (2023). Emerging technologies for 6G communication networks: machine learning approaches. Sensors 23, 7709. doi:10.3390/s23187709

PubMed Abstract | CrossRef Full Text | Google Scholar

Qazzaz, M. M. H., Zaidi, S. A., McLernon, D. C., Hayajneh, A. M., Salama, A., and Aldalahmeh, S. A. (2024). Non-terrestrial UAV clients for beyond 5G networks: a comprehensive survey. Ad Hoc Netw. 157, 103440. doi:10.1016/j.adhoc.2024.103440

CrossRef Full Text | Google Scholar

Sarkar, N. I., and Gul, S. (2023). Artificial intelligence-based autonomous UAV networks: a survey. Drones 7 (5), 322. doi:10.3390/drones7050322

CrossRef Full Text | Google Scholar

Shahzadi, R., Ali, M., Khan, H. Z., and Naeem, M. (2021). UAV assisted 5G and beyond wireless networks: a survey. J. Netw. Comput. Appl. 189, 103114. doi:10.1016/j.jnca.2021.103114

CrossRef Full Text | Google Scholar

Shokrnezhad, M., Taleb, T., and Dazzi, P. (2024). Double deep Q-Learning-Based path selection and service placement for latency-sensitive beyond 5G applications. IEEE Trans. Mob. Comput. 23 (5), 5097–5110. doi:10.1109/tmc.2023.3301506

CrossRef Full Text | Google Scholar

Singh, Y. (2012). Comparison of okumura, hata and COST-231 models on the basis of path loss and signal strength. Int. J. Comput. Appl. 59 (11), 37–41. doi:10.5120/9594-4216

CrossRef Full Text | Google Scholar

Sufyan, A., Khan, K. B., Khashan, O. A., Mir, T., and Mir, U. (2023). From 5G to beyond 5G: a comprehensive survey of wireless network evolution, challenges, and promising technologies. Electronics 12, 2200. doi:10.3390/electronics12102200

CrossRef Full Text | Google Scholar

Tung, T. V., An, T. T., and Lee, B. M. (2022). Joint resource and trajectory optimization for energy efficiency maximization in UAV-based networks. Mathematics 10 (20), 3840. doi:10.3390/math10203840

CrossRef Full Text | Google Scholar

Van Hasselt, H., Guez, A., and Silver, D. (2016). “Deep reinforcement learning with double Q-learning,” in Proc. AAAI Conf. Artif. Intell.

Google Scholar

Won, J., Kim, D. Y., Park, Y. I., and Lee, J. W. (2023). A survey on UAV placement and trajectory optimization in communication networks: from the perspective of air-to-ground channel models. ICT Express 9 (3), 385–397. doi:10.1016/j.icte.2022.01.015

CrossRef Full Text | Google Scholar

Wu, W., Sun, S., Shan, F., Yang, M., and Luo, J. (2022). Energy-constrained UAV flight scheduling for IoT data collection with 60 GHz communication. IEEE Trans. Veh. Technol. 71 (10), 10991–11005. doi:10.1109/tvt.2022.3184869

CrossRef Full Text | Google Scholar

Yu, Y., Tang, J., Huang, J., Zhang, X., So, D. K. C., and Wong, K. K. (2021). Multi-objective optimization for UAV-assisted wireless powered IoT networks based on extended DDPG algorithm. IEEE Trans. Commun. 69 (9), 6361–6374. doi:10.1109/tcomm.2021.3089476

CrossRef Full Text | Google Scholar

Zhou, Q., Guo, C., Wang, C., and Cui, L. (2022). Radio resource management for C-V2X using graph matching and actor–critic learning. IEEE Wirel. Commun. Lett. 11 (12), 2645–2649. doi:10.1109/lwc.2022.3213176

CrossRef Full Text | Google Scholar

Zhu, B., Bedeer, E., Nguyen, H. H., Barton, R., and Henry, J. (2021). UAV trajectory planning in wireless sensor networks for energy consumption minimization by deep reinforcement learning. IEEE Trans. Veh. Technol. 70 (9), 9540–9554. doi:10.1109/tvt.2021.3102161

CrossRef Full Text | Google Scholar

Keywords: deep-double Q-learning (DDQL), deterministic-policy gradient (DDPG), energy efficiency, sleeping ground BS, ABS-assisted beyond-5G network

Citation: Saleh V, Eslami M and Kazemi K (2026) DDPG-based energy efficiency optimization for ABS-assisted beyond-5G cellular networks with sleep mode management. Front. Commun. Netw. 6:1764320. doi: 10.3389/frcmn.2025.1764320

Received: 09 December 2025; Accepted: 29 December 2025;
Published: 26 January 2026.

Edited by:

Mehran Behjati, Sunway University, Malaysia

Reviewed by:

Mohammed Sani Adam, National University of Malaysia, Malaysia
Javad Haghighat, TED University, Türkiye

Copyright © 2026 Saleh, Eslami and Kazemi. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Mohsen Eslami, bWVzbGFtaTFAdWFsYmVydGEuY2E=

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.