A Modified Long Short-Term Memory-Deep Deterministic Policy Gradient-Based Scheduling Method for Active Distribution Networks

To improve the decision-making level of active distribution networks (ADNs), this paper proposes a novel framework for coordinated scheduling based on the long short-term memory network (LSTM) with deep reinforcement learning (DRL). Considering the interaction characteristics of ADNs with distributed energy resources (DERs), the scheduling objective is constructed to reduce the operation cost and optimize the voltage distribution. To tackle this problem, a LSTM module is employed to perform feature extraction on the ADN environment, which can realize the recognition and learning of massive temporal structure data. The concerned ADN real-time scheduling model is duly formulated as a finite Markov decision process (FMDP). Moreover, a modified deep deterministic policy gradient (DDPG) algorithm is proposed to solve the complex decision-making problem. Numerous experimental results within a modified IEEE 33-bus system demonstrate the validity and superiority of the proposed method.


INTRODUCTION
To reduce greenhouse gas emissions, numerous government policies have been established to encourage the development of renewable energy sources. Along with this trend, conventional distribution networks are being transformed into active distribution networks (ADNs) (Wei et al., 2021). Meanwhile, the intermittent and volatility output of high penetration distributed energy resources (DERs), such as photovoltaic generations (PVs), energy storage systems (ESSs), and wind farms, increases the uncertainty of ADNs (Usman et al., 2018;Ehsan and Yang, 2019). Especially, the increasingly severe issues of voltage violation and network loss have attracted widespread attention. Thus, it is necessary to coordinate the scheduling of DERs to promote the flexibility and interaction of ADNs.
Recently, various research efforts have been paid to study coordinated scheduling policies to optimize the decision-making and control of DERs. Studies (Zamzam et al., 2022;Prabawa and Choi, 2021) maintain voltage quality and optimize power losses by coordinating ESSs with charging stations (CSs). In (Zamzam et al., 2022), the scheduling of DERs in a fast time resolution is solved by the interior point method. It is verified that ESSs along with CSs are promising entities for reducing network voltages deviations and system losses. Similarly, Prabawa et al. propose a hierarchical volt/ var control (VVC) framework to minimize the total active power losses and voltage deviations through the coordination of smart CSs, PVs, and ESSs at both global and local stages (Prabawa and Choi, 2021). However, limited by the model complexity and computational efficiency, the proposed VVC method may be incapable of handling a large distribution network with various DERs. Additionally, these scholars (Ma et al., 2021;Zhu et al., 2020) dissect the random fluctuation characteristics of PV plants via multi-scenario modeling, improving ADNs' efficiencies and economics. To reduce the PV curtailment and network loss, a non-dominated sorting genetic algorithm II (NSGA-II)-based voltage regulation method is proposed in (Ma et al., 2021). Although the NSGA-II algorithm is easy to implement, it does not guarantee the global optimum in practical applications. The study reported in (Zhu et al., 2020) constructs a typical scenario set-based approach to address the stochastic economic dispatching, preestablishing charging and discharging schemes for controllable generation units, PV systems, wind farms, and ESSs. However, it suffers a heavy computational burden due to the need to consider many scenarios. Furthermore, studies (Li et al., 2020a;Luo et al., 2021) establish robust optimal operation strategies to deal with the randomness of DERs. In (Luo et al., 2021), the uncertainty of DERs is described based on beta distribution, and a robust optimization model is established to optimize the network loss, power purchase cost, and voltage distribution. Li et al. propose a distributed adaptive robust VVC method (Li et al., 2020a). It robustly mitigates the network loss while keeping voltage within regulation scope. However, the decisions made by the above methods only rely on the current status of ADNs, and the long-term information and objectives are ignored. These scholars (Zhang Z. et al., 2021;Chen et al., 2021;Sheng et al., 2021) consider the cooperative relationship between fast and slow response resources and mainly establish a multi-timescale scheduling architecture to improve the economics of ADNs. For example, studies Sheng et al., 2021) propose a day-ahead economic scheduling model and establish a real-time scheduling method using model predictive control (MPC). The authors (Zhang Z. et al., 2021) formulate a doublelayer MPC method to achieve minute-level control of mechanical voltage regulation devices and distributed generations (DGs). Furthermore, the MPC method combined with decentralized inter-area coordination is proposed by (Li et al., 2020b) to cope with the high volatility of DGs efficiently.
Although the aforementioned methodologies help us master the nature of coordinated scheduling decision-making for ADNs, the conventional physical model-based methods highly rely on specific optimization models, resulting in low computational efficiency and unstable solution performance. The timevarying DERs gradually infiltrate into ADNs, and it is challenging for the above methods to respond quickly to realtime dispatching demands.
Fortunately, in recent literature, deep reinforcement learning (DRL) has received growing interest in addressing the ADN scheduling issue. The nonlinear programming problem is formulated as a finite Markov decision process (FMDP) in (Cao et al., 2021a), and the proximal policy optimization is utilized to coordinate ESSs and wind farms. Bahrami et al. develop a deep neural network as the approximator of the state-action value function to benefit load aggregators and users (Bahrami et al., 2021). Further, reference (Zhang Y. et al., 2021) controls switchable capacitors, voltage regulators, and smart inverters via a deep Q-network (DQN) and designs a delicate reward function to maintain the voltage range. Besides, these researches (Gao et al., 2021;Cao et al., 2021b;Zhang J. et al., 2021) introduce the multi-agent DRL technology into ADN controlling and decision-making. Based on a multi-agent and multi-objective architecture, DRL is adopted in (Gao et al., 2021) to develop operation schedules for voltage regulators, onload tap changers, and capacitors, improving the communication efficiency of multi-agent. Research (Cao et al., 2021b) proposes a multi-agent soft actor-critic approach to analyze the impact of PV fluctuation on voltage distribution. However, the state vector consists of node active power, reactive power, and PV output. For optimization problems with a large power system, the perception of the state variables usually leads to low training efficiency and poor optimization solutions. In reference (Zhang J. et al., 2021), DQN and deep deterministic policy gradient (DDPG) are utilized to control discrete and continuous variables, respectively. It rapidly responds to the state changes of distribution networks through the coordinated training of two agents. Other studies (Sun and Qiu, 2021a;Sun and Qiu, 2021b) focus on the collaborative optimization of conventional programming methods and DRL methods. Sun et al. (Sun and Qiu, 2021a) present a two-stage control method to alleviate fast voltage violations. The day-ahead scheduling model is established as a mixed-integer second-order cone programming (MISOCP), while the real-time scheduling problem is solved by a multi-agent DDPG scheme. A similar situation is discussed in (Sun and Qiu, 2021b), where the dayahead scheduling of ADNs, considering the active and reactive power capacity of electric vehicles (EVs), is constructed as a MISCOCP. Moreover, the DDPG algorithm is adopted to formulate the reactive power control and V2G control schedules.
Given the state-of-the-art ADN scheduling solutions in this field, there are still two significant limitations. Firstly, the DRL algorithms represented by DQN and DDPG still suffer shortcomings in terms of low training efficiency, overlearning, and poor stability. Secondly, in terms of application, DQN-based methods fail to learn the mapping relationship between continuous state and action spaces. Although DDPG-based methods output continuous actions, they lack an understanding of temporal structural characteristics and are incapable of handling large state spaces. It results in a lower perception of the continuous state information of ADNs.
It can be found that methods for extracting highdimensional temporal characteristics in real-time scheduling of ADNs are limited, and the DRL-based methods lack the assessment of the integration of multi-extension. To fill these research gaps, this paper presents a long short-term memory (LSTM) and modified DDPG (namely, MLDDPG)-based coordinated scheduling solution. The comprehensive optimization objective is constructed to minimize the operating cost and maintain the voltage range of ADNs.
The temporal features of the ADN environment are extracted by a LSTM module. While the DDPG agent is leveraged to strategize real-time operation schemes for DERs. The main contributions of this paper are threefold. 1) To our best knowledge, the existing DRL-based approaches are challenging to handle the massive temporal structure data generated by ADNs. Conversely, relying on the highdimensional understanding and mining ability, we employ a LSTM module to characterize the temporal data of ADNs. It helps the DRL agent extract and learn the changes of temporal characteristics from both the generation and demand sides and improves the modeling ability for node features. 2) Although the classic DDPG can rapidly respond to the scheduling requirements, it still suffers from overlearning, cold start, and poor stability issues. Thus, the learning rate decay strategy is proposed to balance the exploration and exploitation of DRL agents. Besides, the collaborative assistance policy combined with the modified prioritized experience replay mechanism is proposed to prevent the agent from falling into non-optimal strategies. The combination of extensions improves the convergence speed and application stability and enhances agents' reliability in decision-making scenarios. 3) A modified LSTM-DDPG (MLDDPG) method is developed to tackle the ADN scheduling issue, which is formulated as a FMDP. In this way, the optimal ADN scheduling decisions can better satisfy the real-time response requirements of DERs. The simulation results demonstrate that our approach significantly improves the operation efficiency and economy of ADNs while optimizing voltage distributions.
The remainder of this paper is organized as follows. Problem Formulation Section sketches the modeling process of the ADN coordinated scheduling problem. Then our proposed solution approach is presented in Proposed Real-Time Scheduling Method Section. Case studies are reported in Case Studies Section. Finally, Conclusion Section concludes the paper. Figure 1 exhibits the established ADN coordinated scheduling architecture based on LSTM and modified DDPG algorithm. The ADN control problem involving DERs is appropriately formulated as a FMDP. Specifically, a LSTM module is utilized to capture the temporal information characteristics of the ADN load and PV output, which, together with the real-time information of CSs and ESSs, constitute the environment state. A DRL-based agent is developed to formulate the ADN control strategy and evaluate the environmental feedback. Further, the agent is trained and optimized based on a modified DDPG module to accelerate the convergence and improve the application stability of the algorithm. Finally, the optimal mapping relationship from the environment state to the control strategy is output to realize the optimal economic operation of ADNs. The details about the modeling process are as follows.

Objective Function
The sub-objectives consist of substation power purchase cost, ESS charging and discharging degradation cost, and CS response cost to realize the economic operation of ADNs. Mathematically, the comprehensive objective is expressed as: where: C sub t , C ESS t , and C CS t separately represent the substation power purchase cost, ESS charging and discharging degradation cost, and CS response cost. Ω T represents the set of time periods. Ω sub , Ω ESS , and Ω CS are sets of the substation, ESS, and CS nodes, respectively. π sub , π ESS , and π CS indicate the electricity price purchased from the transmission network, ESS degradation unit cost, and CS scheduling unit cost, respectively. P sub i,t is the power interacted with the transmission network. P ESS i,t indicate the active power of the ESS. ΔP CS i,t denote the active power changes of the CS.

Power Flow Constraints
where: Ω bus is the set of buses in the ADN. P PV j,t , P ESS j,t , P L j,t , and P CS j,t indicate the active power of the PV, ESS, load, and CS, respectively. Q PV j,t , Q ESS j,t , and Q L j,t are reactive power of the PV, ESS, and load, respectively. P ij,t and Q ij,t separately represent the active and reactive power injecting from the ith bus to the jth bus. r ij and x ij are the resistance and reactance, respectively. g j and b j indicate the conductance and susceptance, respectively. I ij,t andŨ i,t represent the square of the branch current and bus voltage, respectively. Constraints (5-8) represent the second order cone programming-based Dist-flow constraints.

Safety Operation Constraints
where: U i,t indicates the voltage of the ith node at time t. U max i and U min i are the maximum and minimum voltage values, respectively. I ij,t is the branch current at time t. I max ij and I min ij represent the maximum and minimum current values, respectively.

Operation Constraints of ESSs
where: P ESS i, max and P ESS i, min represent the ESS maximum and minimum active power respectively. Q ESS i, max and Q ESS i, min are the ESS maximum and minimum reactive power, respectively. S ESS i, max stands for the maximum apparent power of the ith ESS. E ESS i,t indicates the stored energy in the ith ESS at time t. E ESS i, max and E ESS i, min are the maximum and minimum stored energy, respectively. η c i and η d i separately denote the charging and discharging efficiencies. Equations 11-13 limit the power output ranges of ESSs, while Equations 14-16 indicate the energy constraints of ESSs. separately denote the upper and lower energy boundaries of the jth EV . t sta j and t fin j separately represent the start and finish charging time of the jth EV. P cha i is the charging pile output power. E exp j represent the expected charging power of the jth EV. E CS i,t, max and E CS i,t, min separately indicate the upper and lower energy boundaries of the ith CS. Equations 17, 18 limit the energy boundaries of EVs, and Equations 19, 20 constraint the response power capacities of CSs. Eq. 21 represents the time translation characteristics of CSs' demand response.

Long Short-Term Memory for Information Perception
DERs with different operating characteristics bring highdimensional and complex information to ADNs, while DRL agents are challenging to capture their high-dimensional feature changes. On the other hand, ADN load and PV output are less affected by control decisions and show high correlation characteristics on the time scale. As an improved version of recurrent neural network (RNN), LSTM effectively solves gradient disappearance and gradient explosion issues and shows remarkable performance in time series data prediction and feature extraction. Therefore, a LSTM module is employed to extract the temporal characteristics of loads and PVs and further improve the long-term performance of the scheduling model. The temporal structure information input X generated by ADNs can be expressed as: where: L represents the time-step. LSTM defines the input gate, forget gate, and output gate based on the RNN. The formulations of all nodes in a LSTM structure are given by Equations 23-27.
where: W f , W i , W c , and W o are the weight matrices of the forget gate, input gate, cell state, and output gate, respectively. b f , b i , b c , and b o are the bias weights. σ(·) and tanh(·) denote the sigmoid activation function and tanh function, respectively.c t indicates the candidate cell state. Eq. 25 denotes that the forget gate controls what to forget from the previous cell state c t−1 , while the input gate decides what to preserve from the candidate cell statec t . Eq. 27 represents that the output gate controls what to pass from the cell state c t (Kong et al., 2019). The temporal characteristics of loads and PVs are captured relying on the feature extraction ability of the LSTM module. The LSTM output h t is taken as the temporal information perception required by the DRL agent.

Finite Markov Decision Process-Based Scheduling Model
After the temporal environment information is extracted, the agent completes the scheduling of the ADN by making a sequence of decisions on DERs. We construct the ADN scheduling problem as a FMDP. The details about the FMDP formulation are described as follows.
1) State: the agent captures the real-time environment information. In this study, the environment information is divided into two parts: temporal information and instant information. The temporal information of loads and PVs are extracted by the LSTM module. The instant information consists of the real-time states of ESSs and CSs. Thus, the environment state s t can be expressed as: where: z t indicates the feature information of ESSs and CSs as shown in Eq. 29.
2) Action: the agent selects the action to be executed according to the ADN state. Slow devices are usually scheduled in an offline manner due to their limited allowable daily switching times . To sufficiently absorb the PV power, thus, the active and reactive output of ESSs and the response power of CSs are regarded as the action a t .
3) Reward: the feedback value that the agent obtains from the environment after executing the control action. The substation power purchase cost C sub t , ESS charging and discharging degradation cost C ESS t , and CS response cost C CS t are taken as the feedback reward. Additionally, given the significance of the safe operation of ADNs, the voltage violation penalty is also considered in the reward r t , expressed as follows: where: D t represents the penalty caused by voltage violation, quantizing the voltage deviation level in ADNs (Zhang Y. et al., 2021). π vol is a significant penalty coefficient.
4) State-action value function: the total expected rewards that the current policy π can bring after executing the action a t . The state-action value function Q π (s, a) can be expressed as: where: π is the policy that maps from a comprehensive state to a schedule plan. K represents the horizon of time steps. γ indicates the discount rate, balancing future rewards and immediate rewards. The primary purpose of the ADNs scheduling problem is to find the optimal policy πp, which is equivalent to maximizing the state-action value function:

PROPOSED REAL-TIME SCHEDULING METHOD 3.1 Classic Deep Deterministic Policy Gradient
DDPG adopts a classic actor-critic-based architecture and realizes agent learning and training through four deep neural networks. It adopts the actor network μ(s|θ μ ) and critic network Q(s, a|θ Q ) to realize the policy action and action evaluation. The target actor network μ′(s|θ μ′ ) is utilized to select an action a j+1 for the state s j+1 extracted from the replay buffer, and the target critic network Q′(s, a|θ Q′ ) is applied to calculate the state-action value function of the historical sample. The action of DERs can be expressed in the following equation.
a t μ s t θ μ + N where: N represents the noise, which is usually the Ornstein-Uhlenbeck (OU) process. The ADN is not a great inertia system (e.g., inverted pendulums and aircraft systems) (Fujimoto et al., 2018). Thus, we adopt the Gaussian noise N(0, σ t ) instead of the  OU process. The standard deviation of noise decreases linearly to 0 as the training episode increases. The critic network can be updated by minimizing the loss function L Q : y j r j , s j+1 is terminal r j + γQ′ s j+1 , μ′ s j+1 θ μ′ θ Q′ , otherwise where: N b represents the mini-batch size sampled from the replay buffer. y j is the target value. The parameter of the actor network can be updated based on the policy gradient, which can be expressed as: Then, the weights of target networks are soft-updated via Eq. 39. θ μ′,k+1 τθ μ,k + (1 − τ)θ μ′,k θ Q′,k+1 τθ Q,k + (1 − τ)θ Q′,k where: k is the learning iteration. τ indicates the soft-updated parameter, and τ ≪ 1.

Proposed Modified Strategies
The classic DDPG algorithm is widely applied in continuous action decision processing. Nevertheless, it has the following two significant shortcomings in practical application.
1) DDPG updates the network parameters with a fixed learning rate α, expressed as Eq. 40. A larger learning rate may lead to overlearning and affect the agent's stability, while a lower learning rate slows down the convergence speed.
2) Based on the experience replay buffer, the prioritized experience replay buffer refines the learning efficiency of the agent (Hou et al., 2017). In the early training stage, however, the samples with the larger deviations are frequently selected for training, which may cause the overfitting issue. The repeated training of such samples makes the agent fall into the locally optimal solution, and the agent's generalization ability is significantly reduced.
For the shortcomings of the classic DDPG algorithm, we propose three improved strategies as fellows: learning rate decay strategy, collaborative assistance policy, and modified prioritized experience replay to improve the basic agent. The details of the proposed modified model are as below.

Learning Rate Decay
An exponential decay model is introduced to change the learning rate α appropriately and balance the exploration and exploitation abilities (Wang et al., 2022). The learning rate in each episode can be calculated by Eq. 41.   where: α 0 is the initial learning rate. c d indicates the decay rate. n stands for the current training episode. n d is the decay step.

Collaborative Assistance
Generally, agents can be equipped with a specific scheduling ability to deal with ADNs environment after a long training period. However, considering the importance of ADNs' security indicators, agents are often difficult to be trusted in some critical decision-making scenarios. To this end, we propose the collaborative assistance mechanism to help the agent efficiently learn the coordinated control strategy. Specifically, we first generate N s scenes before the training and then capture the environment state s t . The CPLEX solver is applied to calculate the optimal solution of the control variable, namely, the action a t . Next, the reward r t and new state s t+1 are obtained, and the above "successful" samples containing the optimal actions are placed in replay buffer D. These pre-generated samples assist the agent in speeding up convergence and preventing it from sticking into non-optimal strategies. In the training stage, successful samples generated by the CPLEX and historical samples obtained from the FMDP are combined to form the mini-batch to optimize the agent parameters.

Modified Prioritized Experience Replay
The main idea of the modified prioritized experience replay is to reconstruct the replay buffer D and the mini-batch sampling method. Firstly, the replay buffer with a capacity |D| is divided into two equal pools used to store successful and historical samples, respectively. The cooperative training of different samples speeds up the convergence while avoiding the locally optimal problem. Secondly, successful and historical samples are sampled with different probabilities. The successful samples are extracted from the replay buffer with uniform probability, eliminating the relevance between different scenes. The historical samples are sampled with the specified priority according to the time difference error (TD-error). The proportion of two types of samples participating in training is shown in Eq. 42.
where: N s and N h represent the number of successful and historical samples in the mini-batch. β is the proportion parameter, which decreases linearly with the increase of episode. In this way, the agent gradually accumulates highquality historical samples to significantly reduce the possibility of voltage violation. Table 1 demonstrates the training process of our proposed solution approach for solving the ADN scheduling problem as described in Problem Formulation Section. In each episode, we first use LSTM to extract the temporal feature h t of PVs and loads, which are combined with instant information z t to serve as the environment state s t . The agent formulates the scheduling strategies of ESSs and CSs using the actor network μ(s|θ μ ). Upon executing the action a t , the reward r t is obtained by the agent, and the new state s t+1 is observed. The historical samples are accumulated via the above interactions and stored in the replay buffer D. Note that half of the replay buffer has been filled with successful samples via the collaborative assistance policy. Then, a mini-batch is extracted based on the modified prioritized experience replay mechanism, and the network parameters are updated. Specifically, after 24-h scheduling is completed, the learning rate α decays exponentially, and the training proportion β is also adjusted. Repeat the above steps until the maximum training episode is reached.

Case Study Setup
In this study, the performance of the proposed approach is illustrated using a modified IEEE 33-bus distribution system. The system consists of two PV plants at buses 14 and 24, two CSs at buses 7 and 32, and an ESS at bus 17. The capacities of all PV plants are 400 kWp, and their power generation characteristics are described by real-world data. The installed capacity of the ESS is 600 kWh, and the charging and discharging capacity limit is 250 kVA. The charging efficiency η c i and discharging efficiency η d i are set as 0.9. The upper and lower boundaries of storage capacity are set as 0.1 and 0.9, respectively. Assume the CSs serve 200 EVs per day, wherein the configuration and operation data of EVs and CSs come from the Charging Bar (http://admin.bjev520.com). The electricity price for power loss is modeled by the time of use (TOU) price. The unit costs of the CS scheduling π CS and ESS degradation π ESS are set as 0.2 ¥/kWh and 0.06 ¥/kWh, respectively (Cui et al., 2020). The penalty coefficient π vol for voltage violation is -5000. A workstation with an AMD R9 3950X CPU and an NVIDIA GeForce 2080Ti GPU is used for the simulation.

Training Process
Let the simulation step length be 5 min, and the temporal data over the past 12 time steps are fed into the LSTM module. Table 2 details the parameters of the proposed method, and Figure 2 illustrates the obtained rewards under 1000 training episodes.
As attested by Figure 2, the agent learns from the ADN environment by undergoing trials and errors, and the rewards oscillate obviously in the initial stage. Then, the solution process tends to converge steadily from the middle to the final late stage. Especially, the initial learning rate is 0.015, so the agent is encouraged to explore the environment with a high probability in the first 30 episodes. Therefore, the rewards fluctuate obviously, and the average reward in this stage is −201.36. From 30 to 300 episodes, the agent quickly learns successful samples via the collaborative assistance policy and accumulates a

Practical Application Results
Figures 3, 4 separately exhibit the active power output and voltage amplitude distribution in the testing period. As attested by Figure 3, the well-trained agent can schedule the output of the ESS and CSs as well as cooperate with PVs to respond to the power demand of the ADN. Herein, the agent chooses to charge the ESS during 0:00-7:00 due to the low load level and TOU price. It plays a positive role in reducing the load peak-valley difference and network loss, and the average power loss is 58.06 kW in this period. At around 20:00, the operating pressure of the ADN is alleviated by reducing the charging load and adjusting the ESS discharging power. The final operating cost of the distribution network throughout the day is ¥31,936.62. Moreover, it can be seen from Figure 4 that the operating voltage of each node in the ADN is within the safe range. The minimum voltage is 0.966 3 p.u., which appears on bus 18 at 11:35.

Numerical Comparison of Different Methods
To comprehensively evaluate the implementation effect of our method, DRL algorithms, including DQN, DDPG, and LSTM-   Figure 5 details the reward in each episode for different DRL algorithms, and Figure 6 exhibits the cumulative costs of their online testing over 100 days. As depicted, although the DQN algorithm converges rapidly, it has relatively weak convergence and stability in dealing with decision-making problems with high-dimensional state and action spaces. The average convergence reward of DQN is −126.79. Due to the capacity for coping with continuous action spaces, the performance of DDPG is better than that of DQN in terms of convergence performance and stability. Obviously, the LDDPG method initially shows the worst convergence performance, and the reward stabilizes at −117.78 after about 500 episodes. The proposed MLDDPG method improves the convergence performance using three modified mechanisms, and the rewards are stable at −111.59, which is increased by 9.72% compared with the classic DDPG algorithm. Moreover, MLDDPG also achieves excellent decision-making results in the online testing stage, reducing the operation cost by 18.89%. Furthermore, we define the ADN voltage qualification rate R vol as shown in Eq. 43, and the optimization comparison results are listed in Table 3. The prediction horizon, control horizon, and sampling time interval of the MPC algorithm are 1 h, 20 min, and 5 min, respectively. The DQN and DDPG methods improve the voltage distribution of the ADN, but they still suffer from voltage violation issues. Depending on the information perception ability of the LSTM module for PVs and residential loads, both LDDPG and MLDDPG algorithms successfully restrict the node voltage within an acceptable range. Besides, the MLDDPG agent is capable of adapting to various environments via the collaborative assistance policy combined with the modified prioritized experience replay mechanism. Thus, the proposed method shows remarkable results in reducing power loss and operating cost, and the total operating cost is ¥32,203.17, which is 5.19% lower than that of LDDPG. The MPC can also cope with uncertainties based on rolling optimization, and the algorithm performance is close to that of LDDPG. In addition, the proposed method takes only 0.16 s to solve the scheduling scheme, which is much less than 381.37 s of the CPLEX. Therefore, although the difference between the MLDDPG and the optimal solution is 0.12%, it still achieves an excellent optimization decision-making effect while meeting the real-time scheduling requirements.

Sensitivity Analysis
Furthermore, the influence of ESS and CS planning schemes and running states on the proposed model is analyzed. Figure 7 illustrates the influence of ESS capacity and CS capacity planning schemes on the ADN operation cost. As attested, the total cost of the ADN gradually decreases with the increase of ESS capacity. For every 100-kWh increase in the ESS capacity, the total operation cost of the ADN is reduced by ¥29.18. Meanwhile, for every 40-kW increase in the CS capacity, the total cost only decreases by ¥6.80. Notably, when the CS capacity is larger than 400 kW, there is little impact on the operating cost of the ADN, indicating that the CS capacity configuration far covers the EV charging demand.
Assuming that the number of vehicles is 2 000 in this area, Figure 8 exhibits the impact of EV penetration rates and charger power operation status on the total cost of the ADN. With the EV penetration rate increasing, the total cost increases gradually. For every 1% increase in the EV penetration rate, the operation cost of the ADN increases by ¥31.20. The increase of the charger power improves the carrying capacity of CS but also increases the operation burden of the ADN. For every 1-kW increase in the charger power, the operation cost increases by ¥15.61. Note that the total cost remains stable when the EV penetration rate increases to a specific value. For example, when the charger power is 20 kW, the operation cost is stabilized at around ¥33,998.07 after the EVPR is increased to 20%, which means that the CS carrying capacity and dispatchable potential reach the upper limits.

Scalability Performance
Finally, simulations are also performed on a modified IEEE 123-bus test system to evaluate the scalability of the proposed method. As shown in Figure 9, the test system is modified by integrating 6 PV units, 3 ESSs, and 3 CSs. The parameter setting of each unit is the same as that in Case Study Setup Section. Table 4 lists the numerical results in the modified IEEE 123-bus test system, and Figure 10 exhibits the voltage distribution at peak power consumption.
It can be observed that there are voltage violation issues when no control is applied, especially during peak power consumption. The uncontrolled method also suffers from high network loss and operating cost issues due to the lack of coordination. Limited by the dimension of environmental states, the DDPG algorithm makes slight improvements in dealing with voltage violation issues. By contrast, the proposed method captures the temporal trends and high-dimensional features of DERs to against uncertainties and provides a basic state for the coordination of each unit. The total operating cost of the MLDDPG method is ¥25,813.86, which is 25.39% lower than that of the uncontrolled mode. The results demonstrate that the proposed MLDDPG method effectively realizes improvements in economic performance and voltage violation mitigation. We conclude that the scalability performance of our method in a large system is validated.

CONCLUSION
Based on the LSTM and modified DDPG algorithm, this paper proposes a novel DRL method for coordinated scheduling of Frontiers in Energy Research | www.frontiersin.org ADNs. Specifically, the LSTM is employed to capture the temporal information of DERs. Then, the extracted state features are fed into the modified DDPG to formulate the operation schedules for CSs and ESSs. Case studies are carried out within a modified IEEE 33-bus system embedded with PVs, ESSs, and CSs. The training and testing results show that the proposed MLDDPG method can not only maintain the safe voltage range but also reduce the economic cost of ADNs. The convergence performance and stability of the proposed method are also improved, which is 9.72% higher than that of the classic DDPG algorithm. Furthermore, the sensitivity analysis is performed, and the scalability of the proposed method is validated in a modified IEEE 123-bus test system. One future direction is to evaluate the sensitivity of DRL-based training parameters and further enhance the robustness of the proposed method. In addition, slow devices will be considered to coordinate with the proposed method and further improve the scalability of the scheduling model.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding author.