ORIGINAL RESEARCH article

Front. Energy Res., 19 September 2024

Sec. Smart Grids

Volume 12 - 2024 | https://doi.org/10.3389/fenrg.2024.1464151

Safe dynamic optimization of automatic generation control via imitation-based reinforcement learning

  • ZZ

    Zhiyuan Zhang 1

  • YW

    Yongjun Wu 1

  • ZH

    Zhenkun Hao 1

  • MS

    Minghui Song 1

  • PY

    Peipei Yu 2*

  • 1. State Grid Beijing Electric Power Company, Beijing, China

  • 2. Engineering Research Center of Offshore Wind Technology Ministry of Education (Shanghai University of Electric Power), Shanghai, China

Abstract

Introduction:

The increasing penetration of distributed generation (e.g., solar power and wind power) in the energy market has caused unpredictable disturbances in power systems and accelerated the application of intelligent control, such as reinforcement learning (RL), in automatic generation control (AGC). However, traditional RL cannot ensure constraint safety during training and frequently violates the constraints (e.g., frequency limitations), further threatening the safety and stability of grid operation.

Methods:

To address the safety issue, we propose a novel safe RL framework that combines expert experiences with the RL controller to achieve imitation-based RL. This method allows an initialized safe policy by imitating expert experiences to prevent random explorations at the beginning. Specifically, we first formulate the AGC problem mathematically as a Markov decision process. Then, the imitation mechanism is developed atop a soft actor–critic RL algorithm.

Results and discussion:

Finally, numerical studies are conducted with an IEEE 39-bus network, which show that the proposed method satisfies the frequency control performance standard better and improves the RL training efficiency.

1 Introduction

Automatic generation control (AGC) is a fundamental part of a power system that is important for realizing system frequency stability and smoothing tie-line power among interconnected grids (). Regional power grid dispatch centers are often required to achieve closed-loop correction control on area control errors (ACEs) based on real-time deviations (. Generally, these ACEs are influenced by large fluctuations and uncertain photovoltaic outputs that decrease the power quality significantly (; . Many researchers have focused on different methods to handle these quality problems, such as AGC and demand-side resources (; . The control performance standard (CPS) for assessing AGC strategies was established in 1999 by the North American Electric Reliability Council (NERC) ( and focuses on the medium- and long-term stability performances of the system frequency as well as tie-line power. Therefore, an efficient AGC strategy is of great significance in improving the CPS and realizing the economical distribution of grids.

Generally, the time scale for AGC strategies is rather short and of the order of 2–8 s. These AGC strategies entail two control processes: (1) determination of the total power adjustment according to the observed system operating state; (2) allocation of this determined total power adjustment among the AGC units to correct the ACEs and minimize energy costs. At present, research on conventional AGC strategies has achieved fruitful results, such as proportional integral derivative (PID) control (; , model predictive control (, and learning-based intelligent control (. However, conventional AGC has a typical feedback delay that may lead to overregulation or underregulation when coordinating different AGC units (e.g., water and thermal power units). In addition, given the increasing penetration of wind generation, centralized grid connections of wind power can cause large amounts of minute-level power fluctuations (. This further complicates AGC-based regulation and places a greater burden on real-time coordinated AGC. To cope with the increasing power fluctuations and hysteresis issues, the concept of dynamic optimization of AGC has been proposed (, whose key idea is optimization of the AGC units in advance based on ultrashort-term forecasting of the future loads and renewables (e.g., wind power). Unlike economic dispatch (for the next 15 min) and conventional AGC (response within 2–8 s), dynamically optimized AGC is considered a middle process for optimizing the AGC units within 15 min at an optimization step of 1 min. The main advantage of AGC dynamic optimization is that it can effectively handle short-term fluctuations (within 15 min) caused by renewables because it takes into account the future load and renewables. Therefore, AGC dynamic optimization has significant impacts on power systems with stochastic renewables.

Generally, optimization programming is adopted as the most common approach to solve the AGC dynamic control problem using the probabilistic model of wind power, such as robust optimization (. For instance, proposed a chance-constrained programming method to solve the dynamic dispatch of AGC units by combining the evolutionary programming algorithm with the point estimation method to solve the stochastic wind power model. developed an improved multiobjective optimization model of AGC dispatch using the genetic algorithm to solve for the dispatch model; this work established an accurate dispatch model based on real-time data of the phasor measurement units. used the model predictive control framework to effectively address real-time dispatch given the dynamic variations of AGC signals between adjacent dispatch intervals. used robust optimization to address uncertain wind power information by converting it into boundary information of the prediction interval; then, a decentralized robust optimization method was proposed based on approximate dynamic programming to solve for the robust AGC dispatch model. However, all of the above works rely heavily on the accurate probability model of renewables, which is difficult to obtain in practice. Moreover, stochastic programming is usually non-convex owing to the uncertainty involved and is difficult to solve as it entails a large computational burden. Hence, the future fluctuations of renewables cannot be effectively considered in the AGC dispatch process.

Through the adoption of neural networks for uncertainty predictions (, deep reinforcement learning (RL) has become increasingly popular for handling the AGC dynamic optimization problem as it is robust with stable convergence results (; . For instance, proposed a multiple-experience pool-replay-twin-delayed deep deterministic policy gradient to solve for AGC dispatch that effectively improved the training efficiency and action quality via four improvements, including the multiple-experience pool probability replay strategy. designed a linear active disturbance rejection control scheme based on the tie-line bias control mode and solved the control problem using the soft actor–critic (SAC) RL algorithm. adopted the proximal policy optimization RL algorithm to optimize power regulation among the AGC units in advance so as to ensure that the frequency characteristics could better satisfy the CPS under large fluctuations in power systems. However, given that online training interacts with real-world systems, any RL strategy must be trained through trial-and-error extensively before being considered intelligent (. This means that some “bad” decisions may be made during training, some of which may cause critical frequency violations. This is unsafe and unacceptable for real-world AGC problems. Therefore, direct application of traditional RL methods is not ideal for coping with such critical constraints because the strategy involves learning with frequent constraint violations.

To address the limitations of conventional RL algorithms, we propose a safe RL framework to ensure that the critical constraints are satisfied during training. Generally, trial-and-error conditions occur during the initial stages of training because the initialized policy is random and not satisfactory (; . Hence, to avoid early random explorations in RL, we adopt imitation learning to train an initialized policy that is similar to expert experiences; this training is performed offline without interactions with real-world grids. Then, based on the imitated policy as the initialization, the SAC RL algorithm is used to further train an optimal AGC strategy online (. The main contributions of this work are as follows: (1) the AGC dynamic optimization problem is formulated as a Markov decision process (MDP) to consider both the dispatch economy and CPS; (2) imitation learning based on expert experiences is designed on top of the traditional RL framework to prevent significant frequency violations during training; (3) the state-of-the-art SAC algorithm is adopted as it is model-free and can effectively cope with uncertainties from the short-term fluctuations of renewables.

The remainder of this manuscript is organized as follows. Section 2 introduces the AGC problem and its mathematical formulation as an MDP. Section 3 presents the imitation-based safe RL framework for solving the proposed MDP. Section 4 outlines the numerical studies conducted based on the proposed approach. Section 5 presents the conclusions of this study.

2 Problem statement and MDP formulation for AGC

2.1 System model and AGC problem statement

Figure 1 shows the conventional AGC dynamic optimization scheme for optimizing the regulated power of AGC units in steps of 1 min over a duration of 15 min based on the deviation information for system frequency, ACE, and tie-line power. Generation units in a power grid are of two types, namely AGC and non-AGC units, whose power outputs are denoted as and , respectively. Here, and are both the indices of the AGC and non-AGC units. In this work, the AGC units participate in both primary and secondary frequency control, while the non-AGC units only provide primary frequency control. Hence, the dynamic power outputs can be calculated as follows:where and are the frequency deviations at time and , respectively; is the power adjustment of the AGC unit at time for secondary frequency control; and are the frequency regulation constants of the AGC unit and non-AGC unit , respectively. Here, Equation 2 includes two parts, which are the primary frequency control power of the AGC unit and power increment of the AGC unit at each optimization time. Thus, the AGC units respond to the uncertain power fluctuations in the grid through power regulation at time .

FIGURE 1

From Figure 1, we see that the AGC strategy requires three dynamic system parameters as inputs: frequency , ACE , and tie-line power . Hence, their system dynamics are as follows:where is the equivalent inertial constant; is the equivalent damping coefficient; , and are power adjustments of wind, generation, and demands at time t; is the frequency regulation constant of the control system in megawatts per 0.1 Hz (positive value); and are the tie-line synchronization coefficient and frequency deviation of the connected -region, respectively; is the tie-line power outflow that is considered to be positive; is the tie-line power deviation. Equations 15 describe the system dynamics of key variables.

In the present work, our control objective is to schedule the AGC units so as to satisfy both the minimum economic cost of auxiliary services as well as stability and safety of the CPS. Therefore, the objective can be expressed mathematically as follows:where is the auxiliary service cost coefficient of the AGC unit ; is the climbing power at time of the AGC unit ; indicates the change in the direction of the power output, where denotes a decrease, denotes an increase, and denotes that there is no change.

The assessment indexes of the CPS include CPS1 and CPS2. Here, CPS1 is defined to evaluate the correlation between the system frequency deviation and ACE, while CPS2 is defined as the average ACE over 15 min, indicating that the ACE is maintained within a tolerance range to ensure that the power exchanged between the regions does not exceed the specified limits. The detailed definitions of CPS1 and CPS2 are as follows:where is the time interval; is the rated frequency; represents the frequency control target, which is usually taken as the root mean-squared (RMS) value of the mean frequency deviation of 1 min in the previous year. In practice, when CPS1 satisfies the condition , CPS2 will not be assessed (; otherwise, CPS2 must be less than the following threshold:where is the RMS value of the mean frequency deviation over 15 min in the previous year; is the equivalent frequency regulation constant for the entire interconnection power grid. If CPS1 is suitably satisfied in this work, CPS2 will not be considered an objective or a hard constraint.

The operational constraints for AGC dynamic optimization include system power balance, AGC unit regulation characteristics, and limits for the frequency and tie-line power deviations. The basic power balance constraint must satisfy the conditionwhere and are the forecast values of the wind power and loads for period ; is the line transmission loss.

As shown in Figure 1, the saturation function and ramp rate limiter will take effect on the control signals before being executed. Specifically, the power output and ramp power constraints of AGC units are defined as follows:where and are the upper and lower limits of the output power of the -th AGC unit, respectively; and are the upper and lower limits of the ramp power of the -th AGC unit, respectively. Moreover, the limitations on the frequency deviation and tie-line power are as follows:where and are the corresponding upper and lower limits of the system frequency deviation, respectively; and are upper and lower limits of the tie-line power, respectively.

2.2 MDP formulation

In this work, the MDP is a mathematical framework used to model the AGC dynamic optimization problem as a sequential decision-making process, as shown in Figure 2. The MDP is defined by a tuple , where is a set of states, is a set of actions, is a transition model that gives the probability of moving from one state to another when given an action, is a reward function that provides the immediate reward for state transitions, and is a discount factor that determines the importance of future rewards1. In the MDP, the AGC strategy is considered an agent that observes the grid operating state and outputs an action at each time step . Then, the agent will receive an immediate reward . The sequential continuous experience is then recorded as . The objective of the agent is to maximize its cumulative reward by iteratively updating its policy : , which is given by

FIGURE 2

Based on the objectives and constraints defined in Equations 614, we further present a well-designed MDP formulation. In this work, the control variables are the regulation direction and regulation power of each AGC unit. To simplify the action space scale using smaller action dimensions, the action is defined as the power adjustment of the AGC unit:where denotes the set of AGC units. Before execution, all actions are subjected to the saturation and ramp limits shown in Figure 1.

The design of the state space must capture necessary information based on two aspects: (1) conditions of the current operating system; (2) uncertain environments that must be forecast. For the former, we take into account four factors, including the current power outputs of the AGC units , frequency deviation , tie-line power deviation , and ACE value . For uncertain environments, we only consider wind power forecasting in this work because the load fluctuations within 15 min are usually negligible. Hence, we introduce historical wind power information to the state space for better forecasting. The system state is defined aswhere is the period of historical observations; denotes the difference between successive wind power outputs.

The reward design for the MDP should consider both objectives and constraints. Here, we design the reward function based on three aspects as follows:where , , and are weight factors that balance the tradeoffs between the three subreward items; is the economy objective defined in Equation 6. Note that the presence of too many items will complicate the design of the corresponding weights and lead to convergence failure. Hence, we design the third item as a penalty for the total violations of the upper/lower limitations, such as those of the output power, ramp power, tie-line power, and frequency in Equations 1114. The penalty term is formulated aswhere represents the index of the constraints defined in Equations 1114, i.e., ; is a linear rectification function for measuring the violations; and represent the actual index value and required limit value, respectively. Taking the constraint in Equation 13 as an example, and can be defined as and , respectively; here, we separate the original Equation 13 into two inequalities as and , which are expressed using the same structure. Hence, Equations 1619 show our design of four key fators in MDP formulations.

3 Imitation-based SAC for solving the MDP formulation

Figure 3 depicts the framework for the proposed imitation-based RL, which introduces imitation learning to the conventional RL scheme to improve the initialized random policy. We introduce the behavioral-cloning-based imitation learning and SAC RL algorithm separately in the following subsections.

FIGURE 3

3.1 Imitation learning based on behavioral cloning

Behavioral cloning (BC) is a common method for implementing imitation learning (, where the demonstrator (i.e., expert experiences) can be imitated directly without interacting with a real-world environment. The key idea of BC is to replicate the expert policy using a classifier or regressor based on previously collected training data from the encountered states and demonstrator actions (. Therefore, BC-based imitation learning can be used in an MDP framework without defining a reward function. The learning objective for the agent here is to obtain an imitation policy as the initial policy, i.e., , which is necessary to behave like an expert. Here, we adopt the variable to represent the expert demonstrations.

Given the precollected set of state–action pairs , the objective of the agent is to seek an imitation policy that best matches the provided set of state–action pairs. The policy network parameter is then updated using maximum-likelihood estimation, i.e., the optimal is defined asConsidering that the designed action space is continuous, we assume that the policy follows a Gaussian distribution over each action dimension. In this work, we adopt a neural network to approximate the policy and use the same network structure as that of the actor in the RL framework. Then, the Adam stochastic gradient descent optimizer is adopted to solve for in Equation 20, where the gradient descent approach aims to find changes in that can increase the accuracy of each imitated demonstrator action based on the imitation policy . The pseudocode for this process is summarized in Algorithm 1.

Algorithm 1

  • 1: Initialize as a random policy network

  • 2: Define the loss function and Adam optimizer

  • 3: Collect expert demonstration data

  • 4: Preprocess the data and split them into training and validation sets

  • 5: for each policy imitation epoch do

  • 6:  for each batch in the training data do

  • 7:   Get the batch state–action pairs

  • 8:   Forward pass the process

  • 9:   Compute loss, backward pass, and optimize the parameter

  • 10:  end for

  • 11:  Validate the policy on the validation dataset

  • 12:  Output the validation loss for the current epoch

  • 13: end for

  • 14: Test on the test dataset and output the test results

Behavioral-cloning-based imitation learning.

3.2 SAC algorithm

To solve the optimal policy in Equation 15, we adopt the SAC algorithm in this work to maximize the agent’s cumulative rewards while satisfying the safety constraints. Compared with the conventional RL algorithm, SAC uses a stochastic policy that inherently encourages exploration by adding entropy to the reward. Hence, SAC is less likely to be stuck in local optima and can better explore the action space. Moreover, incorporating policy learning with entropy regularization helps the agent to become more stable during training. The entropy term prevents the policy from becoming too deterministic too early, thereby providing a more balanced and robust learning process. The SAC objective is given aswhere is the hyperparameter that balances the importance of the entropy term with the reward . The entropy is calculated as follows:

For a given policy , the state–action value function is defined to evaluate the expected values of the pair at time step to guide policy learning and optimization. Generally, a larger value indicates better policy control performance. The function is defined using the Bellman equation as follows:To improve the stability and accuracy of the estimated values, the SAC algorithm incorporates two networks to mitigate the overestimation bias that can occur in the learning process. We use and to denote the two networks with parameters and separately. Then, each network is updated by minimizing the mean-squared error (MSE) between the current and target values. The loss for each network is defined aswhere and are the parameters of the two target networks and , respectively. For each iteration, the parameters of each network are updated using gradient descents computed from the following loss functions:where is the learning rate of the network; is a hyperparameter that controls the update rate of each target network based on the moving average value. From Equations 2027, we can effectively achive the policy iteration update in the SAC algotithm.

For the policy network , the newly updated policy at each iteration is improved using the information projection defined in terms of the Kullback–Leibler (KL) divergence . Specifically, for each state, the policy is updated as follows:where is the normalization item that does not influence the policy gradient calculation. Based on the projection, the new policy has a higher value than the old one (, subject to the maximum entropy objective. Furthermore, we can rewrite the gradient of the stochastic policy using a noise vector , which is added to the action as . The expected KL divergence in Equation 28 can be rewritten aswhere is defined implicitly in terms of . Then, the policy gradient for Equation 29 is approximated aswhere is evaluated after adding noise as . This method can be easily extended from the determined policy gradient to any tractable stochastic policy. Finally, the policy updates itself through the learning rate asNote that in the SAC approach, the update rule for the temperature parameter involves minimizing a specific objective function to ensure that the entropy of the policy remains at a desired level. Hence, the objective function for is designed to minimize the difference between the current policy entropy and a target entropy . The loss function for is given byThe gradient of the loss function with respect to is further calculated aswhere is the learning rate of ; is usually given as a hyperparameter according to the specific task or desired level of exploration. Equations 3034 give the gradient calculation of the hyperparameter α and policy network parameters. The pseudocode for the SAC algorithm is summarized as Algorithm 2.

Algorithm 2

  • 1: Initialize the parameters for the two networks and policy network

  • 2: Copy the target network weights

  • 3: Initialize an empty replay pool as

  • 4: for each iteration do

  • 5:  for each environment step do

  • 6:   Sample the action from the policy as

  • 7:   Sample the transition from the environment as

  • 8:   Store the transition in the replay pool as

  • 9:  end for

  • 10:  for each gradient step do

  • 11:   Update the Q function parameters as for

  • 12:   Update the policy weights as

  • 13:   Adjust the temperature as

  • 14:   Update the target network weights as for

  • 15:  end for

  • 16: end for

Soft actor–critic algorithm.

4 Case studies and discussion

To demonstrate the effectiveness of the proposed method, we present the results and analysis based on tests with the modified IEEE 39-bus system.

4.1 Test system settings

The system settings consist of two aspects, which are the physical grid environment and RL agent. For the environment settings, the single-line diagram of the modified IEEE 39-bus system is shown in Figure 4, which includes three AGC units and seven non-AGC units. The three AGC units are installed at buses 31, 38, and 39; a wind farm of 300 MW capacity is installed at bus 39, and an external power grid is connected through a tie-line at bus 29. The parameter settings for the power system are listed in Table 1. Note that the control period and control step are set as 15 min and 1 min in this work, and the initialized deviations of the system frequency and tie-line power are assumed to be 0. The wind fluctuations were obtained from the New England power grid2, and the loads were assumed to be the same over the 15 min duration because load fluctuations are usually smooth compared to wind power fluctuations. Figure 5 shows the minute-level fluctuations of wind power in three random periods; it is seen that the stochastic wind power changes heavily even in adjacent time steps, with a maximum fluctuation of over 10 MW.

FIGURE 4

TABLE 1

SymbolDescriptionValue
Rated frequency50 Hz
Auxiliary service cost coefficient0.5 $/kWh
Upper output power of the AGC units800/860/1,100 MW
Lower output power of the AGC units200 MW
Frequency regulation constant25
Tie-line synchronizing coefficient0.5
Initial tie-line power200 MW
and Target bound of 1-min and 15-min average frequency error0.4 and 0.021 Hz
and Equivalent frequency regulation constants38 and 50 MW/0.1 Hz
and Limits of the tie-line power deviation−30 and 30 MW
and Limits of the frequency deviation−0.05 and 0.05 Hz
and Limits of the ramp power−45 and 45 MW/min

Test system settings.

FIGURE 5

For the agent settings, the discount factor is set as 0.95 because the current power adjustment of the AGC units significantly impacts the future operating state of the system. The neural network structures of all the actor–critic networks are the same (i.e., , , , , and ) and comprise two hidden layers of size The smoothing factor for the two target networks is designed as = 0.01. Adam optimizer was adopted for gradient optimization with the learning rate = = 0.001. The three weight factors are set as = -0.05, = -1, and = -20. The replay buffer size is set to 100,000. The introduced noise follows a Gaussian distribution of the form . The simulations were implemented in Python using an Intel Core i7 CPU @3.0 GHz and 16 GB memory.

In this work, we evaluated three benchmarks to validate the benefits of the proposed method: 1) proposed imitation-based SAC strategy denoted as ISAC; 2) traditional SAC strategy that uses the random initialization policy; 3) a classical proportional integral (PI) strategy. The PI strategy regulates the AGC units in proportion to the system frequency deviations. The following subsections show the results of both the training and control processes.

4.2 Offline and online training performances

The training process of the RL agent involves two stages: offline imitation learning and online RL training. The offline dataset is split into two subsets by random sampling, where 80% of the data are used for training and the remaining 20% are reserved for testing (i.e., validation). For the imitation network, the input data are the observed system state, and the label is the power regulation of the AGC units based on the classical PI strategy. This means that the final converged imitation policy is similar to the PI strategy. Figures 6, 7 show the imitation learning curves through the root mean-squared error (RMSE) and mean absolute error (MAE) indices. The black line represents the loss of the test dataset, and the red line represents the loss of the training dataset. It is observed that the training of the imitation network converges efficiently after approximately 200 epochs. Although the test loss is larger than the training loss initially, the converged strategy achieves training and test losses that are both less than 0.01. The training convergence is achieved at 10.04 s. These results indicate that the imitation learning process is both fast and stable, making it feasible for practical implementation.

FIGURE 6

FIGURE 7

To demonstrate the convergence of the proposed ISAC method, Figure 8 shows the cumulative reward of each episode through the blue lines and moving average reward via orange lines. It is seen that the episode reward decreases from −55 initially to approximately −15 after over 2,000 episodes, which implies successful convergence. Note that the cumulative reward of each episode still shows oscillations even after later training. This is because the wind power fluctuations are uncertain and vary widely over different periods, causing the optimal cost to be dynamic and inconstant. However, from the moving average curve, the episode reward is seen to decrease with continuous training until convergence.

FIGURE 8

To investigate the effects of the imitation network, we applied a random initialization policy in the SAC approach for comparison. Figure 9 shows the CPS1 training results based on two RL methods (i.e., SAC and ISAC). The blue line shows the CPS1 result for the ISAC method, and the orange line denotes the CPS1 result for the conventional SAC method. First, we see that the CPS1 results in both methods converge to 2 after 1,500 episodes. At the beginning of the training process, the value of the CPS1 index in the conventional SAC method is much lower than −2, with a minimum value of almost −12. This means that the system frequency stability is unacceptable for real-world grids. However, in the proposed ISAC method, the CPS1 result is always maintained within a small fluctuation range of [-2, 2] even at the beginning. This is because the proposed imitation network provides a satisfactory initialization policy by collecting effective samples for better policy optimization. It is noted that the sampling efficiency is sacrificed in the ISAC method to ensure safe exploration, where the CPS1 value is stable after approximately 1,500 episodes. In the conventional SAC method, the CPS1 value reaches 2 at nearly 1,000 episodes; this is because a large number of unsafe samples accelerate the learning process. Therefore, these results indicate that the proposed imitation network effectively prevents the unsafe random explorations observed in conventional RL methods, which can help cope with safety issues in real-world grids, especially for safety-critical AGC problems.

FIGURE 9

4.3 System control performance

We applied two well-trained agents and a PI controller for comparison in the AGC environment. Figure 10 presents the system frequency fluctuations over 15 min with the same wind inputs based on the three control strategies, where the purple line denotes PI control, blue line denotes the SAC agent, and red line represents the proposed ISAC agent. At time step = 0, the initial frequency deviations are all 0. It is seen that the maximum frequency deviation of the PI controller is 0.076, which violates the upper limit of the frequency. However, the maximum values are only 0.045 and 0.028 in the SAC and proposed ISAC methods, respectively. This means that the two RL methods can adhere to the limits by effectively predicting wind fluctuations and taking actions in advance, with the proposed ISAC slightly outperforming the conventional SAC method with smaller frequency deviations.

FIGURE 10

Table 2 presents a detailed comparison of the system control results, including the ancillary service costs, CPS1 index values, maximum frequency deviations, and average frequency deviations. We can see that the ancillary costs in the SAC and ISAC methods are significantly higher than that with the PI method because accurate responses to wind fluctuations require more power regulation in the AGC units. Correspondingly, the CPS1 values are effectively improved from 192.58% to 196.46% in the SAC and 199.27% in the proposed ISAC approaches. This means that there exists a tradeoff between the frequency stability and economic benefit depending on the system preference. In addition, the average frequency deviation with PI is 0.029, which decreases to 0.009 in the proposed ISAC to achieve a more stable system frequency. Hence, the converged SAC agent can also achieve satisfactory control results, with the main disadvantage being the unsafe training process. With the proposed imitation learning approach, the training process is safer and the final converged policy is improved through consideration of expert experiences. Figure 11 shows the AGC power regulation curves of the three AGC units. Although the regulation capacities and ramp limits are different, we can see that the curve trends for the three AGC units are quite similar. Specifically, for AGC units 1 and 2, the power deviations start at approximately −3 MW and decrease, reaching approximately −13 MW. At the end time, slight improvements are observed, with the deviations moving toward −5 MW. For AGC unit 3, the power deviation starts at approximately −5 MW and decreases to approximately −20 MW, showing more significant regulation than the other two units. This is because AGC unit 3 has the largest upper output power of 1,100 MW. In summary, all three AGC units experience significant negative power deviations initially, possibly owing to load changes or outdoor environment conditions, before recovering toward the end.

TABLE 2

IndexPISACISAC
Ancillary costs ($)136.25171.49244.06
CPS1 (%)192.58196.46199.27
Max (Hz)0.0760.0450.028
Average (Hz)0.0290.0180.009

Results of the three controllers.

FIGURE 11

5 Conclusion

To achieve real-time frequency response control, this work proposes an imitation-learning-based safe RL framework for AGC dynamic optimization. In the proposed method, an imitator is first used to effectively guarantee a safe initialization policy. Then, the AGC problem is reformulated as an MDP that is solved using an SAC algorithm combined with the imitator; the SAC approach is a model-free method that can handle wind power uncertainties through its forecasting capability. Finally, the proposed methodology is tested on a modified IEEE 39-bus system. The numerical results show that the proposed method effectively copes with stochastic disturbances and improves the CPS1 value from 192.58% to 199.27%. Meanwhile, compared to conventional RL methods, the proposed offline imitation learning achieves safer training performance by decreasing the constraint violations.

Statements

Data availability statement

The raw data supporting the conclusions of this article will be made available by the authors without undue reservation.

Author contributions

PY: writing–review and editing and writing–original draft. ZZ: writing–review and editing and writing–original draft. YW: writing–review and editing. ZH: writing–review and editing. MS: writing–review and editing.

Funding

The authors declare that financial support was received for the research, authorship, and/or publication of this article.

Conflict of interest

Authors ZZ, YW, ZH, and MS were employed by the State Grid Beijing Electric Power Company.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The authors declare that this study received funding from the Science and Technology Project of State Grid Beijing Electric Power Company (grant number: 520210240001). The funder had the following involvement in the study: the writing of this article and the decision to submit it for publication.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations or those of the publisher, editors, and reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Abbreviations

AGC, automatic generation control; RL, reinforcement learning; ACE, area control error; CPS, control performance standard; NERC, North American Electric Reliability Council; PID, proportional integral derivative; SAC, soft actor–critic; MDP, Markov decision process; RMSE, root mean-squared error; BC, behavioral cloning; MSE, mean-squared error; KL, Kullback–Leibler; PI, proportional integral; MAE, mean absolute error.

Footnotes

1.^When the discount factor , it means that the impact of the current decision on the future operating status of the system is not considered; when , it means that the impact of the current decision on the operating status of the system at every moment in the future is considered equally.

2.^https://www.iso-ne.com/isoexpress/web/reports.

References

  • 1

    ChengL.YuT. (2019). A new generation of ai: a review and perspective on machine learning technologies applied to smart energy and electric power systems. Int. J. Energy Res.43, 19281973. 10.1002/er.4333

  • 2

    DaftryS.BagnellJ. A.HebertM. (2017). “Learning transferable policies for monocular reactive mav control,” in 2016 international symposium on experimental robotics (Springer), 311. 10.1007/.978-3-319-50115-4_1

  • 3

    DahiyaP.SharmaV.NareshR. (2016). Automatic generation control using disrupted oppositional based gravitational search algorithm optimised sliding mode controller under deregulated environment. IET Generation, Transm. & Distribution10, 39954005. 10.1049/iet-gtd.2016.0175

  • 4

    HaarnojaT.ZhouA.AbbeelP.LevineS. (2018). “Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in International conference on machine learning (PMLR), 18611870. Available at: https://proceedings.mlr.press/v80/haarnoja18b.html.

  • 5

    JaleeliN.VanSlyckL. S. (1999). Nerc’s new control performance standards. IEEE Trans. Power Syst.14, 10921099. 10.1109/59.780932

  • 6

    KumarN. (2024). Ev charging adapter to operate with isolated pillar top solar panels in remote locations. IEEE Trans. Energy Convers.39, 2936. 10.1109/TEC.2023.3298817

  • 7

    KumarN.SaxenaV.SinghB.PanigrahiB. K. (2023a). Power quality improved grid-interfaced pv-assisted onboard ev charging infrastructure for smart households consumers. IEEE Trans. Consum. Electron.69, 10911100. 10.1109/TCE.2023.3296480

  • 8

    KumarN.SinghH. K.NiwareebaR. (2023b). Adaptive control technique for portable solar powered ev charging adapter to operate in remote location. IEEE Open J. Circuits Syst.4, 115125. 10.1109/.OJCAS.2023.3247573

  • 9

    LiJ.YuT.ZhangX.LiF.LinD.ZhuH. (2021). Efficient experience replay based deep deterministic policy gradient for agc dispatch in integrated energy system. Appl. energy285, 116386. 10.1016/j.apenergy.2020.116386

  • 10

    LiuZ.LiJ.ZhangP.DingZ.ZhaoY. (2022). An agc dynamic optimization method based on proximal policy optimization. Front. Energy Res.10, 947532. 10.3389/fenrg.2022.947532

  • 11

    OshnoeiA.KheradmandiM.KhezriR.MahmoudiA. (2020). Robust model predictive control of gate-controlled series capacitor for lfc of power systems. IEEE Trans. Ind. Inf.17, 47664776. 10.1109/TII.2020.3016992

  • 12

    PeddakapuK.MohamedM.SrinivasaraoP.AryaY.LeungP.KishoreD. (2022). A state-of-the-art review on modern and future developments of agc/lfc of conventional and renewable energy-based power systems. Renew. Energy Focus43, 146171. 10.1016/j.ref.2022.09.006

  • 13

    RajaramanN.YangL.JiaoJ.RamchandranK. (2020). Toward the fundamental limits of imitation learning. Adv. Neural Inf. Process. Syst.33, 29142924.

  • 14

    RuanJ.LiangG.ZhaoH.LiuG.SunX.QiuJ.et al (2024). Applying large language models to power systems: potential security threats. IEEE Trans. Smart Grid15, 33333336. 10.1109/TSG.2024.3373256

  • 15

    SahuB. K.PatiS.MohantyP. K.PandaS. (2015). Teaching–learning based optimization algorithm based fuzzy-pid controller for automatic generation control of multi-area power system. Appl. Soft Comput.27, 240249. 10.1016/j.asoc.2014.11.027

  • 16

    SatapathyS. S.KumarN. (2020). Framework of maximum power point tracking for solar pv panel using wsps technique. IET Renew. Power Gener.14, 16681676. 10.1049/iet-rpg.2019.1132

  • 17

    WangC.ZhuJ.ZhuT. (2018). “Decentralized robust optimization for real-time dispatch of power system based on approximate dynamic programming,” in 2018 international conference on power system Technology (POWERCON) (IEEE), 19351941. 10.1109/POWERCON.2018.8601952

  • 18

    WangX.WangS.LiangX.ZhaoD.HuangJ.XuX.et al (2022). Deep reinforcement learning: a survey. IEEE Trans. Neural Netw. Learn. Syst.35, 50645078. 10.1109/TNNLS.2022.3207346

  • 19

    WangY.LinX.TanZ.LiuY.SongZ.YuL.et al (2023). “Wind power forecasting: lstm-combined deep reinforcement learning approach,” in 2023 IEEE 7th conference on energy internet and energy system integration (EI2) (IEEE), 52025206. 10.1109/EI259745.2023.10512354

  • 20

    WangZ.ZhangH. (2024). Customized load profiles synthesis for electricity customers based on conditional diffusion models. IEEE Trans. Smart Grid15, 42594270. 10.1109/TSG.2024.3366212

  • 21

    WuZ.ShenY.PanT.JiZ. (2010). “Feedback linearization control of pmsm based on differential geometry theory,” in 2010 5th IEEE conference on industrial electronics and applications (IEEE), 20472051. 10.1109/ICIEA.2010.5515457

  • 22

    XiL.ZhouL.LiuL.DuanD.XuY.YangL.et al (2020). A deep reinforcement learning algorithm for the power order optimization allocation of agc in interconnected power grids. CSEE J. Power Energy Syst.6, 712723. 10.17775/CSEEJPES.2019.01840

  • 23

    YanW.ZhaoR.-F.ZhaoX.WangC.YuJ. (2013). Review on control strategies in automatic generation control. Power Syst. Prot. control41, 149155.

  • 24

    YangL.LiangG.YangY.RuanJ.YuP.YangC. (2024). Adversarial false data injection attacks on deep learning-based short-term wind speed forecasting. IET Renew. Power Gener.18, 13701379. 10.1049/rpg2.12853

  • 25

    YuP.WangZ.ZhangH.SongY. (2024a). Safe reinforcement learning for power system control: a review. arXiv Prepr. arXiv:2407.00681. 10.48550/arXiv.2407.00681

  • 26

    YuP.ZhangH.SongY. (2024b). Adaptive tie-line power smoothing with renewable generation based on risk-aware reinforcement learning. IEEE Trans. Power Syst., 113. 10.1109/TPWRS.2024.3379513

  • 27

    ZhangJ.LuC.SongJ.ZhangJ. (2015). Real-time agc dispatch units considering wind power and ramping capacity of thermal units. J. Mod. Power Syst. Clean Energy3, 353360. 10.1007/s40565-015-0141-z

  • 28

    ZhangR.ChenY.LiZ.JiangT.LiX. (2024). Two-stage robust operation of electricity-gas-heat integrated multi-energy microgrids considering heterogeneous uncertainties. Appl. Energy371, 123690. 10.1016/j.apenergy.2024.123690

  • 29

    ZhangX.XuZ.YuT.YangB.WangH. (2020). Optimal mileage based agc dispatch of a genco. IEEE Trans. Power Syst.35, 25162526. 10.1109/TPWRS.2020.2966509

  • 30

    ZhaoX.YeX.YangL.ZhangR.YanW. (2019). Chance constrained dynamic optimisation method for agc units dispatch considering uncertainties of the offshore wind farm. J. Eng.2019, 21122119. 10.1049/joe.2018.8558

  • 31

    ZhengY.TaoJ.SunH.SunQ.ChenZ.DehmerM.et al (2021). Load frequency active disturbance rejection control for multi-source power system based on soft actor-critic. Energies14, 4804. 10.3390/.en14164804

Summary

Keywords

automatic generation control, renewable energy, deep reinforcement learning, safe optimization and control, imitation learning

Citation

Zhang Z, Wu Y, Hao Z, Song M and Yu P (2024) Safe dynamic optimization of automatic generation control via imitation-based reinforcement learning. Front. Energy Res. 12:1464151. doi: 10.3389/fenrg.2024.1464151

Received

13 July 2024

Accepted

14 August 2024

Published

01 September 2024

Volume

12 - 2024

Edited by

Zhengmao Li, Aalto University, Finland

Reviewed by

Yitong Shang, Hong Kong University of Science and Technology, Hong Kong, SAR China

Hongyi Li, Iowa State University, United States

Jiarong Li, Harvard University, United States

Updates

Copyright

*Correspondence: Peipei Yu,

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Outline

Figures

Cite article

Copy to clipboard


Export citation file


Share article

Article metrics