Intelligent Frequency Control Strategy Based on Reinforcement Learning of Multi-Objective Collaborative Reward Function

Large scale wind power integration into the power grid will pose a serious threat to the frequency control of power system. If only Control Performance Standard (CPS) index is used as the evaluation standard of frequency quality, it will easily lead to short-term centralized frequency crossing, which will affect the effect of intelligent Automatic Generation Control (AGC) on frequency quality. In order to solve this problem, a multi-objective collaborative reward function is constructed by introducing a collaborative evaluation mechanism with multiple evaluation indexes. In addition, Negotiated W-Learning strategy is proposed to globally optimize the solution of the objective function from multi dimensions, it avoids the poor learning efficiency of the traditional Greedy strategy. The AGC control model simulation of standard two area interconnected power grid shows that the proposed intelligent strategy can effectively improve the frequency control performance and improve the frequency quality of the system in the whole-time scale.


INTRODUCTION
Automatic Generation Control (AGC) is an important means to realize the balance of active powerload supply and demand in the power system. Among them, the quality of frequency control strategy is an important factor that affects the performance of AGC control (Alhelou et al., 2018;Shen et al., 2021a;Shen and Raksincharoensak, 2021a). However, the control strategies applied in engineering, such as the threshold zone AGC control strategy that takes into account the combined effects of the proportional component, integral component and Control Performance Standard (CPS) control component of the regional control deviation (Arya and Kumar, 2017;Shen et al., 2020a;Xi et al., 2020;Shen and Raksincharoensak, 2021b), have been unable to adapt to the increasingly complex frequency control of interconnected power grids (Shen et al., 2017;Zhang and Luo, 2018).
In recent years, the intelligent frequency control strategy of reinforcement learning has received lots of attention (Yu et al., 2011;Abouheaf et al., 2019;Xi et al., 2019;Shen et al., 2020b;Liu et al., 2020), because it does not rely on models and does not require precise training samples or system prior knowledge (Watkins and Dayan, 1992;Yang et al., 2018;Li et al., 2020;Yang et al., 2021a;Shen et al., 2021b).
However, most intelligent control strategies are built on the CPS frequency control performance evaluation standard. The CPS index has low sensitivity for short-term inter-area power support evaluation, and cannot take into account the short-term benefits of frequency control performance (Kumar and Singh, 2019;Yang et al., 2019;Zhu et al., 2019). In a system with large-scale wind power grid connection, the ability of each region to comply with CPS indicators is limited. The intelligent AGC control strategy that only considers the CPS control criteria can easily cause shortterm concentrated frequency crossings, which seriously affects the control effect of the intelligent AGC control strategy (Wang and James, 2013;Xie et al., 2017;Yang et al., 2021b).
In fact, with the development of grid-connected new energy sources and smart grids, the grid frequency control evaluation standard is transitioning from single-scale evaluation to multitime-scale and multi-dimensional evaluation. The North American Electric Reliability Council (NERC) proposed a new frequency evaluation performance index named Balancing Authority ACE Limits (BAAL), which is used to ensure the short-term frequency quality of the system by constraining the mean value of the frequency difference fluctuates in any 30 min not to exceed the limit. However, the intelligent AGC control strategy under both BAAL and CPS indicators is a kind of multiobjective control problem, and there is no relevant literature to study it.
In response to the above problems, this paper proposes an intelligent frequency control strategy for collaborative evaluation of multi-dimensional control standards. This strategy constructs and introduces a collaborative reward function that considers the CPS index and the BAAL index in the multi-objective reinforcement learning algorithm. Then, the Negotiated W-Learning strategy is used to learn the action space of the agent, which effectively solves the problem that the agent cannot fully explore the action (Nathan and Ballard, 2003;Liu et al., 2018;Wang et al., 2019). Simulation examples show that the proposed intelligent control strategy can effectively improve the overall frequency performance quality of the power system.

CPS1 Frequency Control Performance Evaluation Standard
NERC uses the BAL (BAL-001) disturbance control series of indicators to evaluate the frequency control quality of the interconnected power grid. Among them, the CPS1 (BAL-001-2: R1) indicator is the most widely used in China, as shown in Eq. 1: where ΔF 1 min and ACE m 1 min are separately the average value of the frequency deviation and power deviation in the control area within 1 min, B m is the frequency deviation coefficient of the area m, and represents the frequency adjustment responsibility assigned to area m. AVG 1,T (·) means calculate the average value for 12 months, ε is the upper limit of the area m in controlling the frequency deviation.
Taking the situation that the actual frequency is higher than the planned frequency as an example, expand Eq. 1 as follows: where: T is the entire time period, ΔF/ε is the frequency deviation contribution of the region itself, ΔP tie /− 10B m ε is the frequency contribution of other regions to this region, and ΔP tie /− 10 B m ε + ΔF/ε is the comprehensive frequency deviation contribution. For the convenience of analysis, define ΔF/ε p [ΔP tie / − 10 B m ε + ΔF/ε] as the comprehensive frequency deviation factor, and denoted by ψ. The CPS1 indicator statistically evaluates the rolling root mean square of the frequency difference time series during the T period in the evaluation area. When T is large enough, the system frequency deviation qualification rate is greater than 99.99%. Therefore, CPS1 is a long-term evaluation index reflecting the frequency quality of interconnected power grids.

BAAL Frequency Control Performance Evaluation Standard
NERC proposed the BAAL (BAL-001-2: R2) evaluation index in 2013 and began to implement it in 2016, as shown in Eq. 3 ∼4: where F A is the actual frequency value; F S is the planned frequency value; F FTL-high /F FTL-low is the high/low frequency trigger limit; T v is the specified allowable continuous time limit. T [·] is the continuous over-limit time. Taking the situation that the actual frequency is higher than the planned frequency as an example, Eq. 3 can be transformed into the following form in the same way:

Performance Analysis Under the Joint Control of BAAL Standard and CPS1 Standard
In order to further study the feature of the two index, Figure 1 shows the change curve of the comprehensive frequency deviation factor ψ, which considers different performance indicators under the influence of the time dimension. As shown in Figure 1, taking point A as the critical point of frequency line crossing, when only CPS1 is considered, the system frequency can still meet the requirements of control performance index, but it will affect the safe operation of various equipment in the system and cause the power quality reduced. If only the BAAL indicator is considered, the system frequency may appear "vertical dro" and "tip oscillatio," as shown in point B in Figure 1. At this time, the synchronous generator frequently receives the opposite frequency deviation signal that occurs in a short period of time. This situation will increase the wear of the unit. When considering the effects of CPS1 and BAAL indicators at the same time, the frequency will change into the reverse process under the influence of BAAL performance after shortterm limit violation.
In summary, if CPS1 and BAAL indicators can be coordinated to constrain the system frequency closely, it can guarantee not only the long-term frequency quality but also the short-term frequency safety.

INTELLIGENT AGC CONTROL STRATEGY CONSIDERING COOPERATIVE EVALUATION OF MULTI-DIMENSIONAL CONTROL STANDARDS
Based on the analysis in Section 2.3, this paper constructs an AGC control model based on a multi-objective collaborative reward function reinforcement learning frequency control strategy. As shown in Figure 2A, it mainly consists of the following parts: system governor, equivalent module of the generator, dynamic model of system's frequency deviation, and intelligent brain controller. Where R, T g , T t , M, D are separately the equivalent unit adjustment coefficient, time constant of the governor, equivalent generator time constant, equivalent inertia coefficient and equivalent damping coefficient of the power system in area m; ΔP tie is the exchange power deviation of the tie line in area m, ΔX g , ΔP g , ΔP d are separately the change in the position of the regulating valve, in generator output power and in load disturbance, ΔP Σ is the total adjustment command of the unit.
Frequency controller intelligent learning stage: This article uses a multi-objective collaborative reward function reinforcement learning strategy to learn and train the intelligent frequency controller. This strategy mainly includes two parts, namely CPS1 index and BAAL index cooperative reward function and Negotiated W-Learning based intelligent frequency control learning algorithm. First, use the MORL idea to construct the instant reward function of CPS1 index and BAAL index, and use dynamic coordination factors to characterize the impact of different indicators on environmental changes. Then, the implementation rewards given under the MORL learning are used to update the respective state action sets of the CPS1 index and the BAAL index. Finally, Negotiated W-Learning conducts a global search to get the final action, which will meet the CPS1 and BAAL indicators and environmental feedback characteristic information.
Frequency controller online deployment stage: The learned and mature frequency controller receives the SCADA database in the Energy Management System (EMS) in each AGC control cycle to collect frequency deviation, ACE, CPS, BAAL, and other data in real time, and make real-time frequency control action.

Collaborative Reward Function of CPS1 Indicator and BAAL Indicator
This paper constructs a cooperative reward function based on the CPS1 indicator and the BAAL indicator, which is expressed as follows: Among them: R i (s, s ′ , a) is the instant reward value obtained when the ith goal is transferred from state s to state s′ through action a; ACE (t) is the real-time value of the regional control deviation at the current moment; s is the system state [ACE(t)] at time t, s′ is the state [ACE (t + 1)] at time t + 1, a is the system action (ΔP Σ (t)) when the system goes from s to state s′. BAAL(t) is the instantaneous value of BAAL at time t, CPS1 (t) is the instantaneous value of CPS1 at time t, CPS1* is the target value, generally 200%. λ i is the dynamic coordination factor of the cooperative reward function, that is, λ i changes dynamically with each state transition process. This paper adopts the method of comprehensive weighting and multiplicative weighting, comprehensively considers the preferences of decision makers and the inherent statistical law between the index data to determine the value of the dynamic coordination factor.
Firstly, Define parameter K as a parameter for evaluating the importance of frequency performance evaluation indicators. K i , j represents the importance degree of the evaluation index relative to another one in the frequency performance evaluation. When there is an out-of-bounds situation such as ACE < BAAL or CPS1 > 200, the importance of the corresponding indicators will increase accordingly. When the two indicators play equal or unimportant roles in the frequency evaluation process, the corresponding K i , j /K j , i values are all 4 or 0. The relative importance of any index increases by one point, the corresponding K i , j /K j , i value increases by 1, and the K j , i /K i , j value decreases by 1. Then obtain the weighting factors of each target in each action cycle: In order to eliminate subjectivity, the entropy method is used to calculate the coefficient of difference between the two indicators β i : β i 1 + ln −1 (N) K y 1 P y,i ln P y,i N i 1 1 + ln −1 (N) K y 1 P y,i ln P y,i Where: x y , i is the standardized index value of the ith frequency control performance evaluation index at the yth time, K represents the number of the ith frequency control performance evaluation index from 0 to the current time t, and N represents the target number. P y , i is the proportion of x y , i to the total number of indicators from 0 to t. At last, the final coordination factor is determined by multiplication weighted method. Therefore, the coordination factor can be obtained by combining 8 and 9:

Negotiated W-Learning Intelligent Frequency Control Learning Algorithm
The update formula of MORL is the same as the state-action value function update of traditional Q learning, as shown in Eq. 11. In order to facilitate the selection of the optimal action that satisfies each of the following goals, this paper uses the MQ (s, a) vector to represent the state-action value function Q value of the action a in the state s for the N goals, as shown in Eq. 12, and the optimal action strategy π * MQ for each target in the current state expressed in Eq. 13: In Eq. 11:α (0 < α < 1) is the learning rate, which is set to 0.01 in this article; c is the discount coefficient, which is set to 0.9 in this article; Q i (s, a) represents the Q value of the ith target's choice of action a in state s. However, the above-mentioned optimal action selection strategy cannot guarantee that the agent fully explores the entire state-action space. In this paper, Negotiated W-learning strategy is used to optimize the MQ (s, a) vector space. This strategy defines variable W i as a leader parameter. The operation steps are as follows, and Figure 2B is a reference flow chart: Step 1: Choose an objective function in the MQ(s, a) vector space as the guide objective function. Its investigation parameter is expressed as W i . The first guide objective function is uniformly set to W cir 0, and the guide action is obtained as follows: Step 2: The remaining objective functions are calculated according to the following methods, as shown in 15: Step 3: Choose the maximum value of for other objective functions except the guide objective function, and compare it with W cir . If W i , max > W cir , the objective function which is corresponding to this maximum value of W i should be selected as the new guidance objective function, the guidance value W cir should be updated as the value of W i , max , the corresponding action a should be made to be the new guidance action a cir , and then go back to step 2 for repeated iterations until this condition is no longer met. If W i ≤ W cir is obtained, record the guidance action a cir and the guidance objective function at this time as the final value.

SIMULATION RESULTS
This paper builds a typical two-region interconnected power grid AGC model for controlling load frequency. The parameter settings of the two regions in the model system are the same, and the system base capacity is 1000 MW. Figure 3A,B shows the pre-learning process of single CPS1 target and Negotiated W-Learning Algorithm. In the pre-learning stage, a continuous sinusoidal load disturbance with a period of 1,200 s, an amplitude of 100 MW and a duration of 20,000 s is applied to the A area, and a 2-norm Q function matrix Q t (s, a) − Q t−1 (s, a) 2 ≤ ζ (ζ is a constant) is used as the standard for pre-learning to achieve the optimal strategy (Imthias Ahamed et al., 2002).
It can be seen from Figure 3A that after many iterations, the Q function tends to stabilize, reaching the optimal strategy for the CPS1 target. Figure 3B shows the average value of CPS1 (CPS1 avg -10−min ) in area A every 10 min during the prelearning process. It is found that the curve almost remains at a stable and acceptable value in the later stage, which shows that the Negotiated W-Learning algorithm has approached the optimal CPS1 control strategy. At the same time, the Q matrix corresponding to the target BAAL has also converged.
In addition, from the perspective of algorithm learning time, the four algorithms have been simulated for many times, and the average calculation time has been counted. See Table 1 for details. Due to the difference in the number of optimization targets and the difficulty of calculating the coordination factor, the calculation time of the single target CPS1-MORL is the shortest. Since the CoordinateQ-MORL algorithm cannot fully explore the action set, its calculation time is the second. Compared with the global search algorithm Greedy-MORL, Negotiated W-Learning has gone through more search steps, so its time is the longest.
In order to further verify the adaptability of Negotiated W-Learning in the constantly changing power grid environment, this paper applies a random disturbance with a period of 1,200 s and an amplitude of 100 MW in area A. Four types of algorithms are set for comparison as follows.
Algorithm 1. Traditional single-objective reinforcement learning algorithm for intelligent frequency control based on CPS1 frequency control performance evaluation index (CPS1-MORL).
Algorithm 2. Multi-objective reinforcement learning algorithm for intelligent frequency control based on the traditional greedy strategy of multi-dimensional frequency control performance evaluation index and multi-objective Q function (Coordinate Q-MORL).
Algorithm 3. Under the traditional greedy strategy, this algorithm uses a cooperative reward function based on multidimensional frequency control performance evaluation indicators to achieve multi-objective reinforcement learning and intelligent frequency control algorithm (Greedy-MORL).
Algorithm 4. The Negotiated W-Learning algorithm proposed in this paper is based on the collaborative reward letter under the multi-dimensional frequency control performance evaluation index for multi-objective reinforcement learning and intelligent frequency control (Negotiated W-MORL). Figure 3C shows the frequency deviation self-contribution degree (Δf/ε) and CPS1 index change curve of Algorithm 1 and Algorithm 4. In this paper, the threshold is used for calculation, where E is 0.01. The frequency contribution degree has the ability to reflect the frequency quality of different algorithms. If the frequency contribution degree exceeds ± 1, it means that the frequency at this time has exceeded the prescribed limit 3ε. It can be seen that the frequency contribution curve of Algorithm 1 exceeds the short-term index frequency continuous limit time specified in this article and has a steep drop in this interval, which will cause greater influence on system operation safety. However, the frequency contribution curve of Algorithm 4 stays within the defined range. There are two main reasons for this phenomenon: One is that Algorithm 4 controls the frequency by relaxing the weights of the two indicators in real time. If frequency fluctuations or "frequency drops" occur, the BAAL indicator will be given greater weight. If the frequency continuously exceeds the limit during the simulation period, CPS1 will be given a larger weight for regulation. The second is that Algorithm 4 considers two indicators to participate in the evaluation of AGC control at the same time, while Algorithm 1 only considers the impact of CPS1. At the same time, the CPS1 curve of Algorithm 4 in Figure 3D fluctuates less throughout the simulation cycle, while the fluctuation of Algorithm 1 is larger, which further proves that Algorithm 4 is superior to Algorithm 1 in terms of frequency control effect. In summary, combining the BAAL and CPS1 indicators to constrain the system frequency can effectively improve the frequency quality of the system at the full time scale.

The Influence of Cooperative Reward Function on Frequency Control Performance
In order to verify the effectiveness of the collaborative reward function proposed in this paper, the control performance indicators of Algorithm 2 and Algorithm 3 can be compared. It can be seen that the control performance indicators of Algorithm 3 are better than those of Algorithm 2. This is because the introduction of coordination factors between the multi-objective state-action value function may cause the agent to not fully explore the action set, leading to the omission of key actions, and the use of collaborative reward functions can effectively solve the above problems.
In summary, the introduction of a collaborative reward function can effectively improve the system frequency quality and various frequency performance indicators.

The Influence of Different Learning Strategies on Control Performance
In order to verify the effectiveness of Algorithm 4 proposed in this paper, Figure 3D shows the CPS1 curve of Algorithm 3 and Algorithm 4. It can be seen from Figure 3E that Algorithm 4 has a faster convergence rate and a more stable fluctuation situation than Algorithm 3 after the occurrence of load disturbance. This is because the Negotiated W-Learning strategy selects actions from global considerations, which effectively improves the traditional greedy strategy that is, easy to fall into the local optimal solution problem.
In summary, the global search strategy Negotiated W-Learning is more time-consuming than the local search strategies Greedy and CoordinateQ, but the search quality is higher.

CONCLUSION
This paper proposes a multi-intelligence frequency control strategy based on multi-dimensional evaluation criteria and cooperative reward function.
The simulation results show that: 1) Compared with the general algorithm, the Negotiated W-Learning algorithm can effectively improve the quality of the system frequency on the full time scale, and better explore the global action. 2) The collaborative reward function proposed in this paper can improve the linear weight of the traditional multi-objective Q function. In general, the intelligent AGC control strategy based on the collaboration of CPS1 and BAAL learning criteria proposed in this paper can effectively deal with the short-term power disturbance problem caused by the grid connection of new energy sources such as wind power, and improve the stability of the system.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.