A Learning-Based Bidding Approach for PV-Attached BESS Power Plants

Large-scale renewable photovoltaic (PV) and battery energy storage system (BESS) units are promising to be significant electricity suppliers in the future electricity market. A bidding model is proposed for PV-integrated BESS power plants in a pool-based day-ahead (DA) electricity market, in which the uncertainty of PV generation output is considered. In the proposed model, we consider the market clearing process as the external environment, while each agent updates the bid price through the communication with the market environment for its revenue maximization. A multiagent reinforcement learning (MARL) called win-or-learn-fast policy-hill-climbing (WoLF-PHC) is used to explore optimal bid prices without any information of opponents. The case study validates the computational performance of WoLF-PHC in the proposed model, while the bidding strategy of each participated agent is thereafter analyzed.


INTRODUCTION
The share of photovoltaic (PV) installations experiences an exponential growth worldwide and accounts for most of the electricity supply of renewable energy (Zucker and Hinchliffe, 2014). However, the actual output of PV power may be different from the scheduled production, which brings an inevitable challenge in power system real-time balancing. Battery energy storage system (BESS) units can deal with the uncertainty of PV production by the flexible up-and-down regulation capability (Li et al., 2013). Hence, the combination of PV farms and BESS sets will be a promising form of virtual power plant, which will actively participate in the future energy spot market with more deregulated paradigms. Thus, it is necessary to investigate the decision making of such PV-BESS generation as prosumers in the market.
In the work of Shafie-khah and Catalao (2015) and , bidding strategies of large-scale renewable resources in oligopoly electricity markets were formulated as mathematical programming with equilibrium constraints (MPEC) with the uncertainty of market competitors considered using incomplete information dynamic game theory. However, the equilibrium of such model is often difficult to be obtained because of the computational burden, and the complexity of these models increases with consideration of numerous complicated real-world assumptions and constraints (Ventosa et al., 2005). In this way, this complex set of equations is required to be solved again to find the market equilibrium in the new situation (Salehizadeh and Soltaniyan, 2016/04). With the development of artificial intelligence (AI) techniques in recent years, AI algorithms have been applied in the power system to deal with various problems such as renewable energy forecasting (Zeng et al., 2020), price prediction (Kebriaei et al., 2015), and energy management (Wang et al., 2019). The electricity market can be modeled as an AI-enabled energy platform, where market participants are regarded as AI agents. Agents make bidding decisions by gradually learning through repetitive communication with the AI-enabled market platform. The common AI learning technologies applied in the electricity market refer to heuristic search, artificial neural network, and reinforcement learning. Market participants make bidding decisions with shuffled frog-leaping algorithm (Jonnalagadda and DullaMallesham, 2013), genetic algorithm (Praça et al., 2003), and fuzzy adaptive gravitational search algorithm (Vijaya Kumar et al., 2013) by performing a heuristic search. Some reinforcement learning methods are used to address bidding problems, for example, the traditional Q-learning algorithms in the work of Najafi et al. (2019) and a deep reinforcement learning-based approach (Ye et al., 2019). However, market players develop the bidding strategy by using the abovementioned methods without consideration of other competitors. In the real-world electricity market, each agent achieves its purpose in response to other agents' bidding behaviors. Considering this, a multiagent multiobjective architecture with reinforcement learning is proposed to minimize energy costs for EV owners, in which agents should communicate with all friendly agents and get their rewards functions (Da Silva et al., 2019). The Markov game approach is utilized to update multiagent competitive bidding strategies in the work of Rashedi et al. (1049), while it is necessary to obtain other agents' previous bidding. However, market participants are not willing to share neither perfect nor part information in practice. The future research is expected to develop a bidding strategy obtained by a fully distributed online training procedure without any information communicated among agents.
Two bidding strategies are formulated considering the uncertainty of PV prediction in the work of Bo et al. (2017). The bidding strategy of battery storage systems in the secondary control reserve market is investigated in the work of Merten et al. (2020). Chen et al. (2021) studied the optimal bidding strategy of a PV-BESS VPP in frequency control ancillary services markets. A two-stage bidding strategy of households PV-BESSs is proposed in peer-to-peer market . Niknam et al. (2012) introduced a bidding strategy of combined PVstorage systems in day-ahead (DA) market, in which PV-storage systems are considered as price takers. So far, to the best of the authors' knowledge, there is little research considering PVattached BESS power plants in a pool-based DA wholesale market as oligopolists to make their bidding decisions without any information of opponents. Furthermore, prior research studies consider aggregated PV-BESSs developing bidding strategy with either complete or part information of other strategic players. In other words, previous work cannot deal with the situation that each strategic participant of PV-BESSs does not share any information with other rivals. This challenging issue is required to be addressed properly. In this study, we propose a DA bidding strategy of PV-attached BESS power plants to maximize their benefits by self-bidding not relied on any information of competitors. A multiagent reinforcement learning win-or-learn-fast policy-hill-climbing (WoLF-PHC) is used to solve the proposed bidding problem. The main contributions of this study are summarized as follows: 1) A stochastic bidding strategy model of PV-attached BESS power plants in a pool-based DA wholesale market is developed, to maximize revenues of PV-attached BESS power plants considering the uncertainty of potential maximum PV power production 2) A multiagent stochastic game framework with incomplete information is used to describe the proposed bidding model, and the proposed model is then solved by a multiagent reinforcement learning WoLF-PHC without any opponents' information 3) The validity of the proposed model and the WoLF-PHC algorithm is validated by the modified IEEE 6-bus and 118-bus systems The remaining part of this article is arranged as follows. Proposed Bidding Model introduces the proposed bilevel stochastic bidding model. In Methodology, the WoLF-PHC is used to solve the proposed bidding problem. Simulation results and analysis are conducted in Case Study, while Conclusion concludes the whole article.

PROPOSED BIDDING MODEL
The DA wholesale pool-based market is considered in this study. Strategic participants PV-attached BESS power plants submit bid prices and power capacities to the market operator (MO) on an hourly basis. The MO runs the market clearing process to confirm the locational marginal pricing (LMP) and scheduled power production of PV-attached BESS power plants. The overall structure figure of the proposed model is presented in Figure 1.
The assumptions of the proposed market model are as follows: 1) Uncertainty of potential maximum PV power production is considered in this study, which is dealt with a scenario-based stochastic optimization method. The uncertainty is modeled as a set of scenarios derived from a scenario generation process on account of the roulette wheel mechanism in the work of Niknam et al. (2012) and an efficient scenarioreduction method in the work of Morales et al. (2009). In this way, a stochastic optimization problem can be converted into a deterministic one and solved with many methods. A stochastic bidding model is introduced where the total cost is minimized for the MO completing the market clearing, while respective revenues are maximized for strategic participants PVattached BESS power plants.

Market Clearing Model
In the market clearing process, suppliers PV-attached BESS power plants and loads first submit their bid prices π bid co,t and π D d,t , respectively, to the MO. The MO then completes market clearing by minimizing the total cost relied on the OPF. At last, the dispatched power production of PV-attached BESS power plants P CO co,α,t and LMP λ n,α,t will be returned to maximize revenues of strategic PV-attached BESS power plants.
subject to (1.6) θ n,α,t 0, ∀t, α, n: ref, (1.7) −π ≤ θ n,α,t ≤ π, ∀t, α, n\n: ref, (1.8) The objective of Eq. 1 is to minimize the total cost. The first term is the costs of purchasing electricity from PV-attached BESS power plants π bid co,t · P CO co,α,t , while the second term represents the revenues of selling electricity to load demands π D d,t · P D d,t . P CO co,α,t and P D d,t are the power output of the coth PV-attached BESS power plant and the dth load in each hour. α indexes scenarios of PVs, and τ α is the corresponding probability. The constraint of Eq. 1.1 is the power production and consumption balance for node n with a dual variable λ n,α,t donating the LMP, where B nm is the susceptance of the line connecting nodes n and m, and θ is the voltage angle. Eq. 1.2 represents the scheduled power of PVattached BESS power plants P CO co,α,t supplied from PVs P PV pv,α,t and BESS units P BE be,α,t . The scheduled power of PV-attached BESS power plants should be nonnegative, as shown in Eq. 1.3. Maximum and minimum capacity limitation for PV units and BESS units are considered in Eq. 1.4 and Eq. 1.5 respectively. Inequality Eq. 1.6 limits the thermal capacity of the transmission line f max nm . Eq. 1.7 and inequality Eq. 1.8 set voltage angle limits at the slack bus and other buses, respectively. Inequality Eq. 1.9 represents the SOC range of the BESS at the present hour, while constraints Eq. 1.10.1 and Eq. 1.10.2 indicate time-series SOC formulation of the BESS at present and the previous hours. η c and η dis are the charging and discharging efficiency of the BESS separately. E max be refers to the maximized power capacity of BESS.

The PV-Attached BESS Power Plant Revenue Model
The strategic player revenue for the coth PV-attached BESS power plant is maximized and represented by scenarios α with corresponding probabilities τ α and represented in Eq. 2, where LMP λ n,α,t and scheduled power output of PV-attached BESS power plant P CO co,α,t are obtained from the market clearing process. The revenue of a PV-attached BESS power plant in Eq. 2 includes the income of selling electricity to the electricity market λ n,α,t · P CO co,α,t and the battery degradation cost (Κ b /ϖ b ) · P E be,α,t . Κ b and ϖ b are battery lifetime and battery capital cost, respectively. Absolute-value function in Eq. 2 can be addressed by a linear programming simplex method in the work of Hill and Ravindran (1975).

Introduction to Multiagent Reinforcement Learning
The proposed bidding model brings fundamental problems: how the strategic market participants work as AI agents to learn and determine the optimal bid prices? This research implies that, in the electricity market, it is possible to train agents with AI algorithms to better solve the optimization of bidding problems. The common core techniques for AI are classified as the artificial neural network, reinforcement learning, genetic algorithms, and multiagent systems (Xu et al., 2019). In reinforcement learning (RL), the agent makes its decision in terms of communication with the external environment as in Figure 2 (Hwang et al., 2017). First, the agent perceives a state x n and a reward r n based on its past action a n−1 from the environment at each step n. Then, its learning is reinforced by comparing the returned scalar reward signal r n every time with the one in last round r n−1 for evaluating the quality of its environment-based behavior. Specifically, the probability of this potential action p will be increased if the compared result is better and decreased if conversely. Last, the highest probability action a n would be chosen through the learning by itself. There are three main classes of methods that made use of RL principles, namely, dynamic programming methods, Monte Carlo techniques, and temporal difference learning methods (Tellidou and Bakirtzis, 2006). The premise of using dynamic programming is the complete availability of system information. Although Monte Carlo techniques could cope with unknown environments, the solution process is very time consuming and a long time would be needed to wait for the final outcome of learning. Temporal difference learning methods used to learn from an unknown environment after every step without the final result are more suitable for the problem presented in this study, and Q-learning is one of such most frequently used RL approaches. In Q-learning, sets of states g and actions k of each agent are represented as χ {x 1 , x 2 , . . . , x g } and Λ {a 1 , a 2 , . . . , a k }. Then Q values are updated in the nth step Eq. 3,   in which x n ∈ χ, a n ∈ Λ, and r n refers to each pair (x n , a n ). α and β are the learning rate and discount factor separately, which are both in the range (0,1]. Q n+1 (x n , a n ) ← (1 − α)Q n (x n , a n ) + α r n + β max an ′ Q n+1 x n , a n ′ . ( Multiagent reinforcement learning (MARL) is developed from the single-agent RL with adding the game relationship between all agents, which are similar to strategic players in the electricity market. Let a tuple (K, χ, Λ, P, r) represent a multiagent game framework, where K {1, 2, . . . , k} is a set of agents and χ is a set of states {x g }. The sects of actions of each agent a i are described as a i {aa min , . . ., aa max } in Λ {a 1 , . . . , a i , . . . , a k }. P refers to the transition function written as χ × Λ × χ → [0, 1]. r {r 1 ,. . ., r i . . ., r k } is the set of reward functions of all agents, where r i : (x g , a i ) → R implies the ith agent's reward function with a pair (x g , a i ). In each episode, the agent observes the state x g ∈ χ and selects to execute the action a i relying on an appropriate policy of learning algorithm and then steps into the next state x g ∈ χ.

Assumptions and Definitions
The proposed bidding model in Proposed Bidding Model can be expressed as the multiagent game framework. We consider agents where K is a set of strategic participants PV-attached BESS power plants, states where χ is defined as different levels of PV-attached BESS power plants' capacities. P CO co,α,t is obtained from the market clearing, which would show that a state x co is selected after the communication with the extra environment, and actions Λ is used to update bid price {π bid co,t } co∈{a j } .
Reward function: r i (x co , a i ) → R is the revenue of the coth player with the bid price π bid co,t in the PV-attached BESS power plant's capacity level x co .
In this way, K, χ, Λ, and r have been defined. The optimal policy p, which is used to choose an action in current state, is required to find. Here, a suitable algorithm win-or-learn-fast policy-hill-climbing (WoLF-PHC) would be introduced in this study.

The Step-by-Step Implementation of the Proposed Model With WoLF-PHC
The WoLF-PHC is developed from the Q-learning, which requires two learning parameters with winning ξ w and losing ξ l . The convergence is enhanced with these two learning rates. It is defined that ξ ω should be smaller than ξ l . If the agent loses, it will learn faster with ξ l to update its action. On the contrary, the agent will keep caution with ξ ω when it wins. The evaluation criterion of winning or losing is comparing the expected revenue and the average profit, in which the average strategy replaces the original equilibrium policy. The WoLF-PHC algorithm of agent i is represented as follows.   Specific learning procedures for the ith PV-attached BESS power plant strategically bidding with WoLF-PHC are described in following steps. 1) Bid price λ i , parameters α, β, η, ξ w , and ξ l , and Q i , p i , and c(x co ) are initialized. 2) In the nth episode, market clearing is completed as (1)-(10b). After that, the reward function of the ith agent r in can be obtained as (2). Then, Q i , p i , ξ, and p i are updated in sequence as (9), (10)-(11), (15), and (12)-(14), individually. Last, the bid price of ith agent λ i is updated according to the updated policy p i . 3) n n + 1 is set, and step 2) is repeated until n > number of intervals. The abovementioned implementation of WoLF-PHC for solving PV-attached BESS power plants' bidding problems in Proposed Bidding Model is shown in Figure 3.

CASE STUDY
The proposed model is tested on the IEEE 6-node and 118node systems. Scenarios for electricty output capacities of the   PV unit are derived from historical data from the work of Agathokleous and Steen (2019) and represented in Figure 4 after scenario generation and deduction (Niknam et al., 2012) (Morales et al., 2009). 10 corresponding probabilities are shown in Table 1. Parameters of BESS and WoLF-PHC are shown in Table 2 and Table 3, respectively. We run all simulations in MATLAB with a 1.6 GHz Intel Core i5-5250U computer.

Case 1
In the 6-node system, three suppliers PV-attached BESS power plants are located in buses 1-3 separately and three loads are connected to buses 4-6 individually. Bid prices of loads are assumed as constant in 24 h, which are 59.4 $/MWh, 50.8 $/MWh, and 39.7 $/MWh (Zugno et al., 2013).
Three suppliers PV-attached BESS power plants represent three strategic participants in this case. According to the parameter setting given above, their bid prices and revenues are shown in Figure 5 and Figure 6. It demonstrates that WoLF-PHC could be used to, respectively, optimize bid prices for the competitive PV-attached BESS power plants. During this process, each strategic participant obtains optimal bid price only relying on the communication with the extra environment ISO. Rivals' cost functions, bidding information, and historical bidding information are not open to the agent. We protect the personal information of market players with the WoLF-PHC. Three PV-attached BESS power plants' power outputs and SOC are, respectively, shown in Figures 7-9. There is no solar power output in 1:00-5:00 and 20.00-23:00, and BESSs supply loads by discharging, while PV units satisfy the requirement of load demand and charging of BESSs during 6:00-19.00.

Comparison
The proposed bidding model of the PV-attached BESS power plant is compared with the other two models, which consider only PV units and only BESSs as strategic market participants. Revenue comparison of the proposed model, PV unit, and BESS for 24 h is represented in Table 4. Due to the limited light time and the degradation of the battery, revenues of PV unit and BESS separately as the strategic player are both lower than the profit of the proposed model. The social welfare of the proposed model is higher than that of the other two models. FIGURE 10 | Bid prices of three suppliers PV-attached BESS power plants within increased load in the 6-node system.  Case 2 In this case, results of bidding prices are analyzed with different load levels and different numbers of strategic players. First, the total load demand is set equal to the total capacity of the three strategic players. Figure 10 represents bid prices of three suppliers in 100 iterations. Compared with bid prices in Figure 5, bid prices of three participants are higher in Figure 10. The lack of competition among market participants shows that they are not required to lower their bid prices for selling more. On the contrary, each player tries to raise its bid price for earning more profits. Then, total demand is set as half of the load in case 1. Bidding results are shown in Figure 11. More competition drives all participants to reduce their bid prices than those in Figure 5. Last, the number of strategic players is increased to five and the corresponding curves of bidding prices in 100 iterations are represented in Figure 12. Suppliers adopt relatively conservative behaviors so that their price levels are lower than those in Figure 5.

Case 3
The proposed model is applied in the IEEE 118-node system with three and nine PV-attached BESS ower plants, respectively, in this case. The three suppliers of PV-attached BESS power plants are located in node 12, 29, and 98, which are then individually duplicated to be three strategic players in the same nodes and then become nine players. The convergences of bidding prices for three suppliers and nine suppliers are shown in Figure 13 and Figure 14 separately, which imply that each agent can get its convergent bid price with WoLF-PHC in a larger power system with more participants. In this process, any information of opponents is not required. Each supplier communicates just with the ISO in the clearing process, which ensures the privacy of suppliers. Additionally, the overall bid level of nine strategic suppliers in Figure 14 is lower compared with three suppliers in Figure 13,. This is because more competition compels market players to reduce bid prices for selling more.

CONCLUSION
A bidding model with incomplete information for considering the uncertainty of generation output of PV units is proposed. A MARL algorithm WoLF-PHC is used to explore optimal bid prices for strategic PV-attached BESS power plants, and it protects personal privacy and respects the autonomy of market players. Three cases are implemented in the modified IEEE 6-node system and a larger IEEE 118-node system, with some conclusions represented as follows: 1) multiple strategic market players can obtain their bid prices individually with the WoLF-PHC in the electricity markets; 2) compared with models of PV unit and BESS as strategic participants independently, the revenue of proposed model is higher; and 3) decreased load and increased numbers of market players bring more competition, resulting in strategic suppliers bidding at lower prices.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/supplementary material; further inquiries can be directed to the corresponding author.