Modeling and Simulation of Large-Scale Wind Power Base Output Considering the Clustering Characteristics and Correlation of Wind Farms

The rapid development of renewable energy improves the requirements of renewable energy output simulation. The clustering characteristics and correlation of renewable energy would improve the accuracy of power output simulation. To clarify the typical power output process of a large-scale wind power base, a novel method is proposed for wind power output scene simulation in this paper. Firstly, the genetic algorithm (GA) Kmeans is used to divide the wind farm clusters. The wind power output of each cluster is calculated by the wind turbine model. Then, the Copula principle is used to describe the correlation characteristic of wind farm clusters. Finally, the power output scenes are simulated by the Markov chain Monte Carlo (MCMC) method. To verify the effectiveness of proposed method, the wind power base in the downstream Yalong River basin is taken as the case study. The results show that the 65 wind farms should be divided into 6 clusters. The five typical power output scenes in winter–spring and summer–autumn seasons are simulated respectively based on the clustering characteristics and correlation of wind farms. This study provides a valuable reference for other large-scale renewable power bases all over the world.


INTRODUCTION
An energy structure with fossil energy as its main source brings many problems, such as environmental pollution, climate change, and energy depletion crisis, which seriously restrict the development of the social economy (Zhang et al., 2018;Wang et al., 2018). Since the 21st century, energy structure transformation has become the focus of countries worldwide (Hou et al., 2019). Under the guidance of the concept of energy structure transformation, the development of the global renewable energy industry has been accelerating, and the installed capacity of renewable energy has increased from 812 GW in 2004 to about 3,089 GW in 2021. However, the high randomness, intermittency, and uncontrollability of renewable energy result in large-scale wind and photovoltaic (PV) power generation presenting large challenges for integration into a power grid Liu et al., 2020). Therefore, clarifying the characteristics of large-scale renewable energy and simulating the power output scene is of great significance for renewable energy development (Kim et al., 2020;. Currently, numerous studies focus on the analysis of the spatial and temporal distribution of renewable energy, as well as the evaluation of complementary characteristics of various clean energy power stations. De Blasis et al. (2021) applied a highorder multivariate Markov model to clarify the cross-and autocorrelation characteristics between wind speed and direction. Almeida et al. (2021) proposed a Monte Carlo-based multiarea reliability assessment method to represent the relevant features and intermittency of variable renewable energy resources. Xu et al. (2017) constructed the relation function of the marginal cumulative distribution function of the intensity of wind speed and light irradiance through the Copula function and used the Kendall rank correlation coefficient to describe the spatial and temporal characteristics of wind and PV indirectly. Huang et al. (2021) used the Copula method to analyze the uncertainties of wind and solar power for quantifying the risk of wind-solar-hydro complementary system. Cantao et al. (2017) used hydro-wind correlation maps to analyze the wind and hydropower complementarity, which are quantitative and more intuitive. Based on the variable-structure Copula function,  proposed a novel method to describe the correlation and complementarity of distributed wind power and load for optimizing the planning capacity of distributed wind power. Antunes Campos et al. (2020) assessed the complementary nature between wind with the Pearson's correlation coefficient and PV power and optimized energy storage capacity in the utility-scale hybrid power plants. However, the research on the combination of temporal-spatial distribution characteristics of renewable energy and output simulation or prediction is still insufficient.
Based on the complementary characteristics of new energy such as wind energy and solar energy, there have been many scholars who have studied the clustering characteristics of new energy in different regions. Dai et al., 2017 proposed an evaluation method of cluster output smoothness and quantified the contribution of wind power clustering to reduce the fluctuation of wind power output. Yesilbudak (2016) adopted the Kmeans clustering method with Squared Euclidean, City-Block, Cosine, and Pearson Correlation distance measures to analyze the clustering characteristics of 75 provinces' wind speed in Turkey. According to the aggregate effect of wind and solar power plants, Liu et al. (2020) aggregated all the power plants of study area into a virtual wind power plant and a virtual solar power plant. Chidean et al. (2018) presented the Second-Order Data-Coupled Clustering (SODCC) algorithm to analyze the wind power resource in the Iberian Peninsula. Yan et al. (2020) proposed a scenario generation method and established the planning model of renewable energy based on cluster partition. Nevertheless, there is less research focused on the correlation of multiple renewable energy clusters.
To develop and utilize large-scale renewable energy and reduce the adverse impact of renewable energy uncertainty, many scholars have conducted research on renewable energy scene simulation and power forecasting. Renewable power output scene simulation aims to fully tap the overall characteristics and statistical laws of renewable energy, generate typical power output scenes, and provide basis for renewable power system planning (Densing and Wan, 2022). In the previous literature, Deng et al. (2018) used a typical scenario simulation method of renewable power output calculating the renewable energy accommodation capability. Ding et al. (2016) proposed a short-term stochastic simulation method based on the renewable power output error and used the method for a real power grid in Northwest China. Compared with renewable power output scene simulation, renewable energy prediction provides a basis for making power system generation plan and power grid dispatching operation . Renewable energy forecast generally uses the statistical regression methods and machine learning technologies to estimate the future power output process.  proposed a hybrid wind power forecasting approach based on Bayesian model averaging and Ensemble learning. Neshat et al. (2021) proposed a novel three stages' composite deep learning-based evolutionary approach to forecast the power output in wind-turbine farms with the chaotic characteristics of wind speed series. Singh et al. (2021) represented the short-term wind power forecasting accuracy of five machine learning methods, such as k-nearest neighbor (kNN), decision-tree, extra tree regression, random forest, and gradient boosting machine (GBM). However, most of the existing research ignores the characteristic differences between different wind farms, and there are only a few studies on wind power simulating or forecasting of large-scale wind power bases based on the clustering method.
At present, research on the characteristics of new energy resources, cluster division, and renewable power output forecasting and simulating has achieved phased results, but there are still some deficiencies. In the planning and designing stage of the renewable energy system, the simulation scenes of renewable power output would be frequently used. However, the unreasonable wind power output scenes would seriously affect the development and management of the renewable energy system. In particular, previous research on renewable energy simulation assumes that the power output should be consistent in the whole area. The power output scenes of a representative wind farm would be usually used to describe all wind farms in the region. However, for large-scale wind power bases, there are certain differences in wind power output characteristics in the region. Ignoring the correlation and complementarity of wind farm clusters will lead to a large deviation in the simulation results of the wind power output. Consequently, researching on power output scene simulation of large-scale wind power base considering the power station cluster division and power output correlation of adjacent clusters is very necessary and urgent.
In order to fill this gap and obtain the accurate power output scenes of large-scale wind power bases, this paper proposed a power output scene simulation method considering power station clustering and cluster correlation. Firstly, the wind farm clusters are divided by the genetic algorithm (GA)-Kmeans method with similar distances. Secondly, based on the conversion relationship of wind speed and electric power, the wind power output physical model is used to calculate the wind power output of each wind

METHODOLOGY
The methods to be used for simulating the power output scene of large-scale wind power mainly consists of four parts. The technical route of the large-scale wind power base output simulating method with the correlation is shown in Figure 1.
The nomenclature table of abbreviations, variables, and constants is shown in Table 1.

Wind Farm Cluster Division With the GA-Kmeans Method
The GA-Kmeans method was performed to divide the wind farm clusters. The uncertainty caused by the clustering number K and clustering center {c 1 , c 2 , . . . , c k } is difficult to solve using the conventional Kmeans method. The GA, which has a fast computing speed, a stable operation, and a strong global searching ability, was combined with the Kmeans method in this article. The GA-Kmeans method can reduce the influence of the initial cluster number and the selection of cluster center on the resulting cluster effectively. Besides, the GA-Kmeans method can improve the accuracy of clustering results and avoid Kmeans clustering into a local optimum (Yesilbudak, 2016). Furthermore, the correlation distance was selected as the distance evaluation index in the Kmeans clustering process. The fitness function optimized by the GA algorithm was constructed by counting the intra-class distance and inter-class distance of each cluster.
The correlation distance between X (x 1 , x 2 , . . . , x n ) and C (c 1 , c 2 , . . . , c n ) can be expressed as: The similarity of objects in the Kmeans cluster can be expressed by the average class inner distance as follows: where x ij is the jth object of class i, N i is the sample size of class i, c i is the cluster center of class i, and d(x, y) is the relative distance between two samples. The object difference between clusters of Kmeans clustering can be expressed by class spacing as: The fitness function was defined as: The fitness function value was determined by the quality of clustering results. The fitness function value is larger when the average in-class distance is smaller and class distance is larger. Currently, the clustering effect is better.

Calculation Method of Wind Power Output
The wind speed and electric power conversion model is adopted to calculate the output power of a single wind turbine. The total output process of the wind farm can be obtained by the ratio of the unit capacity to the installed machine. The power conversion relation of a wind turbine is shown as: where P W,t is the power output of unit wind turbine at time t, kW; P r is rated power, kW; C p represents the wind energy utilization coefficient of the wind power; S is the swept leaf area; v is the real wind speed at time t, m/s; v ci , v co , and v r are cut in, cut out, and rated wind speed, respectively, m/s; ρ is the moist-air density, kg/ m 3 ; ρ 0 is the dry-air density at normal pressure and temperature, ρ 0 1.293m 3 /kg; T is temperature,°C; P is the air pressure at the hub height of wind turbine, hPa; P w is the water pressure, hPa.

Wind Farm Clusters Correlation Analysis With the Copula Principle
The correlation analysis method of adjacent wind farm clusters based on the Copula principle includes the marginal distribution model of wind farm cluster power output, the Copula function type and conditional distribution of adjacent wind power cluster output, and the goodness-of-fit test method of the distribution model.

The Marginal Distribution of Wind Power Output
The main distribution marginal functions widely used in statistical analysis are Pearson type III distribution (P-III), lognormal distribution (Ln), Generalized extreme value distribution (Gev), and Weibull distribution. In this paper, four distributions are used to fit the marginal distribution of each wind power cluster's output.
It is worth noting that if the wind power output is taken as the random variable, there are multiple repeated minimum and maximum values in the sample sequence. Moreover, the probabilities of minimum and maximum values are not equal to 0, which leads to the discontinuity of probability density function and cumulative distribution function of wind power output. Therefore, the probability distribution of wind power output needs to be described by the interception distribution model, and the probability density and cumulative distribution function can be expressed as: where, β 1 and β 2 represent the probability of occurrence of minimum and maximum events, respectively; δ(·) is the Dirac delta function; f c (x; R) is a continuous function and satisfies Copula joint distribution c(·) Copula joint probability density c i The cluster center of class i Correlation distance between two samples X and C F (v|u) Copula conditional probability distribution function The average class inner distance Icd Class spacing distance K Clustering number N i The sample size of class i Pe i Empirical frequency P i Theoretical frequency P The air pressure at the hub height of wind turbine P W,t The power output of unit wind turbine at time t S The swept leaf area The Copula Function Type and the Conditional Distribution Sklar (1959) introduced the theory of Copula into statistics, providing an effective method for multivariate analysis. For 2dimensional random variables, random variables X and Y obey the marginal distribution F X (x) and F Y (y), respectively. F(X, Y) represent their joint distribution function. There is a Copula: where x ∈ [0, 1] and y ∈ [0, 1].
If F X and F Y are continuous functions, then the C(·) is unique, and the joint distribution density function can be expressed as: where u and v are random variables. Analyzing the correlation between variables is the basis of constructing Copula joint distribution. Pearson linear correlation coefficient (r n ), Spearman correlation coefficient (ρ n ), and Kendall rank correlation coefficient (τ n ) were used to describe the correlation of wind energy in wind power cluster.
Nelson (1999) gave a detailed introduction to the Copula function and its properties. Generally, Copula functions can be divided into three types: Elliptic, Archimedean, and Quadratic. The Archimedean Copula with one parameter is the most widely used.
In this paper, three Archimedean Copula (Gumbel Copula, Clayton Copula, and Frank Copula) are used to construct the joint distribution of wind power of each wind farm cluster. The joint distribution functions and conditional distribution functions of three Copula type are provided as follows: 1) The joint distribution function and conditional distribution function of Gumbel Copula are shown as: where θ is the parameter of the Gumbel Copula function, and θ ∈ [1, ∞).
2) The joint distribution function and conditional distribution function of Clayton Copula are shown as: where θ is the parameter of the Clayton Copula function, and θ ∈ (0, ∞).
3) The joint distribution function and conditional distribution function of Frank Copula are shown as: where θ is the parameter of the Frank Copula function, and θ ∈ R.

The Goodness-of-Fit Test Index
Root Mean Square Error (RMSE) and Akaike Information Criterion (AIC) were used to evaluate the goodness of fit of the Copula joint distribution function.
1) RMSE is the most commonly used index for the goodness-offit test.
where Pe i and P i are the empirical frequency and theoretical frequency, respectively.
2) AIC considers the deviation of Copula function fitting and the uncertainty caused by the number of parameters of Copula function.
AIC n ln ⎛ ⎝ 1 n n i 1 where m is the number of model parameters. The smaller value of the AIC and RMSE, the better fitting degree of the Copula function.

Large-Scale Wind Power Output Scene Simulation Considering the Correlation
According to the correlation characteristics among wind power clusters, the MCMC method is used in this study to randomly sample from the conditional distribution of each variable and its related variables in a fixed order to form the output scenario set of large-scale wind power bases, and the sampled output scenarios are reduced based on the synchronous backstepping method to extract representative typical output scenarios. The steps of output scenario simulation of largescale wind power are as follows: 1) Generate N random numbers a N ∈ (0, 1); let it be the marginal probability of wind power output of the first cluster, that is, P(X 1 ≤ x 1 ) a 1 ; bring a 1 into the inverse function of marginal distribution F −1 1 (a 1 ) x 1 , and solve for x 1 , which is the first cluster power output. 2) Let a i i ∈ [2, N] be the conditional transition probabilities from the second cluster to the last cluster P(X i ≤ x i |X i−1 x i−1 ) a i ; bring a i into the conditional distribution among each cluster one by one, F(v|u) P(X i ≤ x i |X i−1 x i−1 ), and calculate the marginal probability v i ; according to the inverse function of marginal distribution F −1 i (v i ) x i , solve for x i , which is the power output of i cluster.
3) Calculate the output of all wind farm clusters (x 1 , x 2 , /, x N ) and accumulate the wind farm clusters' power output to obtain the output scenario of a large-scale wind power base. 4) Repeat steps (1) to (3) M times to obtain the output scenario set of a large-scale wind power base. 5) Based on the Kmeans scenario reduction method, the representative typical output scenarios are extracted in the output scenario set of a large-scale wind power base.

CASE STUDY
This study focuses on the Yalong River (the longest tributary of the Jinsha River), which is located in Southwest China. Its geographical location is 96°52′E to 102°48′E and 26°32′N to 33°58′N. The Yalong River basin is an area that is rich in wind energy and solar energy resources. There are abundant wind and PV power resources on both sides of the river basin, and it has great development potential Liu et al., 2019). The complementary characteristics of wind, PV, and hydropower resources within the year are fully used to improve the comprehensive benefits. According to the preliminary plan of the watershed-type multi-energy complementary bases (WMCB) in the downstream Yalong River basin, there are more than 65 wind power farms with a total capacity of 7 GW; there are nearly 19 PV power stations with a total capacity of about 5.6 GW; the hydropower installed capacity of the downstream Yalong River basin is 14.7 GW .
According to the planned location of the wind farms in the lower reaches of the Yalong River (as shown in Figure 2), the wind energy reanalysis data at each station location are extracted, and the wind speed power conversion model is used to calculate the long-series output process of each wind farm. The advanced GW121-2.5MW wind turbine is selected as the reference unit in the research process. The main technical parameters of the GW121-2.5MW wind turbine are shown in Table 2.

RESULTS AND DISCUSSION
In order to numerically verify the effectiveness of the research model and method, the results and discussion of the wind farms cluster in the downstream Yalong River basin are performed.

Dividing the Wind Farm Clusters
Kmeans method should determine the clustering number k firstly. Generally, the optimal clustering number is between [2, N √ ], where N represents the number of clustering wind farms. In this study, 65 wind farms in the downstream Yalong River basin are clustered. Considering the geographical location, scale, wind energy, and other specific conditions of the wind farms, the maximum clustering number is 8 and the minimum clustering number is 3. Then, the clustering results of wind farms in the downstream Yalong River basin under different clustering numbers are shown in Figure 2. As can be seen from Figure 3, with the increase of the clustering number, the concentration of each wind farm cluster increases. However, when the clustering number is too large, the number of wind farms in individual clusters is too less.
Therefore, comparing the clustering results under different cluster numbers, the optimal clustering number is k 6. The cluster division results and cluster centers of wind farms in downstream Yalong River basin are shown in Figure 4. The cluster center, representative wind farm, and capacity of each wind farm cluster are shown in Table 3. From Figure 4, the clustering results calculated by GA-Kmeans show obviously regional characteristics, and the characteristics are consistent with the actual situation of the Yalong River basin.

Power Output Characteristic of Wind Farm Clusters
According to the wind power cluster division results, the power output of each cluster is calculated by the wind power output  physical model, and the typical daily power output of 6 clusters in each month is shown in Figure 5. It can be seen from Figure 5 that the wind power output has obvious daily and annual variation rules. In the short term, the wind power output is low from 10:00 to 15:00, and the wind power output usually reaches the peak at about 20:00, which is the same time as the peak load. In the long term, the output of wind power clusters shows obvious seasonal law. From June to October, the power output of each cluster is significantly lower than that in other months. Therefore, it can be divided into two characteristic periods: summer-autumn and spring-winter. The daily power output intervals of 6 wind farm clusters in the winter-spring and summer-autumn seasons are shown in Figure 6. From Figure 6, there are significant differences in   The bold values represent the adjacent wind clusters with the best correlation. the daily power output intervals of different seasons and different clusters. In the spring-winter season, the mean value and variation range of daily power output are relatively large, while in the summer-autumn season, the mean value and variation range of daily power output are both small.

Correlation Analysis of Wind Farm Clusters Based on Copula
According to wind farm cluster division results in downstream Yalong River basin and power output sequence and characteristic of each wind cluster, analyze the correlation  of adjacent wind farm clusters with three types of Copula function.

The Correlation Coefficient of Adjacent Wind Farm Clusters
In this study, the Pearson, Spearman, and Kendall correlation coefficients are used to evaluate the correlation among six wind farm clusters in downstream Yalong River basin, as shown in Table 4. The scatter matrix of wind farm clusters is drawn in Figure 7.
Because it is hard to analyze the correlation of multiple wind farm clusters directly, this study uses a set of correlations of adjacent wind farm clusters to represent the correlation of multiple wind farm clusters. According to the three correlation coefficients and scatter matrix of each wind power cluster, the adjacent wind farm clusters with strong correlation are selected to form the adjacent wind farm clusters connected head to tail: Cluster3-Cluster5, Cluster5-Cluster4, Cluster4-Cluster1, Cluster1-Cluster2, and Cluster2-Cluster6. Figure 7 and Figure 4 indicate that the selected adjacent wind power clusters are consistent with the spatial distribution law of wind farm clusters in downstream Yalong River basin.

The Marginal Distribution of Each Wind Farm Cluster
In this paper, the generalized extreme value distribution, Weibull distribution, Pearson type III distribution, and lognormal distribution are selected as the marginal distribution function to fit the power output of each wind farm cluster, and the marginal distribution parameters are estimated by the maximum likelihood method. The cumulative distribution curves of the power output of six wind power clusters in downstream Yalong River basin are shown in Figure 8.
From Figure 8, comparing the empirical frequency with the cumulative frequency of each marginal distribution, it can be found that the goodness-of-fitting of the four distribution curves is roughly the same, and four type distributions could fit the data samples well. After screening, the optimal marginal distributions of Cluster1, Cluster3, Cluster5, and Cluster6 are lognormal distribution, the optimal marginal distributions of Cluster2 and Cluster4 are Weibull distribution, and the optimal marginal distribution of each wind farm cluster is used to construct Copula joint distribution. Moreover, the interception distribution model used in this study can effectively fit the samples with the power output of 0 and 1 in the data series.

The Copula Joint Function of Adjacent Wind Farm Clusters
According to the adjacent wind farm clusters and the marginal distribution of each wind power cluster, the Gumbel Copula, Clayton Copula, and Frank Copula are used to construct the joint distribution of adjacent wind farm clusters. The Copula joint distribution parameters are estimated by the maximum likelihood method. The AIC and RMSE criteria are used to test the goodness-of-fitting of Copula functions, as shown in Table 4. As can be seen from Table 5, the best joint distribution in Cluster3-Cluster5 is Frank Copula function. The best joint distribution in Cluster5-Cluster4, Cluster4-Cluster1, Cluster1-Cluster2, and Cluster2-Cluster6 is the Gumbel Copula function. The best Copula joint distribution of adjacent wind farm clusters is shown in Figure 9. Figure 9 indicates that the Copula joint distribution diagram can intuitively reflect the joint probability of adjacent wind farm clusters. According to the joint distribution of wind farm clusters, when the power output of a wind farm cluster is certain, the conditional probability of power output of its adjacent cluster can be determined. On the contrary, given the joint probability of adjacent wind farm clusters and one of the cluster power output, the corresponding power output of the other cluster can be deduced.

Output Scenario Combination of Large-Scale Wind Power
Based on the Copula joint distribution, the 10,000 sets of power output scenarios of downstream Yalong River wind power base in summer-autumn and winter-spring are simulated by the MCMC method and shown in Figures 10A, B. From Figure 6 and Figures 10A, B, the simulated power output scenario sets of wind power base in summer-autumn and winter-spring have the same law as the daily power output interval.
However, the power output scenario sets of wind power base in summer-autumn and winter-spring are too complex, so the power output scenario sets need to be reasonably reduced. The Kmeans scenario reduction model is used to reduce 10,000 sets of scenarios into five typical scenarios. The representative typical wind power output scenarios and the corresponding scenario probabilities in the winter-spring and summer-autumn seasons are shown in Figures 10C, D. From Figure 10, the typical power output scenes can basically cover the original scenario set, and each typical power output scenario is highly representative. Otherwise, comparing the typical power output scenes of wind power base in summer-autumn and winter-spring seasons, the probability of each scene in winter-spring season is relatively similar, and the scene probability is about 0.2. In the summer-autumn season, the probability of each scene is quite different. The probability of Scene2 and Scene3 is close to 0.3, while the probability of Scene5 is only 0.055. In general, the power output simulation method of large-scale wind power base can

CONCLUSION
The continuous expansion of new energy such as wind power directly leads to the increase of power system uncertainty and the difficult grid integration of new energy. It is beneficial to divide the large-scale wind power base into wind power clusters and quantify the correlation of wind power clusters. Therefore, this paper proposed a power output scene simulation method of large-scale wind power bases considering power station clustering and cluster correlation characteristics. The method is applied in the Yalong River downstream, and the main conclusions of this paper can be summarized: 1) GA-Kmeans clustering method with a similar distance to the evaluation standard can quickly and accurately divide the clusters of renewable energy power stations and effectively solve the influence of cluster number and initial cluster center on Kmeans clustering results. In the case study, this method is applied to divide the 65 wind farms in the downstream Yalong River basin into 6 clusters, and the cluster division results are consistent with the spatial distribution characteristics of wind energy resources in the basin. 2) Copula function can effectively reflect the output correlation of multi-dimensional wind farm clusters and significantly improve the simulation or prediction effect of the power output in largescale wind power bases. In the case study, the Copula function is constructed to determine the best joint distribution of 6 adjacent wind farm clusters in the downstream Yalong River basin. Then, based on the correlation characteristic, the MCMC sampling method is used to simulate the typical power output of the Yalong River downstream wind power base in winter-spring and summer-autumn seasons, respectively.
3) Compared with the power output scenario sets, the typical power output scenes can effectively remove the redundant information in many scenario sets and highlight the representative situation of the integrated output of a large-scale wind power base. Furthermore, the typical power output scenes could be conducive to the application of scenes in practical work such as planning, design, scheduling, and operation of large-scale wind power base.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusion of this article will be made available by the authors, without undue reservation.

AUTHOR CONTRIBUTIONS
MZ designed the framework and analyzed the data of this study; YW and XW provided significant suggestions on the methodology and structure of the manuscript; YZ, JC, and TL collected the data; MZ wrote the paper.

FUNDING
This research was funded by the Natural Science Foundation of China (grant numbers U1965202, 52009101, and 51909207).