Customer Load Forecasting Method Based on the Industry Electricity Consumption Behavior Portrait

With the dramatic increase of energy demand and the continuous increase of power system operation pressure, higher requirements are put forward for the development of power grid planning and optimization operation. It is important for the refinement of distribution network planning to deeply extract the characteristics of user load. First, the process of load characteristic analysis method from the user level to the industry level is proposed, which achieves the division of electricity consumption patterns of various industries, thus building a panoramic portrait of industry electricity consumption behavior. Then, by expanding the information filled in by traditional customers, the feature vector of each user is extracted, and the users' industry electricity consumption patterns are used as the label. Therefore, a method for identifying the electricity consumption pattern of the customer based on the BB-stacking model fusion framework is proposed, which yields the preliminary forecast results of customer load based on the actual load accounting results of the customers. Finally, comparative simulations with different methods verify the effectiveness of the proposed algorithm, which can provide prominent guidance for the actual distribution network planning work.


Development of Customer Portrait
More recently, with the gradual transformation of enterprises from product oriented to user oriented in the production, it is significant for formulating marketing strategies and product design to fully understand customers and their needs (Wu et al., 2020). At present, in-depth mining of user data and research on accurate user portrait are gradually deepening. Accurate user portrait technology has been widely used in the Internet, finance, retail, operators, advertising, and other industries (Liu and Du, 2020;Zhang et al., 2020). In terms of advertising, Google uses big data to locate high-value users, which helps mobile e-commerce app to locate high-value users and push advertisements and adopts flexible data tracking methods for users of different tag categories, so that brands can achieve accurate advertising push for target users (Shan, 2018). In terms of auxiliary business decisionmaking, eBay carries out business circle customer group analysis, product marketability analysis, store operation analysis, personalized consumption analysis, and customer loss analysis by integrating online and offline massive data, providing decision support for commercial real estate and comprehensive big data analysis and prediction for project parties and brands (Chen et al., 2021). Obviously, user portrait technology has become an effective method to improve customer service quality and customer experience, which is an important basis for integrating high-quality resources and realizing enterprise and user value (Pitner et al., 2012;Yu et al., 2017).
In the field of electric power research, it has become a rising research direction to build user portraits in different application scenarios according to the actual needs (Qiu et al., 2017). In the early stage of the study, traditional user portraits were mainly used in the marketing portal to build user electricity sensitivity or credit portraits to guide the electricity recovery work (Sanchez et al., 2008;Han et al., 2014;Ampimah et al., 2017). To meet the requirements of demand response, the methods of establishing electricity consumption behavior tag library and realizing the portrait of different types of users' electricity consumption behavior patterns were proposed in (Qiu et al., 2017) and (Zhong et al., 2018). With the gradual popularization of new energy Yang et al., 2017;Yang et al., 2018;Yang et al., 2019a;Yang et al., 2019b;Yang et al., 2020), the load change and influence mechanism on the user side are becoming more and more complex. Therefore, the user portrait method is becoming more and more important.
The above pieces of literature are all portraits of the users who have been connected to the power grid, so as to formulate appropriate strategies to guide the corresponding users to change their electricity consumption behavior. However, there is rare research on the reported customers who are not connected to the power grid. In the distribution network planning work, reference (Lian et al., 2014) makes use of the complementarity between user loads to optimize the access decision reasonably, which can effectively improve the load distribution and utilization rate of power supply equipment. Moreover, the multidimensional analysis of user load can also provide effective guidance for the optimal scheduling and control of smart grid Zhang et al., 2015;Xi et al., 2016;Zhang et al., 2016;Zhang et al., 2021). However, due to the lack of research on the load of customers who are not connected, it is limited to transfer the user load of the existing distribution network. For the more common scenario of business expansion, the data information support and optimization ability are insufficient.

Research and Contribution of the Paper
In view of the above research results, the contributions of this paper can be summarized as follows: 1) Unlike the traditional single user portrait, this paper digs the power consumption behavior law of massive users, which constructs a more popularized and applied power consumption behavior portrait for industry. 2) A load forecasting method is presented, which is different from the general sense of load forecasting. Aiming at the new customers who are not connected with electricity, this paper extracts the time sequence characteristics of customer load through limited data information, which provides an effective reference for the actual distribution network planning.
3) The software integrating the above functions is developed and has been applied in engineering, which provides an effective tool for the decision-making of distribution network planners.

PORTRAIT OF INDUSTRY ELECTRICITY CONSUMPTION BEHAVIOR
In this paper, a load characteristic analysis method from user level to industry level is proposed. First, the massive users of various industries in the region are systematically analyzed and integrated. Therefore, a comprehensive and practical industry electricity consumption behavior portrait in the region is profiled.
The key steps are shown in Figure 1.

Load Data Preprocessing
For continuous load data, Lagrange interpolation (Criscuolo et al., 1984) is used to fill the vacancy and bad data, which can play a good repair effect. For the data points that need to be filled, m normal data points adjacent to the point are selected as samples to calculate the Lagrange interpolation formula: where t i and t j are the sampling time of the i and j data points in a day, x i means the load value of the i data point, and m is the number of sample points used to construct the polynomial. Finally, the approximate value of the missing value can be obtained by substituting the missing time t into L(t).
To avoid the influence of the actual load value in the follow-up analysis process, the linear proportion normalization method is used to normalize the load data after repair, which can be expressed as where x i represents the i-th load data point after normalization and x max is the peak load of the user.

Load Analysis of User Level
User Load Curve Analysis The user load curve has strong randomness, so it is difficult to ensure that the load curve of the day is typical for the user. In this paper, the time series mining method in (Lin et al., 2017) is used to extract the typical daily load curve of the user in a period of time T; the specific process is as follows: 1) Dimensionality reduction of load data: let w be the number of segments of daily load data processed by PAA method, then the daily load time series after dimensionality reduction is X [x 1 , /, x w ], and the calculation formula is as follows: where x i represents the mean value of the time i series segment after dimension reduction.
2) Symbolic representation: the SAX method (Notaristefano et al., 2013) is used to assign symbolic values to each segment of the daily load time series after dimension reduction according to the average value of each segment and then select the optimal value α as the size of symbol set; the discretized symbol sequence can be expressed as Y [ y 1 , /, y w ] . 3) Frequency of statistical symbol sequence: after the symbolic representation of daily load data is completed, the frequency of various symbol sequences is counted. Then, the most common load curve forms are screened out and the abnormal load is eliminated. Suppose the most symbol sequence appears is s. 4) Extract typical daily load curve: stack the daily load curve corresponding to the symbol sequence with the highest frequency, and calculate the mean value to obtain the final typical load curve L d [l d 1 , /, l d n ], which can be computed by where l d i stands for the i-th data point of the typical daily load curve L d and x r,i is the i-th data point of the r-th sample daily load curve.

User Load Characteristics Analysis
The user load curve can directly describe the change trend of user load in a day, and the load characteristics can more comprehensively characterize the user's electricity behavior from multiple perspectives. In this paper, user load characteristics are selected and calculated from three aspects, i.e., user load level, typical peak-valley characteristics, and load development law. The names and expression of each index are shown in Table 1.
In Table 1, the load level indicators represent the volume of user power consumption, which is the key information in the distribution network planning. Indicators of typical peak-valley characteristics can reflect the changing trend of user load in a day with less information granularity. The index of long-term load development reflects the rule that a user's load gradually increases to saturation after access to electricity and the relationship between saturated load and its area occupied. It is an important basis for reasonable estimation of the user's load.

Preliminary Classification of Users Based on Industry Classification Standards
To serve distribution network planning, the results of single user load characteristic analysis lack universality. Thus, it is necessary to explore the general law of massive users' load changes.
Generally, the users are classified into various industries according to the National Economic Industry Classification national standard, forming a user group in each industry, which has similarity in the long-term development law of load. In this way, the summary of the law and subsequent application is facilitated. After the preliminary classification of users is completed, due to the different characteristics of user peaks and valleys, it is necessary to further carry out industry power mode analysis and feature extraction.

Analysis of Industry Electricity Consumption Pattern
In the same industry, each user's electricity consumption behavior is still different. The gray wolf optimized fuzzy mean clustering algorithm (GWO-FCM) with high clustering accuracy and fast computing speed in (Gao et al., 2019) is used to further subdivide the user groups in the industry. The specific process is as follows: represents the SC of L d k , which can be computed by where D a (L d k ) is the average distance between L d k and other samples in C u cluster; D b (L d k ) is the minimum average distance between L d k and the samples not in C u . The closer SC(L d k ) is to 1, the better the compactness and separability of samples are and the better the clustering effect is.
The mean value of the SC of all samples is calculated to evaluate the effectiveness of the clustering, which can be computed by 4) Traverse the optional range of cluster number: if the current cluster number c is greater than c max , go to step 5). Otherwise, if c c + 1, go to step 2). 5) The number of clusters with the maximum mean value of the contour coefficient is selected as the optimal number of clusters c best . After the clustering centers are normalized, the typical curve of the power consumption pattern of the industry is obtained. The typical curve of industry electricity consumption pattern C u can be expressed as L c u [l c u,1 , /, l c u,n ] .

Characteristic Extraction of Industry Electricity Consumption
After completing the division of industry electricity consumption pattern, it is necessary to extract the corresponding characteristics of each electricity consumption pattern in the industry.
Considering the nature of the characteristics, it is not significant to summarize and extract the load level and typical peak-valley characteristics at the industry level. Considering the actual needs of the power grid, it is of great significance to extract the long-term load development law of various electricity consumption patterns in the industry. Nonparametric kernel density estimation (Lambert et al., 1999) is a data sample-driven method, which can fit the probability distribution of features without prior knowledge. In this paper, a nonparametric kernel density estimation method is used to fit the probability density of long-term load development characteristics, and then typical characteristics are extracted.
Taking the saturated load density as an example, suppose that there are N u users belonging to industry electricity consumption pattern C u , where the saturated load density of user k is D k ; then the probability density function f(D) of D can be computed by where k means the kernel function; h denotes the bandwidth.
To ensure the continuity of the probability density function, the kernel function needs to be a smooth probability density function. Generally, Gaussian kernel function is often selected, which can be expressed as follows: The point with the largest value of the probability density fitting function f(D) is selected as the typical characteristic value of the electricity consumption pattern C u , which can be computed by where D u stands for the typical saturated load density value of the electricity consumption pattern C u .

Construction of Industry Electricity Consumption Behavior Portrait
After the load analysis from the user level to the industry level, the load curves and related characteristics of the power users and industry modes are obtained, respectively, at the user level and the industry level, which is the description of the electricity consumption behavior of the industry from different angles and at different levels.
To enable the distribution network planners to grasp the electricity consumption of various industries intuitively, massive analysis results are sorted and visualized. Then, a panoramic portrait of industry electricity consumption behavior in a region is built. The form and content of the portrait are shown in Figure 2.

PATTERN CLASSIFICATION OF NEW CUSTOMERS' ELECTRICITY CONSUMPTION
In the traditional business expansion and installation, for the customers who have not been connected to the power grid, the planners are supposed to calculate the customer's load based on the customer's reported capacity and business experience, working out the business expansion and access scheme. The processing flow is extensive and lacks consideration for user load time sequence characteristics.
After the construction of a panoramic portrait of industry electricity consumption behavior in the region, if we can reasonably infer the electricity consumption pattern according to the relevant power consumption information provided by customers, it can provide more effective technical support for business decision-making of planners.
The basic flow of the electricity consumption pattern classification in this paper is shown in Figure 3. The training model obtains the electricity pattern classification ability by using the users' characteristics as the training samples. After training the model, the corresponding characteristic vectors are extracted from the electricity consumption information of the customers and input into the classification model to obtain the electricity consumption pattern classification results of the customers.

Customer Classification Algorithm Based on BB-Stacking Model Fusion Framework
In the field of the classification problem, ensemble learning (EL) has gained the attention of a huge number of scholars, because it can make up for each single model's advantages (Wang et al., 2015). Among them, bagging (Hu et al., 2011) and boosting (Lu et al., 2006) are the most classic and widely used integration methods, which have their own characteristics in model generalization and model accuracy. Specifically, bagging builds multiple datasets in bootstrap way and trains multiple single models in parallel, integrating the output by Voting. The advantages of each model are integrated, preventing the overfitting phenomenon effectively. Boosting framework  belongs to the mode of serial integration of models. The output of each model is used as the input of the lower model, and the minimum deviation is used as the loss function to continuously improve the accuracy of each model. Generally, the characteristics of the two integration methods are shown in Table 2.
To give full play to the advantages of the two integration methods, a model fusion framework of BB-stacking (bagging and boosting in stacking) based on the stacking method is proposed. It combines the two integration methods reasonably to realize the organic trade-off between high-precision and strong generalization performance, yielding the high-precision classification of customers.
Traditional stacking is composed of the base model in the lower layer and the metamodel in the upper layer. Firstly, the initial data is divided into subdatasets by k-fold, and then the subdatasets are input into the base model for training and classification. Then the output of the lower layer is reconstructed and sent to the upper model for training and classification. The BB-stacking model fusion algorithm proposed in this paper changes the lower base model layer into the integration layer on the basis of the two-tier structure of stacking, in which bagging and boosting methods are used to preliminarily integrate the original model and then transferred to the upper metamodel to coordinate the advantages of the two integration modes, obtaining a model with higher accuracy and more stable generalization ability.

Expansion of New Customers' Electricity Consumption Information
In the traditional business process, customers provide limited information in the information filling process. To make the pattern classification of electricity consumption more accurate, this paper proposes an appropriate expansion of the electricity consumption information reported by the customers. In addition to the traditional information of electricity consumption type, electricity consumption location, and electricity consumption capacity, the peak electricity consumption time schedule, peak electricity consumption level estimation, and valley electricity consumption level estimation are added as the customer's electricity consumption information. Table 3 shows the demonstration of extended power consumption information of a customer.
According to the example of extended electricity consumption information of the reported customer in Table 3, the reference load curve of the reported customer is obtained by time simulation. According to the definition of the peak-valley characteristic index in Table 1, the peak-valley characteristic vector of the customer can be calculated as the input of the electricity consumption pattern classification model.

Calculation of Customer Load Level
The next step is to calculate the load level of the customers. Based on the regional industry electricity consumption behavior portrait, the corresponding method is adopted to calculate the customers' load according to their type of electricity consumption (Yaoyao et al., 2013). The specific calculation method is as follows: 1) Commercial, nonindustrial, and residential customers It is recommended to use the load density method to calculate the customers' load because the load level of these customers is closely related to their building area. Assuming that the classification result of the customer is industry electricity consumption pattern C u , the customer accounting load can be computed by where S bz means the building area of the customer.

2) Industrial customers
Generally, the load level of industrial customers often depends on the capacity of production equipment and has weak correlation with the building area. It is recommended to use the utility rate method to calculate the customer load. Similarly, assuming that the classification result of the customer is industry  electricity consumption pattern C u , the customer accounting load can be computed by where ϵ u stands for the typical utility rate value of the electricity consumption pattern C u ; P bz means the building area of the customer.

Forecasting of Customer Load Curve
After the customer load level estimation is completed, the load curve can be forecasted according to the electricity consumption pattern classification results of the customer in the industry. The forecasting load curve of the customer belonging to electricity consumption pattern C u is L yg , and the calculation formula is as follows: where L c u is the typical load curve of industry electricity consumption pattern C u .

Evaluation Index of Customer Load Forecasting Results
The customer load forecasting technology is based on the industry electricity consumption behavior portrait of the region. Through the limited electricity consumption information, customer load after access is estimated. To provide guidance for planners to make customer access decisions, the accuracy of the expected results needs to be evaluated. Assuming that the actual load curve of the customer is L sj , maximum absolute error ratio (MAER) and Euclidean distance (ED) are used to evaluate the accuracy of the forecasting results.
1) To directly reflect the maximum error level between the forecasting load curve and the actual load curve of customer, MAER is used to measure the error, which can be computed by where l sj,i means the i-th load point of L sj ; l yg,i denotes the i-th load point of L yg .
2) To ensure that ED is not affected by the actual load level of the customer, the maximum value of the actual load curve L sj is taken as the benchmark, and then the ED between the actual load curve and the estimated load curve can be calculated by where l sj,i means the i-th load point of L sj ; l yg,i denotes the i-th load point of L yg .

CASE STUDIES
The dataset used in this paper is the load data of distribution transformer in an economically developed city from 2015 to 2019. One measurement point data is obtained every 15 min, with a total of 96 load data points per day. From the huge dataset, the distribution transformer data with high data quality and load development to saturation are selected. Combined with the corresponding industry information identification of distribution transformer users, the data cleaning and load characteristic analysis from user level to industry level are completed. Finally, the regional industry electricity consumption behavior image of the city is constructed. In all industries, the metal products industry accounts for the highest proportion of users, which is a typical industry in the region. Taking the metal products industry as an example, this paper analyzes and demonstrates the output results and verifies the effectiveness of the proposed method.

Display of Industry Electricity Consumption Behavior Portrait
Analysis Results of User Level Load Characteristics

1) User load curve analysis
Taking a user in the metal products industry as an example, the typical daily load curve of the user is extracted. Set the PAA segment number as 6 and the number of elements in symbol set as 4. The symbol sequence with the highest number of users is abdddb. The typical load curve of the user can be obtained through Eq. 4, as shown in Figure 4. Similarly, take the user as an example, collect the user's relevant loading information, and calculate the corresponding user load characteristics, as shown in Table 4. It can be found that the user has large-scale production and the daily peak-valley difference is high. From the long-term development point of view, when the user load develops to saturation, the utility rate is low.

1) Industry load curve analysis
After extracting the typical load curve of 2,478 users in the metal products industry, the load curve of this user group is clustered. The minimum cluster number c min is set as two and the maximum cluster number c max is set as 8. Finally, the algorithm flow is executed, the average value of SC of each clustering result is calculated through Eqs 5-8, and the result is shown in Figure 5.
It can be seen that the average value of SC of GWO-FCM clustering algorithm reaches the maximum when c 4, then the optimal clustering number c best 4 is determined, and the corresponding industry typical electricity consumption pattern division results are shown in Figure 6. It can be found that the load curves of each electricity consumption pattern are obviously different, and the division effect of electricity consumption pattern is satisfying.

1) Industry load characteristic extraction
After dividing the typical electricity consumption pattern of the industry, the characteristic extraction of the utility rate of 342 users in electricity consumption pattern one is taken as an example. The bandwidth h is set as 0.01. The probability density function of the user utility rate of the group is obtained by fitting Eq. 9, and the results are shown in Figure 7.

Customers' Electricity Consumption Pattern Classification
To verify the effectiveness of load forecasting technology for new customers, 141 metal products' users with complete electricity consumption data are selected as hypothetical customers. Therefore, the corresponding characteristic vectors are extracted and input into the trained BB-stacking model for electricity consumption pattern classification. In this paper, the traditional decision tree model (Safavian and Landgrebe, 1991), GRNN model (Specht, 1991), and PNN model (Oh and Pedrycz, 2002) are used to compare with the BB-stacking model. The classification accuracy under different training sample sizes is shown in Figure 8.
It can be found that, in the training process, the BB-stacking classification model can achieve a better classification effect than other classification models in a small training sample size. When the training sample size increases, the BB-stacking classification model can maintain a more stable generalization ability. It is obvious that the BB-stacking classification model prevents the phenomenon of classification ability decline caused by underfitting, performing better classification effect and stability.

Analysis of Customer Load Forecast Results
After the electricity consumption pattern classification of the customers is completed, the forecasting load curves of 141 customers are obtained through Eqs 12-14 combined with the typical electricity consumption characteristics of the industry. To compare the effectiveness of various methods, the similarity between the load curve forecast results and the actual load curves is evaluated by EM and MAER, and the comparative analysis is carried out from the following two aspects.

1) Comparative analysis of different clustering numbers
Considering that the different number of clusters selected by the clustering algorithm will lead to different electricity consumption patterns, which may affect the accuracy of the forecasting results of customers, this paper uses the GWO-FCM algorithm to test the case of cluster number c from 2 to 8. The mean value of EM and MAER under each clustering number is compared and shown in Figure 9. To further study the effectiveness of the proposed customer load forecasting method, this paper compares the traditional load estimation method with the proposed method. In addition, the sectional load curve obtained by the segment simulation method in the process of electricity consumption pattern classification is compared with the proposed method in this paper.

2) Comparative analysis of different methods
The comparison between the load curve forecasting result of a certain customer and other methods is shown in Figure 10. It can be found that there is little difference between the maximum load level of the estimated load and the actual load curve, but there is no time sequence characteristic. To some extent, the segment simulation method reflects the time sequence characteristics of customers, but the difference is large when the load level changes. The overall trend of the load curve forecasting result is very close to the actual load curve, which can meet the optimal access requirements in the planning field considering the time sequence characteristics of customers' load.
Comparing the error index of actual load curve and load curve forecasting results under different methods, the mean value comparison of test samples is shown in Table 5. It can be found that the proposed method performs best in both of the two evaluation indexes. Significantly, the accuracy is greatly improved compared with the traditional load estimation method.
Counting the two evaluation indexes of load estimation results of each hypothetical customer under different methods, two box diagrams are drawn to observe the distribution of sample indexes of each method, which are shown in Figures 11, 12. According to the box diagrams, the sample median level of the two error evaluation indexes of the traditional load estimation method is  Frontiers in Energy Research | www.frontiersin.org October 2021 | Volume 9 | Article 742993 9 the highest, and the error fluctuation range is the widest; the results show that the segmented simulation method achieves the suboptimal effect in the two evaluation indexes, and the error fluctuation is also controlled in a small range. It is worth noting that the median of the evaluation index samples of the proposed method is the lowest, the falling range of the error index samples is the narrowest, and the fluctuation range of the algorithm performance is smaller, showing the best stability.

CONCLUSION
Aiming at the problem of distribution network planning refinement, in this paper, the methods for regional industry electricity consumption behavior portrait construction are proposed. Further, the load forecasting technology of new customers is studied. The main conclusions are as follows: 1) The methods for regional industry electricity consumption behavior portrait construction can fully mine the multidimensional characteristics of users in various industries, providing effective data support for distribution network planning. The division of electricity consumption pattern can comprehensively and accurately reflect the power consumption characteristics and general laws of various industries.
2) The proposed load forecasting method has a good effect on the load forecasting of the customers who are not connected to the power grid, providing effective guidance for the  optimization of the customers' access decision and load scheduling . Moreover, the proposed method achieves significant performance in terms of the forecasting effect, algorithm stability, and practical application value.

DATA AVAILABILITY STATEMENT
The datasets presented in this article are not readily available because regional power consumption data are subject to confidentiality requirements. Requests to access the datasets should be directed to 1085283057@qq.com.

AUTHOR CONTRIBUTIONS
Conceptualization, methodology, and writing-original draft, WG; data curation, DZ; investigation, HY; writing-review and editing, BP; formal analysis and visualization, YW; resources and funding acquisition, TY; supervision, KW.