Practical Method for Data-Driven User Phase Identification in Low-Voltage Distribution Networks

For low-voltage distribution networks (LVDNs), accurate models depicting network and phase connectivity are crucial to the analysis, planning, and operation of these networks. However, phase connectivity data in the LVDN are usually incorrect or missing. Wrong or incomplete phase information collected could lead to unbalanced operation of three-phase distribution systems and increased power loss. Based on the advanced measurement infrastructure (AMI) in the development of smart grids, in this study, a novel data-driven phase identification algorithm is proposed. Firstly, the method involves extracting features from voltage–time matrices using a non-linear dimension reduction algorithm. Secondly, the density-based spatial clustering of applications with noise (DBSCAN) algorithm is used to divide customers into clusters with arbitrary shape. Finally, the algorithms were tested with the IEEE European Low Voltage Test Feeder of the IEEE PES AMPS DSAS Test Feeder working group. The results showed an accuracy of over 90% for the method.


INTRODUCTION
Since the introduction of the concept of "Digital Power Grid" (Islam, 2016) and the development of measurement technology, how to deal with electrical data in smart grids has become a focus of research. At the same time, distributed energy resources (DERs) are being deployed in the electric power distribution systems at an unprecedented pace (Yang et al., 2016;Yang et al., 2017;Yang et al., 2018;Yang et al., 2019a). To fully exploit the benefits of the DERs, the distribution network must be actively managed (Yang et al., 2019b;Xi et al., 2020;Yang et al., 2020;Li et al., 2021). The low-voltage distribution network is the last link to connect users in the whole power system. Therefore, the network's level of information interaction ability directly affects the user experience. The distribution network must be actively managed.
The introduction of the smart grid and advanced measurement infrastructure (AMI) concepts has brought new opportunities for developing distribution networks. To operate the distribution system in an efficient and reliable manner, distribution system operators typically need to perform a series of tasks, including three-phase optimal power flow, distribution system restoration and reconfiguration, and three-phase unbalance degree. Although network connectivity models are often accurate, phasing errors are common. Therefore, an accurate phase identification method is needed.
Electric utility companies typically do not have accurate information on phase connectivity. Moreover, phase connectivity of a distribution network may change when new distribution lines are constructed and included in the network. Correct phase connectivity data are essential to the efficient and reliable operation of a distribution system, especially when more advanced applications are connected. A model has been established to identify transformers and user phases based on voltage correlation using linear regression (Short, 2013). The correlation between a circuit and transformers in it can be determined by analyzing the correlation in voltage between buses and user meters from the perspective of power flow (Luan et al., 2013;Tang and Milanovic, 2018). Topology can also be identified by analyzing the correlation in load between lines at upper and lower levels (Pappu et al., 2018;Lisowski et al., 2019). Most of these studies focus on mediumvoltage distribution networks, while the identification of topology in LVDNs is yet to be studied. There are two methods to identify the phase in LVDNs. The first method is based on the law of conservation of energy. With all possible user phases listed, the mixed integer programming model is used to find the optimal solution, taking into consideration the degree of three-phase unbalance and line loss (Tian et al., 2016;Tang and Milanovic, 2018;Zhou et al., 2020). The method requires complex calculation, and the electrical features of users in the same phase sequence are not considered. When there are missing user power values, accuracy is not guaranteed. The other method is based on the clustering algorithm in machine learning. User phases in an LVDN are identified by establishing clusters among three-phase users (Wen et al., 2015;Wang et al., 2016;Liu et al., 2020). However, the difference in load fluctuation between phases is not intuitive enough after three-phase treatment in LVDNs.
To address the problem of the existing solutions, the current study proposes a data-driven phase identification algorithm based on the advanced metering infrastructure (AMI) that allows for in-depth exploitation of data features. Then, the LargeVis (Large-scale Visualizing Data) dimensionality reduction algorithm is used to extract data from highdimensional time-voltage matrices of LVDN users, resulting in low-dimension data which retain only the main features. Finally, the DBSCAN (density-based spatial clustering of application with noise) algorithm is used to analyze the features of users in clusters and identify the specific user phases. The method may improve the efficiency and accuracy of topology identification.

TOPOLOGY OF LOW-VOLTAGE DISTRIBUTION NETWORKS
Through the high-voltage transmission line, the electric energy is transmitted to the distribution network. After the distribution transformer is stepped down to 400 V, the electric energy is transmitted to the clients through the three-phase feeder.
Three-phase gate ammeters are installed at the outlet of the distribution transformer to record voltage, current, active power, reactive power, load, and other values for each of the three-phase electrical data information of the feeder. Figure 1 shows the relationship between the gate ammeter at the bus and the meter of each user in a singer-phase line.
As low-voltage distribution feeders extend to a shorter distance than high-voltage lines, no more than 500 m in most cases, the influence of line reactance is not considered in this study. Reactive power effects that exist on the lines are not considered neither, as they are negligible in well-managed networks: where U 0 represents the voltage at the bus node; U 1 , U m−1 , U m , U n are the voltage values at the corresponding node; R 1 , R 2 , R m , R n are the impedance values of the corresponding line; and △U m is the voltage difference between adjacent nodes m and m − 1.
The voltage of users closest to the bus is related only to the bus voltage and their own load. When positioned with short electrical intervals, adjacent nodes in the same feeder will have similar voltage values and the coefficient of correlation is higher than that when they are in different feeders. By analyzing the changes of user voltage values in time sequences, the phase relationships of users can be identified.

Time-Series Voltage Data Pre-Processing
The users' smart meter collects voltage data at a given interval and uploads them to the terminal. The time-series variation matrix of voltage amplitude of users in the station area is obtained from the historical data in the terminal. If some data are missing, and the interpolation method is used to complete the missing values, U ∈ R N×M is shown as follows: where u i,tj represents the measured voltage of user i at time t j , N is the total number of users, and M represents the number of voltage collection times within the analysis period. For user nodes near the bus, since their voltage is only affected by the bus voltage and their own load, their voltage timing curve will be close to that of the bus if the value of their own load is low. This will cause great disturbance to the subsequent clustering. To avoid this problem, these nodes are distinguished from the rest of the matrix and put into a separate cluster based on their correlation with the bus in terms of voltage and their voltage amplitude. The rest of the data are standardized to eliminate the influence of variation in voltage fluctuation at different phases. The dimensionality of the time-voltage matrix does not change after standardization. The formula used for standardization is as follows: where U t j ′ represents standardization of voltage at t j ; U t j represents initial values of the voltage at t j ; μ(U t j ) represents the mean value of voltage at all measurement points at t j ; σ(U t j ) represents the voltage standard deviation at all measurement points at t j ; and U′ represents the standardized user voltage dataset.

Feature Dimension Reduction Based on LargeVis
As the time for data collection mounts, the dimensionality of time-voltage matrices also increases. High-dimensional datasets contain excessively redundant information and data noise and require more complex and time-consuming computation.
PCA linear dimension reduction first conducts projection transformation and then finds the low-dimensional space that maximizes its goal. The purpose is to maintain the maximum variance of samples in the low-dimensional space, and the processing speed is fast, but the information loss is serious when the dimension is low. In this study, the LargeVis method is used to reduce the dimensionality of the data, keeping only the main features. It can reduce the high-dimensional dataset of the user voltage matrix to two or three dimensional spaces for visualization and retain the distribution characteristics of the original voltage dataset. The above problems should be improved by means of the feature dimension reduction method.
LargeVis (Tang et al., 2016) is a non-linear reductive dimension algorithm, which can reduce the high-dimensional dataset of the user voltage matrix to two or three dimensional spaces for visualization and retain the distribution characteristics of the original dataset. This algorithm was proposed by Professor Tang Jian in 2016. The dimension reduction process is as follows: 1) In high-dimensional space, LargeVis retains only the weight of KNN edges in the process of mapping. These edges are called positive edges, while nodes that are not directly adjacent are called negative edges. In high-dimensional space, the Euclidean distance between users is transformed into probability similarity, and the formula is as follows: where w ij is the probability similarity between user i and user j, to avoid the outlier node, getting it by adding conditional probabilities. W i is the similarity matrix between user i and other users in the same station. W is the Gaussian probability distribution matrix of the normalized voltage dataset. σ i is the standard deviation of the Gaussian model.
2) In low-dimensional space, the low-dimensional coordinates are determined by the probability of observation, and the probability of an edge connection between two points is set as follows: P e ij 1 f y i − y j , P e ij w ij P e ij 1 wij , where e ij represents the edge weight between two nodes and f(x) is a probability function, indicating the distance between vertices y i and y j . The closer the points are in higher dimensions, the closer the points are in lower dimensions.
3) In the dimension reduction process, the final objective function is as follows: where E is the set of positive edges, E is a complement to E, and c is the uniform weight assigned to the negative edge.
Using LINE technology (Tang et al., 2015), the weighted edge is regarded as the w ij unit edge. All positive edges are sampled directly, and the weight of the edge is trained to obtain the lowdimensional feature set Y. Y is consistent with the characteristic distribution of the standardized voltage dataset U L×M′ .

Phase Identification Based on DBSCAN Algorithm
Clustering of unlabeled voltage-time datasets can be performed with unsupervised learning algorithms. DBSCAN (density-based spatial clustering of applications with noise) as a density-based Frontiers in Energy Research | www.frontiersin.org November 2021 | Volume 9 | Article 752571 clustering algorithm can divide regions with enough density into clusters and identify clusters of arbitrary shape in spatial databases with noise. After dimension reduction, the Euclidian distance between two points is used as the distance between them. Users at the same phase have relatively shorter distances between them and will form a cluster. Therefore, the DBSCAN method is suitable.
The core point of DBSCAN is determined by setting parameters, including the neighborhood radius (Eps) and the minimum number of sample points (MinPts). To limit the space of density clustering and achieve better visual performance, the maximum and minimum values of the feature set Y after dimension reduction are normalized. The formula is as follows: where y ij is a member of matrix Y, max (y pj ) is the maximum value of the j column vector in the Y dataset, min (y pj ) is the minimum value of the j column vector in the Y dataset, and y ij ′ belongs to the normalized dataset Y′. Based on the dataset Y′, the distance between all nodes in the dataset is calculated to form a matrix D∈ R L×L . The calculation formulas are as follows: Here, d(Y i , Y j ) indicates the Euclidean distance between Y i and Y j . To establish DBSCAN parameters Eps and MinPts, the calculation formulas are as follows: MinPts Here, Z is the number of predicted clusters and count (D i < Eps) is the number of nodes whose distance between adjacent and surrounding nodes is less than Eps in the distance vector D j . After that, set a certain step size, adjust the values of Eps (0.01) and MinPts (1), and determine the most suitable parameter coefficient according to the silhouette coefficient. The formula for calculating the silhouette coefficient is as follows: Here, a(i) represents the average distance between node i and other nodes in the same cluster and b(i) represents the average distance between node i and other cluster nodes. The closer s(i) is to 1, the more reasonable the clustering result is. In other words, the closer s(i) is to -1, the more unreasonable the clustering result is.
After clustering, each cluster group is obtained, and the phase recognition results of users are tested according to the phase tags of clustering results. The specific flow chart is shown in Figure 2.

ANALYSIS OF EXAMPLES
The dataset used in this paper is the IEEE European Low Voltage Test Feeder of the IEEE PES AMPS DSAS Test Feeder working group (IEEE and PES, 2019). The low-voltage test feeder is a radial distribution feeder with a base frequency of 50 Hz. The feeder is connected to the medium-voltage power system through the transformer of the substation, which makes the voltage from 11 kV to 416 V. There are 55 users in total, and all of them are single-phase users. There are 21 households with phase A load, 19 households with phase B load, and 15 households with phase C load.
According to the configuration file, the power factor of all loads was set to be 0.95 in the whole simulation range. According to the power load curve of 55 users, the power flow calculation was carried out by OpenDSS software, and the voltage curve lasting 24 h with a resolution of 1 minute was obtained.

Parameter Settings of LargeVis and DBSCAN
In the actual environment, the situation of smart meter measurement may be more complicated, and the error is inevitable. To evaluate the effectiveness of the algorithm in the actual environment, we need to test it in a noisy dataset. Smart meters have a non-negligible uncertainty, and their accuracy levels vary in different countries. According to the measurement, the accuracy can be roughly divided into the following grades: 0.2, 0.5, 1, and 2, which means the uncertainty of 0.2, 0.5, 1, and 2%, respectively.
The number of nodes near the bus caused by error clustering accounted for 5∼8% of the total number of nodes. According to the correlation between voltage amplitude and bus, related nodes will be classified separately. The voltage timing matrix composed of other meters has been standardized by the Z-score to obtain the matrix U′. The LargeVis algorithm is used to reduce the dimension of the user timing voltage matrix U′. After dimension reduction, the low-dimensional voltage characteristic matrix Y is obtained. The low-dimensional Y is maximally and minimally normalized to Y′. Calculating the distance matrix between each node of the user, Z is 3 to obtain the initial DBSCAN parameter value (Eps 0.126, MinPts 4). Eps changes with the step size of 0.005, MinPts changes with the step size of 1, and the specific values of cluster parameters are determined by the profile coefficient method.

Analysis of Numerical Example Results
Eps 0.131, MinPts 3, as the final cluster input parameter. Three clusters are formed after clustering. After comparing the correlation coefficient between the user voltage in the cluster center and the bus voltage, the phase sequence of the users in the station area can be determined. To further prove the accuracy of the proposed method in phase recognition, the proposed method is compared with k-means, PCA, and k-means (Wen et al., 2015) and spectral clustering algorithm . The cluster number of each method is preset as 3. The recognition accuracy of the results is shown in Table 1.
The method proposed in this study showed the highest accuracy compared with the other methods. One of the reasons might be that the users near the bus were put into a separate cluster to avoid interference. Moreover, the LargeVis algorithm is able to retain data features after dimensionality reduction and the DBSCAN algorithm can cluster data points of arbitrary density, making them more suitable for processing datasets. For k-means, errors with ammeters may cause excessively redundant information in the time-voltage matrix, resulting in the instability of clustering results. For PCA, the linear dimensionality reduction approach they use to remove redundant information may lead to loss of data details and thus decreased accuracy. As for spectral clustering, the clustering effect directly depends on the similarity matrix generated in advance, which requires high precision of the original data.
To verify the usability of the proposed algorithms in engineering problems, disturbance errors were set for accuracy analysis under different sampling frequencies. The sampling frequency of the ammeters was set to 15 min, 30 min, 1 h, or 2 h, and the disturbance error was set to no error, 1%, or 2%. The results are shown in Table 2.   When the metering error is small, the identification accuracy of the algorithm in this paper decreases. When the metering error increases, the accuracy of phase sequence identification can be guaranteed only on high sampling frequency (15 min). The decrease of acquisition frequency will decrease the accuracy of recognition rate. When the collection frequency is reduced to 2 h, the identification cannot be completed, which indicates that a certain sampling frequency should be guaranteed for phase identification based on users' daily voltage variation characteristics.

CONCLUSION AND DISCUSSION
This paper presents a data-driven method for user phase identification in LVDNs. The LargeVis reductive dimension method is used to extract features from the standardized timing voltage matrix. Then, based on the DBSCAN method, the low-dimensional dataset is clustered as a result of user phase identification. Simulations show that the proposed method is more reliable than other unsupervised learning algorithms for single-phase user identification in LVDNs. The method used in this paper only needs to collect the user's load data for analysis, without additional hardware equipment costs and special personnel to check users one by one, so it can save the cost of user phase verification in the low-voltage distribution network.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, and further inquiries can be directed to the corresponding author.

AUTHOR CONTRIBUTIONS
HY involved in conceptualization, performed the methodology, and wrote the article. YW curated the data. WG involved in formal analysis. QL investigated the data. TY obtained the resources and acquired the funding.