Invalid Data Rejection of Audible Noise on AC Transmission Lines Based on Moving Window Kernel Principal Component Analysis

The statistical characteristics of the nighttime noise data of 1000 kV AC transmission lines were investigated, the noise data of the Huainan-Shanghai 1000 kV AC transmission line collected at night (0:00 to 6:00) from September 25, 2015, to February 16, 2016, were statistically analyzed using the nonparametric statistical K-S test, and the outliers were detected using the moving window kernel principal component analysis (MWKPCA). The results show that after the ineffective data are removed by MWKPCA, the 5, 50, and 95% values of the data are basically unchanged. To a certain extent, the method proposed in this paper can remove the invalid audible noise (AN) data of 1000 kV AC transmission lines without affecting the subsequent study of AN, we use various machine learning algorithms to predict the A weight sound level (Awsl) before and after the invalid data rejection, and the results show that the invalid data rejection has contributed to the improvement of the transmission line AN Awsl prediction accuracy.


INTRODUCTION
Audible noise (AN) of transmission lines, as one of the design criteria of transmission lines, affects the conductor selection, corridor width, insulator string length, and conductor arrangement. However, in the process of collecting the transmission lines AN, there is a large amount of ambient noise, and the data collection is easily disturbed by the ambient noises. If the transmission lines AN is smaller than the ambient noises, then the ambient noises will probably become invalid data in the data set, and the invalid data will have an impact on the transmission line evaluation.
Previous research on transmission lines AN contains empirical formulas for transmission lines AN in various countries (Juette and Zaffanella, 1970;Trinh and Maruvada, 1977;Perry et al., 1979;Chartier and Stearns, 2007;Tang et al., 2010;Chen et al., 2012), analysis of transmission lines AN domain characteristics and frequency domain characteristics Cheng et al., 2019), and transmission line design parameters, meteorological factors, environmental factors on transmission lines AN, and so on (Li et al., 2016;Guo et al., 2019;Zao et al., 2021;Xie et al., 2016;Du et al., 2016;Xie et al., 2017;Yang et al., 2016;Li et al., 2018;Pengfei et al., 2019). However, in order to solve the influence of ambient noises on data acquisition, Yuanqing Liu et al. studied the frequency spectrum of corona AN and ambient noises of positive and negative conductors of DC transmission lines at different voltages through corona cage test and studied the conversion relationship between A-weighted sound level (Awsl) and 8 kHz component of DC transmission lines AN, so as to avoid the interference of ambient noises (Liu et al., 2014a). Yingyi Liu et al. studied the relationship between corona current and AN on transmission lines and summarized the empirical formula for calculating the A-weighted sound pressure level (Awsl) by corona current, so as to indirectly get the effective data of AN evading the ambient noises interference (Liu et al., 2019). Li Xebao et al. showed that, to accurately study the time-domain characteristics of the AN generated by single corona discharge, the ambient noise was removed by correlation analysis and impulse characteristics (Li et al., 2015). Liu Yuanqing et al. used a finite impulse response filter to reject the invalid data of AN on DC transmission lines. The above-mentioned research on the effective data of the AN of transmission lines is divided into two types: indirect acquisition of effective data and rejection of invalid data. The research on the rejection of invalid data uses methods for single-dimensional data, which directly process the original data of the sound signal or the Awsl, ignoring the connection between the individual octave components of the sound signal (Liu et al., 2014b). The above-mentioned studies on the effective data of AN on transmission lines are divided into two types: indirect acquisition of effective data and rejection of invalid data. The studies on the rejection of invalid data use methods for singledimensional data, which directly process the original data of the sound signal and repair the sound pressure data disturbed by ambient noise, ignoring the connection between the individual octave band components of the sound signal. Therefore, this paper introduces a data-driven approach based on the determination of multidimensional data, and the data disturbed by environmental noise are directly eliminated.
Data-driven-based methods have more applications in power system stability, energy optimization and dispatch, voltage and current monitoring, transportation, etc. (Zhang and Luo, 2018;Zhu et al., 2019;Li et al., 2020;Yang et al., 2020;Shen and Raksincharoensak, 2021). In this paper, data consisting of 10 components of AN octave band from 16 Hz octave band to 8 kHz octave band and Awsl which are determined with moving window kernel principal component analysis (MWKPCA) by establishing the SPE statistic in the residual subspace of the principal component analysis with the T 2 statistic in the principal component subspace are used to evaluate AN invalid data, and the data that exceed the threshold of SPE statistic or T 2 statistic are excluded, so that the AN invalid data in the dataset are removed.

AN DISTRIBUTION CHARACTERISTICS
Noise data for a total of 69 days of the Huainan-Shanghai AC transmission line were collected at night (0:00 to 6:00) from September 25, 2015, to February 16, 2016. The conductor adopts 8×LGJ-630/45. Subconductor diameter is 33.6 mm. Subconductor spacing is 400 mm and the operating voltage is 1050 kV. The surface gradient of phase A, phase B, and phase C is 14.44, 14.82, and 14.73 kV/cm, respectively. The distribution characteristics of each octave band of AN and Awsl were analyzed using the K-S test (Kolmogorov-Smirnov test) one after another. The following hypothesis is made for the sample data H 0 : the overall sample data is conformed to the normal distribution, and the alternative hypothesis H 1 : the overall sample data from which the sample comes does not conform to normal distribution. The test statistic is defined as (1) where f(x) is the cumulative probability of the sample value in the normal distribution and g(x) is the actual cumulative probability.
Since the actual f(x) and g(x) are discrete values, Equation 1 is modified to where n is the sample size. When the data size is large and the original hypothesis holds, D M approximately conforms to the Kolmogorov distribution, and the distribution function is expressed as Taking the significance level α as 0.05, calculate the test statistic Z values and the corresponding probability p values. If p is less than the significance level, then the original hypothesis H 0 is rejected and the distribution of the sample from the total is considered to be significantly different from the normal distribution. If p is greater than the significance level α, then the original hypothesis H 0 should not be rejected and the distribution of the total from which the sample comes is not significantly different from the normal distribution.
Normal distribution analysis in days for a total of 69 days of data: 16 Hz octave band of AN has the highest number of days conforming to the normal distribution with 46 days, the lowest octave band of AN has only 23 days conforming to the normal distribution, average 33 days conforming to the normal distribution. A test of 44 days in which the data size exceeded the average group size of 110 groups: 16 Hz octave band of AN has the highest number of days conforming to the normal distribution with 29 days, and the lowest octave band of AN has only 9 days conforming to the normal distribution, average 17.8 days conforming to the normal distribution.

AN INVALID DATA DETERMINATION Correlation Analysis of Each Octave Band Component
When the electric field strength on the surface of AC transmission lines exceeds the critical strength, due to a large number of ionization effects, ionization zone will appear around the conductor, under the action of the electric field, positive ions in the positive zone and negative ions in negative zone are moved the radially outward movement, respectively. In the role of the alternating electric field around the conductor charged ions along Frontiers in Energy Research | www.frontiersin.org November 2021 | Volume 9 | Article 775519 the conductor to do round-trip movement to produce "humming" sound, this noise is "pure tone," and its frequency is a multiple of the frequency of 50 Hz. At the same time, the rapid movement of these ions will produce corona current pulses around the conductor, while a large number of ions in the direction away from the conductor and air molecules collide to produce sound pressure pulses. The AN generated by the sound pressure pulses and corona current pulses together in the broadband noise belongs to the medium and high-frequency AN (Fa Yuan et al., 2016;Zelong et al., 2012;Cheng, 2020). Both "pure tone" and broadband noise are periodic outward propagation of sound waves due to the pressure exerted on the air layer by ion motion under the effect of alternating electric fields (Di et al., 2012). There are many sound sources that produce various ambient noises during the acquisition of transmission lines AN. The frequency spectrum of different types of sound sources is not the same (Lu et al., 2010;Liu et al., 2018), and the final collected sound signal is the result of the joint action of the noise components belonging to different octave band. Therefore, it is necessary to consider the noise component data belonging to different octave band center frequency as a whole and to determine the invalid data for the data set composed of them. Eqs 4, 5 were used to calculate Pearson's correlation coefficient and gray correlation coefficient between each octave band component, respectively.
where x i and y i are the sample observations of variable X and variable Y, respectively; μ and ] are the mean values of variables X and Y, respectively; N is the total number of samples.
where Δ i (k) is the absolute value of the difference between the variable y(k) and the corresponding element of the variable x i (k) and ρ is the resolution factor; usually ρ is 0.5. A total of 55 pairs of correlation coefficients were obtained after calculating the Pearson correlation coefficients between each AN component by Equation 4, of which 33 groups had correlation coefficients less than 0.5 and 28 groups had correlation coefficients less than 0.4. A total of 55 pairs of gray correlation coefficients obtained after calculating the nonlinear relationship between the AN components by Equation 5 are all greater than 0.7. It can be found that there is a strong nonlinear relationship between each octave band component, so it is necessary to consider each octave band component as a whole composed of multidimensional data. It has been proved that the data do not satisfy the normal distribution in most cases, the time span of the transmission line AN collection is long, and the meteorological factors change a lot during the data collection process, so MWKPCA is used to determine the invalid data day by day to reduce the influence of the change of meteorological factors on the determination results.

Algorithm Principle of MWKPCA
KPCA can be viewed as a principal component analysis in highdimensional feature space Zhang and Luo, 2018;Zhu et al., 2021); compared with traditional PCA, it needs to project the dataset X [x 1 , x 2 /, x N ] into the high-dimensional feature space Γ through a nonlinear mapping b to obtain a new dataset: where X is a matrix of N rows and M columns, ϕ(x) is a matrix of D rows and M columns, and D > N.
Then the covariance matrix in the higher dimensional space is C Γ : The kernel matrix Kϵϕ N×N is usually obtained in the highdimensional feature space using the kernel function instead of the mapping function, followed by the calculation of the kernel matrix K after centering.
where k is a kernel matrix and 1 N is an N × N matrix where each element is 1 N . The eigenvectors (P 1 , P 2 , /, P 3 ) and the corresponding eigenvalues (λ 1 , λ 2 , /, λ A ) are obtained by the singular value decomposition of the covariance matrix R of the matrix K, where A (A<N) is the number of principal elements obtained by the cumulative variance contribution, and the covariance matrix of the matrix K is shown in the following equation: where P is the principal component load matrix and P e is the residual load matrix. By building a good KPCA model, the T 2 statistic is used to determine the information of K projection into the principal component subspace, as the following equation: where Λ diag(λ 1 , λ 2 , /λ m ) is the principal variance matrix, n is the number of samples, m is the number of principals, F(n, n − p) is the F distribution with degrees of freedom n and n-p. Let the confidence coefficient be α; then the control threshold of the T 2 statistic is T 2 UCL .
The SPE statistics in the residual subspace are used to determine data anomalies. The SPE statistic is given in the following Eq. 12: The control threshold SPE UCL is given in the following Equation 13: where α is the confidence level, C is the critical value of the normal distribution at the detection level of α,h 0 1 − 2θ 1 θ 3 /3θ 2 2 , and θ i m j A+1 , i 1, 2, 3. MWKPCA introduces the moving window function on the basis of KPCA, and for such cases as this paper where the time span is up to 6 months, the invalid data is determined in days, and the training data and test data are continuously updated with SPE UCL and T 2 UCL , so as to reduce the negative impact of changes in meteorological factors on the results of invalid data determination.
The flow of MWKPCA calculation is shown in Figure 1.

Multidimensional Invalid Data Determination
The 484 sets of data for each octave band component which are close to the average value of that component are selected as the initial training data, and the training data are updated in the process of determining invalid data day by day, adding the data judged as normal on that day to the training data, and eliminating the corresponding number of data from the previous training data, so as to detect abnormal data for 7,658 sets of test data day by day. The computed significance level of the initial training modelα 0.85, kernel width gamma 16 for the radial basis function, corresponding to the control threshold SPE UCL for the SPE statistic and the control threshold T 2 UCL for the T 2 statistic, and the corresponding number of principal elements is 9. The final outlier determination results are shown in Figure 2: the total number of groups that exceeded the threshold of SPE statistics or T 2 statistics was 1,013, the total number of groups that exceeded the threshold of T2 statistics was 703, and the final rejected data were 1,475.

PREDICTION OF AWSL EFFECTIVE DATA
Percentile Comparison Table 1 shows the percentile of each octave band component of AN in the two stages of original data and after MWKPCA (Ln in the table indicates the values ranked in the top n% positions by arranging the data in descending order), and it can be found that most of the octave band components L5, L50, L95 do not change much after the removing of invalid data screening, so the elimination of invalid data using the method of this paper basically does not affect the study of AN data (Liu et al., 2014a). , and the values are converted to between 0 and 1 to avoid the effect of the difference in magnitude between different features on the prediction accuracy.

Prediction Result Comparison
where S is the normalized result of each feature; s is the original data of each feature; S max and S min are the maximum and minimum values of each feature. In order to prevent the influence of chance on the prediction results due to the random combination of data when dividing the train sets and test sets, this paper divides the data sets into 10 copies by 10-fold cross validation, taking one of them as the train sets and the remaining nine as the test sets, and quantifies the error of the model prediction results by root mean square error (RMSE), mean absolute error (MAE), Mean Absolute Percentage Error (MAPE), and Symmetric Mean Absolute Percentage Error (SMAPE) (as shown in Eqs. 15-18, the smaller the error, the better the where y i and y i represent the true and predicted values; n represents the number of predicted versus true values.
In order to better reflect the improvement of the prediction accuracy by the outlier rejection algorithm, this paper uses LightGBM and XGBoost based on Boosting model, SVR based on hyperplane, KNN based on distance, and elastic network and linear regression to predict the Awsl, and the mean value of the final Awsl prediction result is shown in Table 2: predictions were made using the data sets before and after invalid data rejection in this paper, respectively. The mean error of the prediction results after invalid data rejection using MWKPA is lower than that of the original data, and the invalid data rejection has contributed to the improvement of the prediction accuracy.
Using the above six algorithms to predict the effective Awsl data after eliminating invalid data by IF, DBSCAN, LOF, KPCA, and MWKPCA, the comparison of the mean error values of the prediction results is shown in Table 2; the mean error values after eliminating invalid data by using MWKPCA are significantly lower than those of the other four methods.

CONCLUSION
A method is proposed to reject the invalid data of AN on transmission lines using MWKPCA. After using this method to reject the invalid transmission line AN data, there is no impact on the subsequent study of AN.