Review and Perspectives of Machine Learning Methods for Wind Turbine Fault Diagnosis

Wind turbines (WTs) generally comprise several complex and interconnected systems, such as hub, converter, gearbox, generator, yaw system, pitch system, hydraulic system control system,integration control system, and auxiliary system. Moreover, fault diagnosis plays an important role in ensuring WT safety. In the past decades, machine learning (ML) has showed a powerful capability in fault detection and diagnosis of WTs, thereby remarkably reducing equipment downtime and minimizing financial losses. This study provides a comprehensive review of recent studies on ML methods and techniques for WT fault diagnosis. These studies are classified as supervised, unsupervised, and semi-supervised learning methods. Existing state-of-the-art methods are analyzed and characteristics are discussed. Perspectives on challenges and further directions are also provided.


INTRODUCTION
Wind power has gained remarkable attention in the past decade because wind energy is one of the rapidly clean energy sources and has received worldwide support for renewable energy development (MUA, 2017). In recent years, in order to achieve the goal of carbon peak and carbon neutralization, China has commercialized the use of renewable energy, expanded the use of renewable energy, and demonstrated its determination to reach the peak of carbon dioxide emission by 2030 and carbon neutralization by 2060. As the main force of global renewable energy development, China attaches great importance to new energy, especially wind power generation. According to the statistics of the Global Wind Energy Commission (GWEC), the newly installed capacity of the country has reached 65.1 GW 2) in 2019 (Elizondo et al., 2019). The large-scale development and utilization of wind energy have brought huge opportunities for the development of the market economy, and also raised important crucial challenges related to reliability, cost-effectiveness, and energy blade images of the security. On the one hand, wind turbines (WTs) are often located in remote areas, operated in harsh working environments for a long time, and have withstood randomly varying weather conditions, wind shear, temperature, wind speed, and load, thereby frequent WT failures. As shown in Figure 1, the highest proportion of fault rate of WT components is the electrical system (Hahn et al., 2007), followed by the control system and sensor. On the other hand, the high cost of operation and maintenance (OM) of WTs underscores the urgency of fault diagnosis. Evidently, fault diagnosis and the timely maintenance of WTs can reduce huge financial losses.
Given the preceding reasons, fault warning and fault diagnosis of WTs should be performed. The fault diagnosis method based on machine learning (ML) is suggested to detect the operating conditions of the WT for it can minimize the downtime and reduce OM cost WTs, and extend the service life of these turbines. With the advent of the era of fault diagnosis technology, many local and international experts and scholars have proposed some efficient fault diagnosis methods for various components , such as power system (Qiao and Lu, 2015;Zappalá et al., 2019), mechanical (Wang et al., 2016a;Chen et al., 2016b;Garg and Dahiya, 2017;Salameh et al., 2018), and driving faults (Nasiri et al., 2015;Zeng et al., 2015), etc. Among these methods, generator (Hossain et al., 2015;Yang et al., 2017) and gearbox faults (Wang et al., 2016b;Igba et al., 2016;Teng et al., 2016;Wang et al., 2019) are mostly studied. Fault diagnosis methods are classified into fault diagnosis methods based on analytical models, knowledge-based methods, and data-driven fault diagnosis methods (Chen et al., 2016a).
The analytical model-based WT fault diagnosis methods need to analyze and model the system to achieve real-time diagnoses of the faults, which are often directly related to WT model parameters (Gao et al., 2015;Zhong et al., 2018). With a further understanding of the fault diagnosis mechanism of WT, modeling is implemented to increase the accuracy of fault diagnosis. However, in the process of analytical modelbased WT fault diagnosis methods by uses system residuals to model the internal subsystem of the WTs for state estimation and online approximation; nevertheless, this process has difficulty in ensuring the accuracy of fault diagnosis Cho et al., 2018). Consequently, inevitable errors and unknown interference terms will result, and the aforementioned process is insufficient to guarantee robustness.
Knowledge-based WT fault diagnosis methods rely on expert experience in wind power-related fields (da Silva et al., 2012;Yang et al., 2016). The accuracy of fault diagnosis results depends on the extensiveness of expert experience and knowledge the level of WT fault diagnosis experts, which lack self-learning and recognition abilities. Knowledge-based WT fault diagnosis methods cannot acquire new knowledge from the diagnosed engineering examples during the operation of WT. Hence, poor diagnosis accuracy may be resulted.
Without relying on prior experiences, data-driven WT fault diagnosis methods uses data mining technology to obtain hidden useful information to characterize the fault and normal states of the system, and eventually realize real-time fault diagnosis (Ding, 2012;Qin, 2012). The WT supervisory control and data acquisition (SCADA) system contains real-time online data and extensive offline data. The use and analysis of data mining is necessary to obtain detailed fault characteristics, thereby realizing real-time WT fault diagnosis. Data-driven WT fault diagnosis methods include the ML, multivariate statistical analysis, signal analysis, and information fusion methods (Yin et al., 2014).
As shown in Figure 2, the fault diagnosis methods of WT based on ML can be generally divided into supervised, unsupervised, and semi-supervised learning methods. Although some literature reviews on WT fault diagnosis  and condition monitoring (de Azevedo et al., 2016) have been published, there still lack of comprehensive review on the ML-based fault diagnosis method of ML. Therefore, the current study provides a systemic and pertinent state-of-theart review on recent studies on ML methods and techniques that have been used for WT fault diagnosis. In particular, this research summarizes the research methods in WT fault diagnosis, presents the strengths and shortcomings of existing methods, and reveals the challenges and recommendations of future research direction in this domain.

Fault Diagnosis of Wind Turbine
Numerous countries have earlier previously conducted research on WT technology, and European countries and the US have made some progress in fault diagnosis and prediction (Habibi et al., 2019). For example, Siemens' SCADA system is widely used in major wind power generation industries (Dao et al., 2018).
Compared with European and American countries, China's wind power industry started late, but WT fault diagnosis research has made some progress in recent years. Since the progress and development of artificial intelligence and ML in recent years, the fault diagnosis methods of WT have been intensively studied. The WT structure is shown in Figure 3. The main components of WT include wind wheel, gearbox, generator, converter, yaw system, pitch system, hydraulic system control system, integration control system, and auxiliary system (Lin et al., 2016).
The wind wheel is key to the energy conversion of WT, and operational stability directly affects the efficiency and safety of WT. As the operating time of WT increases, the failure rate of the wind wheel and other components also increases, which seriously affects the working performance of WT. In a non-stationary state, the frequency component of the WT failure at the generator output will expand over the bandwidth proportional to the speed, thereby making its diagnosis capabilities considerably complicated. Therefore, Dahiya (2018) proposed a fault diagnosis method of WT based on wavelet analysis, using electrical signals to diagnose rotor eccentric faults. The effectiveness of this method under varying speed and load conditions has been verified through experiments.
Gearboxes are one of the important WT, but the most expensive WT sub-assemblies. Gearboxes are often operating under extreme temperature and high speed of rotation, which will cause a high fault rate and irreversible damage to WT. At present, many studies and research have been conducted on the fault diagnosis of WT gearboxes (Salameh et al., 2018). Du et al. (2015) proposed a convex optimization-based WT generator gearbox fault diagnosis method. This method considers identifying multiple features from the superimposed signal of WT gearbox, and makes full use of the potential a priori information, that is, multiple faults with similar spectrum have different morphological waveforms, which can be sparse represented on the joint of redundant dictionaries. The proposed framework is verified by diagnosing multiple faults of gearbox in wind farm. (Zhang et al., 2017) used the Morlet wavelet-based continuous wavelet transform for actual wind turbine gear fault diagnosis. This diagnosis uses the signal analysis method, which has considerably refined time frequency characteristics and achieved satisfactory results.
A generator is the core equipment for generating electricity through WT, which converts kinetic energy into electrical energy. Generators will also experience a high failure rate owing to the harsh environment, large load fluctuation, and diverse operating parameters of this equipment. Numerous publications have specially reviewed the WT generator fault diagnosis, including those involved in avoiding incorrect internal pattern recognition caused by heavy noise, Chen et al. (2016b) extract inherent modulation information by decomposing the signal into mono-components on an orthogonal basis using empirical wavelet transform (EWT). Moreover, before EWT, they applied wavelet spatial adjacent coefficient denoising with data-driven threshold to improve signal-to-noise ratio (SNR), which is considered to be a powerful tool for WT generator fault diagnosis. Yang et al. (2017) considered the shortcomings of sparse representation results affected by dictionaries, and proposed a novel data-driven fault diagnosis method based on shift-invariant dictionary learning and sparse representation for WT generator, which can effectively identify the WT generator. The coefficients obtained can be considerably sparse based on the learned shift-invariant dictionary, and the impulse signal extracted nearly approximating to the real signal.
The converter is a critical component of the WT energy conversion, and the WT outputs current with stable frequency and amplitude to the grid through the converter. Converters have poor stability and are often impacted by high-temperature and high-pressure working circumstances, and the long-term operation will cause irreversible damage to the WT system. Toubakh et al. (Toubakh and Sayed-Mouchaweh, 2016) analyzed the converter fault caused by parameter drift, and proposed a fault diagnosis method of the WT converter based on a hybrid dynamic classifier, which can monitor the normal operation of converters in the discrete mode affected by parameter failure. The parameter drift under conditions is used for fault diagnosis in the early period of the WT converter. Liang et al. (2020) proposed a fault diagnosis method based on WT converters. A series of inherent mode functions are obtained through the overall empirical mode decomposition processing of the measured output voltage. Thereafter, the standard entropy is calculated according to the inherent model functions statistical characteristics, the extracted information is used to describe the diagnostic characteristics, and the fault diagnosis of the fan system is performed. The diagnostic accuracy is 99.57%, and its performance was impressive.
The yaw system is an important WT component and can drive the WT engine room to revolve around the tower centerline, thereby maintaining the verticality of the wind wheel scanning surface and wind direction vertical. Yaw system failures often occur owing to its harsh operating environment and load fluctuation, thereby affecting the power generation efficiency of WTs. To qualitatively evaluate the zero-offset error of the yaw system, Pei et al. (2018) proposed a datadriven method for WT yaw system fault diagnosis, which can detect the zero-point shifting fault by analyzing the power characteristics of different yaw angles. If the yaw angle measurement error is greater than a predetermined threshold, then the zero-point shift fault will be triggered, which can detect the fault in time and improve the WT performance. In the case of yaw system faults, Ouanas et al. (2018) proposed a fault diagnosis method of WT yaw system based on the signal analysis method. By filtering the inverter signal provided by the yaw drive, the discrete wavelet transform and empirical mode decomposition method were used to eliminate redundant information. Faults from the envelope of the Hilbert transform are detected, thereby verifying its effectiveness.
Pitch control system is the speed control device of WT and can adjust power change by changing the blade angle of attack. Given to the variable external wind conditions of WT and complicated internal system structure of the pitch system, abnormal output power, blade damages of the s, and even unit collapse can easily be caused, in which the failure rate is high. Many studies have proposed fault diagnosis methods for pitch systems. Habibi et al. (2017) proposed the fault diagnosis method of the WT pitch system by using a nonlinear model and presented the problem of maximizing energy extraction by designing the optimal desired state. Experiments have been performed to verify the practicability of the proposed method. Lan et al. (2018) conducted a study based on the adaptive step-by-step sliding window observer's state estimation and fault indicator functions of a pitch system, which can effectively deal with the nonlinear fault distribution function and identify the pitch fault of WTs.
A hydraulic system is an important WT component and plays a essential role in the yaw, pitch, and transmission chain braking of WTs. Hydraulic system function in all-weather, open-air and high-altitude running conditions, which is prone to failures such as oil leakage and spool jamming, thereby making maintenance difficult. In the case of WT hydraulic system faults, Yang et al. (2011) proposed a fault detection method for WT hydraulic system based on the Petri net model. First, Petri net theory is used to establish a model for each discrete operating state of the WT hydraulic pitch system, and a fault Petri net model is built. Thereafter, a system reliability index is obtained based on the fault qualitative analysis and calculation of the Petri net. The Petri net model calculation is simple, which is ultilized to the WT hydraulic system fault diagnosis and has a broad application prospect.

Machine Learning Methods for Wind Turbine Fault Diagnosis
ML refers to a computer that learns from a limited amount of data without specialist intervention to train an inductive model and uses this model thereafter to guide future decisions (Clifton et al., 2013;Stetco et al., 2019). The ML method has been used for fault diagnosis in WT (Leahy et al., 2016), which consists of inputs, outputs, models, and objective functions. Given the WT data sample data x {x 1 , x 2 , . . . x n } (x represents a data set containing n samples) and fault category y, n represent the total number of data sample. Thereafter, we use the training sample {x i , y i } M 1 ({x i , y i } M 1 ∈ {x, y}) to train the model and obtain the approximate value f(x) to fit the real value y. Moreover, y p represents the mapping relationship between x and y, and M is the total number of training samples. y p argmin f E xy L y, f(x) In Equation 1, L represents a loss function, and the average loss of the training set is called empirical risk. The goal of ML is to minimize empirical risk. Frequently employed loss functions include 0-1, square, absolute value, and log loss function.
The problem of overfitting is one of the key issues in the ML method. Therefore, empirical and structural risks should be minimized. The regular term J(f) is introduced to measure the model complexity. The frequently employed regular terms are Lasso and Ridge regression. The final optimized objective function can be expressed as follows: ML methods are divided into supervised, unsupervised, and semi-supervised learning methods (Lei et al., 2020). The current study also classifies the ML-based WT fault diagnosis methods as the supervised, unsupervised, and semi-supervised learning methods, which are analyzed and discussed in the following sections.

Supervised Learning Methods for Wind Turbine Fault Diagnosis
Supervised learning is a process of adjusting classifier parameters using samples of a known class to achieve the desired performance. In supervised learning (Schwenker and Trentin, 2014;Zhou, 2018), the computer is received the example inputs and its required outputs, given input and output, and the target is to learn a general rules of mapping input to output. Supervised learning methods are widely used in the WT fault diagnosis field. As shown in Figure 4, supervised learning methods have different algorithms for specific problems. First, we take WT fault diagnosis (Jiménez et al., 2019) as the research object to obtain data from the SCADA of WTs; divide the training, validation, and the test sets and perform data preprocessing on the data set; and normalize the data after processing the missing values. Second, an ML algorithm is chosen to train the training set which is used for modeling. Thirdly, the test set is used to evaluate the model quality. Lastly, an accurate fault classification is obtained by continuously optimizing the fault diagnosis model of the WT.

Artificial Neural Network
Artificial neural network (ANN) (Agatonovic-Kustrin et al., 2000;Xi et al., 2020) is one of the most frequently used supervised learning algorithms. ANN consists of numerous neurons and is divided into input layer, hidden layer, and output layer. ANN is widely used in the fault diagnosis field (Samanta et al., 2003;Saravanan and Ramachandran, 2010). By learning from known fault samples, the mapping relationship between fault characteristics and fault categories is established to detect whether a device is faulty. Figure 5 shows a three-layer simple WT fault diagnosis model based on ANN, x 1 , x 2 , . . . , x n are the input characteristics of WT, n is the total sample of input characteristics, and m is the total fault types of WT.
The frequently employed neural network methods include adaptive resonance theory (ART), self-organizing map (SOM) neural network and radial basis function (RBF) neural network.
Zhang et al. (Zhang and Wang, 2014). proposed an ANNbased fault diagnosis method for the WT main bearing based on the WT SCADA system data. The difference between the theoretical and the actual parameter values can identify the early stage of the main bearing faults of WT. To decrease the time of ANN for WT fault detection, Bielecki et al. (2014) proposed a hybrid method of ART and RBF neural networks for online detection of the operating status of WT, which can monitor the status of WT in time, identify the early fault conditions and have good real-time performance. However, the actual engineering in a wind farm cannot collect all information on the fault, and the ANN cannot make accurate fault diagnosis. Therefore, Zhao (Zhao et al., 2015) proposed to apply the SOM neural network to the fault diagnosis of WT and to train the network through the sample data of the normal WT state. This is judged whether the wind turbine malfunctions according to the position of the output neuron in the output layer. Accordingly, the SOM neural network method can effectively diagnose the WT fault with good robustness.
Although the fault diagnosis of ANN has high precision and good robustness, this method requires numerous parameters for modeling, and the training model takes a long time. China's wind power industry started late, but WT fault diagnosis research has been developed in the recent years. However, WT fault data samples are considerably lacking, and the accuracy and completeness of the WT data samples directly affect the accuracy of fault diagnosis classification. This issue is currently the main drawback restricting the development of ANN in WT fault diagnosis.

Support Vector Machine
Support vector machine (SVM) is a kernel-based ML method used in regression problems and classification tasks introduced by Vapnik (2013). The main idea is to find two parallel hyperplanes to separate two sets of data in a multi-dimensional space and maximize the margin between the hyperplanes. SVM formulation ensures that the decision hyperplane is constructed with structural risk minimization to obtain a balance between empirical risk and complexity of model (Deka, 2014). SVM is mainly used in nonlinear problems, by building a classification hyperplane as a decision plane, in which the isolation boundry between negative and positive samples is maximized. As shown in Figure 6, any hyperplane can be represented by a normal vector W and a constant b (intercept) as follows: For point A(x 1 , y 1 ), any two hyperplanes have a geometric interval d. SVM is to find a hyperplane to make the data points separable, in which the minimum geometric distance is the largest. The SVM solution process can be regarded as the solution process of a convex quadratic problem, which has a global optimal solution. Thus, SVM is widely used in the fault diagnosis field.
To solve the local optimal phenomenon caused by the improper selection of sample parameters, Laouti et al. (2011) chose a radial basis function as the kernel parameter of SVM, which can immediately detect the WT blade pitch positionand generator failureand has good generalization performance. To further solve the problem of overfitting or underfitting caused by the improper selection of nuclear parameters, Tang et al. (2014) proposed a method of WT fault diagnosis based on the Shannon wavelet SVM (SWSVM) and manifold learning. In this method, mixed-domain features are extracted to construct a highdimensional feature set, manifold learning is used to compress the high-dimensional feature set into low-dimensional eigenvectors, and low-dimensional eigenvectors are inputted into an SWSVM to recognize WT gearbox faults. Gao et al. (2018) proposed a novel fault diagnosis method of WT that combines mean decomposition, multi-scale entropy, least squares, and SVM. In this method, the WT raw vibration signal is divided into several groups for preprocessing. Thereafter, the mean decomposition method is applied to group the signals to obtain the product function. Moreover, the feature parameters are obtained using the multi-scale entropy method of processing the main product function to obtain the feature vector. The characteristic parameters were input into the least squares SVM, which was trained. This method can significantly enhance the fault classification ability of a single SVM and classify the fault type precisely. In the case of single kernel parameters and parameter optimization, Zhao et al. (2018) proposed a fault diagnosis method of WT based on random subspace identification and multi-kernel SVM. Compared with the traditional SVM, the multi-kernel SVM can successfully identify the bearing fault of the WT and has higher fault diagnosis accuracy. In the classification problem, there are not only two classification problems, but also multi classification problems. SVM can also show good classification ability in the face of two classification problems. (Liu K. et al., 2020) used multi-SVM machine to diagnose the fault of renewable energy power grid, which effectively improves the accuracy of fault diagnosis. (Xue et al., 2017). proposed a fault intelligent diagnosis method combining optimal composition of symptom parameters (SPOC) and multi-SVM to diagnose the motor fault, and realized the fault detection and identification of multiple motor faults. In recent years, with the wide application of SVM, experts began to optimize and improve SVM, put forward some machine learning algorithms derived from SVM, put them into the field of fault diagnosis, and achieved good results. (Zhang and Zhou, 2014;Tang et al., 2019).introduced margin mean and margin variance on the basis of SVM and proposed a large margin distributed machine (LDM), and this method has better classification performance than SVM. (Tang et al., 2020a) used LDM to detect the fault of WT's pitch system and optimized it with state transition algorithm (STA), which significantly improved the accuracy of fault detection.
SVM uses inner product kernel function to turn the raw data into linear data through mapping the raw data to a highdimensional space. However, modeling WT big data is difficult, and the selection of kernel parameters also affects the fault diagnosis accuracy. Moreover, guaranteeing the classification of multi-type WT fault problems is difficult.

Decision Tree
Decision tree (DT) is composed of multiple judgment nodes, and a classification or regression model is formed by the tree structure (Safavian and Landgrebe, 1991). The basic idea is simple, and Figure 7 shows a WT fault diagnosis model based on DT.
Rabah et al. (Benkercha and Moulahoum, 2018) proposed a fault diagnosis method for a grid-connected WT generator system based on the DT algorithm with high prediction performance and high accuracy. Abdallah et al. (2018) adopted the DT algorithm to perform fault diagnosis on WT, continuously sampled extensive data from thousands of WT at a high rate, and trained integrated DT classifier. Compared with other ML algorithms, DT is easy to implement but it has limitations in dealing with missing values. The WT fault diagnosis process, there are few samples of fault type and more samples of fault-free type. For DT that deals with data with inconsistent sample sizes in various categories, information gain is biased toward features with additional numerical values, which is easy to overfit and minimally used in WT fault diagnosis.

Ensemble Learning
The basic concept of ensemble learning (Polikar, 2012;Liu et al., 2019) is to adjust and train multiple base learners as ensemble members into a strong learner that should have greater performance on average than any other ensemble member. Thereafter, a model is estabilshed by optimizing the loss function to advance the performance of fault classification. The frequently employed ensemble learning methods include bagging, boosting. Bootstrap aggregating, also called bagging (Breiman, 1996) applied in regression and statistical classification, is an ML ensemble that obtains a new data set by returning the samples, trains a better base learner based on each new data set, and eventually combines the base learners. The algorithm reduces variance and helps to prevent overfitting. Typical bagging algorithm including random forest (RF). A diagnosis method (Cabrera et al., 2015) was presented for detecting the faults of WT gearboxes, which is based on Random Forest. First, the condition parameters of the vibration signal are extracted by wavelet packet decomposition and used as the input feature of the classification problem. Second, a study approximates the parameter space to find the best mother wave set, and select the best feature through the internal ranking of the random forest classifier. Lastly, the RF algorithm is used to detect the fault of the WT gearbox. To further improve the fault detection rate,  proposed a method based on deep RF fusion (DRFF) to improve the fault detection performance of the WT gearboxes. Two deep Boltzmann machines are used to characterize the parameter values of the wavelet packet transform, and the output of the two deep Boltzmann machines is fused into an integrated DRFF model using an RF algorithm. The results indicate that DRFF may improve fault diagnosis capabilities for gearboxes compared with conventional RF.
Boosting (Freund and Schapire, 1996) adjusts the algorithm by giving considerable importance to the bad classification that results in significant improvements in performance of classification. The bagging algorithm focuses on reducing bias facilitates prevention of overfitting. Many algorithms are based on boosting methods, such as XGBoost and LightGBM. Zhang et al. (2018) proposed an efficient WT fault detection method based on the RF and XGBoost. RF is used to rank the features by importance, while XGBoost trains the ensemble classifier for each specific fault based on the top-ranking features. The proposed ensemble classifier can protect against overfitting and experiments verifies the robustness of this method. To enhance the fault diagnosis accuracy, Tang et al. (2020b) proposed the adaptive LightGBM method for the WT gearbox fault detection. The correlation of the WT data samples is analyzed using the maximum information coefficient to realize the feature selection of fault detection. Meanwhile, the LightGBM method after Bayesian hyperparameter optimization is used for the fault detection of WT gearbox. Experiments prove that this method has a low false alarm rate and missing detection rates.
Ensemble learning is widely used in fault diagnosis and early warning of WT with high accuracy. However, some algorithms have slow convergence speed, weak learners rely heavily on one another, and over-fitting problems occur. When using the ensemble methods, the number of iterations, number of base learners, and weights are the issues that should be considered.

Deep Learning
Deep learning was proposed by Hinton et al. (LeCun et al., 2015), and the basic idea is a ML process that includes a multi-level deep network structure through a certain training method based on sample data. Deep learning combines low level features to form a considerably abstract high-level representation to discover the distributed feature representations of data. Deep learning November 2021 | Volume 9 | Article 751066 (Schmidhuber, 2015;Goodfellow et al., 2016) is widely used in image processing, data mining, fault diagnosis (Helbing and Ritter, 2018), and other fields. Different deep learning (Jiang et al., 2018) configurations have also been introduced such as deep belief nets (DBNs), deep auto-encoder (DAE) network, and convolutional neural networks (CNNs). Toward the WT gearbox faults, Qin et al. (2018) proposed a novel fault diagnosis method that combines DBNs and improved logical Sigmoid unit for the WT gearbox. The integrated approach, which uses the optimized Morlet wavelet transform, kurtosis index, and soft-thresholding is used to extract impulse components from original signals to advance the accuracy of dignosis. Compared with the traditional Sigmoid method, the WT gearbox fault diagnosis method based on deep confidence network and improved logical Sigmoid unit has the higher comprehensive performance. To achieve anomaly diagnosis and fault analysis of WT components, Zhao et al. (2018) proposed a deep learning method based on DAE networks using the WT SCADA data, while the Boltzmann machine builds a deep automatic encoder network model. This method can realize the early warning of the faulty component and derive the physical location of the faulty WT component through the residual of the deep autoencoder network model. Since the diverse operating status of WT with a large amount of noise interference, which leads to a decrease in the accuracy of fault diagnosis of WT. To solve this problem, Chang et al. (2020) proposed a fault diagnosis method for WT based on a concurrent convolution neural network (CeCNN). The raw WT data do not require any prior knowledge, and the characteristics are learned adaptively and directly from input with high accuracy and powerful generalization ability. The convolutional layers of different branches select kernels of varying scales at the same level, thereby improving the accuracy of the WT fault diagnosis. Yi and Jiang (2020) proposed a DAE-based discriminative feature learning for WT blade icing fault detection.
Although deep learning has a strong learning ability and high fault diagnosis accuracy, it requires extensive data and computing power with high cost and high hardware requirements, which are current issues should be considered.

Unsupervised Learning Methods for Wind Turbine Fault Diagnosis
The basic idea of unsupervised learning is the process that a machine learns unlabeled data to reveal the hidden structure, explain the key features of data, and divides them into several categories. Representative technique is clustering. Many algorithms typically used in unsupervised learning are based on the clustering method. Unlike supervised learning methods that analyzes class-labeled instances, unsupervised learning (Figueiredo et al., 2002;Zhang T. and Zhou Z. H., 2018) does not need all information, but trains the information of unlabeled samples. The sample set is clustered according to the similarity between the samples to minimize the intraclass variance and maximize the interclass variance, thereby establishing the model. Unsupervised learning methods can classify and predict test data by extracting hidden concepts and relationships in the data set, which are widely used in fault diagnosis, data mining, and image processing among others. Many methods are typically employed in unsupervised learning, such as the K-means algorithm, fuzzy C-means (FCM), hierarchical clustering method, Gaussian mixture model, and other methods (Hastie et al., 2009).

K-Means
K-means clustering (Kanungo et al., 2002;Jain, 2010) is a simple unsupervised learning method which aims to divide n observations into K clusters, in which the observation belongs to the cluster with the nearest mean. First, we initialize cluster centers and determine K initial points in the data as the center of clustering; Second, we calculate the distance from each point to the center and assign it to the nearest cluster. Third, we recalculate the cluster center to minimize the internal sum. Lastly, the allocation and update operations are repeated until the centers of all clusters no longer change. If all points are allocated to the same cluster as before, then K-means clustering is completed. For example, given the data set of WTs, K-means algorithm was used for clustering and the five types of clustering result is shown in Figure 8.
To overcome the sensitivity of K-means to the choice of the initial cluster centers, Yiakopoulos et al. (2011) proposed a K-means clustering method for fault diagnosis of rolling bearings, and the initial centers are selected using features extracted from simulated signals. The fault detection experiments on three types of bearings show that this method can successfully classify faults. Khediri et al. (2012) proposed an unsupervised learning process based on kernel technology, which can separate different non-linear process modes, and effectively detect faults, and reduce the false alarm rate. Kusiak et al. (Kusiak and Verma, 2012) used three different operating curves (i.e., power rotor and blade pitch curves) to monitor the performance of wind farms, and proposed a multivariate outlier based on Mahalanobis distance and K-means clustering. This method, uses the skewness and kurtosis of bivariate data as metrics to evaluate the WT performance, which is simple to apply and has a rapid convergence speed. K-means clustering is simple to implement and has a good effect on WT fault diagnosis. However, the choice of the initial cluster center K is difficult to grasp and even cause difficulty in convergence in the case of non-convex data sets. WT has many fault-free samples and few samples of faults, which will result in poor clustering effect when the amount of data is unbalanced.

Fuzzy C-Means
The FCM algorithm (Bezdek et al., 1984;Pal et al., 2005) is a clustering algorithm in which each data point can belong to more than one cluster. The basic idea is to maximize the similarity between objects divided into the same cluster while minimizing the similarity between viaroius clusters.
Given the WT gearboxes fault detection, Luo et al. (Luo and Huang, 2014) proposed a fault diagnosis method based on global local mean decomposition and FCM clustering. In this method, the known sample was clustered using the FCM clustering, and the test sample was classified and recognized, which has simple implementation and good diagnosis results. Although the WT fault diagnosis methods require supervision and training based on historical samples of known faults, collecting samples of known faults is time-consuming and expensive. Given the lack of complete characteristics of known samples in WT,  presented a method based on the kernel FCM (KFCM) clustering to the fault diagnosis of the WT gearbox. The KFCM clustering algorithm is used to classify the samples of known samples, and the classification center of each known fault is obtained. Similarity parameters are also calculated to diagnose whether the new data samples belong to the known faults. This method can accurately and effectively diagnose the known and unknown faults of WT.
Some issues should be considered when the FCM algorithm is used in the WT fault diagnosis. For example, a large fault-free sample size and extremely small fault sample size may lead to failure, thereby ensuring that the optimal solution of the fault diagnosis model is found.

Hierarchical Clustering
Hierarchical clustering (Johnson, 1967;Corpet, 1988) is a cluster analysis method in unsupervised learning, which builds a model by establishing a hierarchical structure of clusters. The hierarchical clustering method can be represented as a tree structure (i.e., "tree diagram", which includes roots and leaves. In clustering tree species, the original data points of different categories are the lowest level of the tree, and the top level of the tree is the root node of a cluster. As shown in Figure 9, the hierarchical clustering method (Navarro et al., 1997) involves a process that starts from the leaves and successively merges clusters called agglomerative hierarchical clustering; or a process that begins from the root and recursively splits the clusters called divisive hierarchical clustering. The hierarchical clustering method uses Euclidean distance to calculate the distance between the data points of different categories. Li Y. et al. (2018) proposed a fault diagnosis method based on adaptive multi-scale morphological filters and improved hierarchical arrangement entropy to identify varoius health situations of gearboxes, and used the hierarchical aggregation method to reduce noise fault features extracted from the signal. Liu and Ge (2018) presented a weighted random forest scheme based on hierarchical clustering selection for fault classification in complex industrial processes. The application of the hierarchical clustering method to offline model selection in RF can reduce the complexity of online fault classification.
In the fault diagnosis process of WT, the need to calculate the proximity matrix in the hierarchical clustering algorithm, is timeconsuming, and unsuitable for use in the WT big data sets. Hierarchical clustering method is appropriate for the clustering of small data sets, and real-time issues should be considered when dealing with the WT big data.

Gaussian Mixture Model
The Gaussian mixture model (GMM) (Reynolds, 2009) assumes that all data points conform to the Gaussian distribution, and is generated from a mixed finite number of probability models with unknown parameters. GMM can be regarded as the process of fitting a linear combination of multiple Gaussian distribution functions to perform data distribution. Heyns et al. (2012) proposed a Gaussian hybrid model to detect WT gearbox failures and calculate the negative log-likelihood of the gearbox bearing vibration signal segment, which represents the healthy gearbox. This method is suitable for nonlinear and nonstationary wind turbine gearbox vibration signals. Given the highly complex and unstable operating conditions of WT, Dong et al. (2013) proposed a multi-parameter WT health assessment framework that considers dynamic operating conditions. After the characteristic parameter selection and GMM based multiregime modeling, the operation status of WT can be evaluated, which can effectively detect WT faults. In response to frequent WT faults, Luis et al. (Avendaño-Valencia et al., 2017) proposed a fault diagnosis method for WT based on a GMM random coefficient model. The vibration response signals of WT that change with time under the environment and operating conditions are extracted and the model coefficients are determined through the GMM random coefficient framework. The method offering significant performance improvements and most fault levels and types are represented to be correctly diagnosed. GMM is effective in handling the big volume of WT data samples, but it has a large calculation amount and slow convergence. Selecting the number of sub-models in advance is difficult and is sensitive to abnormal points. When processing small data set samples in WT, the result cannot meet the requirements.

Semi-supervised Learning Methods for Wind Turbine Fault Diagnosis
Semi-supervised learning (Chapelle et al., 2009;Zhou et al., 2014) is a learning paradigm that detects some common features of labeled data samples and unlabeled data samples to help determine the model characteristics and to disseminate labels from labeled data to unlabeled ones, which is an ML method between supervised and unsupervised learning. In selecting data sets, combining unlabeled samples and labeled samples in the training process can improve training accuracy. There are four mainstream paradigms for semi-supervised learning (Zhu, 2005), are the semi-supervised SVM (S3VM), generative model-based, disagreement-based, and graph-based methods. Disagreement-based semi-supervised learning (Blum and Mitchell, 1998) started with the work on cotraining (Zhou and Li, 2005) by Blum, which is less affected by the non-convexity of the loss function and the data size and is mainly used in the field of human-computer interaction. The graph-based method (Camps-Valls et al., 2007) was developed by the graph min-cut method (Blum and Chawla, 2001) proposed by Blum, but it is rarely seen in WT fault diagnosis. The method based on S3VM and generative model is also applied in WT fault diagnosis.

S3VM
S3VM (Bennett and Demiriz, 1998) involves the development of SVM in semi-supervised learning . The major idea of S3VM is to mark unlabeled samples to maximize the interval after the hyperplane is divided. The frequently used S3VM is a transductive SVM (TSVM). The basic idea of this method can be presented as five steps. The first step involves training an SVM classifier with labeled samples. The second step entails using SVM to predict the classification results of unlabeled data. The third step aims to find the opposite label in the predicted unlabeled data that may be wrong for the labeled sample to swap the label, and use the existing labeled sample and unlabeled sample to retrain SVM; The fourth step involves repeating the second and third steps until the best S3VM classifier is obtained. The fifth step entails using use the S3VM classifier to label the unlabeled samples and predict the classification results.
The S3VM methods are widely used in the field of WT fault diagnosis. Liu C. et al. (2020) proposed a fault diagnosis method for rolling bearing based on S3VM using only a few labeled samples to build a model with good classification effect. In order to reduce false alarm rates and improve the discriminative ability of incipient fault features, Mao et al. (2020) proposed an online method for early fault detection of bearings using a semisupervised architecture. A safe semi-supervised SVM (S4VM) is introduced to identify the sequentially arrived data of the goal bearing as anomalous or normals and fault states and a stacked noise reduction automatic encoder is used to extract depth features from the normal state data and fault state data of the bearing. According to the S4VM generalization error upper bound to adaptively identify the occurrence of an incipient fault. Optimal margin distribution learning machine (ODM), which is also classified based on split hyperplane, has also appeared semi-supervised ODM(ssODM) in recent years, and has been applied to wind turbine fault detection with good performance (Zhang T. and Zhou Z.-H., 2018).
S3VM predicts the unlabeled samples, adds the prediction results to the labeled data set and improves the fault diagnosis rate. However, S3VM should determine a few known WT data samples as a guide. Accordingly, we cannot ensure such WT data samples with delicate information and it is uncertain to know how many WT data samples are needed to achieve an effective S3VM model.

Generative Models
The main idea of the semi-supervised generative model is that the probability that unlabeled samples belong to each category as a set of missing parameters. Thereafter, the expectation maximization (EM) algorithm is used to perform maximum likelihood estimate on the parameters of the generated model. Generative model methods (Zhu, 2005;Kingma et al., 2014) include mixed Gaussian distribution, mixture multinomial distribution, and hidden Markov model. Ge et al. (Xin et al., 2018) proposed a semi-automatic fault detection method based on a probabilistic model in the form of a hybrid Gaussian with good robustness. Wang et al. (Wang et al., 2015) proposed a comprehensive method based on semisupervised learning, using a small amount of labeled data and a large amount of unlabeled process data to construct a neighborhood weighted graph. By solving the optimization problem, the optimal regression function and the optimal prediction label matrix of unlabeled data are acquired. This method can obtain the promised results of fault detection and fault diagnosis in the monitoring process. To achieve automatic detections, Omid et al. (Geramifard et al., 2013) introduced a semi-parametric method based on the hidden Markov model for fault detection and diagnosis of synchronous motors. After training the hidden Markov model classifier (parameter stage), which is based on each probabilistic (non-parametric stage) hidden Markov model. Moreover, the probabilistic inference are used to compute two matrices to solve the efficiency problem in the fault classification process. Li X. et al. (2018) presented a fault detection method on a multivariate Bayesian control scheme and a hidden semi-Markov model to predict early bearing failures of gearboxes. The method of using the continuous-time hidden semi-Markov decision process to characterize the failure process of the gearbox bearing system, which can predict the early failure of the gearbox bearing and detect the remaining useful life at each sampling epoch.
The semi-supervised generative model method has good robustness, but the fault diagnosis model has low accuracy, long model training time, and many iterations. These issues must be considered in WT fault diagnosis.

CONCLUSION AND PERSPECTIVES
Given the rapid development of early wind power generation, wind power equipment has entered a high failure period, and the fault diagnosis methods of WT have high requirements for their operation and maintenance stability. Accordingly, the development fault prediction, fault diagnosis, fault detection, and condition monitoring of WT have improved. Various studies have proposed various methods and strategies for the fault diagnosis and detection of various WT components (Faiz and Moosavi, 2016). Following studies and research on the most recent WT fault diagnosis methods, the current study gathers a review of WT fault diagnosis methods and techniques based onML. Given the many uncertainties in the WT operation, many issues should continue to be considered in the ML-based fault diagnosis of WT. Improvement of ML algorithms effects. Many types of ML algorithms have advantages and disadvantages, in which among the research fields in the future include improving algorithm performance, optimizing algorithm parameters, combining algorithms, and studying new algorithms. Given that the algorithm has advantages and disadvantages, the need to adopt the advantages and bypass the disadvantages in the algorithm have become urgent issues to be addressed. Moreover, a single algorithm cannot detect all WT faults. Hence, the combined algorithm will become a hot research topic in the future. The advantages and disadvantages of existing algorithms indicated that future research involves proposing and improving new algorithms.
Comprehensive simulation of WT fault conditions. The wind power generation system is a typical complex system, given the uncertain severity and probability of faults. In the research on the WT fault mechanism, only a single fault is modeled, and the single component faults of WT are accompanied by multiple faults, which causes serious damage to WT. All WT units are interconnected and their variables are highly coupled. The occurrence of a fault in a particular component affects all remaining units. Therefore, additional compound fault models should be established to conduct a comprehensive analysis of the WT system.
Research on the feature selection method. WT have many characteristic parameters because the operation state of WT is time-varying Redundant and useless feature parameters will inevitably exist in WT feature extraction. Given the need to extract additional fault features, the research on optimized feature extraction algorithms will become popular in the future, thereby enabling us to better describe and detect the status of WT.
Multi-parameter information fusion. A single sensor or piece of single parameter information cannot acquire dedicated WT operating status information, which entails difficulty in accurately reflecting the fault or normal state of each WT component. Therefore, a multi-parameter information fusion method is adopted to obtain additional parameter information from multiple sensors and improve the efficiency of fault diagnosis.
Establishment of remote WT fault diagnosis system. The WT fault diagnosis system should be able to predict fault and provide the period plan maintenance to keep the WT minimum downtime and maintain long-distance condition monitoring. Long-term available historical data should be provided by the WT fault diagnosis system to set the correct alarm for preventive maintenance. In large wind farms, multiple wind power generation systems must be installed for fault diagnosis and early warning. The need to develop a low-cost and highefficient remote WT fault diagnosis system should also be considered in the future.