Intelligent Fault Diagnosis Method of Wind Turbines Planetary Gearboxes Based on a Multi-Scale Dense Fusion Network

Due to the powerful capability of feature extraction, convolutional neural network (CNN) is increasingly applied to the fault diagnosis of key components of rotating machineries. Due to the shortcomings of traditional CNN-based fault diagnosis methods, the continuous convolution and pooling operations result in the constant decrease of feature resolution, which may cause the loss of some subtle fault information in the samples. This paper proposes a CNN-based model with improved structure multi-scale dense fusion network (MSDFN) to realize the fault diagnosis of wind turbines planetary gearboxes under complicated working conditions. First, the continuous wavelet transform is applied to preprocess the vibration signals, and the two-dimensional wavelet time-frequency diagrams are used as the network input. Then, the multi-scale feature fusion (MSFF) module and a feature of maximum (FoM) module are used in the extraction and classification stages of fault features, respectively. Next, the multi-scale features of each network layer are fused to enhance the fault features. Finally, the high fault diagnosis accuracy is achieved by extracting the separable fusion result of fault features. The proposed method achieves more than 99% fault diagnosis average accuracy on a planetary gearbox dataset. The comparative experimental results verify the effectiveness of the proposed method and its superiority to some mainstream approaches. The ablation study further confirms that MSFF module and FoM module play the positive role in fault diagnosis.


INTRODUCTION
Planetary gearbox is a key component in the transmission system of wind turbines (WT) (Feng and Liang, 2014;Wang et al., 2019). Due to the harsh working environment and complex structure, the key gear components in the wind turbine planetary gearboxes are prone to damage, which adversely affect the entire transmission system. Since wind turbines are often installed in places with inconvenient transportation (Lu et al., 2020;Sun et al., 2021), any gear fault of planetary gearboxes may cause the long downtime of the corresponding wind turbine and the high cost of the related operation, maintenance, and reparation (Cao et al., 2019;Sun et al., 2019). During the service life of a wind turbine, the cost of the related maintenance and operation account for about 75% of the total investment (Lin et al., 2018;Zhu et al., 2021). The monitoring of the health status of each planetary gearbox plays an important role in the normal operation of a wind turbine. The gear fault types of wind turbine planetary gearboxes mainly include chipped tooth, missing tooth, crack tooth, and surface wear (Wang et al., 2018;Liang et al., 2020). At present, most of the fault diagnosis research on the gears of planetary gearboxes are based on vibration signals (Lei et al., 2020).
Due to no-stationary working condition and complex structure, it is extremely difficult to establish a general mathematical model for the vibration signals of planetary gearboxes (Feng and Zuo, 2012). Since the vibration signals of planetary gearboxes have three main features, composite signals, pass-through effect, and nonlinearity, it is difficult to directly extract fault features by observing vibration responses . In addition, traditional signal processing methods are difficult to process the monitoring data with massive states in time in modern industry (Pan et al., 2019). Therefore, deep learning-based methods with powerful feature extraction capabilities are increasingly applied to the fault diagnosis of key components of rotating machinery. They usually have three main steps, data preprocessing, fault feature extraction, and fault classification (Liu et al., 2018;Ma et al., 2019;Liang et al., 2020).
Deep learning-based fault diagnosis methods often preprocess the original vibration signals first. Wavelet transform, which has excellent time-frequency analysis ability for non-stationary vibration signals, is often applied to fault diagnosis of rotating machinery. The optimized Morlet wavelet transform is used to process the vibration signals to obtain better time-and frequencydomain statistical feature sets (Wang et al., 2018). Fault diagnosis can also be well achieved by wavelet packet coefficient matrix of vibration signals  or time-frequency images (Liang et al., 2020;Zhao et al., 2020c;Cheng et al., 2021).
Since deep learning can automatically extract abstract features from the original data (Wen et al., 2017), various deep learningbased models are often applied to extract and classify fault features. These methods can effectively avoid using complex signal processing methods to calculate feature parameters for the expression of fault information. According to vibration signals or statistical feature sets, frequency spectrum, and time-frequency spectrum of vibration signals, DBN (Wang et al., 2018;Kang et al., 2020), LSTM (Cao et al., 2019), SAE (Jiang et al., 2017), RNN (Miao et al., 2020), and CNN (Jiang et al., 2018;Wang et al., 2020a) can obtain relatively high fault diagnosis accuracy.
Considering the powerful feature extraction capabilities of deep learning, this paper proposes a CNN-based fault diagnosis model for wind turbine planetary gearboxes. The proposed model aims to use CNN to extract fault features from the time-frequency images of vibration signals and achieve fault classification. However, in actual wind turbine applications, due to noise interference and difficulty in determining the impact of faults, etc., the fault information of the vibration signals of wind turbine planetary gearboxes mapped on time-frequency images may be extremely subtle, especially in the early-stage faults (Wei et al., 2019). Following the continual decrease of feature resolution, some important fault information may be lost to cause adverse impact (Wang et al., 2020b), when CNN is used to extract fault features. In the traditional CNN methods, the learning of each network layer is always only based on the features learned in the previous layer. If partial information of the learned features is lost, the features learned later is adversely affected. To solve the above issue, this paper proposes an intelligent fault diagnosis method based on multiscale dense fusion network (MSDFN).
MSDFN consists of two parts, feature extraction network and classifier. A dense feature fusion structure Huang et al. (2017) based on projection and back-projection operators Irani and Peleg (1991). Dai et al. (2007) is used to optimize the traditional CNN network, and the output of each layer of the feature extraction network is fused as the input of the classifier. Through the dense fusion of multi-scale features to supplement the fault information, the fault diagnosis of wind turbine planetary gearboxes under complicated working conditions is realized. The main contributions of this paper can be summarized as follows.

1) A CNN-variant MSDFN is proposed for fault diagnosis of
wind turbine planetary gearboxes under complex working conditions. It could extract enhanced fusion results of fault features from the time-frequency images of vibration signal to improve diagnosis accuracy. 2) The MSFF module designed by projection and backprojection operators are embedded in each network layer. And the fault feature enhancement algorithm based on multiscale feature fusion is used to supplement the missing information of every layer in time. The fused features can express fault information more effectively and have stronger separability. 3) A FoM module is designed to fuse the output fault features of each feature extraction layer. Specifically, this module uses adaptive maximum pooling to convert the features of each layer to the same resolution and concatenate them. It makes the input features of the classifier have more complete fault information, and the corresponding diagnosis accuracy is improved.
The rest of this paper is organized as follows. Section 2 reviews the CNN-based fault diagnosis research of key components of rotating machinery in recent years; Section 3 describes the proposed fault diagnosis method based on MSDFN; Section 4 compares the proposed method with several existing fault diagnosis networks to verify its effectiveness and also conducts the corresponding ablation study to test the performance of each module; and Section 5 concludes this paper.

RELATED WORK
In existing deep learning-based fault diagnosis research of rotating machinery, CNN is one of the most commonly used deep learning models for fault diagnosis. Compared with SAE, DBN, and other models, CNN and its variants such as Deep residual network (DRN), deep convolutional neural network (DCN), etc. are more convenient for training (Wen et al., 2017;Zhao et al., 2017), when vibration signals or their timefrequency features are used as input. Because their local receptive fields and weight sharing strategy usually only needs a smaller amount of parameters. In recent years, many CNN-based fault diagnosis solutions of rotating machinery have been published.
CNN was applied to the fault diagnosis of rotating machinery and fault features were extracted from the spectrum of vibration signals, which achieved better performance than classical classifiers such as random forest and SVM (Janssens et al., 2016). Two-dimensional DCN was used to extract fault features from the wavelet packet energy images of vibration signals, and the corresponding method achieved high fault diagnosis accuracy (Ding and He, 2017). As the input timefrequency information composed of the original vibration signals and its frequency spectrum, a one-dimensional CNN was applied to the fault diagnosis of planetary gearboxes (Jing et al., 2017). Fault features were extracted from multi-sensor data by a CNN with the multi-input branch structure to realize the fault diagnosis of rotating machinery (Xia et al., 2017). A onedimensional CNN-based method realized the end-to-end fault diagnosis of rotating machinery (Wu et al., 2019). Discrete wavelet transform was used to obtain the time-frequency matrix of the vibration signals, and CNN was applied to extract the fault features of planetary gearboxes . Vibration signals were first analyzed by recursive graphs, and then CNN was used to achieve the fault diagnosis of rotating machinery according to the obtained recursive matrix (Wang et al., 2020a). Since the second-order cyclostationary behavior of vibration signals can reveal valuable health information, cyclic spectral coherence (CSCoh) analysis of vibration signals was used to preprocess the original data, which reduced the difficulty of feature learning in deep diagnosis model and improved diagnosis accuracy .
The above work focuses on the preprocessing methods of raw data, and studies the effects of various data processing methods on fault diagnosis. However, the related research not only relies on rich knowledge of vibration signal processing, but also increases the workload of data processing. The following research proves that the improvement of network structure is conducive to extracting detailed fault features and achieving high diagnostic accuracy. Zhao proved that deep residual network (DRN) can efficiently extract the high-level fault features contained in the wavelet packet coefficients Zhao et al., 2020b). On the basis of DRN, the dynamic weight module was introduced to weight the fault features of different frequency bands in the time-frequency images, which improved the diagnosis accuracy . A multi-branch CNN network structure was used to extract features from different scales and improve the diagnostic ability of the model (Pan et al., 2019;Peng et al., 2020). MSCNN used multiple CNN branches to process vibration signals from multiple scales and the fault diagnosis accuracy of wind turbine planetary gearboxes was improved (Jiang et al., 2018). An SE-Res module (Hu et al., 2018) was added to the ordinary CNN network to reduce the interference of redundant information and enhance fault features (Cao et al., 2020). A CNN model using hollow convolutions was applied to increase the receptive fields to improve the gear fault diagnosis accuracy of planetary gearboxes . To accurately and automatically identify the health status of rotating machinery, a normalized convolutional neural network was proposed for the fault type diagnosis of rotating machinery under variable operating conditions (Zhao et al., 2020a). A multi-core cascade structure of CNN was used to substitute a single core for fault diagnosis (Wang et al., 2020b). Xu developed a new method VMD-DCNNs that integrated convolutional neural networks with variational mode decomposition (VMD) algorithms (Xu et al., 2020). This method used CNN to extract features from each intrinsic mode function (IMF) and directly processed the original vibration signals in an end-to-end manner without any manual experiences and intervention to realize the fault diagnosis of the key components of wind turbines.
All the above methods use the CNN model as the fault feature extractor and classifier, and achieve good performance in fault diagnosis of the key components of rotating machinery. However, these methods ignore that when CNN performs feature extraction, the reduction of feature resolution may cause the loss of partial fault information and even the decrease of fault diagnosis accuracy (Wang et al., 2020b). The dense feature fusion structure makes the input passed to the next network layer come from all the extracted features, which can effectively supplement the fault information. Therefore, according to the inspiration and guidance of the above-mentioned work, this paper focuses on improving the network by using the feature fusion structure, and develops a multi-scale dense fusion network (MSDFN). The proposed model can extract the enhanced fusion results of fault features from the time-frequency images of vibration signals of wind turbine planetary gearboxes under complicated working conditions, so the relatively complete fault information can be obtained, and the fault diagnosis accuracy can be improved. Section 3 specifies the overall structure and working principle of MSDFN.

MSDFN-BASED FAULT DIAGNOSIS METHOD
The fault diagnosis of wind turbine planetary gearbox is really important. However, the traditional CNN-based fault diagnosis methods may cause the loss of fault information. So, this paper proposes a MSDFN-based intelligent fault diagnosis method for wind turbine planetary gearboxes. Figure 1 shows the diagnosis process. The proposed method uses continuous wavelet transform to preprocess the original vibration signal data. A multi-scale feature fusion (MSFF) module is embedded into each feature extraction network layer. An FoM module is used to fuse the output of each feature extraction network layer to obtain the classifier input. So, the information loss is reduced during the fault extraction, and the corresponding fault diagnosis accuracy is improved. The feature extraction network contains five layers, which can fuse features of five scales at most. A structure that is too shallow will result in poor accuracy, and a structure that is too deep will result in a significant increase in the Frontiers in Energy Research | www.frontiersin.org November 2021 | Volume 9 | Article 747622 amount of calculation and cannot improve the accuracy. This is verified in the Section 4.

The Generation of Time-Frequency Images
Wavelet transform is widely used in vibration signal processing of rotating machinery due to its excellent time-frequency analysis ability for unsteady signals. The commonly used expression of wavelet transform is shown in Eq. 1.
is the wavelet basis function, a is the scale factor that controls the expansion and contraction of the wavelet basis function, and τ is the translation amount that controls the translation of the wavelet basis function. The scale corresponds to frequency, and the amount of translation corresponds to time. When the wavelet function is translated on the time axis at each scale, it is multiplied by the input signals respectively. So, the frequency components of the signals in each time period can be obtained. The selection of wavelet basis function is an important step in using wavelet transform. Similarity coefficients (Mao et al., 2019;Liang et al., 2020) are applied to select the wavelet basis function for quantitative analysis. The expression of similarity coefficients is shown in Eq. 2.
where δ is the dimensionless similarity coefficient, k and s i , m i , α i is the number of peaks, area of each peak, maximum of each peak, weighted coefficient of each peak after making absolute value for wavelet basis function. Table 1 lists the similarity coefficients of several commonly used wavelets for fault diagnosis. When the similarity coefficient of a wavelet increases, the wavelet gets close to the original signals, which means the wavelet contains more fault information to facilitate diagnosis. Therefore, the cmor wavelet defined in Eq. 3 is used. The complex Morlet wavelet is a complex sinusoid modulated by a Gaussian envelope defined by: where F b is the bandwidth of the wavelet, F c is the central frequency of the wavelet function, and i is the imaginary unit. This paper takes F c F b 3.

Time-Frequency Image Preprocessing
Time-frequency images need to be preprocessed to improve the training performance before inputting into the network. For image datasets, the commonly used data enhancement methods include rotation, flipping, and random cropping. As the purpose of data enhancement, diverse training samples enable the network to extract key features from the samples undergoing various changes for improving the generalization ability of the network. However, since the vibration signal sequence is timedependent, it can be regarded as a discrete time function. Flipping, rotation, and random cropping disrupt the relationship between time and frequency features, resulting in poor training performance. Therefore, the proposed image processing method first transforms the size of input images, and then normalizes them to improve the training speed. Bilinear interpolation is used to convert the resolution of timefrequency images. Bilinear interpolation is the expansion of twovariable linear interpolation, which performs linear interpolation in two directions to obtain the value of the unknown function f (x, y) at the point P (x, y). The transformation expression of bilinear interpolation is shown in Eq. 4.
where Q 11 (x 1 , y 1 ), Q 21 (x 2 , y 1 ), Q 12 (x 1 , y 2 ), and Q 22 (x 2 , y 2 ) are the four points closest to the target point f (x, y), and f (Q ij ) represents the value of the point Q ij , x i,j , and y ij represent the abscissa and ordinate of the point Q ij , respectively. Image normalization is used to scale the value of each image pixel to a small specific interval, remove the data unit limit, and  convert it into a dimensionless pure value, which facilitates the comparison, and weighting of the indicators of different units or magnitudes. The normalization formula used in this paper is shown in Eq. 5.
where the pixel value of a point is converted from x to x, x max and x min are the maximum and minimum values of the sample image pixel values, respectively. The values of all points can be converted to the interval of 0-1 by normalization, which improves the convergence speed and accuracy of the model.

MSDFN
As shown in Figure 1, the MSDFN consists of two parts: a feature extraction network and a classifier. Table 2 shows the specific composition of MSDFN. In Table 2, "Conv" refers to convolutional layer; "RG" refers to residual group; "AMP" refers to adaptive maximum pooling; "AGAP" refers to adaptive global average pooling; "Fc" refers to fully connected layer; "numclass" refers to the number of fault categories. The convolutional layer of the feature extraction network layer is used to transform the channel and reduce the feature resolution, the Residual Group (RG) is used for fault feature extraction, and the MSFF module is used for multi-scale feature fusion. FoM merges the output features of each layer of the feature extraction network again, and inputs it into the classifier constructed by a fully connected layer to obtain a prediction vector. The residual group as shown in Figure 2A is composed of three residual blocks as shown in Figure 2B. ReLU function is be used as the activation function. Although multi-scale feature fusion creates redundancy, it supplements more complete fault information, so that the subsequent network layer can extract more comprehensive fault information, which enhancing the extracted fault features and improving diagnosis accuracy.

MSFF Module
According to the feature fusion method proposed in (Dong et al., 2020), the MSFF module is proposed to apply projection and back projection operators to the fault diagnosis of planetary gearboxes. As shown in Figure 1, the MSFF module fuses the features of all layers to supplement important time-frequency and fault information. Figure 3 shows the structure of the MSFF module. The MSFF module of the n-th network layer is defined as Eq. 6.j n D n (j n , {j 1 ,j 2 , . . . ,j n−1 }) where j n is the latent feature of the n-th feature extraction network layer,j n is the enhanced features obtained through dense fusion, and {j 1 ,j 2 , . . . ,j n−1 } are the enhanced fusion features from n MSFF modules before this layer in the network. This paper uses the enhanced features j n−t , t ∈ {1, 2, . . . , n − 1}, t times to gradually improve the enhanced features j n . The specific update and improvement process is shown as follows.
1) As shown in Eq. 7, the difference e n t between the enhanced feature j n t of the t-th iteration and the t-th enhanced featurej t is calculated.
TABLE 2 | Detailed structure of MSDFN. "Conv", convolutional layer; "RG", residual group; "AMP", adaptive maximum pooling; "AGAP", adaptive global average pooling; "Fc", fully connected layer; "numclass", the number of fault categories. where q n t represents the back projection operator, which upsamples the promoted feature j n t to the same dimension asj t .
2) The enhanced feature j n t is updated through back projection for the difference calculated by Eq. 7, as shown in Eq. 8.
where p n t is the projection operator, which downsamples the difference e n t in the t-th iteration to the same dimension of j n t .
3) After the iteration of all previous enhanced features, the enhanced featurej n is finally obtained.
Unlike the traditional back projection techniques, the sampling operators q n t and p n t in the network are unknown. The proposed method uses strided convolution (deconvolution layer) to learn the downsampling (upsampling) operator in an end-to-end manner. In order to avoid introducing too many parameters, this paper uses 2 as the stride size and stacks the convolutional and deconvolutional layers of n − t-th layer together to achieve downsampling and upsampling learning in q n t and p n t . In summary, the multi-scale feature fusion (MSFF) algorithm used to enhance fault features is described in Algorithm 1. Algorithm 1. Fault feature enhancement algorithm based on multi-scale feature fusion

FoM
The MSFF module achieves feature fusion in the feature extraction process, but the issue of incomplete information still exists when the network uses the extracted features to classify faults. Therefore, the FoM module is designed before the classifier part, and its definition fom is shown in Eq. 9. As shown in Figure 4A, the FoM module is used to convert the output features of each network layer to a uniform size and concatenate on the channel dimensions to achieve the fusion of predicted features.
where cat represents the concatenation of each input feature on the channel dimensions, and amp W×H represents the adaptive maximum pooling operation with an output size of W × H. As shown in Figure 4B, the features of each scale are converted into feature vectors with the same resolution and different channel numbers to represent the diagnosis result obtained by each layer. y 1, y 2 . . . y L are the output features of the first to L-th (L 5 in this paper) network layers. The FoM module performs the second-time fusion of fault features to prevent the feature vectors used for classification from affecting the diagnosis result due to insufficient or missing information.

EXPERIMENT
In order to verify the effectiveness of the proposed MSDFN-based wind turbine planetary gearbox fault diagnosis method and the performance of the feature fusion module MSFF and FoM modules, based on the planetary gearbox dataset of the State Key Transmission Laboratory of Chongqing University (Wang et al., 2018) and gearbox dataset of the University of Connecticut (Cao et al., 2018), comparative experiments and ablation experiments were carried out. The aim is to use the multistage gearbox of the experimental platform to simulate the gearbox of the wind turbine for fault diagnosis research.

An Introduction of the Dataset
The vibration signals datasets of the planetary gearboxes used in this research come from the University of Connecticut gearbox data (Cao et al., 2018) and actual experimental data of the State Key Laboratory of Mechanical Transmission of Chongqing University (Wang et al., 2018). The following describes the basic situation of the two data sets respectively. 1) Chongqing University gearbox dataset. As shown in Figure 5A, the fault diagnosis experiment platform of planetary gearboxes is mainly composed of a motor, a planetary gearbox, a parallel shaft gearbox, and an electromagnetic brake. The multi-axis accelerometer sensor is installed at the housing directly above the second-stage Sun gear of planetary gearboxes to collect the original vibration signals; and the rotation frequency of the second-stage Sun gear is set to 4.17 Hz. As shown in Figure 6, the types of gear faults in the experiments contain 1) surface wear of gear teeth, 2) cracked gear teeth, 3) chipped gear teeth, and 4) missing gear teeth. For the collection of vibration signals, the sampling frequency is set to 5,120Hz, and each group of sampling time lasts for 200 s. So, each type of fault includes 1,024,000 sampling points under each load condition. The working conditions are often unmeasured and easy to fluctuate under actual engineering conditions. To discuss the fault diagnosis performance of network models on planetary gearboxes under different load conditions, four load conditions are set to 0 N.m, 1.4 N.m, 2.8 N.m, and 25.2 N.m for each type of Sun gear fault, respectively. The collected data is processed by high-pass filtering to remove some low-frequency noise    interference in the vibration signals. All the experimental data used in this paper are the time sequence of filtered vibration signals (Wang et al., 2018).
2) University of Connecticut gearbox data, the experimental data are collected from a benchmark two-stage gearbox with replaceable gears as shown in Figure 5B. The gear speed is controlled by a motor. The torque is supplied by a magnetic brake which can be adjusted by changing its input voltage. A 32tooth pinion and an 80-tooth gear are installed on the first stage input shaft. The second stage consists of a 48-tooth pinion and 64-tooth gear. The input shaft speed is measured by a tachometer, and gear vibration signals are measured by an accelerometer. The signals are recorded through a dSPACE system (DS1006 processor board, dSPACE Inc.) with sampling frequency of 20 KHz. Nine different gear conditions are introduced to the pinion on the input shaft, including healthy condition, missing tooth, root crack, spalling, and chipping tip with five different levels of severity (Cao et al., 2018). This dataset contains 104 samples every class and every sample contain 3,600 points.
The gearbox data set of Chongqing University contains usual types of gear failures, including data on a wide range of working conditions, which can better simulate the variable characteristics of wind turbine gearbox load conditions Cao et al. (2019). At the same time, the gearbox data shared by the team of Professor Jiong Tang from the University of Connecticut was selected to further verify the effectiveness of the proposed method. This data set contains nine types of failures, which can better test the model's ability to distinguish samples of the same type but with different failure levels. The experimental platform configuration of these two datasets and the structure of the gearbox are relatively reasonable, which has been proved in the related wind turbine gearbox fault diagnosis researches (Jiang et al., 2018;Lu et al., 2020).

Data Preprocessing
According to the method proposed in Section 3, the original vibration signals are preprocessed to obtain the dataset of timefrequency images. 1) Chongqing University gearbox dataset, since the rotation frequency of faulty gears is 4.17 Hz, the data collected per second can contain the vibration information of multiple rotation periods. So, a sample is composed of the data per second. 200 time-frequency images are obtained for each fault type. Four load conditions are marked as L1, L2, L3, and L4. Each of them has 1,000 image samples. 2) University of Connecticut gearbox data, according to the data set structure and sampling frequency, 104 samples of each class are obtained. Table 3 shows the average accuracy of MSDFN with different ratio of training and testing sets from the Chongqing University gearboxes dataset. It can be found that the accuracy has not changed significantly, nor has it shown an obvious monotonic trend, but fluctuates slightly as the ratio changes. Considering the number of samples, in order to obtain a higher diagnostic accuracy, all image samples are randomly divided into the training and testing sets at a ratio of 7:3. The data collected from different load conditions is combined to obtain a mixed dataset. The sample size of the obtained time-frequency images is 256 × 256. Tables 4, 5 shows the details of the obtained datasets.

Hyperparameter Setting
In the training process of deep learning networks, the hyperparameter settings have considerable impact on the training performance. In the experiments, the batch size of the network input (the number of samples input to the network each time) affects the testing accuracy of the model. If the batch size is too large, the model is difficult to fit or the fitting performance is poor. If the batch size is too small, the model is difficult to  Label  0  1  2  3  4  DatasetA  200  200  200  200  200  L1  700  300  DatasetB  200  200  200  200  200  L2  700  300  DatasetC  200  200  200  200  200  L3  700  300  DatasetD  200  200  200  200  200  L4  700  300  DatasetE  200  200  200  200  200 L1,L2,L3,L4 700 300  converge, and the accuracy rate oscillates unevenly. Considering the number of samples and the testing results, 16 is chosen.
The learning rate is another important hyperparameter. The learning rate represents the updated stride size of network parameters. If the learning rate is too small, it causes too low training efficiency and too much time is spent on training. If the learning rate is too large, it leads to local optimization, loss of oscillation, and model failure to converge. The dynamic adjustment mechanism of learning rate is adopted. The initial learning rate is 0.001, the learning rate transformation coefficient is 0.9. For instance, the learning rate is multiplied by 0.9 for every 10 training cycles. It ensures a fast training speed in the initial stage of training. When the convergence speed of the model slows down, the learning rate is reduced to gradually approach the optimal network parameter values.

Other Training Details
All experiments are implemented on a computer with a RTX3090 GPU, 16 GB RAM, and an Intel i710700 CPU. The implementation, training and testing of the network model are executed on the Pytorch1.7.0 deep learning framework. Matlab 2018a is used to divide the original vibration signals into sample sequences and generate time-frequency image samples. This paper uses the softmax cross-entropy loss function. The corresponding loss calculation formula for a single sample is shown in Eq. 11.
where c is the label (the category index), the vector x is the prediction result of each category (the network output), and x c represents the c-th element of x. As the network trains, x c approaches 1, so the loss approaches 0. The loss of each batch is shown in Eq. 12.
The vectors x i and C i are the network output and label category of each sample respectively, and N is the batch size. The loss of each batch is the average loss of each sample.
In order to speed up the convergence speed and avoid falling into the local optimization, the momentum-driven stochastic gradient descent (Momentum-SGD) algorithm is used to optimize the network model, and the momentum parameter is set to 0.9.

Experimental Results and Analysis
The depth of the feature extraction network in MSDFN will affect feature fusion and the accuracy of fault diagnosis. Therefore, the best depth is selected through experiments. Table 6 shows the diagnostic accuracy of MSDFN of different depth feature extraction networks on Dataset E. It can be seen that if the structure is shallow, the diagnostic accuracy will decrease, and if the structure is too deep, it does not improve the accuracy.
Adding one layer will increase the amount of model parameters several times. Therefore, we set the feature extraction network to five layers.

Performance Comparison of Different Network Models
In order to verify the effectiveness of the proposed method, the proposed MSDFN is compared with DWWC + DRN , WT-CNN (Liang et al., 2020), and three traditional image classification networks: ResNet18, DenseNet201, VGG11. There are the following explanations about the comparative experiment.
Since the innovations of the proposed method focus on the fault diagnosis network, the networks proposed by existing methods are used in the comparative experiments. In comparative experiments, the data preprocessing methods of the original literature are not included, but the same dataset of time-frequency images is used. Therefore, this paper selects the networks proposed in Zhao et al. (2017) (the original literature uses wavelet packet coefficient matrix) and Liang et al. (2020) (the original literature uses time-frequency images) for comparison.
In addition, several traditional image classification networks are applied to the performance comparison of network models. DBN (Wang et al., 2018) is the source of the dataset used in this article. The listed results are from the original literature. Since the original network structure is not convenient to process timefrequency images and its training method is quite different from others, this paper does not test DBN. All the related comparative results are from the original literature (Wang et al., 2018). Comparative experimental results. Table 7 shows the accuracy of fault diagnosis on each dataset. Compared with DBN and CNN using the traditional structures, the residual network and MSDFN have obvious advantages in accuracy. Compared with the residual network, the proposed network uses fewer network layers to achieve a certain improvement in fault diagnosis accuracy. The accuracy of each network on Dataset F, the University of Connecticut dataset proves the effectiveness of the proposed gearbox fault diagnosis method. In addition, all networks perform better on Dataset F, which also reflects the different complexity of the dataset distribution. Figure 7 shows the accuracies of several CNN-based networks obtained in 10 times repeated experiments. MSDFN has a stabler and higher accuracy rate on randomly allocated datasets, which means that the proposed network has stronger robustness. Figures 8A,B respectively compare the accuracy and loss curves of the Dataset E during the training process of each network. The convergence speed of MSDFN is close to ResNet, and its accuracy fluctuation is small.

The Role of Dense Feature Fusion
In this paper, the feature fusion module first is removed and then ablation study is performed on the complete model to verify the impact of the proposed multi-scale dense fusion mechanism on diagnosis performance. The different degrees of ablation models are shown in Table 8. After removing all the remaining parts of the feature fusion structure, the diagnosis network shown in Figure 1 is used as backbone, which is mainly composed of a convolutional layer, a residual block, and a fully connected layer. The complete network D is MSDFN. The diagnostic accuracy of different degrees of ablation models on each dataset is shown in Table 9. It can be seen that the feature fusion module has a positive effect on improving the accuracy of fault diagnosis. The diagnostic accuracy of different degrees of ablation models does not fluctuate greatly under different working conditions, but they are still affected by the working conditions. Different models show different diagnostic   performance on different datasets. As the load increases, the diagnostic accuracy of models B, C, and D is also improving, but model A does not have this rule, which reflects that the feature fusion module increases the model sensitivity to changes in load conditions. Figure 9 shows the confusion matrices of the diagnosis results obtained by each ablated network model on the testing set of Dataset E containing 1,200 images. After adding the feature fusion structure to the network, the accuracy rate is significantly improved. Additionally, due to the dense feature fusion at each network layer, the MSFF module plays a more important role than the FoM module. Although the projection and back projection operations of each layer increase the amount of calculation, the fault diagnosis accuracy is effectively improved. Three images incorrectly diagnosed by the complete network D belong to different categories and are predicted to be three different categories, which indicates that MSDFN has no obvious deviation in the diagnosis ability of different fault types. However, the obvious deviation in the diagnosis ability exists in networks A, B, and C. The integration of the two feature fusion modules, MSFF and FoM, effectively improves the unstable recognition ability of the network for various fault types. 240 images are randomly selected from the testing set of Dataset E to compose a dataset. According to the network with various degrees of ablation, Figure 10 shows the distribution of output features after t-sne dimensionality reduction. The distribution of output features intuitively shows the separability of output features. The fault feature extraction performance of each ablated network can be evaluated.
The output features of the first two feature extraction network layers of each network model show two parts, a large part and a small part, because the separability of features is low at this time.
Since testing is carried out on a mixed dataset, there are obvious data distribution differences between the data of the mixed working conditions. This mainly reflects the load among L1 (0 N.m), L2 (1.4 N.m), L3 (2.8 N.m), and L4 (25.2 N.m), so the two parts in the scatter diagram show an approximate ratio of 3:1. The performance of the two feature fusion modules on fault diagnosis can be analyzed by comparing the feature distribution of networks A, B, C, and D. As the network depth increases, the differences caused by working conditions gradually decrease, so the features of the same type of faults gradually become concentrated. Different ablated networks show different recognition capabilities. Compare with the third-layer output of different networks, the introduction of the MSFF module enhances the ability to distinguish features. According to the output comparison of each layer of each model, the network D achieves the best performance.
The fault recognition capability of each network can be determined by observing the output of FoM module in each model. Compared with other models, the output features of FoM module in model A (backbone network) have many errors, and the features of each fault category are relatively close. As the complete network, the output features of FoM module in model D show high separability, and 100% accuracy is achieved on the set of 240 random samples. Monitoring the health status of each wind turbine planetary gearbox is of great significance for reducing the operation and maintenance cost of wind turbines. In order to solve the loss of fault information in the diagnosis process and improve the model diagnosis ability, this paper proposes an intelligent fault diagnosis method for wind turbine planetary gearboxes based on MSDFN. Both MSFF and FoM modules are used to perform feature fusion in the feature extraction and fault classification, respectively. The loss of fault information caused by continuous convolution and pooling in CNN is effectively mitigated. The proposed method is compared with two mainstream methods and three traditional image classification networks to verify its effectiveness. Compared with traditional CNN-based networks, MSDFN uses feature fusion twice to improve the accuracy of fault diagnosis under both single and mixed loads. In repeated experiments, the proposed method achieves more than 99.5% accuracy rate. The ablation study verifies that feature fusion is conducive to the gear fault diagnosis of planetary gearboxes. On the mixed dataset, MSFF and FoM modules increase the accuracy rate by 4.5% and 3.42%, respectively. To enable the proposed model to maintain high fault diagnosis accuracy in the case of large changes in working conditions, future work will focus on how to use transfer learning to improve the model's adaptability to working conditions.

DATA AVAILABILITY STATEMENT
The data analyzed in this study is subject to the following licenses/ restrictions: The dataset is provided by the author of paper "https://content.iospress.com/articles/journal-of-intelligent-andfuzzy-systems/ifs169538". Requests to access these datasets should be directed to https://content.iospress.com/articles/ journal-of-intelligent-and-fuzzy-systems/ifs169538.

AUTHOR CONTRIBUTIONS
Conceptualization, YC and XH; Methodology, YL and XH; Software, XH; Writing XH and YC.