Prediction of Protein–Protein Interactions in Arabidopsis, Maize, and Rice by Combining Deep Neural Network With Discrete Hilbert Transform

Protein–protein interactions (PPIs) in plants play an essential role in the regulation of biological processes. However, traditional experimental methods are expensive, time-consuming, and need sophisticated technical equipment. These drawbacks motivated the development of novel computational approaches to predict PPIs in plants. In this article, a new deep learning framework, which combined the discrete Hilbert transform (DHT) with deep neural networks (DNN), was presented to predict PPIs in plants. To be more specific, plant protein sequences were first transformed as a position-specific scoring matrix (PSSM). Then, DHT was employed to capture features from the PSSM. To improve the prediction accuracy, we used the singular value decomposition algorithm to decrease noise and reduce the dimensions of the feature descriptors. Finally, these feature vectors were fed into DNN for training and predicting. When performing our method on three plant PPI datasets Arabidopsis thaliana, maize, and rice, we achieved good predictive performance with average area under receiver operating characteristic curve values of 0.8369, 0.9466, and 0.9440, respectively. To fully verify the predictive ability of our method, we compared it with different feature descriptors and machine learning classifiers. Moreover, to further demonstrate the generality of our approach, we also test it on the yeast and human PPI dataset. Experimental results anticipated that our method is an efficient and promising computational model for predicting potential plant–protein interacted pairs.


INTRODUCTION
Identification of protein-protein interactions (PPIs) in plants is essential for exploring the mechanisms underlying of biological processes, such as organ formation, homeostasis control (Canovas et al., 2004), plant defense (Zhang et al., 2010), signal transduction (Khan and Kihara, 2016), and stress response (Bracha-Drori et al., 2004). Although numerous high-throughput techniques have been developed to identify PPIs of model species, such as affinity purification mass spectrometry (Fukao, 2012;Armean et al., 2013) and yeast two-hybrid (Causier and Davies, 2002;Fang et al., 2002), these approaches are cumbersome, costly, particularly time consuming, and always suffer from high false positive rate. To overcome these problems, there is an urgent need to develop sequence-based computational methods that can accurately predict potential PPIs while analyzing the functions of plant genes.
In recent years, many studies have been introduced for detecting PPIs. These methods can be broadly classified into several categories: protein structure-based method (Hayashi et al., 2018), genomic information-based method (Zahiri et al., 2014), evolutionary relationship-based approach (Xu et al., 2011), and protein sequence-based method (Richoux et al., 2019). In fact, the first three methods have better prediction performance. However, these methods typically require the structural details of proteins such as 3D structural and protein homology information. If this prior knowledge is not available, then the method will not perform as expected. Theoretically, amino acid sequence contains all the necessary information to detect PPIs. In addition, with the improvement of sequencing technology, more and more plant genome sequences are available. Hence, it is meaningful to develop computational methods to predict potential PPIs from sequence information.
To date, some new approaches have been proposed to predict PPIs using the feature descriptors of protein sequence, such as the composition-transition-distribution descriptor (Yang et al., 2010), auto-covariance descriptor (Guo et al., 2008), Zernike moments descriptor (Wang et al., 2017), and local descriptor (Davies et al., 2008). These descriptors summarize specific aspects of amino acid sequence, including frequencies of local patterns, physicochemical properties, and positional distribution of protein sequence. However, the coverage of these feature descriptors is still limited. Recently, many deep learning techniques also have been applied on PPI-based prediction. For example, Du et al. (2017) presented an approach called DeepPPI, which adopted deep neural networks (DNN) to extract high-level features from raw input features of protein sequence to identify PPIs. Zeng et al. (2020) were inspired by the deep learning algorithm and proposed a framework called DeepPPISP, which extracts local and global features from amino acid sequences and employs DNN to predict PPIs. Sun et al. (2017) employed stacked autoencoder (SAE), which is a deep learning algorithm to predict PPIs from human protein sequence. Hashemifar et al. (2018) developed a novel sequence-based approach called DPPI that used Siamese-like convolutional neural networks (CNN) combined with data augmentation and random projection to improve PPI prediction. Sledzieski et al. (2021) proposed a novel model named D-SCRIPT, which indicated that employing a deep learning language modeling of protein sequence data is effective for PPI prediction. Chen et al. (2019) put forward an endto-end framework that combined contextualized information and local features with a deep residual recurrent CNN in the Siamese architecture to predict PPIs only using protein sequence information. Yi et al. (2018) proposed the RPI-SAN model using a deep learning stacked autoencoder network to extract features from RNA and amino acid sequences. Finally, they fed these features to the RF model for training and predicting. Despite these advances in previous studies, there is still a need to improve the accuracy and efficiency of the PPI prediction models.
In this article, we combined DNN with discrete Hilbert transform (DHT) and singular value decomposition (SVD) to predict PPIs in plants. More specifically, for each plant primary sequence, position-specific score matrix (PSSM) was constructed, and then DHT was applied to gather important information from the protein PSSM. Subsequently, SVD algorithm was adopted to reduce feature dimension and noise interference and finally generated a 600-dimensional feature vector. Lastly, a deep neural network was applied to make predictions between target plant proteins. When the proposed method was applied on the Arabidopsis thaliana, maize (Zea mays), and rice (Oryza sativa) PPI datasets, it yielded promising results of average AUC (area under ROC curve) values of 0.8369, 0.9466, and 0.9440. When compared with some different feature selection methods and state-of-the-art machine learning classifiers, our method obtained better results. In addition, to achieve more convincing evidence, we also applied our method to the yeast and human PPI dataset. These combined results suggest that the proposed approach is effective and trustworthy for predicting potential PPIs in plants.

Data Collection and Construction of the Benchmarking Set
To validate the robustness and effectiveness of the proposed model, we performed it on three plant PPI datasets, A. thaliana, Z. mays, and O. sativa. The A. thaliana dataset was collected from TAIR 1 (Rhee et al., 2003), IntAct 2 (Kerrien et al., 2012), and BioGRID 3 (Stark et al., 2006). After removing the redundancy, the final A. thaliana-positive dataset comprised 28,110 PPI pairs containing 7,437 A. thaliana proteins. These proteininteracted pairs constructed the primary A. thaliana PPI network. For the construction of the negative dataset, we employed Frontiers in Genetics | www.frontiersin.org a bipartite to formulate a network of plant PPIs, where the nodes represent the plant proteins and the links denote the interactions between them. Here, we use A. thaliana as an example. The whole associations between the 7,437 proteins are 55,308,969 (7,437 × 7,437) in the corresponding bipartite. However, only 28,110 PPIs had been demonstrated to have the interactions. Thus, the possible number of negative pairs is 55,280,859 (55,308,969-28,110), which is significantly more than the positive samples. To handle this binary classification problem, we randomly collected 28,110 non-interacting pairs as the negative dataset. In theoretical terms, the negative samples may contain a small number of positive samples; however, given the size of the whole non-interaction pairs, the probability of this situation is very small. In this way, the whole A. thaliana dataset consists of 56,220 protein pairs.

Representation of the Plant Amino Acid Sequence
To mine highly efficient features for training the models, each protein pair is encoded as 800-dimensional feature vector by PSSM (Gribskov et al., 1987). PSSM has been successfully employed in various fields of biological research including the prediction of PPI site, subcellular localization, and DNA-binding protein identification. In this section, we applied PSI-BLAST (Altschul and Koonin, 1998) tool to represent protein sequence as a U × 20 matrix, where Q = η a,b : a = 1 · · · U and b = 1 · · · 20 , and it can obtain the information of plant sequential evolution. PSSM can be defined as where η a,b represents probability that the a-th mutate to b-th amino acid during the evolutionary process. In the experiment, 6 http://bis.zju.edu.cn/prin/ plant protein sequences were adopted as seeds to search and align homogenous sequences from SwissProt database by PSI-BLAST tool. The tool will be used to recognize members of gene family and evolutionary relationships between plant protein sequences. It is also able to generate a 20-dimensional vector to denote the probabilities of conservation against mutations to the 20 amino acids. The number of iterations is set to 3 and the E-value is cut off at 0.001 to achieve homologous sequences. The PSI-BLAST tool and SwissProt database can be accessed online 7 .

Discrete Hilbert Transform
In this section, we introduce discrete Hilbert transform (DHT; Cizek, 1970) to extract feature descriptors from the PSSM to make the prediction more convenient and accurate. DHT is used as a 7 http://blast.ncbi.nlm.nih.gov/Blast.cgi   tool for signal analysis in the time and frequency domains. Before describing the 2-dimensional DHT, the 1-D DHT (Ponomareva et al., 2018) is used in the spatial and frequency domain and has been previously described (Stark, 1971;Bracewell and Bracewell, 1986;Zhu et al., 1990;Onodera et al., 2005).
To better extract the feature descriptors, we used the 2-D DHT for constructing the local energy of PSSM. In this work, we applied the 2-D DHT, which is defined by Read and Treitel (1973) in the frequency domain. Our Matlab code is shown as follows: function x = hilbert2(xr,m,n) %HILBERT2 Discrete-time 2D analytic signal via Hilbert transform. % X = HILBERT2(Xr) computes the 2D discrete-time analytic signal % X = Xr + i * Xi such that Xi is the Hilbert transform of real image Xr. % If the input Xr is complex, then only the real part is used: In this work, PSI-BLAST encoded each protein sequence as a U × 20 matrix. Due to the different lengths of protein sequences, the size of each matrix constructed by PSSM is also different. To handle this problem, we transformed the variably sized PSSM into a 20 × 20 matrix, and the 2-D DHT is applied to extract feature vectors from the PSSM profile. In this way, each plant protein sequence will be converted into a 400-dimensional vector by 2-D DHT. As a non-linear filtering technique, SVD has been widely applied in noise reduction of vibration signals. This is because the signals after noise reduction have a small phase-shift and there is no time delay effect. To improve the prediction accuracy and reduce the dimensionality of the input feature matrix, we applied SVD (Klema and Laub, 1980) algorithm to reduce size of feature vectors from 400 to 300. At the same time, the lower dimensions could reduce the complexity of the model and increase the generalization error of the classifier. Finally, each protein pair will be represented as a 600-dimensional DHT descriptor.

Deep Neural Networks
Considering the larger numbers of hidden layers that can be used for training networks, artificial neural networks consist of two or more hidden layers that are often referred as DNN as shown in Figure 1. The depth of a neural network relates to the quantity of hidden layers, and the largest number of neurons determines the width of DNN Hinton and Salakhutdinov, 2006).
In terms of structure, DNN is composed of many plain modules, which appear as a multilayer stack. The data are first received by the input layer, and then converted through a nonlinear way across many hidden layers. Before calculating the final output, the average gradient is first computed and the corresponding weights are adjusted. Neurons of a hidden layer or input layer are associated with the neurons of the existing layer. Each neuron will compute a weighted sum of its input and perform a non-linear activation function to capture its outputs.  The non-linear activation functions usually include sigmoid, rectified linear unit (ReLU), and hyperbolic tangent. In this work, we used the sigmoid and ReLU. We constructed a DNN-based model using the TensorFlow platform shown in Figure 1. This model consists of two hidden layers with 48 neurons each. The DHT feature descriptors are employed as the inputs for the DNN model. After that, these features were set into the hidden layers for training and predicting PPIs. Adam algorithm (Kingma and Ba, 2014), which is an adaptive learning rate approach, was adopted in our methods to accelerate the training process. At the same time, to avoid overfitting, the dropout technique was also applied to our model (Khan et al., 2019). We also used the cross-entropy loss and ReLU activation function to speed our training and achieve better predictive performance (Hinton et al., 2015). The loss can be calculated by the following formulas: R m i1 = σ T i1 X i1 + b i1 (i = 1, 2, 3, · · · , n; m = 1, 2) (2) R m ij = σ T ij R i (j−1) + b ij (i = 1, 2, 3, · · · , n; j = 2, 3, 4 · · · , h 1 ; m = 1, 2 In Eqs. 2-6, n describes the amount of protein pairs that need to be trained, m denotes the individual network, h 1 represents the depth of two individual networks, and h 2 denotes the depth of the fused network. The activation function of ReLU and output layer with sigmoid is σ 1 and σ 2 , respectively; ⊕ is the concatenation operator. R represents the output of hidden layer and y is the corresponding desired output. T and b indicate the weight matrix and bias vectors.

Evaluation Criteria
To prevent overfitting and validate the robustness of our method, five-fold cross-validation (CV) scheme is performed on our method. Specifically, the entire plant's PPI dataset will  be randomly split into five equal parts; four of them will be employed for training and the remaining one was used for testing. The training and testing data will not overlap with each other to prevent overfitting. The final validation results were the mean value obtained by the five-fold CV scheme. The predictive performance of the proposed approach is verified by five different measurements, including accuracy (Acc), precision (PR), sensitivity (Sens), specificity (Spec), and MCC. They can be represented by where TP, FP, TN, and FN are associated with the number of true positive, false negative, true negative, and false negative, respectively. In addition, receiver operating characteristic (ROC) curves (Hand, 2009) were plotted for better accessing the predictive performance of the proposed model. Furthermore, AUC (area under ROC curve) Huang and Ling (2005) values were also used as an evaluation criterion.  Figures 2-4 illustrate the ROC curves yielded on A. thaliana, Z. mays, and O. sativa datasets. In the figure of ROC curves, x-axis is the false positive rate and y-axis represents the true positive rate. Based on the experimental results, it can be indicated that the proposed model is effective for identifying PPIs in plants. We attributed this better prediction performance to the powerful DHT-SVD descriptors and the excellent DNN classifier. The PSSM not only encodes the sequence into matrix but also obtains the sufficient prior information of plant proteins. In addition, the application of DHT extracted robust feature descriptors from PSSM, and then, SVD algorithm was employed to reduce the noise and decrease the dimension of feature matrix that can better improve the prediction performance. As a popular deep learning classifier, DNN shows the powerful ability for training and predicting, which makes us more convinced that our method can be a useful tool for plant PPI prediction.

Comparison With Random Forest and K-Nearest Neighbor Classifier
There are many machine learning classifiers that have been applied to predict PPIs. K-nearest neighbor (KNN) (Keller et al., 1985) and random forest (RF) (Breiman, 2001) are the most widely used algorithms. The KNN algorithm is one of the simplest classification approaches and it has been widely applied  to detect PPIs . RF is a decision tree-based ensemble learning method, and it is known for its powerful ability of classification (Li et al., 2012). To further verify the predictive ability of DNN classifier, we compared it with the KNN and RF model by the five-fold CV scheme and adopted the same DHT feature descriptors. The results list in Table 4 illustrates that our method achieved higher AUC values across the A. thaliana, Z. mays, and O. sativa datasets. It can be observed that the average AUC values of the DNN classifier are 0.1023, 0.1215, and 0.1354 higher than those of KNN classifier. Similarly, when compared with the RF classifier, the AUC value of our model improved 0.0036, 0.013, and 0.0241, respectively. From the comparison results shown in Figure 5, we considered that the combination of DNN classifier and DHT descriptors can significantly improve the performance in plant PPI prediction. To evaluate the performance of PSSM, we compared it with the substitution matrix representation (SMR), which was proposed by Yu et al. (2012) to represent protein sequence. In this section, we employed the BLOSUM62 matrix to encode the A. thaliana protein sequence as a 20 × 20 matrix. Then, the DHT algorithm was applied to extract feature descriptors from SMR matrix and SVD was also adopted to reduce the feature dimensions. By this way, we can generate a 600dimensional SMR-DHT descriptor for each protein pair. The five-fold CV results of SMR-DHT descriptors combined with DNN classifier on the A. thaliana dataset are summarized in Table 5. It can be observed that the PSSM-based method performs significantly better than the SMR-based method. For example, the accuracy and AUC gaps between PSSM and SMR-based method are 4.38 and 4.94%, respectively. The higher predictive accuracy and lower SDs further indicated that our method performs better than the SMR-based approach (Figure 6).

Predictive Ability on Yeast and Human Dataset
To further validate the potential of the presented method, we performed it on the yeast and human PPI dataset, which was introduced by Guo et al. (2008) and Huang et al. (2015). The predictive results of the two datasets are listed in Tables 7, 8

DISCUSSION
In this article, we proposed a deep learning framework to predict PPIs in plants only using the information of amino acid sequence. This approach is based on DNN combined with DHT descriptors and PSSM. More specifically, we first used the PSSM to represent plant protein sequences, and then extracted feature vectors from these matrices by DHT. To improve the prediction accuracy and reduce the computational complexity, the SVD algorithm was adopted to reduce the feature dimensions. Lastly, these feature descriptors were sent to the DNN classifier for training and predicting. To verify the performance of the proposed approach, we performed it on A. thaliana, Z. mays, and O. sativa datasets. To evaluate the power of the DNN-based classifier, we compared it with the KNN and RF classifier using the same DHT descriptors. In addition, we also compared the DHT with some different feature descriptors. To further indicate the generality of our model, we also applied it to the yeast and human datasets. The experimental results indicated that our model performs significantly well in predicting PPIs in plants. In further work, we will continue to design more effective computational models for better analyzing biomolecular interactions in plants.

AUTHOR CONTRIBUTIONS
JP, L-PL, and Z-HY: conceptualization, methodology, software, validation, formal analysis, investigation, resources, and data curation. C-QY and Z-HR: writing -original draft preparation, writing, review, editing, visualization, and supervision. Y-JG: project administration. Z-HY: funding acquisition. All authors read and approved the final manuscript.

FUNDING
This research was funded by the National Natural Science Foundation of China, grant numbers 62002297 and 61722212.

ACKNOWLEDGMENTS
Our deepest gratitude goes to the editor Robert Friedman and three reviewers for their careful work and thoughtful suggestions that have helped improve this manuscript substantially.