Prediction of Plant Resistance Proteins Based on Pairwise Energy Content and Stacking Framework

Plant resistance proteins (R proteins) recognize effector proteins secreted by pathogenic microorganisms and trigger an immune response against pathogenic microbial infestation. Accurate identification of plant R proteins is an important research topic in plant pathology. Plant R protein prediction has achieved many research results. Recently, some machine learning-based methods have emerged to identify plant R proteins. Still, most of them only rely on protein sequence features, which ignore inter-amino acid features, thus limiting the further improvement of plant R protein prediction performance. In this manuscript, we propose a method called StackRPred to predict plant R proteins. Specifically, the StackRPred first obtains plant R protein feature information from the pairwise energy content of residues; then, the obtained feature information is fed into the stacking framework for training to construct a prediction model for plant R proteins. The results of both the five-fold cross-validation and independent test validation show that our proposed method outperforms other state-of-the-art methods, indicating that StackRPred is an effective tool for predicting plant R proteins. It is expected to bring some favorable contribution to the study of plant R proteins.


INTRODUCTION
The rapid multiplication and spread of pathogens affect plant growth and development and pose a serious threat to crop and food security. Resistance (R) proteins are of increasing interest because of their important role in plant defense against pathogens. R-proteins are plant proteins that contain a variety of structural domains such as nucleotide-binding structural domains (NB-ARC), leucine-rich repeat (LRR), Toll-interleukin-like receptor (TIR), Coiled-Coiled structures (CC), and kinases (KIN) (Sanseverino and Ercolano, 2012;Kushwaha et al., 2016). The exploration of R-proteins and proteins with R-protein characteristics can play a key role in plant defense against different pathogens. In recent years, computational methods have been widely used in R-protein prediction studies.
NLR-parser predicts NLR-like sequences based on MAST motif search (Steuernagel et al., 2015). RGAugury predicts different R protein subclasses by integrating the results generated by several computational tools (Li et al., 2016), including the following: BLAST (Camacho et al., 2009), Hmmer3 (Eddy, 2011), Phobius (Käll et al., 2004), TMHMM (Bateman et al., 2004), and so on. Restrepo-Montoya et al. (2020) developed a computational approach to classify RLK and RLP proteins using SignalP 4.0 (Petersen et al., 2011), TMHMM 2 (Krogh et al., 2001), and PfamScan (Finn et al., 2014). In general, sequence alignment-based methods generally have low sensitivity and are time-consuming, which makes them difficult to predict proteins with low similarity. The application of machine learning methods for predicting plant R proteins has thus become of increasing interest.
Machine learning methods have been widely used to study plant and animal protein data (Sun et al., 2020a(Sun et al., ,b, 2021. Several common machine learning-based methods for predicting R proteins are described below: NBSPred (Kushwaha et al., 2016), DRPPP (Pal et al., 2016), prPred , and prPred-DRLF (Wang et al., 2022). The NBSPred (Kushwaha et al., 2016) method is a high-throughput pipeline based on support vector machine (SVM), which is used to identify NBSLRR and NBSLRR-like proteins from non-NBSLRR proteins from genomic, transcriptomic and protein sequences, and was tested and validated employing input sequences from three dicots and two monocot plants. Similarly, the DRPPP (Pal et al., 2016) method is an SVM-based predictive approach to predict disease resistance proteins in plants. The method applied 16 feature methods to obtain 10,270 features and performed ten-fold cross-validation to train optimized radial basis function SVM parameters, achieving an overall accuracy of 91.11% on the test dataset. Recently, two machine learningbased methods, prPred  and prPred-DRLF (Wang et al., 2022), were proposed by Wang et al. to predict Plant R proteins. prPred (Wang et al., 2021b) used two feature extraction methods, k-spaced amino acid pairs (CKSAAPs) and k-spaced amino acid group pairs (CKSAAGPs), to obtain Plant R protein sequence feature information, and then used a two-step feature selection strategy to detect irrelevant and redundant features. The prediction accuracy of the prPred model was 93.5%. The prPred-DRLF method applied bi-directional long short-term memory (BiLSTM) and unified representation (UniRep) embedding to represent Plant R protein sequence features and used a light gradient boosting machine (LGBM) classifier to identify plant R proteins, achieving a prediction accuracy of 95.6% in independent tests.
Although considerable progress has been made in existing machine learning methods for predicting Plant R proteins, some significant challenges remain. For example, most prediction methods only target the sequence features of Plant R proteins, ignoring the protein structure and the physicochemical properties of the bases. In contrast, protein residue pairwise energy content matrices (RECM) have been used to predict intrinsically non-structural proteins due to their ability to capture energy information between residue pairs (Jones and Cozzetto, 2015;Mészáros et al., 2018). Mishra et al. (2019) used the characteristics of protein residue pair energy content to predict DNA and RNA binding proteins, and Fu et al. (2020) used the characteristics of protein residue pair energy content to predict cell-penetrating peptides.
In recent years, the Stacking framework has been widely used in biological sequence prediction, including protein, noncoding RNA and RNA-protein interaction prediction, etc. Mishra et al. (2019) proposed a method for predicting DNAbinding proteins by combining evolutionary information and a stacking framework; Yi et al. (2020) proposed a method to predict ncRNA-protein interactions by fusing multiple sources of information and the stacking framework; Fu et al. (2020) used the stacking framework to construct a prediction model for cell-penetrating peptides and their uptake efficiency; Basith et al. (2022) applied 11 different encodings to represent three different features and input them into the stacking model to predict prokaryotic lysine acetylation sites; Wang et al. (2021a) proposed a hybrid framework based on a stacking strategy to predict non-coding RNAs.
In this manuscript, we propose a machine learningbased predictor, called StackRPred, to further improve Plant R protein prediction accuracy. The main contributions of StackRPred are as follows.
(i) We employ RECM to encode Plant R proteins and combine the discrete wavelet transform (DWT) (Shensa, 1992) and pseudo position-specific score matrix (PsePSSM) (Chou and Shen, 2007) to obtain Plant R protein feature representations. (ii) We used a stacking-based machine learning model to efficiently predict Plant R proteins. The model consists of two layers; the first layer (base layer) uses these features to train an ensemble of predictors; the second layer (meta-layer) combines the outputs of the predictors from the base layer. (iii) The prediction results show that StackRPred outperforms state-ofthe-art methods for Plant R protein prediction. The superior performance of StackRPred could motivate researchers to explore Plant R proteins even further.

Framework of the Proposed Method
In this study, we present a sequence-based plant R protein prediction model called StackRPred. The StackRPred prediction model consists of two major parts, feature extraction and classifier construction. (1) Feature extraction; we first calculate the RECM matrix (see Section "Residue Pairwise Energy Content Matrices") of Plant R protein in the benchmark dataset according to the physicochemical properties of the Plant R protein sequence, and extract the PsePSSM and DWT characteristics of each Plant R protein based on the RECM matrix. Then, we use SVM-RFE + CBR (Yan and Zhang, 2015) method to reduce the dimensionality of the feature information.
(2) Classifier construction; We constructed a stacking model to build the classification model. Our proposed Stacking model classifier consists of two layers: the first layer (base layer) contains multiple classifiers; the second layer includes one classifier called the meta-layer. The base layer consists of eXtreme Gradient Boosting (XGBoost), SVM, K-Nearest Neighbor (KNN), Gradient Boosting Decision Tree (GBDT), Light Gradient Boosting Machine (LightGBM), and Random Forest (RF); the meta-layer uses SVM as the meta-classifier. The overall framework of the proposed method for predicting Plant R protein is shown in Figure 1.

Datasets
The dataset used in this thesis was derived from the study by Wang et al. (2021b). The specific data were obtained as follows: R proteins of 35 plant species were obtained from the PRGdb database (Osuna-Cruz et al., 2018) and the protein sequences of these 35 plant species were downloaded from the NCBI database to construct a positive sample dataset; then, proteins with sequence similarity greater than 30% were excluded from the non-R protein dataset using CD-HIT (Fu et al., 2012). The obtained dataset contains 456 protein sequences with 152 positive and 304 negative samples. The data set is divided into a training sample set and a test sample set with a ratio of 8:2, the number of training samples is 364, and the number of independent test samples is 92. The training dataset consisted of 121 plant R protein sequences and 243 non-R protein sequences; the independent test dataset consisted of 31 R protein sequences and 61 non-R protein sequences.

Residue Pairwise Energy Content Matrices
The energy of interaction between protein residues ensures protein structural stability, and the energy contribution of residue interactions can be approximated by an energy function extracted from known structures Mishra et al., 2016). Dosztanyi et al. (2005) performed the least square fit of the contact energy derived from the primary sequences of 674 proteins to the tertiary structures of 785 proteins and constructed the RCEM matrix, a 20×20 dimensional matrix with rows and columns representing the 20 standard amino acids. Table 1 shows the RCEM table applied in this manuscript (Dosztanyi et al., 2005).

Discrete Wavelet Transform Features
Discrete Wavelet Transform (DWT) (Shensa, 1992) is a transform operation that can capture wavelet discrete sampling of sequence base frequency and position information. The transform operation is done by projecting the signal onto the wavelet function. When applied to Plant R protein sequence analysis, DWT can decompose the physicochemical properties of the base sequence into a list of coefficients of different resolutions and also remove noise information from the highpass curve. In this manuscript, the RECM matrix is calculated for each given Plant R protein sequence. Then, each RECM matrix is regarded as a two-dimensional signal, and the whole of the two-dimensional signal is denoised by discrete wavelet transform.  Wavelet transform (WT) is defined as the projection of the signal f(t) onto the wavelet function: Where a(a > 0) is a scale factor and b is a translation factor, and both belong to the real set r(n). ( t−b a ) is the analyzing wavelet function, and T(a, b) is the wavelet transform coefficient of the signal at the specific position (t = b) and the specific wavelet period (the equation of the scale factor a). Discrete wavelet transform (DWT) can decompose lncRNA sequences into coefficients of different dilations and then remove noise components. Nanni et al. (2012Nanni et al. ( , 2014 proposed an efficient algorithm for performing DWT by assuming that the discrete signal f(t) is x [n], and is defined as follows: (2) Where N is the length of the discrete signal. y low [n] is the approximation coefficient of the signal (low frequency component). y high [n] is the detailed coefficient (high frequency component). g is a low pass filter and h is a high pass filter. As the level of decomposition increases, more detailed signal characteristics can be observed. Figure 2 is an example of a 4-level discrete wavelet transform. At each level, the data can be divided into a high frequency band containing more noise information and a low frequency band including more useful signals, and should be transformed in the next stage.
At each level of the DWT, the high and low band signals are separated. Inspired by the work of Nanni et al. (2012Nanni et al. ( , 2014, we FIGURE 2 | An example of a discrete wavelet transform process.
Frontiers in Plant Science | www.frontiersin.org calculate the maximum, minimum, mean, and standard deviation values for each band. Four characteristics can be obtained for the high frequency and the low frequency, respectively, and a total of eight features are obtained. In addition, since the high-frequency component noise is large, the low-frequency component is more important. We also extract the first five discrete cosine coefficients from the approximation coefficients, and the first five elements are more important information indicating the sequence in the compressed low-band. Therefore, we can get 4 + 4 + 5 features from each level of DWT, and there are 52 features of 5 levels throughout the conversion process.
In the RECM matrix, we can extract 52 features for each attribute using the 5-level DWT method. Thus, we can obtain 1040 features. Chou and Shen (2007b) proposed the Pseudo Position-Specific Score Matrix (PsePSSM) feature extraction method widely used for protein sequence feature extraction. Similarly, we established a new feature extraction method based on RECM matrix-PseRECM, which can be used for feature extraction of Plant R protein sequences. PseRECM is defined as follows.

PsePSSM Features
Here p i,j represents the values of the i-th row and the j-th column in the RECM matrix.
Where G λ j is the average correlation of amino acid residues with a separation distance λ (λ < L) in the sequence, j = 1, 2,. . . , 20.

Feature Optimization Algorithm
After extracting feature information for the full Plant R protein dataset, to eliminate noise and redundant features from the original feature space and reduce overfitting to improve performance, we employ the SVM-RFE + CBR (Yan and Zhang, 2015) algorithm to select the best feature subset. the SVM-RFE + CBR (Yan and Zhang, 2015) algorithm has been successfully applied to many systems biology problems (Fu et al., 2018(Fu et al., , 2019aChen et al., 2021). We first use SVM-RFE + CBR to rank all feature vectors and select a set of top-ranked feature vectors, and then, reorganize the selected feature vectors into new and ordered feature vectors. The 112-dimensional feature input model is obtained for training after applying the SVM-RFE + CBR algorithm.
The SVM-RFE algorithm is an Embedded method based on the maximum interval principle of SVM, proposed by Guyon et al. in the classification of cancer, and has been successfully applied to many systems biology problems (Yin et al., 2016;Chowdhury et al., 2017). The SVM-RFE algorithm trains samples through the model and ranks the score of each feature, removes the feature with the lowest score, then trains the model again with the remaining features for the next iteration, and finally selects the number of features needed. To reduce the potential bias between non-linearity and linearity of the SVM-RFE algorithm, Yan et al. incorporated the Correlation Bias Reduction (CBR) strategy and proposed the SVM-RFE + CBR algorithm. To incorporate the CBR strategy into the feature elimination process, half of the remaining features are removed in each iteration of SVM-RFE at the beginning of the algorithm. When the number of remaining features is less than an elimination threshold, they are removed in the next iterations for better accuracy.
The SVM-RFE + CBR algorithm requires the following main parameters: kerType, rfeC, rfeG, useCBR, Rth. The values and descriptions of these parameters in this paper are shown in Table 2.

Classification Models
The stacking method achieves model stacking by combining the output results of multiple models (called base models) as feature input to the next layer of models. Specifically, the model output of the first layer of the stacking model is used as the input of the second layer of the model, the output of the second layer of the model is used as the input of the third layer of the model, and so on, with the output of the last layer of the model as the final result. In this manuscript, a stacking model is constructed as a classification prediction model for Plant R proteins. The StackRPred model consists of two layers: the first layer (base layer) contains multiple classifiers. The classifier in the base layer is called the base classifier; the second layer includes one classifier called the meta layer. The output of the base classifier is used as input data for the meta-classifier in the overlay model, so that the meta-classifier can be found and corrected for deviations in the base classifier and learn inductively from the results of the base classifier, thus improving the generalization accuracy of the integrated classifier. Choosing suitable base classifiers and meta-classifiers is the key to improving the generalization ability of the StackRPred model. In this study, through several experimental tests, we selected six classification algorithms as the base classifier for the first layer, namely KNN, GBDT, SVM, XGBoost, LightGBM, and RF, and chose SVM as the metaclassifier.
K-nearest neighbor (Altman, 1992) is a non-parametric statistical method for classification and regression. The core idea of KNN: If most of the K nearest neighbors of a sample in the  feature space belong to a certain category, the sample also belongs to this category and has the characteristics of the samples in this category. This method only determines the category of the sample is classified according to the category of the nearest one or several samples in determining the classification decision. Support vector machine (Vapnik, 1999) is a generalized linear classifier that classifies data by supervised learning, and its decision boundary is the maximum-margin hyperplane for solving learning samples. In this study, we employ grid search to optimize the RBF kernel parameter γ and the cost parameter C, and choose the radial basis function (RBF) as the SVM kernel function.
Gradient boosting decision tree (Friedman, 2001) is an iterative decision tree algorithm that consists of multiple decision trees, with the conclusions of all the trees accumulating to make the final decision. It was considered to be a more generalizable algorithm when it was first proposed, along with SVM.
Random forest (Svetnik et al., 2003) is a classifier that contains multiple decision trees and whose output classes are determined by the plurality of the classes output by the individual trees. RF randomly combines multiple decision trees into a forest, and determines the final class of the test sample based on the voting results of each decision tree during classification. eXtreme gradient boosting (Chen and Guestrin, 2016) is an algorithm that integrates and boosts multiple weak classifiers into a strong classifier. Compared to gradient boosting classifier (GBC), XGBoost performs more regularized model formalism to control model overfitting, thus improving performance.
LightGBM is a gradient-lifting tree framework proposed by Ke et al. (2017). LightGBM is a framework for implementing the GBDT algorithm, which supports efficient parallel training and has the advantages of faster training speed, lower memory consumption, better accuracy, and distributed support for fast processing large amounts of data.
The following describes the settings of the parameters in the six classifiers. XGBoost/RF: The number of trees in the model is fine-tuned using the grid search method, i.e., the value of the "n_estimators" variable and the rest of the parameters are default parameters.
100 ≤ n_estimators ≤ 1000 with step n_estimators = 25 SVM: We choose the radial basis function as the kernel function of the SVM and use the grid search to optimize the parameters C and γ. Therefore, we optimized these parameters using the following range: 2 −5 ≤ C ≤ 2 15 with step C = 2 2 −15 ≤ γ ≤ 2 15 with step γ = 2 −1 LightGBM: Fine-tune the three key parameters "n_estimators, " "max_depth, " and "learning_rate" in the model using the grid search method:

Evaluation Criteria
To evaluate the performance of the proposed plant R protein prediction model, four metrics were introduced in this study to evaluate the performance of the model prediction. These four evaluation metrics are: Precision, Recall, Accuracy (ACC), and Where TP, TN, FP, and FN denote the number of true positives, true negatives, false positives, and false negatives, respectively.
In addition, receiver operating characteristics (ROC) were plotted on the basis of specificity and sensitivity, and the area under the ROC curve (AUC) was calculated on the basis of the trapezoidal approximation. The AUC provides a measure of classifier performance; large values of AUC correspond to improved classifier performance.
Measuring Algorithm Performance Through Five-Fold Cross-Validation K-fold cross-validation is one of the most common ways to measure the performance of a computational model. In this manuscript, we apply five-fold cross-validation to the training set and calculate the evaluation metrics of Accuracy, Sensitivity, Precision, Specificity and AUC. The average experimental results for the five-fold cross-validation of these evaluation metrics are as follows: Accuracy (0.975), Precision (0.984), Recall (0.942), and AUC (0.995). prPred-DRLF model uses the three optimizations of LGBM, RF, and MRMD3.0 (Zou et al., 2016;He et al., 2020). The best accuracies corresponding to these three optimization algorithms for the prPred-DRLF model are 0.97, 0.964, and 0.964, respectively, all of which are lower than the accuracy achieved by our method (0.975).
Therefore, a comparison of the experimental results shows that our proposed model is superior to the prPred-DRLF model. It is worth mentioning that the prPred-DRLF model extracts far more feature dimensions than our method, indicating that our method uses fewer feature dimensions and is able to capture more effective feature information. For a better presentation of the results, we plotted the average ROC curve for the five-fold cross-validation, as shown in Figure 3.

Measuring Algorithm Performance Through Independent Test Validation
To further compare the performance of our proposed method with other methods in independent tests, we compared it with the prPred and prPred-DRLF groups of methods, respectively, and chose the best experimental results given in their papers for these two models. The experiments were compared under the same dataset and the results are shown in Table 3. The prPred model has an accuracy value of 0.935 and a Precision value of 1 which is the largest among all the models. A total of three features were extracted from the prPred-DRLF model, namely TAPE-BERT, BiLSTM, and UniRep. prPred-DRLF1 in Table 3 represents the result of the prPred-DRLF model choosing the combination of BiLSTM + UniRep, which is the best accuracy given by the prPred-DRLF model. prPred-DRLF2 indicates the result of the prPred-DRLF model choosing all three combinations (TAPE-BERT, BiLSTM, and UniRep), which in contrast does not perform as well as prPred-DRLF1.
As can be seen in Table 3, the Accuracy, Precision, Recall, F1score, and AUC of our proposed method StackRPred were 0.967, 0.980, 0.968, 0.980, and 0.997, respectively, of which, except for Precision, all were maximum values, indicating the superiority of our method in predicting plant R proteins. Also, to make the results of our method more visual, we plotted the ROC curves, as shown in Figure 4.

CONCLUSION
The discovery and study of plant R proteins is of great importance to agricultural production. In this study, we propose a novel plant R-protein predictor, StackRPred, which introduces DWT and PsePSSM methods to extract plant R-protein feature information based on the base pair energy content, and then applies SVM-RFE + CBR techniques to optimally select the obtained feature information to obtain 112-dimensional feature information; finally, the 112-dimensional feature information was fed into the constructed stacking model for training to build the prediction model. The stacking model was divided into two layers, with the first layer containing six classifiers, namely KNN, GBDT, SVM, XGBoost, LightGBM and RF, and the SVM was selected as the classifier in the second layer. Precision, Recall, Accuracy (ACC), F1-score, and AUC were used to evaluate the performance of the model, and a five-fold cross-validation and independent test validation were performed, respectively. The experimental results show that the proposed StackRPred model outperforms other state-of-the-art algorithms. The StackRPred model is useful for further exploration of plant R proteins and is expected to be extended to other protein or peptide research areas. In the future, we will focus more on the interpretability of plant R protein prediction models. Model interpretability is one of the key directions of current bioinformatics research (Cai et al., 2021a,b). The exploration of model interpretability is beneficial to further functional studies on plant R proteins.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.

AUTHOR CONTRIBUTIONS
YC and ZhL performed the experiments. YC wrote the manuscript. All authors conceived the concept of the work, contributed to the article, and approved the submitted version.