KK-DBP: A Multi-Feature Fusion Method for DNA-Binding Protein Identification Based on Random Forest

DNA-binding protein (DBP) is a protein with a special DNA binding domain that is associated with many important molecular biological mechanisms. Rapid development of computational methods has made it possible to predict DBP on a large scale; however, existing methods do not fully integrate DBP-related features, resulting in rough prediction results. In this article, we develop a DNA-binding protein identification method called KK-DBP. To improve prediction accuracy, we propose a feature extraction method that fuses multiple PSSM features. The experimental results show a prediction accuracy on the independent test dataset PDB186 of 81.22%, which is the highest of all existing methods.


INTRODUCTION
Proteins are spatially structured substances formed by the complex folding of amino acids into polypeptide chains through dehydration and condensation. Proteins are the material basis of life and they are required for every vital activity. Given the vast number of proteins and their roles, protein classification has always been central to the study of proteomics. DNA-binding proteins (DBP) are a very specific class of proteins whose specific binding to DNA guarantees the accuracy of biological processes and whose nonspecific binding to DNA guarantees the high efficiency of biological processes (Gao et al., 2008). DNA-protein interactions, such as gene expression and transcriptional regulation, occur ubiquitously throughout the biological activities of living bodies Shen and Zou, 2020;Xu et al., 2021a). All of these interactions are tightly linked to DBP, where the fraction of DNA-binding proteins in eukaryotic genes is approximately 6-7%.
The role of DBP in biological activities has gained a lot of attention in recent years, as various large genome projects and research on DBP identification have rapidly progressed. However, identifying DBP using traditional biochemical analyses is inefficient and expensive (Li and Li, 2012;Xu et al., 2021b). In recent years, machine learning methods have been widely used in the field of bioinformatics (Jiang et al., 2013;Geete and Pandey, 2020;Tao et al., 2020;Wang et al., 2021a;Long et al., 2021). Using machine learning methods for DNA-binding protein identification can enable rapid and accurate prediction of DBP from a large number of proteins, while drastically reducing prediction costs (Fu et al., 2018). Because the number of proteins is large and promiscuous, overcoming every classification prediction problem with one method is difficult, if not impossible (Wang et al., 2021b). Therefore, we must continue to propose effective methods for high-quality DBP prediction and identification in order to understand the significance of more vital activities and to promote further progress within the bioinformatics field.
Feature extraction methods can be broadly classified into two categories: those based on structural information and those based on sequence information (Kim et al., 2004;Meng and Kurgan, 2016;Qu et al., 2019;Ao et al., 2021a;Lv et al., 2021a;Liu et al., 2021;Tang et al., 2021;Wu and Yu, 2021); (Stawiski et al., 2003) proposed a model based on protein structure that utilises a neural network approach incorporating information like residue and hydrogen bond potential. Liu et al. (Liu et al., 2014) developed a model called IDNA-prot|dis, based on the pseudo amino acid composition (PseAAC) of protein sequence information. iDNAPro-PseAAC (Liu et al., 2015), which uses a similar feature extraction method, adopts a prediction model based on a support vector machine to predict DBP. IDNAprot (Lin et al., 2011) was constructed based on physicochemical properties and random forest (RF) classification. In addition, a support vector machine model based on k-mer and autocovariance transformation was proposed by Dong et al. (Liu et al., 2016). Local-DPP (Wei et al., 2017a) used random forests based on PSE-PSSM features to predict DBP. MK-FSVM-SVDD is a multiple kernel SVM prediction tool based on the heuristic kernel alignment developed by  to identify DBP. In addition, two models for predicting DBP were developed: DNAprot (Kumar et al., 2009) and DNAbinder (Kumar et al., 2007). Lu et al. (Lu et al., 2020) developed a prediction model for DBP based on support vector machines using Chou's five-step rule.
Currently, a number of DNA-binding protein prediction methods based on different strategies exist. Unfortunately, most of these DBP prediction methods fail to extract features based on evolutionary information, so their robustness and prediction accuracy have much room for improvement. To address these issues, more research is needed with regard to feature extraction and the selection of classifiers (Zuo et al., 2017;Zheng et al., 2019).
In this paper, we propose a new DNA-binding protein prediction method called KK-DBP. We first obtained the position specificity score matrix (PSSM) of the protein sequence for each sample used to train the model. PSSM information was then used to extract three features of each sample: PSSM-COMPOSITION (Zou et al., 2013), RPSSM (Ding et al., 2014) and AADP-PSSM (Liu et al., 2010), which were combined to form the initial feature set of each sample. The final initial feature set of each sample reached 930 dimensions. To avoid feature redundancy and improve prediction accuracy, KK-DBP used the max relevance max distance (MRMD) (Zou et al., 2016) feature ordering method to establish the optimal feature subset for model training. Finally, a new DBP prediction model was constructed using the random forest learning method. The complete method framework is shown in Figure 1:

Dataset
The dataset is one of the key factors determining the quality of the predictive model and is the cornerstone of machine learning algorithm learning, which directly affects the final effect of the model, so dataset construction is meticulous and complex (Liang et al., 2017;Su et al., 2021). Other researchers have proposed many prediction models for DNA-binding proteins that have been pertinent to objectively comparing existing data. In the present study, we have used protein sequences from the PDB database as our training dataset and test dataset. Table 1 shows the contents of the dataset: The training set PDB1075 contained 525 DNA-binding proteins and 550 non-DNA-binding proteins, and the test set PDB186 contained 93 DNA-binding proteins and 93 non-DNAbinding proteins. The dataset construction rules are as follows: where S + is the positive subset containing only DNA-binding proteins, and S − is the negative subset containing only non-DNAbinding proteins.

Feature Extraction
Feature extraction is very important to modeling sequence classifications, which directly affect the accuracy of predictive models (Zhang et al., 2020a;Lv et al., 2021b). Evolutionary Step A: Construction of Position Specificity Score Matrices for protein sequences.
Step B: Extraction of three features: AADP-PSSM, PSSM-COMPOSITION, and RPSSM as the initial feature set for a single sample.
Step C: Feature ranking and selection using the MRMD algorithm.
Step D: Identification of DBP using random forests. information is among the most important information we have regarding protein function and genetics (Zuo et al., 2014). Position specificity score matrices (PSSM) can intuitively display protein evolutionary information. Thus, the feature extraction method based on PSSM is widely used in protein classification.

Position specificity Score Matrices
In 1997, Altschul et al. (Altschul et al., 1990) proposed the BLAST algorithm. When given a protein sequence, BLAST can represent the evolutionary information of a protein by aligning it with data in a specific database and extracting a position specific score matrix (PSSM). To improve the prediction accuracy of proteins, our method predominantly utilises protein evolution information to extract features. For the training and test sets used in our method, the PSSM matrices for each sequence were generated by three PSI-BLAST iterations with an E-value of 0.001. The PSSM is a matrix of size L × 20, where L is the length of the protein sequence and 20 is the number of amino acids. Coordinates (i, j) in the position specificity score matrix. (PSSM) represent the log score for the amino acid at position i being replaced by the log score of the amino acid at position j. When the coordinate value is greater than 0, it indicates that during the alignment, there is as large probability that the amino acid at the corresponding position in the sequence is mutated to 20 native amino acids. The higher the value is when the number is a negative integer, the less prone it is to alteration. This numerical pattern indicates the probability of the mutation of a residue in a given protein sequences. Its matrix form behaves as follows: Reduced Position Specificity Score Matrices and Position Specificity Score Matrices-Composition PSSM-COMPOSITION is generated by adding the same amino acid rows in the original PSSM matrix, dividing by the sequence length and scaling to [-1,1]. For each protein sequence PSSM matrix, a 400-dimensional vector feature{d 1 , d 2 , d 3 , ..., d 400 } is generated. Li et al. (Li et al., 2003) first proposed that 10 might be the minimum number of residue types (letters) needed to construct a reasonably folded model. Reduced PSSM (RPSSM) borrowed this idea and simplified the original PSSM matrix with form L × 20 to one with form L × 10.
a 1 a 2 . . . a L is a protein in the dataset, a i is assumed to be mutated to s, and p i,s represents the pseudo composition component of amino acid a i . The pseudo composition of all amino acids in protein a 1 a 2 . . . a L is defined as: s 1, 2, ...10; i 1, 2, ..., L The dipeptide composition was later incorporated into the RPSSM method in order to overcome its inability to extract full sequence information. Assuming that a i+1 is replaced by 't', the dipeptide pseudocomposition of a i a i+1 is defined as: where x i,i+1 represents the difference of p i,s and p i+1,t from their mean values. Finally, because each protein sequence in the dataset will consist of the pseudo composition of all of its dipeptides, we can generate a 110-dimensional vector feature of RPSSM, defined as follows: s, t 1, 2, . . . 10 (5)

AADP-Position Specificity Score Matrices
A protein's structure is closely related to its amino acid composition. For every amino acid sequence in the dataset, AADP-PSSM produces a vector with dimensions 20 + 400 420. AADP-PSSM is divided into two parts. The amino acid composition is first extracted from its PSSM matrix: the average value of the PSSM matrix column of length 20 is called AAC-PSSM, where x i is the type of amino acid in the PSSM matrix and represents the average fraction of amino acid mutations during evolution. It is defined as follows: x j 1 L L i 1 p i,j j 1, 2, . . . , 20 The traditional dipeptide composition was later extended to PSSM and represented with DPC-PSSM to avoid the loss of information due to an X in the protein, which was defined as a vector of 400 dimensions:

Feature Selection
Feature redundancy or dimensionality disasters often occur during feature extraction. Feature selection not only reduces the risk of overfitting but also improves the model's generalization ability and computational efficiency Yang et al., 2021a;Ao et al., 2021b;Zhao et al., 2021). In the present paper, we use the max relevance max distance (MRMD) feature selection method to reduce the dimensions of the initial feature set (He et al., 2020).
In MRMD, feature selection is based primarily on the correlation between the subset and the target vector and the redundancy of the subset. When measuring correlations, MRMD used the Pearson correlation coefficient, which is defined as: where X and Y are two vectors, x k and y k are the kth elements in X and Y, and N is the total sample number. The initial feature set constructed using this method is F {f 1 , f 2 , f 3 , . . . , f 930 }. The maximum correlation value maxMR i between feature f i and target class vector C is defined as: where M is the initial feature set dimension, f i → is the vector composed of the ith feature of each instance, and C i → is the vector composed of the target category of each instance.
When evaluating the similarity between two vectors, MRMD uses the distance functions Euclidean distance (ED), cosine similarity (COS) and Tanimoto coefficient (TC) to measure: We use the mean of the three above as the maximum distance maxMD i for feature i: The MRMD values of all the features are calculated with the above two constraints. The PageRank algorithm is used to sort the initial feature set from high importance. One feature is added to the feature subset at a time and is used to train the model to determine which subset is the best.

Classification Algorithm
Protein prediction is usually described as a binary classification problem (Zhai et al., 2020;Zhang et al., 2021;Zulfiqar et al., 2021). We selected the random forest learning method for prediction modelling in the present study. Because the random forest method randomly extracts features and samples during construction of a decision tree set, it is more suitable to addressing the problem of high feature dimensions. By using RandomizedSearchCV and GridSearchCV for parameter selection, the random forest model constructed finally includes 800 subtrees, in which each tree has no limit, and a single decision tree is allowed to use all features. The maximum depth of each decision tree is 50.

Measurements
We selected four different performance measures, accuracy (ACC), specificity (SP), sensitivity (SN) and Matthew's correlation coefficient (MCC), to evaluate the methodology used by this study to demonstrate the predictive ability of the model used (Wei et al., 2014;Wei et al., 2017b;Manavalan et al., 2019a;Manavalan et al., 2019b;Jin et al., 2019;Su et al., 2019;Li et al., 2020a;Liu et al., 2020a;Ao et al., 2020;Li et al., 2020b;Zhang et al., 2020b;Yu et al., 2020;Zhao et al., 2020;Wang et al., 2021c;Zhu et al., 2021). The equations for determining these four parameters are shown below: Where TP represents positive samples predicted to be positive by the model, FP represents negative samples predicted to be positive by the model, and TN represents negative samples predicted to be negative by the model. FN represents positive samples predicted to be negative by the model. Removing the above four performance measures, the ROC curve will also be used to assess the effect of our predictions.

Performance of Different Features on Training Set PDB1075
A large amount of information on homologous proteins is contained in evolutionarily informative features based on the PSSM matrix. In our method, we selected the evolutionary information-based features PSSM-COMPOSITION, RPSSM, and AADP-PSSM for experimentation. To better show the efficiency of prediction models under different combinations of features, the receiver operating characteristic (ROC) curve was used for analysis. The closer the curve is to the y-axis, the better the classification results will be. The area under the curve (AUC) is defined as the area under the ROC curve enclosed by the coordinate axis. The closer the area is to 1, the better the prediction model will be Random forests can achieve better prediction performance when dealing with high-dimensional features. In this section, we use random forests with default hyperparameters on the training set Frontiers in Genetics | www.frontiersin.org November 2021 | Volume 12 | Article 811158 pdb1075 for 10-fold cross validation of different feature fusion schemes and find out the feature fusion method that can maximize the area of AUC. As shown in Figure 2, the prediction performance of RF was the best after fusing the three features, and its AUC area reached 0.963. In addition, we also tested the predictive performance of SVM and KNN under different feature fusion schemes, and their optimal feature fusion schemes had AUC areas of 0.828 and 0.790, respectively. The ROC curve details of SVM and KNN are given in Figure 1 and Figure 2 of supplementary material respectively.

Performance After Feature Selection
For the 930-dimensional features of the initial vector set, we ranked all features from high to low based on MRMD scores. After obtaining the final feature ranking results, we took the first feature as the feature subset and utilised random forest to check the performance of the selected feature subset in 10-fold cross validation on PDB1075. Subsequently, we added one feature in the feature subset, one at a time, according to the feature sorting order. Then we repeated the above process until all the features in the initial feature set were included in the feature subset. Finally, we determined the best predictive accuracy and the optimal feature subset. The results are shown in Figure 3. The feature subset achieves the best accuracy when it contains 267dimensional features, so the optimal feature subset we used for training models is 267-dimension. The optimal feature subset contains 98-dimensional AADP-PSSM features, 142-dimensional PSSM-COMPOSITION features, and 27-dimensional RPSSM features. The details of the optimal feature subset are given in the supplementary materials. From the distribution of the optimal feature subset, it can be found that the distribution difference of amino acid pairs is the key to identify DBP from massive proteins.

Performance of Different Classification Algorithms
To determine the prediction model with the best performance, we put the best feature subset into four powerful classification algorithms with default hyperparameters, KNN, SVM, RF and naïve Bayes, and we used 10-fold cross validation to compare performance. Experimental results show that the random forest method demonstrates the best classification performance ( Figure 4). We use ACC, Sn, SP, MCC and AUC to evaluate the performance. As shown in Figure

Performance of Different Methods on Test Set PDB186
To evaluate the generalization ability of the prediction model proposed in this paper, we tested the model independently using dataset PDB186. Table 2 compares the performance of this study to other prediction methods on the dataset PDB186.
From Table 2, we can see that on the independent test set PDB186, the ACC, SN, SP of KK-DBP reach 81.2, 97.8 and 64.5%, respectively. In terms of prediction accuracy, KK-DBP is higher than other existing methods. Compared with the current method with the highest accuracy Local-DPP, KK-DBP was improved by 2.2 and 5.3% on the ACC and SN, respectively. SP is slightly lower than Local-DPP and IDNA-Prot. The results of independent verification experiments confirm that KK-DBP has reliable predictive performance and can recognize DBP from a large number of unknown proteins more accurately than existing DBP recognition methods.

DISCUSSION AND CONCLUSION
A large number of studies have shown that the classification of DNA-binding proteins has important theoretical and practical significance for future genomics and proteomics research. This paper proposes a DNA-binding protein prediction method, called KK-DBP, that is based on multi-feature fusion and improves the feature extraction method in DNA-binding protein prediction. This method uses PSSM features that contain dipeptide composition information for multi-feature fusion to construct the initial feature set, and it obtains the optimal feature subset for modeling by the maximum correlation maximum distance method. Finally, PDB186 was used as an independent test to further evaluate the effectiveness of our method. On the independent test set, the prediction accuracy, sensitivity and specificity of the model reached 81.2, 97.8 and 64.5%, respectively. KK-DBP surpasses existing methods in prediction accuracy, confirming that our method can identify DBP more accurately than existing methods.
Although our method improves the prediction accuracy of DNA-binding proteins, we still do not know how to construct a better feature extraction algorithm based on sequence and structure information. Therefore, our future research direction will be towards finding more distinguishable feature extraction algorithms (Ding et al., 2016;Zeng et al., 2020a;Yang et al., 2021b;   Wang et al., 2021d;Jin et al., 2021) and more suitable classifiers (Ding et al., 2019;Ding et al., 2020a;Ding et al., 2020b;Yang et al., 2021c;Guo et al., 2021) and prediction models (Liu et al., 2020b;Zeng et al., 2020b;Chen et al., 2021;Xu et al., 2021c;Song et al., 2021;Xiong et al., 2021) to better recognise DNA-binding proteins.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding authors.

AUTHOR CONTRIBUTIONS
YJ conceived the algorithm, performed the experiments, analyzed the data, and drafted the manuscript. TZ designed the experiments and revised the manuscript. YJ, SH, and TZ provided suggestions for the study design and the writing of the manuscript. All authors approved the final manuscript.

FUNDING
This work was supported by the Fundamental Research Funds for the Central Universities (2572021BH01) and the National Natural Science Foundation of China (62172087, 62172129).