MoRF-FUNCpred: Molecular Recognition Feature Function Prediction Based on Multi-Label Learning and Ensemble Learning

Intrinsically disordered regions (IDRs) without stable structure are important for protein structures and functions. Some IDRs can be combined with molecular fragments to make itself completed the transition from disordered to ordered, which are called molecular recognition features (MoRFs). There are five main functions of MoRFs: molecular recognition assembler (MoR_assembler), molecular recognition chaperone (MoR_chaperone), molecular recognition display sites (MoR_display_sites), molecular recognition effector (MoR_effector), and molecular recognition scavenger (MoR_scavenger). Researches on functions of molecular recognition features are important for pharmaceutical and disease pathogenesis. However, the existing computational methods can only predict the MoRFs in proteins, failing to distinguish their different functions. In this paper, we treat MoRF function prediction as a multi-label learning task and solve it with the Binary Relevance (BR) strategy. Finally, we use Support Vector Machine (SVM), Logistic Regression (LR), Decision Tree (DT), and Random Forest (RF) as basic models to construct MoRF-FUNCpred through ensemble learning. Experimental results show that MoRF-FUNCpred performs well for MoRF function prediction. To the best knowledge of ours, MoRF-FUNCpred is the first predictor for predicting the functions of MoRFs. Availability and Implementation: The stand alone package of MoRF-FUNCpred can be accessed from https://github.com/LiangYu-Xidian/MoRF-FUNCpred.


INTRODUCTION
Intrinsically disordered regions (IDRs) and intrinsically disordered proteins (IDPs) are sequence regions and proteins lack stable 3D structures (Deng et al., 2012;Deng et al., 2015). IDPs and IDRs are widely distributed in organisms. Research on IDPs and IDRs contributes to biomedicine and biology, such as drug discovery and protein structure prediction. Molecular recognition features (MoRFs) are regions that can make the IDR complete the transformation from disordered state to ordered state (Cheng et al., 2007). With the studies of MoRFs, these functional sites may play a role as druggable disease targets, and some drugs are discovered through these sites of action (Kumar et al., 2017;Li et al., 2020;Wang et al., 2020;Zhang et al., 2020;Lv et al., 2021a;Joshi et al., 2021;Shaker et al., 2021;Yan et al., 2021).
Some biological analyses are used in the existing methods of predicting MoRF functions; for example, through analysis of cellular viability by flow cytometry, a target's function can be recognized (Johansson et al., 1998). Accurate prediction of the function of the MoRF region is conducive to understanding the mechanism of cancer and discovering targeted drugs. DisProt is a IDPs database. Disprot not only contains IDPs but also supports IDPs functional annotation (Piovesan et al., 2017). In our research, we found that these five MoRF functions are not mutually exclusive. Therefore, the prediction of MoRF function is a multi-label task. It is necessary to propose automatic discovery methods to expand the MoRF functional annotation.
In this study, we propose the first computational method for predicting the functions of MoRFs in IDPs called MoRF-FUNCpred. We introduce a method based on the residues of IDPs to predict the possibility that the residues have five functions of MoRFs. MoRF-FUNCpred uses an ensemble learning (Dietterich, 2000) model to predict the possibility of five functions of MoRFs. The individual classifiers are Support Vector Machine (SVM) (Vapnik and Vapnik, 1998), Logistic Regression (LR) (Cessie and Houwelingen, 1992), Decision Tree (DT) (Safavian and Landgrebe, 1991) and Random Forest (RF) (Breiman, 2001). The four models are integrated using a weighted averaging strategy, and the weights of the models are obtained through a genetic algorithm (Maulik and Bandyopadhyay, 2000).
The innovation of this work lies in the following: 1) we construct a dataset of inherently disordered proteins with MoRF functions annotation; 2) we take advantage of an ensemble model to integrate the different advantages of models; 3) we propose the first model, MoRF-FUNCpred, for predicting the functions of molecular recognition features in intrinsically disordered proteins.

Datasets
The data were extracted from the DisProt database, which is a database of IDPs and provides functional annotations of IDPs (Piovesan et al., 2017). The data can be downloaded from the site: https://disprot.org/api/search?release=2020_12&show_ambiguous= true&show_obsol (Hatos et al., 2020). In this version of the data, 1590 intrinsically disordered proteins were provided, and 596 proteins of them had functional annotations about disordered regions. The 7 functions of intrinsically disordered regions were divided into functions of MoRFs (MoR_assembler, MoR_chaperone, MoR_ display_sites, MoR_effector and MoR_scavenger) and other functions (entropic chain and biological condensation).
After further screening of the 596 protein sequences obtained above, 3 proteins were deleted because of incorrect residue expression. Proteins with residues that only have both other functions and functions of MoRFs were deleted. To better construct the training set and testing set, some protein sequences with multi-MoRF functional residues were deleted, and finally, we obtained 565 sequences.
To reduce the similarity between the training set and the testing set, we ran BlastClust (Altschul et al., 1990) with length coverage >70% and identity threshold = 25% for the 565 sequences. Through this, we obtained 508 classes from 565 protein sequences. Next, we randomly divided the training set and the testing set according to the sequence number ratio of 1:1 based on the clustering result. Through this, we obtained a set containing 243 categories and another set containing 265 categories, including 283 pieces, and 282 pieces of sequences.
In this study, residue data were used as training data and testing data, and we selected residue data as follows: residues without 7 functions of IDRs were dropped, residues with both other functions and 5 functions of MoRFs were also dropped, residues with only 5 functions of MoRFs were selected as positive samples, and the other residues with only other functions were selected as negative samples. See Table 1 for the number of sequences with different functional residues and the number of different functional residues in the training set and testing set.

Architecture of MoRF-FUNCpred
The flowchart of MoRF-FUNCpred is shown in Figure 1, which includes protein sequences, PSFM representation and training phase.

PSFM Representation
In this study, protein evolutionary information was used as a protein sequence representation. The position specific frequency matrix (PSFM) is a kind of protein evolutionary information and indicates the frequency of 20 amino acids at the sequence corresponding position. PSFM has been used as a protein sequence representation in many studies (Wang et al., 2006;Liu et al., 2012;Zhu et al., 2019). In our paper, the PSFM was generated by using PSI-BLAST (Altschul et al., 1997) searching against the non-redundant database NRDB90 (Holm and Sander, 1998) with default parameters except that the iteration and e-value were as 10 and 0.001, respectively.
Protein sequence P of length L can be expressed as: where R i represents the amino acids of the protein sequence, and the subscript represents the ith residue in this protein.
The PSFM profile of protein P is a matrix, whose dimensions are L × 20: where 20 is the total number of standard amino acids. The element F i,j is the probability of amino acid j occurring at position i of P.

Multi-Label Learning Strategy
The functions of MoRFs can be divided into five categories: MoR_assembler, MoR_chaperone, MoR_display_sites, MoR_effector and MoR_scavenger. According to the DisProt database, the MoRF functional regions overlap, leading to each residue carrying out multiple functions. Therefore, we treat MoRF functional prediction as a multilabel learning problem.
In this study, we wanted to make full use of positive samples. Therefore, the multi-label learning strategy "Binary Relevance" (BR) (Boutell et al., 2004) was employed. Under the "BR" strategy, the multi-label samples can be used as positive samples in each predictor of the corresponding label. We called this advantage "crossing training".
In this paper, In order to explore the impact of different machine learning models on this task, four machine learning classifiers with the "BR" strategy were used to predict the probability of each MoRF function. Therefore, as Figure 1 shows, for each machine learning model, five classifiers are trained to predict different MoRF functions. We use the features of residues and the label of a certain MoRF function to train the classifier to obtain a classifier that can predict the corresponding function. Finally, 20 classifiers are trained in our model.

Ensemble Learning
Ensemble learning is used in many protein tasks and has good performance, such as recognition of multiple lysine PTM sites and the different types of these sites (Qiu et al., 2016a), recognition of phosphorylation sites in proteins (Qiu et al., 2016b) and recognition of protein folds . The ensemble model usually has better performance than individual predictors.
The flowchart of the ensemble strategy on different machine learning methods is given in the training phase of Figure 1.

Basic Classifiers
The general structure of ensemble learning is (i) generate a set of basic classifiers and (ii) select a combination strategy to ensemble basic classifiers. From the general structure of ensemble learning, we can find two common problems of ensemble learning. The first one is which basic classifiers to choose? The other is which combination strategies to select?
For the basic classifiers, we choose four common machine learning models: Support Vector Machine (SVM), Logistic Regression (LR), Decision Tree (DT) and Random Forest (RF). The four models are chosen because SVM can use the kernel trick to obtain nonlinear fitting ability, LR can solve the problem of linear fitting, DT usually has good performance in dealing with continuous features, and RF can balance errors when dealing with unbalanced datasets. To illustrate the complementarity of the four classifiers at the data level, we define the distance function between the classifiers (Liu et al., 2017): where m represents the number of samples in the data, d ik represents the misclassification probability of classifier C(i) on the kth sample, and d ik Δd jk can be calculated by (Liu et al., 2017): if C(i) and C j incorrectly predicts the kth sample 0, otherwise The value of Distance [C(i),C(j)] ranges from 0 to 1, where 0 means that classifier C(i) and classifier C(j) are completely noncomplementary, and 1 means that classifier C(i) and classifier C(j) are completely complementary (Liu et al., 2017). The value of Distance [C(i),C(i)] is between 0 and 1, Distance [C(i),C(i)] can reflect the predictive ability of classifier C(i), 1 means that classifier C(i) predicts all the data correctly, and 0 means that classifier C(i) predicts all the data incorrectly.
For the combination strategy, to make different models play the same role for each residue, the weighted averaging strategy was used to ensemble the 4 basic machine learning methods. The weighted averaging strategy can be represented as follows: where W SVM , W LR , W DT and W RF represent the weight of each model in the ensemble model, the sum of the four values is 1, and   SVM, LR, DT, and RF represent the 4 models that use the corresponding machine learning methods.

Genetic Algorithm
To obtain an optimal set of W SVM , W LR , W DT and W RF to maximize the Macro_Accuracy (see this metric in section Performance Evaluation Strategy) of MoRF-FUNCpred in the training set, we transform solving W SVM , W LR , W DT and W RF into a constrained optimization problem. Since the search space for this problem is large, the genetic algorithm is used to quickly obtain the optimal solution. In our study, the Macro_Accuracy of the training set was used as the fitness, and the fitness was used to select outstanding individuals and eliminate individuals who were not adapted to the current environment. The characteristics of the better individuals will be passed on to the next generation. The genetic algorithm generates new individuals through crossover and mutation. In this way, the attributes that adapt to the environment are retained, and new attributes are introduced. After hundreds of circulations, the optimal weight can be obtained (Maulik and Bandyopadhyay, 2000).
The population size is set to 50, the constraint condition is W SVM + W LR + W DT + W RF 1, 0 ≤ W SVM ≤ 1, 0 ≤ W LR ≤ 1, 0 ≤ W DT ≤ 1, 0 ≤ W RF , the mutation probability is 0.001, and the maximum number of iterations is 800.

Performance Evaluation Strategy
In this paper, we use four metrics to measure the quality of a classifier: (i) accuracy of each function, (ii) overall metric Macro_accuracy to measure the performance of model, (iii) sensitivity (sn) to calculate the model's performance of positive samples, (iv) specificity (sp) to represent the model's quality of negative samples (Guo et al., 2020;Tao et al., 2020;Zhai et al., 2020;Wang et al., 2021;Yang et al., 2021).
The prediction of a residue by the model is a vector, and the dimension of the vector is 5. Each column is a fraction from 0 to 1 and represents the probability of residues with the MoR_assembler function, MoR_chaperone function, MoR_ display_sites function, MoR_effector function and   MoR_scavenger function. The fraction can also be converted to a value of 0 or 1 by setting the threshold value to 0.5. The accuracy of each function can be calculated by (Zhang and Zhou, 2013): where MoR_assembler_Accuracy, MoR_chaperone_Accuracy, MoR_display_sites_Accuracy, MoR_effector_Accuracy, and MoR_scavenger_Accuracy represent the accuracy of each function, and N represents the number of labels.
To calculate the prediction performance of the model for positive and negative samples of each function in the testing set, we calculated the sensitivity (sn) and specificity (sp) for each MoRF function (Jiang et al., 2013;Zhang and Zhou, 2013;Lv et al., 2020b;Tahir and Idris, 2020;Wan and Tan, 2020;Xie and Zhao, 2020;Lv et al., 2021b;Gao et al., 2021):

Performance Comparison
We adjust the parameters of the four models in the training set based on the grid search strategy, and the parameters adopted to generate SVM were C = 16, gamma = 32, and kernel = rbf. The parameters adopted to generate LR were penalty = l2 and c = 0.03125. The parameters adopted to generate DT were criterion = gini and splitter = best. The parameters for generating RF were n_estimators = 80 and max_features = sqrt. See Table 2 for the value range of hyperparameters.
We evaluate the overall metrics Macro_Accuracy and accuracy of each function (using MoR_assembler_Accuracy, MoR_chaperone_Accuracy, MoR_display_sites_Accuracy, MoR_effector_Accuracy and MoR_scavenger_Accuracy to represent the accuracy of different functions) of four basic models in the testing set. We can see the metrics of the four models in the testing set in Table 3.
From this table, we can find the following: (i) A common phenomenon is that the prediction ability of different models in the MoR_assembler and MoR_effector functions is lower than that of the other three functions. The extremely important reason for this result is that for the MoR_assembler and MoR_effector functions, there are more positive samples in our dataset, and all models try to learn more information of positive samples. Although Accuracy is reduced, more positive samples are predicted correctly. (ii) The difference between basic models is huge. SVM and RF have better performance than LR and DT not only in overall metric (Macro_Accuracy) but also in accuracy of each function. This is because different models try to predict different aspects; for example, some try to predict positive samples as much as possible, but others try to predict all negative samples. (iii) The LR model in every metric is the worst of the four basic models, and in the MoR_assembler function prediction, the accuracy of the LR model is lower than 0.5. The huge gap between the SVM model and LR model probably shows that the PSFM feature is not strictly linearly separable in the task of MoRF function classification, and LR tries to predict more positive samples and causes low accuracy. However, LR model still have its' advantage. To find more specific differences between each model, we use metrics sn and sp to see the extent to which positive and negative samples can be predicted for each function. Result are provided in Table 4.
As we can see in Table 4, regardless of the proportion of positive and negative samples in the training data, the LR model's result in the testing data changed less than that of the other models. In fact, the greatest advantage of LR is that its prediction ability is much better than that of the other three models in the positive samples. However, the LR model has poor performance in predicting negative samples. In contrast, SVM, DT, and RF are similar; these models have good results in negative samples, and in positive samples, the MoR_assembler and MoR_effector functions are better than the other models. Therefore, the differences between these models make it possible for us to ensemble learning.
sn of SVM, DT, RF model is low and sp of these models is high. When the positive samples of the MoRFs function are large, such Frontiers in Pharmacology | www.frontiersin.org March 2022 | Volume 13 | Article 856417 6 as MoR_assembler and MoR_effector, sn will be higher than the other MoRF functions with less positive sample data, and sp will be lower than the other MoRF functions with less positive sample data.

Complementarity of the Four Basic Classifiers
We calculate the distance between the two models in the training set under the five MoRF functions. The experimental results are shown in Figure 2. As seen from Figure 2, (i) For each MoRF function, the distance between the same models is greater than 0.75, which shows that the four models themselves have good predictive capabilities. (ii) The distance between different models of 5 MoRF functions is greater than 0.95, which shows that the two models are highly complementary. (iii) DT and RF have similar distances to the four models. The main reason for this phenomenon is that the RF itself is a model formed by integrating many decision trees.

Performance of Ensemble Model
We adopt the weighted average method in ensemble learning; that is, four weights for four models were set, and the sum of the weights was 1. The weight of each model represents the importance of each model. Through the genetic algorithm, we calculated that the weight of SVM was 0.31455477, the weight of LR was 0.32997175, the weight of DT was 0.28779645 and the weight of RF was 0.06767703. The final ensemble learning results are shown in Table 5. We can see that in terms of overall indicators Macro_Accuracy, the ensemble learning results are better than the best results of a single model. However, we can also find that MoR_chaperone_Accuracy and MoR_scavenger_Accuracy are slightly worse than the best result in a single model; that is, because the ensemble model can obtain the best overall metric, it improves only some metrics. For example, it may enhance the accuracy of positive samples in some functions, and the price reduces the accuracy of negative samples in some functions. Because of the imbalanced dataset, improving the ability to predict positive samples cannot always improve the sn and sp.

Performance in Entire protein Sequence
MoRF-FUNCpred is trained using the PSFM features and the corresponding labels of the residues and screening the residues in the protein sequence. When providing an interface for other researchers to predict the MoRF functions of a protein, it is to input the entire protein sequence and predict the MoRF functions of the protein. MoRFs usually appear as sequence segments with 5-70 residues. Therefore, our MoRF function prediction should also appear as sequence segments with lengths of 5-70. To verify whether our prediction model also has this property, we randomly extract a sequence from the testing set and input it to the web server. As shown in Figure 3, we input the protein sequence signed DP01087. Three long sequence fragments were predicted as MoR_assembler functions, which is very similar to the MoR_assembler function of the real annotation results 1-101 in the disprot database, but there are still many discrete residue fragments predicted as MoR_assembler functions.
Therefore, although MoRF-FUNCpred inputs features and labels of residues, it still has the original sequence properties of MoRFs at the sequence level. From Figure 3, we can also find that there will still be several discrete residue prediction results that have the function of MoR_assembler. The reason for this phenomenon is mainly due to the input of our models and PSFM features.
The input of the model is features and labels of residues. Features of residues cannot completely reflect sequence properties. PSFM features are only used in MoRF-FUNCpred, and the ability of the PSFM features to capture sequence properties is limited, so MoRF-FUNCpred still has room for improvement.

CONCLUSION
The existing methods for predicting the functions of MoRFs in IDP are mainly through analysis of cellular viability by flow cytometry. The problem with these methods is that the experimental period is long and the experimental cost is expensive. Predicting the functions of MoRFs by calculation methods can not only save time but also reduce experimental costs. We can use calculation methods to initially screen IDPs and further accurately measure the functions of MoRFs in cooperation with biological experiments.
In this study, the first MoRF function predictor is proposed called MoRF-FUNCpred, which predicts the functions of MoRFs regarding residues. MoRF-FUNCpred regards the residue MoRF function prediction task as a multi-label learning task. MoRF-FUNCpred uses PSFM features as the feature representation of residues and uses SVM, LR, DT, and RF combined with "BR" strategies to efficiently prepare for the completion of MoRF function prediction tasks. To utilize the complementarity between the models, the SVM, LR, DT, and RF are integrated through the weight method of ensemble learning, and the weight of each model is obtained through the genetic algorithm. Under the grid search for the best parameters for each model, in the single machine learning model (SVM, LR, DT, and RF), the overall metric Macro_Accuracy is greater than 0.5 for the prediction performance of MoRFs. Compared with single machine learning models, the ensemble model MoRF-FUNCpred shows better performance. In addition, although MoRF-FUNCpred is trained using residue data, the prediction results of MoRF-FUNCpred retain part of the sequence of MoRFs nature. At the same time, this paper constructs the first dataset on the function of MoRFs, which will provide help for further research on this task.
The main dilemma facing MoRF function prediction is that the existing IDPs containing MoRF functions are few, and it is difficult to complete the training tasks at the protein level. MoRF-FUNCpred mainly has the following problems. The use of a single feature of PSFM to represent residues may result in insufficient expression of residues. Using the "BR" strategy to complete the multi-label learning task may cause the model to ignore the correlation between the labels. In future work, we can explore the following aspects. 1) Use more complex features to represent residues, such as fusing multiple features to represent residues.