Prediction of Hormone-Binding Proteins Based on K-mer Feature Representation and Naive Bayes

Hormone binding protein (HBP) is a soluble carrier protein that interacts selectively with different types of hormones and has various effects on the body’s life activities. HBPs play an important role in the growth process of organisms, but their specific role is still unclear. Therefore, correctly identifying HBPs is the first step towards understanding and studying their biological function. However, due to their high cost and long experimental period, it is difficult for traditional biochemical experiments to correctly identify HBPs from an increasing number of proteins, so the real characterization of HBPs has become a challenging task for researchers. To measure the effectiveness of HBPs, an accurate and reliable prediction model for their identification is desirable. In this paper, we construct the prediction model HBP_NB. First, HBPs data were collected from the UniProt database, and a dataset was established. Then, based on the established high-quality dataset, the k-mer (K = 3) feature representation method was used to extract features. Second, the feature selection algorithm was used to reduce the dimensionality of the extracted features and select the appropriate optimal feature set. Finally, the selected features are input into Naive Bayes to construct the prediction model, and the model is evaluated by using 10-fold cross-validation. The final results were 95.45% accuracy, 94.17% sensitivity and 96.73% specificity. These results indicate that our model is feasible and effective.


INTRODUCTION
With the rapid development of society, people have higher and higher requirements for medical and health care (Lin, 2020). Therefore, it is urgent to learn more about the structure and function of proteins in order to explain more of the meaning of life and promote the development of biomedicine and other fields (Wang et al., 2020a;Qu et al., 2021). However, there is a difficulty in the current research, that is, how to use its sequence information to predict proteins effectively. Although effective prediction of protein sequences can be made using physical, chemical and biological experiments, these methods are costly and time consuming.
Hormone binding proteins (HBPs) are carrier proteins that bind specifically to targeted hormones and were first identified in the plasma of pregnant mice, rabbits and humans (Mortezaeefar et al., 2019;Niu et al., 2021a). They are involved in hormonal regulation in living organisms. HBPs not only regulate the amount of hormones reaching the target cell to produce the desired effect  but also regulate non-protein-binding or free-circulating active steroid hormones, which are thought to be the main gatekeepers of steroid effects. Sexual HBPs, mainly produced in the liver, combine with sexual steroid hormones to regulate their bioavailability. The incorrect expression of HBPs, however, can cause various diseases (Tan et al., 2019).
Therefore, understanding the function and regulatory mechanism of HBPs has become very important. Accurately identifying HBPs is the first step in studying their function. Traditional HBPs identification methods involve wet biochemical experiments, such as immunoprecipitation, chromatography, or cross-linking (Sohm et al., 1998;Zhang and Marchant, 1999;Einarsdóttir et al., 2014;Cheng et al., 2016;Fang et al., 2019). These experimental methods are time-consuming and expensive, and with the discovery of a large number of protein sequences, it is difficult to determine HBPs through biochemical experiments. Therefore, it is necessary to establish an effective recognition model to identify HBPs (Akbar et al., 2020). The description of the characteristics of the protein sequence method contains a lot of information, such as the chemical and physical properties of amino acids, sequence characteristics, feature extraction algorithm for classification algorithm which has great impact on the design and the classification of results. Generally, prediction techniques based on machine learning consist of three steps: feature extraction, construction of predictors, and performance evaluation (Liu, 2017;Wang et al., 2018;Zhang et al., 2019). In 2018, Tang et al. (Hua et al., 2018). developed a method based on support vector machines to identify HBPs, which uses the optimal characteristic coding protein obtained by using the optimized dipeptide composition. Subsequently, Basith et al. developed the computational predictor iGHBP, which combined the dipeptide composition and the value of the amino acid index to obtain the optimal selection and predict the construction model (Basith et al., 2018). In this paper, we constructed a prediction model, HBP_NB, to correctly identify HBPs. First, the k-mer (Liu et al., 2008;Christopher et al., 2013;Liu et al., 2015a;Manavalan et al., 2019) method was used to obtain the frequency characteristics of protein sequences, and then the F-score value method was used to select the feature subset. Finally, input the obtained features into Naive Bayes (Gong and Tian, 2010;He et al., 2010;Gumus et al., 2014;Hu et al., 2020;Hu et al., 2021a;Hu et al., 2021b) to construct the prediction model.

Main Process of the Article
Machine learning frameworks have been used to identify multiple protein types, such as DNA binding proteins (Zeng et al., 2015;Qu et al., 2017;Shen and Zou, 2020), RNA binding proteins (Xiao et al., 2017;Lei et al., 2021), lncRNA interacting proteins Liu, 2020), and drug targets (Yan et al., 2016;Wang et al., 2020b;Wang et al., 2020c). Since the recognition of protein sequences includes two important steps sequence feature extraction and classifier selection the effective combination of feature extraction algorithms and classifiers has also been extensively studied . In this paper, we propose a predictive model for identifying hormone-binding proteins based on Naïve Bayes.
HBPs prediction analysis was carried out through the following five steps: 1) HBPs and non-HBPs were searched and downloaded from UniProt, and the similarity threshold of protein sequences was set by the CD-HIT program to construct a high-quality dataset ; 2) feature extraction of protein sequences was performed using the k-mer feature coding method; 3) the extracted features were selected to improve the accuracy of classification; 4) different classification methods were used to classify and predict the selected feature subset and select the best classification methods; and 5) Performance evaluation. Figure 1 shows the structural framework for identifying HBPs in this paper. This section will introduce dataset establishment, feature selection methods and classification methods in detail.

Dataset
It is necessary to collect sufficient correlation function data as the basis of statistical model prediction. Therefore, it is first necessary to construct an objective dataset to ensure the effectiveness and robustness of the model. Therefore, we adopt the benchmark dataset constructed by Tang et al. (Tang et al., 2018). To build this dataset, follow these steps. The first step was to search and collect HBPs from UniProt (Bairoch et al., 2009;Schneider, 2012) and to generate the original HBPs dataset by selecting the hormone binding keywords in the molecular function items of the gene body (Ashburner et al., 2000). Consequently, 357 HBPs with manual annotation and review were selected. In the second step, to avoid the high similarity of protein sequences affecting the results, we used the CD-HIT (Li and Godzik, 2006;Fu et al., 2012) program to set the truncation threshold to 0.6 to remove highly similar HBPs sequences. In the third step, when the protein sequence in the dataset contains unknown residues (such as "X," "Z," and "B"), it will affect the model prediction results, so protein sequences containing unknown residues need to be excluded. After the above steps, a total of 122 HBPs were obtained, which were regarded as positive data. As a control, 121 non-HBPs were randomly selected from UniProt as negative data using a similar selection strategy. The data of the model can be freely download from https://github.com/GUOYUXINXIN/-. The benchmark dataset can be expressed as: Among them, subset D p contains 122 HBPs, and subset D n contains 121 non-HBPs.

Feature Extraction
Protein sequence is a string generated by the permutation and combination of 20 English letters with different lengths. Currently, general machine learning algorithms can only deal with feature vectors, so when machine learning methods are used, protein sequences need to be transformed into numerical vectors representing the characteristics of protein sequences. As the first step in building a biological sequence analysis model, feature extraction is an important part of correctly predicting protein sequences, an efficient feature extraction method can obtain a Frontiers in Genetics | www.frontiersin.org November 2021 | Volume 12 | Article 797641 high performance classification model. The extracted features should not only retain the protein sequence information to the maximum extent, but also have a greater correlation with protein classification. Given a protein sequence, express it as: where Pstands for protein sequence, R i represents theithamino acid residue of proteinP(i 1, 2, /, L).

K-Mer
K-mer (Liu et al., 2015b;Niu et al., 2021b) is the most basic method of expressing protein sequences as digital vectors (Liu et al., 2016), in which k-mer frequency coding refers to the occurrence frequency of all possible nucleotide sequences with k length in a given sequence (Liu et al., 2015c;Bin et al., 2017). The k-mer feature extraction algorithm is used to convert the protein sequence into a vector with a fixed length, which is used as the input vector of the machine learning classifier. For example, setting k to 2 produces a 400-dimensional vector (AA, AC, AD, /, AY, YA, YC, /, YY).
To avoid the problem of overfitting, we generally setk < 4 because whenk > 4 , more dimensions will be generated, resulting in dimension disaster . Therefore, we set k to 3 so that the input protein sequence could be converted into a vector with 8,000 dimensions of fixed length.

Distance-Based Residual
DR (Liu et al., 2014) is a feature expression method based on protein sequences that uses the distance between residue pairs to represent the feature vector of the protein. The feature vector is expressed by calculating the number of occurrences of residual pairs within a certain distance threshold. The feature vector dimension obtained by the DR feature extraction method is 20 + 20 × 20 × d MAX dimensions, where in 20 in 20 + 20 × 20 × d MAX represents the types of amino acids that make up the protein; d MAX is a distance threshold that can be set manually, which represents the maximum distance between pairs of amino acid residues.

Profile-Based Cross-Covariance
Since machine learning-based technologies such as random forest (RF) and logistic regression (LR) require the input of fixed-length vectors as input vectors for training, it is necessary to convert protein sequences of different lengths into fixed length vectors as input vector machine learning. Because each residue in a protein has many physical and chemical properties, protein sequences can be regarded as time series with similar properties. Therefore, CC-PSSM (Dong et al., 2009) is used in this article to convert protein sequences of different lengths into fixed length vectors. PSSM algorithm is a common algorithm in the field of bioinformatics, known as the "positionspecific scoring matrix," which can store the evolutionary information of protein sequences so that it can be used for protein prediction. It is a matrix that calculates the percentage of different residues at each position in a multi sequence alignment, the matrix size is L × 20 (L for protein sequence length). Among them, CC is a measure of correlation between FIGURE 1 | Structure flow chart. The first step is to search and download HBPs and non-HBPs from the protein resource database and then use CD-HIT to perform protein de-redundancy operations. The threshold is set to 60%. Finally, protein sequences containing unknown residues are removed to generate the final protein dataset. The second step is to extract features of the protein, and the third step is to use different classification methods to classify the selected features.
Frontiers in Genetics | www.frontiersin.org November 2021 | Volume 12 | Article 797641 two different properties of amino acid residues and can be calculated using the following equation: (3) wherei1, i2represents amino acids, and S i1 , S i2 represents the average score of i1, i2along the protein sequence. LAG is the maximum lag, lag is an integer value from 1 to LAG, and the total number of CC variables is 380 × LAG. In this paper, we set the value ofLAG to 2 to obtain a 720(380 × 2)-dimensional vector.

Feature Selection
When the feature size is large, there may be irrelevant features or inter-dependence between features, which will easily affect the accuracy of the prediction results. In particular, the more feature dimensions, the more likely it is to lead to "dimension disaster," model complexity and model generalization ability decline. Therefore, removing irrelevant or redundant features through feature selection can improve the accuracy of classification performance and reduce the running time of the model (Polat and Güneş, 2009;Quan et al., 2016;Zou et al., 2016;Guohua and Jincheng, 2018;Wei et al., 2018;Riaz and Li, 2019;He et al., 2020). In this paper, the F-score value is used to select the optimal feature (Chen and Lin, 2008;Cheng et al., 2019;Wei et al., 2019), which is a method to measure the distinguishing ability of features between the two categories, and the most effective feature selection can be achieved through this method. Therefore, we can use (Eq. 4) to describe the contribution of each feature and perform feature selection: whereF(i) is the score of theith feature of the F-score. Generally, the larger the value of F(i) is, the stronger the ability to recognize samples.s 2 w (i) is the intragroup variance, ands 2 b (i) is the intergroup variance. Their calculation formula is as follows: wheress b (i)is the sum of squares between groups; ss w (i)is the sum of squares within the group; Kis the total number of classes; andNis the total number of samples.

Classifier
In this paper, Naive Bayes, Random forests, logistic regression, linear discriminant and other classification algorithms are used to predict HBPs.

Naïve Bayes
The Naive Bayes method is a classification method based on Bayes' theorem and the assumption of the independence of characteristic conditions. It is characterized by combining prior probability and posterior probability and a very widely used algorithm. The main idea of the naive Bayes classifier is to solve the posterior probability P(Y|X) through joint probability modeling and use Bayes' theorem. Then, the category corresponding to the largest posterior probability is used as the predicted category. Suppose there is a sample dataset D {d 1 , d 2 , /, d n }, the feature dataset corresponding to the sample dataset is X {x 1 , x 2 , /, x d }, features are independent and random, and the class variable is Y {y 1 , y 2 , /y m }.
According to the Naive Bayes algorithm, the posterior probability of the sample belonging to categoryycan be expressed as: WhereP(Y)is the prior probability, Naive Bayes is based on the independence of each feature. In the case of a given category, Eq. 6 can be further expressed as the following equation: The posterior probability can be calculated from the above two Eqs 6, 7: Since the magnitude of P(X)is fixed, when comparing the posterior probability, only the molecular part of the above equation can be compared. Therefore, a naive Bayesian calculation of sample data belonging to category y i can be obtained:

Random Forests
RF is a flexible, easy-to-use machine learning algorithm that contains multiple decision trees. It is an optimized version of bagging Zeng et al., 2020). The idea of bagging is to vote on the results of multiple weak classifiers to combine them into a strong classifier, thereby improving the prediction accuracy of the model. In the training phase, RF uses the bootstrap sampling method to collect multiple different subsets from the input training dataset and then uses the different collected subsets to train the internal decision tree. Then, in the prediction phase, RF votes for the prediction results of multiple internal decision trees and then outputs the prediction results. Its advantages are as follows: 1) it can process high-dimensional data without feature selection; 2) accuracy can be maintained even if many of the features are missing; and 3) it has a fast training speed (Jiao et al., 2021).

Logistic Regression
As a classification model, LR can deal with the 0/1 classification problem because of the nonlinear factor introduced by the Frontiers in Genetics | www.frontiersin.org November 2021 | Volume 12 | Article 797641 sigmoid function. The image of the logical function is an S-shaped curve with values between (0, 1). The farther away from 0 a function is, the closer to 0 or 1 the value of the function will be. Therefore, this feature can be used to solve the problem of binary classification. The function formula is as follows: Among them, z θ T x n i 0 θ i x i θ 0 x 0 + θ 1 x 1 + θ 2 x 2 + /+ θ n x n ; therefore, the predictive function of logistic regression can be expressed as:

Linear Discriminant Analysis
LDA is a classical linear learning method, also known as "Fisher" discriminant analysis in dichotomies. Unlike the perception machine, the principle of LDA is dimension reduction. In other words, given a set of training samples, the article tries to sample projections to a straight line, keeping the points with the same classification as close as possible and the classification of different points as far apart as possible, i.e., maximizing and minimizing the variance between variance. LDA can, therefore, make use of sample points in the projection line (or projection location) to determine the type of sample.

Performance Evaluation
In this article, we use the specificity (SP), sensitivity (SN), accuracy (ACC) (Yang et al., 2021) and Matthews correlation coefficient (MCC) to evaluate our proposed method (Snow et al., 2005;Cheng et al., 2018), which can be expressed as: 1. Accuracy: ACC represents the probability that all positive and negative samples will be correctly predicted.

ACC
TP + TN TP + TN + FN + FP (12) 2. Sensitivity: SN represents the probability that the actual hormone-binding protein is predicted to be a hormone-binding protein.
3. Specificity: SP represents the probability that a nonhormone-binding protein is predicted to be a non-hormonebinding protein. 6. F1-Score: The F1 score is balanced by taking into account both accuracy and recall, so that both are maximized at the same time.
Where, the recall rate is: recell TP TP+FN 7. The ROC curve: Receiver operating characteristic curve (the area under the curve is AUROC), X-axis is false positive rate (FPR), Y-axis is true positive rate (TPR): 8. PRC: PRC takes precision rate as Y-axis and recall rate as X-axis.
Where TPrefers to the model correctly predicting positive category samples; FPrefers to the model incorrectly predicting negative category samples as positive category; TN refers to the model correctly predicting negative category samples; and FNrefers to the model incorrectly predicting positive category samples as negative category (Ding et al., 2020a;Ding et al., 2020b).
In machine learning, a test set is needed to test the model and describe its generalization ability. However, in practical applications, due to the limited number of datasets, cross validation is used as a test method. There are three types of cross validation: K-fold cross validation, fold cross validation and independent data verification. In this article, we use K-fold cross-validation to test the constructed model. K-fold cross-validation divides the training data into K parts, of which (K-1) pieces of data are used to train the model, and the remaining 1 piece of data is used to evaluate the quality of the model. This process is cycled K times, and the K evaluation results obtained are combined, such as averaging or voting. The flow chart of K-fold cross verification is shown in Figure 2.

RESULTS AND DISCUSSION
In machine learning, the predicted results of the model can be tested through cross-validation. In this article, we use 10-fold cross-validation to evaluate the built model.

Performance Comparison of Different Feature Expression Methods
According to the feature extraction part, protein sequences are transformed into feature vectors of different sizes through different feature extraction methods. Therefore, in this study we tested the performance of three feature extraction methods: k-mer (K 2), k-mer (K 3), DR and CC-PSSM.
Frontiers in Genetics | www.frontiersin.org November 2021 | Volume 12 | Article 797641 First, use the F-score feature selection method to reduce the dimensionality of the feature vectors obtained by different feature extraction methods to 250 dimensions, then use the selected best feature vector as the input vector of the naive Bayes algorithm and perform 10-fold cross-validation, and finally draw forecast results. The prediction results are shown in Table 1 (the maximum value is in bold). As shown in Table 1, the k-mer (k 3) feature extraction algorithm used in this model performs best in all indicators, among which the values of ACC, MCC,SP and SN are,respectively,95.45,91.36,96.73,and 94.17%. These results prove the validity of our model.

Comparison With Other Classifiers
To show the superiority of naive Bayes in HBPs recognition, we can compare the HBPs recognition performance of different classification algorithms based on the same feature subset (i.e. 250 optimal features). In this paper, we used the constructed HBP_NB model to perform performance comparison with RF, LDA, Logistic regression and other models under the condition of 10-fold cross-validation, and the comparison results are shown as follows.  Figures 3, 4 respectively show the boxplot diagram of different models, ROC and PRC curves schematic diagram. These results show that our model has good classification ability. Therefore, we construct the final model based on naive Bayes. Where, the line in the middle of the box in the boxplot is the median of the data, representing the average level of the sample data; The top of the box represents the upper quartile and the bottom quartile represents the lower quartile, which means the box contains 50% of the data, so the width of the box reflects, to some extent, how much the data fluctuates; at the same time, the lines above and below the box represent the maximum and minimum values of data. The ROC curve is a curve that evaluates the effect of binary model on positive FIGURE 2 | K-fold cross-validation diagram. Divide the data into K parts, where k-1 parts are used as the training dataset, and the remaining part is used as the test set. The mean value of the results of the k groups is calculated as the performance index of the current k-fold cross-validation evaluation model.    Frontiers in Genetics | www.frontiersin.org November 2021 | Volume 12 | Article 797641 7 category prediction. X-axis is false positive rate (FPR), Y-axis is true positive rate (TPR), which indicates that the optimal classifier with the best performance is located in the upper left corner of the image (coordinate 0,1), and the area under its ROC curve is AUROC, with an area value between 0,1. PRC takes presion rate as Y-axis and recall rate as X-axis, and lines are drawn according to changes in the value of probability threshold. The ideal model would be at the point (1,1). The model with excellent performance is as close to this point as possible.

Performance Comparison With the Existing Optimal Algorithm
This section compares the model constructed in the article with other existing methods, in which the results of HBPred (Hua et al., 2018) and iGHBP (Basith et al., 2018) are directly obtained from the literature. The comparison results are shown in Table 3 (the maximum value is in bold). As seen from Table 3, the HBP_NB model constructed in this paper has the best performance in all indicators, among which ACC, SP and SN have reached maximum values of 95.45, 96.73 and 94.17%, respectively. The effect is significantly better than that of the other two methods, which also proves the effectiveness of the HBP_NB model constructed in this paper.

CONCLUSION
As a carrier protein related to the regulation of hormones in the circulatory system, HBPs can cause various diseases when they are abnormally expressed. Therefore, it is very important to understand their function and regulatory mechanism, and the correct identification of HBPs is the first step in understanding their biological process and is necessary to further study their function. There is growing evidence that it is crucial to develop an efficient computational model to identify hormone-binding proteins. In this study, we used a reliable predictive model for HBP_NB to identify HBPs. First, the model uses the k-mer feature extraction method to extract the features of HBPs. Then, to remove redundancy and noise and improve the accuracy of model prediction, the F-score value is used to sort the features and select the optimal features. Secondly, the reduced feature set is input into naive Bayes classifier and the 10-fold cross validation is used to judge the quality of the prediction model. Finally, the accuracy, sensitivity and specificity of the HBP_NB model reached 95.45, 94.17 and 96.73%, respectively, in 10-fold cross validation. The feasibility and validity of our model are illustrated.
However, there is room for improvement in our current approach. Since the data set selected in this experiment is small, we will collect more data for model training and independent test set experiments in the future to improve the model's robustness and generalization ability. At the same time, we will further learn more effective feature representation methods and classification algorithms to gain an in-depth understanding of machine learning and establish a more stable model. In addition, we also hope that our work can help scholars to study hormone binding proteins, to promote research on hormone-binding protein drugs.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding authors.