RF_phage virion: Classification of phage virion proteins with a random forest model

Introduction: Phages play essential roles in biological procession, and the virion proteins encoded by the phage genome constitute critical elements of the assembled phage particle. Methods: This study uses machine learning methods to classify phage virion proteins. We proposed a novel approach, RF_phage virion, for the effective classification of the virion and non-virion proteins. The model uses four protein sequence coding methods as features, and the random forest algorithm was employed to solve the classification problem. Results: The performance of the RF_phage virion model was analyzed by comparing the performance of this algorithm with that of classical machine learning methods. The proposed method achieved a specificity (Sp) of 93.37%%, sensitivity (Sn) of 90.30%, accuracy (Acc) of 91.84%, Matthews correlation coefficient (MCC) of .8371, and an F1 score of .9196.


Introduction
Phages integrate their DNA sequences with bacterial genomes following infection and play a role in maintaining the diversity of microorganisms (Shen et al., 2007;Xia et al., 2010;Wetie Ngounou et al., 2014;Zou et al., 2016). If the abundance of a particular type of bacteria increases rapidly in a bacterial population, the corresponding phage specifically infects and kills the rapidly proliferating bacteria. The entire bacterial population returns to equilibrium following this process. Phages also participate in the Earth's material cycle and are essential to the human microbiome (Brohee and Van Helden, 2006;Shen et al., 2019;Zhang and Quan, 2020). There are approximately 10 14 bacteria in each individual's gut, while the number of bacteriophages is 10 15−16 , which is ten times higher than the number of bacteria. These findings indicate that phage proteins play several crucial roles in biological processes (Ngo et al., 1994;Godzik et al., 1995;Whisstock and Lesk, 2003;Wu et al., 2009;De Las Rivas and Fontanillo, 2010;Awais et al., 2019).
Phage proteins can be classified as virion and non-virion proteins. The virion proteins encoded by the phage genes are essential components of the assembled phage particle and include the capsid protein, envelope protein, and virion enzymes (Chatterjee et al., 2011;You et al., 2013;Peng et al., 2017). These virion proteins determine the specificity for recognizing host bacteria and play essential roles in the recombination of phage viruses, receptor recognition, bacterial attachment, and penetration. The non-virion proteins of phages are synthesized in the infected cells and are also encoded by the phage genome. However, the nonvirion proteins cannot be packaged into mature phage particles. The non-virion proteins primarily include enzymes and regulatory proteins, which play important roles in the processes of gene replication, transcription, and gene expression in phages (Sato et al., 1994;Schwikowski et al., 2000;Wei et al., 2017).
Several computational methods have been reported for classifying the functions of phage genes and virion proteins over the past few decades. Li et al. proposed a novel tool named SynFPS for classifying closely related genomes in whole genome comparison studies (Coates and Hall, 2003). The method employs a support vector machine (SVM) classifier and uses gene-to-gene distances as a feature. Feng et al. proposed a naïve Bayes method for classifying phage virion proteins based on the composition of primary amino acids and dipeptides as coding schemes (Free et al., 2009). Ding et al. proposed a method for classifying virion proteins using an SVMbased approach (Kim and Subramaniam, 2006). In these models, the key features among g-gap dipeptide compositions were initially determined by analysis of variance. Yang et al. described an ensemble algorithm-based method for classifying organellar proteins, in which the amino acid composition, physicochemical properties, sequence distribution, and structural characteristics of the sequences were used as features (Zhang et al., 2012). Han et al. proposed a two-layer multi-class SVM model for classifying subcellular localizations (Vazquez et al., 2003). After the first layer of SVM classification is completed, each amino acid sequence is represented by a k-dimensional vector, and each element in the vector corresponds to a classification result of the classifier (Yang et al., 2020). The output of the first layer is used as the input for the next layer, and the second layer uses SVM to determine the final result. Jia et al. proposed a random forest algorithm-based method that used different features extracted from protein sequences (You et al., 2017).
The method used a voting system for computing the final classification results, which depended on seven independent models. Bahri et al. proposed an ensemble method named Greedy-Boost based on the adaptive combination, which improves the accuracy of detection (Guo et al., 2008). Although the smoothing method improves the stability of the classification system, the method has a high computational cost. Zhang et al. proposed a method based on logistic models for classifying samples using the amino acid composition, transformation, and distribution features and pseudo-amino acid composition as features (Koike and Takagi, 2004). The final results were computed based on the results obtained from the classification models. Liu et al. used different weights for classifying the four SVMs used in their study (You et al., 2015a). The method determined the final classification by traversing and selecting appropriate parameters. These findings indicate that ensemble algorithms can improve the accuracy of the final classification.
This study aimed to develop a method for the classification of phage virion proteins using machine learning methods. A novel method, RF_phage virion, is proposed herein for the effective classification of the virion and non-virion proteins. The method uses four protein sequence coding methods as features, and the random forest algorithm is used for solving the classification problem. The performance of the RF_phage virion model was determined by comparing the performance of this algorithm with some classical machine learning methods. A schematic representation of the RF_phage virion method is provided in Figure 1.

Dataset
Ding's dataset, which primarily focuses on phage virion proteins, was used for classifying the phage proteins in this study (Bradford and

Amino acid composition (AAC)
The AAC feature describes the distribution of amino acid residues (Li et al., 2012). The feature focuses on the frequency of occurrence of each amino acid residue. At the same time, the AAC feature can provide typical statistical information regarding the identified protein sequences. The formula used for determining the AAC is provided in Eq. 1: Where, length represents the length of the identified phage virion protein sequence, and aac(i) represents the occurrence of the ith amino acid residue in the protein sequence. The parameter i refers to the twenty amino acids present in protein sequences. The sum of the twenty amino acids equals to 1.

Composition of k-spaced amino acid pairs (CKSAAP)
Although the AAC feature includes the amino acids present in protein sequences, the feature does not provide any positional information regarding the amino acids in protein sequences (Chen and Liu, 2005;You et al., 2015b;Wang et al., 2018). The CKSAAP feature describes the relationship between two amino acid residues in protein sequences, and focuses on the frequency of amino acid residue pairs, which are separated by n number of neighboring

FIGURE 2
The structure of the random forest algorithm.

FIGURE 3
The ROC curves of AAC feature. Note: LD means the linear discriminant classifier. SVM means the support vector machine. DT means the decision tree. RF means the random forest and CNNBilstm mean convolution neural network with Bi-Long Short Term Memory.
Frontiers in Genetics frontiersin.org amino acid residues. For instance, n = 0 indicates that the two amino acids are successive. There are 400 types of AACs, and CKSAAP can compute the frequency of occurrence for each combination. The formula used for determining the CKSAAP is provided in in Eq. 2: In this study, the value of n was set to 3, and the scale of the CKSAAP feature can reach 1600.

Di-peptide composition (DPC)
The DPC feature focuses on the correlation between two successive amino acid residues (Sun et al., 2017). The scale of this feature can reach 400. The DPC feature was calculated using the formula in Eq. 3: Where, the sum of the whole elements equals 1. In other words, the DPC can be treated as a second-order term of amino acid pairs.

Dipeptide deviation extraction (DDE)
The DDE feature focuses on a binomial and uniform distribution theoretical sequence, but does not consider the alignment of protein relationships (Zhang et al., 2019). The feature can elucidate the interrelationships within a set of proteins. There DDE feature comprises three key parameters, namely, the size of the dipeptide The ROC curves of CSKAAP feature.
For instance, two pairs of successive amino acid residues have a DPC of 400. The scale of the DDE feature is 400, as depicted in Eq. 5: The formulae used for estimating the D c , T m , and T v are provided in Eq. (6) (7) (8), provided hereafter.
There are 400 combinations of amino acid pairs in each dipeptide. Therefore, the D c (i) can be treated as an element in related DPC features.
Where, T m represents the theoretical average, C i1 represent the occurrence of the first amino acid residue, C i2 represents the occurrence of the second amino acid residue, and C N represents the entire set of amino acids.
Where, T v represents the theoretical variations in dipeptides.

Random forest algorithm
The random forest algorithm was proposed by L. Breiman at the beginning of this century and has been successfully used for dealing with classification and regression problems in related areas (Saha et al., 2014;Liu et al., 2018). The algorithm combines randomized decision trees and subsequently aggregates the average results from the decision trees. This algorithm can deal with high-dimensional small-sample problems. In other words, the algorithm performs well in identification problems using datasets where the scale of variables is much larger than the number of samples. The random forest algorithm is also used in big dataset

FIGURE 5
The ROC curves of DPC feature. problems. The steps of the random forest algorithm are outlined in Figure 2.

Measurement of performance
The samples in the classification problem in this study could be categorized into two, namely, phage and non-phage virion protein sequences. The defined positive samples comprised the virion protein sequences, while the defined negative samples comprised the non-phage protein sequences of phages. According to the definition, classified samples can produce four results under common conditions. These formulations, including the sensitivity (Sn), specificity (Sp), accuracy (ACC), F1 scores, and Matthews correlation coefficient (MCC), were obtained using the formulae in Eq. (4) (5) (6) (7) (8), provided hereafter.
Where, P and N represent the scale of positive and negative samples, respectively. T and F represent sets of true and false predicted results, respectively. The F1 score is used to evaluate the distribution of positive and negative samples in two-types problems. Performance measures should consider several parameters, including the four basic parameters, namely, TP, FP, TN, and FN. The performance measure can be treated as a harmonic average of model accuracy and recall. Another important measure of performance is the MCC, and the values of this performance measure ranges from −1 to 1.

Results
The random forest model was used in this study for classifying the virion and non-virion proteins of phages using four typical protein

FIGURE 6
The ROC curves of DDE feature. features, namely, the AAC, CSKAAP, DPC, and DDE. The performance of the method was determined by comparing with state-of-the-art methods. As depicted in Figure 3 and Table 2, the values of Sp, Sn, Acc, MCC, and F1 score for the SVM-based method were 46.99%, 52.24%, 49.61%, −.0077, and .4825, respectively, while the values of these indices for the decision tree model were 63.47%, 74.50%, 68.99%, .3821, and .6718, respectively. The values of Sp, Sn, Acc, MCC, and F1 score for the random forest algorithm using the AAC feature were 74. 83%, 76.94%, 75.89%, .5178, and .7563, respectively, while the values of these indices for the deep learning algorithm, which is a convolution neural network, were 99.47%, 0%, 49.74%, −.0514, and .6643, respectively.

FIGURE 7
The ROC curves of combination feature.

Discussions
In the section of results, we merely employed the AAC, CSKAAP, DPC, and DDE features, respectively. Therefore, we combined the four features to evaluate the performances in this work.

Conclusion
The present study uses machine learning methods to classify phage virion proteins. Four protein sequence coding methods, namely AAC, CSKAAP, DPC, and DDE, were used as features for the effective classification of the virion and non-virion proteins. The random forest algorithm was subsequently used to solve the classification problem. By combining each of the four features with the classification algorithm, we observed that the performance of the model was best when the combination feature was used.
When it comes to the problem of classification of phage virion proteins, such an issue can be regarded as a typical binary classification problem in the field of machine learning. In this work, we employed Ding's dataset, which is a balanced dataset. Actually, the size of positive samples can hardly be equal to the size of the negative ones. In this work, the AAC, CSKAAP, DPC, and DDE feature and their combination feature can be employed as the input of the RF_phage virion model. There are several other features in the field of protein research. Therefore, these features can also be employed in future work. On the other hand, the other typical classification algorithm can be utilized in future work. The size of the combination feature can reach 2420. Considering such a situation, some reduced useless information approaches can be utilized in this future work.

Data availability statement
The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.