Identification of Phage Viral Proteins With Hybrid Sequence Features

The uniqueness of bacteriophages plays an important role in bioinformatics research. In real applications, the function of the bacteriophage virion proteins is the main area of interest. Therefore, it is very important to classify bacteriophage virion proteins and non-phage virion proteins accurately. Extracting comprehensive and effective sequence features from proteins plays a vital role in protein classification. In order to more fully represent protein information, this paper is more comprehensive and effective by combining the features extracted by the feature information representation algorithm based on sequence information (CCPA) and the feature representation algorithm based on sequence and structure information. After extracting features, the Max-Relevance-Max-Distance (MRMD) algorithm is used to select the optimal feature set with the strongest correlation between class labels and low redundancy between features. Given the randomness of the samples selected by the random forest classification algorithm and the randomness features for producing each node variable, a random forest method is employed to perform 10-fold cross-validation on the bacteriophage protein classification. The accuracy of this model is as high as 93.5% in the classification of phage proteins in this study. This study also found that, among the eight physicochemical properties considered, the charge property has the greatest impact on the classification of bacteriophage proteins These results indicate that the model discussed in this paper is an important tool in bacteriophage protein research.


INTRODUCTION
In the biological world, bacteriophages are ubiquitous, with different genomes and lifestyles. According to their morphology, they can be classified as either tail, tailless, or filamentous bacteriophages. According to morphology and nucleic acid, phages are classified as infect bacteria and infect archaea. The bacteriophage must be attached to a host cell for growth and reproduction (Seguritan et al., 2012), and directly affects the host population by lysing host cells. In addition, each bacteriophage is specific and greatly reduces the damage to host cells (Haq et al., 2012). Identification and classification of various bacteria can be performed based on the universality, diversity, dependence, and specificity of bacteriophages (Marks and Sharp, 2015).The structure of bacteriophages is simple, consisting of only a protein shell and genetic material (DNA or RNA) (Haq et al., 2012), making them important substances for simplifying experimental research in bioinformatics. As a bacteriophage can insert genes into host cells (Ding et al., 2014), it is an important tool for studying genetics Hu et al., 2018). Hershey (Hershey and Chase, 1952) performed biological experiments using the T2 bacteriophage and bacteria in 1952, and finally confirmed that DNA is the genetic material of bacteriophages and other organisms. The significance of this research in the development of biological science earned Hershey and coworkers the Nobel Prize in Physiology. Bacteriophage provide experimental systems and tools for the molecular biological science revolution. The bacteriophage rapid development has led to dection of basic principles of ecology and evolution. Besides, it is relatively easy to synthesize and has modular characteristic, which cater to the needs of synthetic biologists and carry out engineering research and implementation of biological function.
Bacteriophage proteins are classified into virion and nonviron proteins , with most practical interest focusing on the function of bacteriophage virion proteins (Feng et al., 2013b). Therefore, bacteriophage proteins must be accurately classified and identified so that researchers can further study the structure and function of a particular bacteriophage. After the human genome project was officially launched in 1990, the number of bacteriophage protein sequences with unknown functions increased dramatically (Seguritan et al., 2012;Chen et al., 2018a). Faced with a large volume of data, traditional biological experimental methods could no longer keep up with the post-gene era (Chen W. et al., 2016;Cheng et al., 2019;Mrozek et al., 2016;Hu et al., 2018). For this reason, researchers introduced different machine learning algorithms into bacteriophage classification and prediction research. For example, Li et al. (2007) developed a support vector machine system called SynFPS that uses the gene-gene distance determined by k-means clustering to identify closely related genomes and perform gene function prediction. Using the protein appearance frequency of amino acids and information of isoelectric points, Seguritan et al. (2012) developed an artificial neural network method to classify viral structures. Feng et al. (2013b) used the main amino acid and dipeptide components as an encoding scheme, and modified a naive Bayes classifier to identify bacteriophage proteins. Ding et al. (2014) used g-gap dipeptide composition to represent protein sequence information, incremental feature selection to analyze the variance and identify the optimal feature set, and a support vector machine for classification. Zhang et al. (2015) obtained sequence feature vectors with various techniques, and then used the incremental feature selection algorithm to select the optimal feature subsets. Finally, the prediction results of individual classifiers trained in different feature spaces were integrated to produce the final classification effect. Machine learning algorithm (Robert, 2012;Stephenson et al., 2018) automatically analyze and obtain rules from data and use them to predict unknown data (Chen and Yan, 2013;Yu et al., 2015Yu et al., , 2016aChen and Huang, 2017;Chen et al., 2018h;Wang et al., 2018). This saves time and money, but the results from such algorithms are not as convincing as those from biological experiments. Therefore, it is especially important to choose an appropriate machine learning algorithm to ensure the most accurate classification results (Liu, 2017;Yao et al., 2017;Yu et al., 2017a). In a protein classification experiment, the classification effect depends largely on the feature set extracted (Zou et al., 2013;Bin et al., 2015;Mrozek et al., 2015;Jia et al., 2016;Yu et al., 2016bYu et al., , 2018Zhang et al., 2016;Huang et al., 2017;Qu et al., 2017;Jiang et al., 2018;Qiao et al., 2018;Xiong et al., 2018;Xu et al., 2018b). To date, feature extraction methods are divided into sequence-based and structure-based approaches Qu et al., 2017) The feature set extraction part of this study is obtained by combining the features extracted by the two feature extraction methods.
In this study, we examined the final classification effect of the selected methods and the stability of the dataset when the feature dimension was reduced. First, to remove the imbalance in the reference dataset, CD-Hit was used to remove redundant data, resulting in a balanced dataset that contains comprehensive information and less redundancy. Pearson's correlation coefficient and three distance functions (Euclidean and cosine distances and the Tanimoto coefficient)  were then used to calculate the correlation between features and class labels and the redundancy between features. Finally, the optimal feature subset with the strongest correlation between features and class labels and low redundancy between features was selected. According to some recent studies (Wu et al., 2009;Yi et al., 2011;Chen and Lin, 2012;Yang et al., 2015;Yu et al., 2017b;Zhang and Liu, 2017;Xu et al., 2018a;Liu et al., 2019), the best algorithms for protein classification are support vector machines and random forest algorithms. However, support vector machines are more suitable for small sample sets in which the number of dimensions is greater than the number of samples. Thus, the random forest algorithm was used in this study. The random forest algorithm (Breiman, 2001;Yao et al., 2017) combines multiple weak classifiers to produce a final result that has higher accuracy and better generalization performance. It can achieve good results, mainly because of the random nature of the "forest, " which makes the algorithm resistant to overfitting and more precise. Finally, in terms of bacteriophage protein classification, the data set extracted by combining the features and the feature selection of the feature set have a positive impact on the protein classification effect. Our results also show that, among the eight physicochemical properties of amino acids, the charge property has the greatest influence on the classification of bacteriophage proteins. To evaluate the performance of the models used in this study, the results were compared with those given by the methods introduced in (Feng et al., 2013b;Ding et al., 2014;Zhang et al., 2015). Figure 1 shows the workflow of this study.

Dataset Processing
Source: UniProt (Rolf, 2004;Consortium, 2012) is a widely used protein sequence database that offers low protein sequence redundancy and complete protein function interpretation (Cao and Cheng, 2016a;Jiang et al., 2016). As this website is free and open, researchers can download the desired protein sequence for free. The original positive samples used in this study (a total of 15,765 data), e.g., the number of bacteriophage virion proteins, were downloaded from this database. After obtaining the bacteriophage virion protein (positive) sample set, the PFAM family of positive samples was excluded from all PFAM families, such that the remaining samples were families of non-phage virion proteins. Finally, the longest protein sequence of the remaining families was extracted to form a negative sample set. The positive and counterexample datasets obtained as described above may all contain homologous sequences. Using such sample sets would result in the classification accuracy being overestimated, which is not conducive to the establishment of prediction models. Therefore, we used the CD-Hit tool to remove redundant positive and negative samples from the datasets.
Data integration: The CD-Hit (Li et al., 2001;Li and Godzik, 2006;Huang et al., 2010;Fu et al., 2012; redundancy tool effectively clusters similar sequences. The basic principle is to sort protein sequences in the dataset in descending order. The longest sequence is taken as the first class, and then this is compared with the second-longest protein sequence in terms of their similarity. If the similarity between the two is greater than some threshold, they are deemed to belong to the same class. Otherwise, the second-longest sequence forms a new class. Because the bacteriophage virion protein sequences were downloaded from UniProt, which ensures relatively low redundancy, the interrupt threshold was set to 0.8. The nonphage virion proteins had a higher degree of redundancy, so their interrupt threshold was set to 0.4. Thus, 6,251 bacteriophage virion protein sequences and 9,514 non-phage virion protein sequences were obtained. The union of the resulting positive and negative sample datasets gives the total dataset, and the intersection of the two is empty.

Representation Algorithms for Amino Acid Composition and Eight Physicochemical Properties
In this study, a feature set containing 188 dimensions was extracted based on amino acid composition and eight physicochemical properties. The amino acid composition is one of the most basic features of proteins Cao and Cheng, 2016b). Eight physicochemical properties of amino acids also play a role in the functional properties of bacteriophage proteins. In 1988, Coia et al. (1988) found that amino acids having lighter side chain groups are more likely to constitute bacteriophage virion sequences. In 1994, Marvin et al. (1994) proposed that hydrophilicity, hydrophobicity, and charge have a greater impact on the function of bacteriophage virion proteins. In 2008, Shen and Chou (2008) identified the vital role that the hydrophilicity and hydrophobicity of amino acids play in the folding of proteins. In 2014, Ting et al. (2014) used logistic regression to integrate several biological features, including physicochemical properties for predicting lysine acetylation, thus demonstrating the effect of physicochemical properties on protein structure and function. Therefore, the amino acid composition and its eight physicochemical properties are used to extract features that reflect the characteristics of bacteriophage proteins.
The 20 most common amino acids are as follows: The occurrence frequency of each amino acid in a protein sequence can be expressed as: Where n i is the frequency with which amino acid i occurs in the protein sequence and L is the length of the protein sequence.
In addition, these 20 amino acids can be classified into three types according to their physicochemical properties (Chou and Com, 2010), as shown in Figure 2.
The composition, transformation, and distribution of amino acids were determined by Dubchak et al. (1995) based on a global description of protein sequences. The feature extraction methods for the eight physicochemical properties of a protein sequence are as follows. Taking the electrode polarity as an example (expressed by p), the 20 amino acids are divided into high-, medium-, and low-charged polarity groups, which are expressed by p h , p p , p l , respectively. The composition, transformation, and distribution of the amino acids at this time can be represented by equations (3)-(7).
Composition features (Dubchak et al., 1995) (frequency of each charged electrode group in a sequence): where f 21 , f 22 , f 23 denote the content of the high-, medium-, and low-charged polarity groups in a sequence, respectively,L is the length of the protein sequence,n 1 , n 2 , n 3 are the frequencies with which the three electrode groups appear in the sequence. Conversion feature (Dubchak et al., 1995) (frequency of occurrence of bigeminal sequences): Where f 31 , f 32 , f 33 denote the content of the three bigeminal groups p hl , p hp , p pl , and m 1 , m 2 , m 3 are the frequencies of these three bigeminal groups appearing in sequence. There are three possible sequences of the charged polarity: p hl , p hp , p pl In addition, in a protein sequence of length L, assuming that any two adjacent amino acids constitute a pair, the protein sequence contains L − 1 paired sequences (Zou et al., 2013). Distribution features (Dubchak et al., 1995) (amino acid distribution of the high-, medium-, and low-charged polarity groups): Where a 1% , a 25% , a 50% a 75% a 100% represent the positions of the first, 25, 50, 75, and 100% high-charged polarity groups in a sequence, b 1% , b 25% , b 50% , b 75% , b 100% represent the positions of the first, 25, 50, 75, and 100% medium-charged polarity groups in a sequence and c 1% , c 25% , c 50% , c 75% , c 100% represent the positions of the first, 25, 50, 75, and 100% low-charged polarity groups in a sequence. In summary, (3 + 3 + 3 × 5) = 21-dimensional features can be extracted from each physicochemical property, and so 8 × 21 = 168-dimensional features can be extracted from the eight physicochemical properties. The 188-dimensional features (20-dimensional + 168-dimensional) are used to express the characteristics of bacteriophage proteins, and are extracted based on the content ratio of each of the 20 amino acids in the sequence and the eight physicochemical properties.

Adaptive k-skip-n-Gram Algorithm
A feature set containing 400 dimensions is extracted based on the adaptive k-skip-n-gram method (Feng et al., 2013c;Cao et al., 2017;Wei et al., 2017a;Tang et al., 2018) . In this study, the value of n was set to 2 (20 2 = 400).
The K value represents the separation distance between two amino acids. For example, in the protein sequence S = A 1 A 2 A 3 · · · A L (where L is the length of the sequence), And A i ,A j are the ith and jth amino acids of S. In a bacteriophage protein dataset, the sequences have very different lengths. If the parameter K is fixed to a specific value, the sequence information cannot be properly represented, which will affect the final classification effect. Therefore, the value of k was set to be adaptive so that K could vary with the length of the sequence.
For n = 2, the combinations of the 20 most common amino acids and the number of occurrences of each combination in the sample datasets are as shown in Figure 3.
This process is similar to full connection in a neural network. Among the 20 common amino acids, anyone can combine with another amino acid (or itself) in pairs, and the combination is random. In the same way as full connection, this leads to overfitting when there are too many data. Therefore, n should not be too high when using an adaptive k-skip-n-gram method. When n = 1, we have the traditional n-gram model proposed by Guthrie et al. (2006), which does not apply to shorter protein sequences. Therefore, n was set to 2 in this study.
In this feature extraction method, the combination set of two specified interval amino acids (Wei et al., 2017a) is given by: In addition, C is used to represent a set of two amino acids that are combined at all intervals in a sequence (Wei et al., 2017a).Namely: Finally, the feature extraction formula (Wei et al., 2017a) is: Where N(C skipgram ) is the total number of elements in set C,a m1 a m2 · · · a mn are the 20 n kinds of amino acid combinations of length n, N(a m1 a m2 · · · a mn ) is the frequency that the two-two combination in a m1 a m2 · · · a mn occurs in C skipgram  (Altschul et al., 1997) and the profile of PSI-PRED (Jones, 1999), which contain rich evolutionary information and secondary structure information, respectively. In this way, the 473-dimensional feature can be extracted.
1) Extract 20-dimensional features based on PSI-BLAST as follows: S z,i indicates that during the evolution process, the residue at the "z" position in the sequence S is mutated to the fraction of the "i" species, and "i" is one of the 20 common residues. S i indicates that during the evolution, the residue in sequence S is mutated to the average score of the ith residue. 2) Extracting 420-dimensional features based on n-gram: The Adaptive k-skip-n-gram algorithm that does not consider the k value is the n-gram method. Here, take n equal to 1 and n equal to 2 3) Based on the secondary structure sequence, the following six features are extracted (Wei et al., 2015): Three feature extraction formulas for spatial arrangement Where P H z represents the position index of the zth H in the secondary structure of the sequence S. n H represents the total number of occurrences of H in the secondary structure of sequence. Two feature extraction formulas for the percentage of the maximum continuous length (Wei et al., 2015).
C H represents the length of the fragment in which H appears consecutively in the sequence of the secondary structure. A new feature for distinguishing between two structural classes, α + β and α β : (Wei et al., 2015) f βαβ = n βαβ /L seg − 2 This formula calculates the frequency at which βαβ appears in the fragmented sequence S seg , n βαβ represents the number of times βαβ appears in S seg , L seg indicates the length of S seg . 4) Extracting 27 features based on structural probability matrices: Three features from the overall information and 24 features from local information

Feature Selection
Based on the feature extraction methods described in section Feature extraction, We extracted a 188-dimensional, 400dimensional feature set based on sequence information, and a 473-dimensional data set based on sequence and secondary structure information representing the entire bacteriophage protein sequence dataset. Some redundant or irrelevant cases  were still present in these features. The existence of invalid features wastes time and computational resources, and affects the classification accuracy of the model (Chen et al., 2018b,f,g;Dao et al., 2018;Yang et al., 2018;Zhu et al., 2018a,b). In this paper, the Max-Relevance-Max-Distance (MRMD)  method was used to select features and identify higherquality feature sets, i.e., the optimal feature subset. In this method, Pearson's correlation coefficient is used to calculate the correlation between features and class labels (MR), thus enabling the selection of features with strong correlation to the target class. Three distance functions (Euclidean and cosine distances and the Tanimoto coefficient) are used to calculate the redundancy between features (MD) and identify features with low redundancy. Taking the two eigenvectors (X,Y) as an example, Pearson's correlation coefficient (Pearson, 1909) expressed as follows: Where σ X and σ Y denote the standard deviation of the two vectors, cov(X, Y) is the covariance, which is used to measure the relationship between two random variables. The covariance formula is as follows: Where − X and − Y denote the mean of the respective vectors. The formula for the Euclidean distance (Larson and Edwards, 1991;Deza and Deza, 2009) is: Where M is the number of feature vectors,n is the total number of elements in each vector, and x q , y q are the q-th elements in X, Y, respectively. The cosine distance formula (Tan et al., 2005) is: The Tanimoto coefficient (Rogers and Tanimoto, 1960) is given by: Using these distance metrics, we identified the features with the strongest correlation and minimum redundancy with respect to the class labels. In different scenarios, we can increase the weights of MR and MD (max wr × MR i + wd × MD i ) to ensure the acquired features are suitable for the classification task.

EXPERIMENTS Performance Evaluation Criteria
A 10-fold cross-validation method was employed to evaluate the models. There are four common evaluation indicators, namely the accuracy (ACC), sensitivity (SN), specificity (SP), and Matthews' correlation coefficient (MCC) (Feng et al., 2013a;  Chen W. et al., 2016;Wei et al., 2017b,c;Xu et al., 2017;Jingjing et al., 2018). These are expressed as follows (Zou et al., 2013;Chen et al., 2014;Qu et al., 2017): Where TP denotes true positive, i.e., the number of positive samples that are predicted to be positive samples, TN denotes true negative, i.e., the number of negative samples that are predicted to be negative samples, FP denotes false positive, i.e., the number of negative samples that are predicted to be positive samples, and FN denotes false negative, i.e., the number of positive samples that are predicted to be negative samples.

Classification Effects of Different Classifiers
Experiment 1: This part of the experiment is based on the feature sets of 188, 400, and 473 dimensions extracted by the method in Feature extraction. The accuracy of each classification algorithm before and after using the MRMD feature selection algorithm is presented in Table 1.
The data in Table 1 indicate that, for the classification of bacteriophage proteins, no matter which feature extraction algorithm is used, whether or not feature selection is performed, the random forest algorithm is the best classification effect.

Performance of Different Feature Extraction Methods
Experiment 2: Experiment 1 showed that the random forest algorithm produces the best classification of bacteriophage proteins. In this second experiment, the 188-dimensional and 400-dimensional datasets extracted based on sequence information (Seq Based), a 473-dimensional dataset extracted based on structure (Seq and stru Based), and two combined feature sets (Com Based) were integrated into the random forest algorithm, and the resulting performance was compared. The experimental results are presented in Table 2.  Feature fusion can boost the recognition performance by combining the complementary information of different features (Zhu et al., 2016(Zhu et al., , 2018c. A 588-dimensional feature set was obtained by combining the features of the 188-and 400dimensional feature sets, and a 661-dimensional feature set was obtained by combining the features of the 188-and 473-dimensional feature sets. According to the experimental results, the 188-, 473-, 588-, and 661-dimensional feature set models give better bacteriophage protein classification performance, However, based on the data of the other three evaluation indicators, the 661-dimensional feature set obtained by combining the 188-dimensional feature set extracted based on the sequence information and the features of the 473dimensional feature set extracted based on the sequence and the secondary structure is the best. This indicates that the feature set extracted by the feature representation algorithm containing both sequence information and structural information in phage protein classification has the best influence on the classification effect, and also shows that combining some feature sets in protein classification is effective for improving classification performance.

Importance of Feature Selection
Experiment 3: This experiment used the random forest classification algorithm to classify the feature sets after MRMD. The results are given in Table 3.
The comparison of the data in Tables 2, 3 shows that after using the feature selection algorithm (MRMD), the classification effect does not change with the decrease of the dimension, and even with the decrease of the dimension, the classification effect becomes better. After removing the redundant features, the best classification performance is still the data set obtained by feature combination, that is, the 256-dimensional feature set obtained by removing redundant features from the 661dimensional feature set.

Comparison With Recent Methods
Experiment 4: To provide an objective demonstration of the performance of the model described in this paper, this experiment compared the optimal proposed model with bacteriophage protein classification models proposed in recent years. The results are presented in Table 4. Table 4 that the bacteriophage classification model proposed in this paper achieves a good classification effect, with a classification accuracy of 93.5%. Compared with Feng, it has increased by 14%, compared with Ding and Zhang by 8%. In the other three evaluation indicators, there are also different degrees of improvement, indicating that the model proposed in this paper is an effective tool for phage protein classification.

Analyzing the Impact of Eight Physicochemical Properties
This section summarizes the first eight dimensional features that have a significant impact on the classification effect of bacteriophage proteins. The top eight features are listed in Table 5 in order of their impact.
According to the information in this table, the effects of eight physicochemical properties of amino acids on the classification of bacteriophage proteins are evenly distributed, and that which has the greatest impact on the classification is the charge property of amino acids.

CONCLUSION
Bacteriophage proteins are of special significance for cell typing and pathological research. It is very important to correctly classify virion and non-virion bacteriophage proteins. Therefore, this paper has proposed the following classification model: (1) higherquality feature datasets are extracted with extraction algorithms based on feature combination; (2) the optimal feature subset is selected using the MRMD algorithm for feature selection; and (3) the random forest algorithm is applied to perform protein classification. The model can achieve accuracy of up to 93.5% for the classification of bacteriophage proteins. This demonstrates that the model developed in this paper is an important tool for the classification of bacteriophage proteins. For the future direction, link prediction paradigms, which have been successfully applied in the prediction of disease genes (Zeng et al., 2017) and miRNAs Zeng et al., 2018), can be considered for identification of bacteriophage proteins. It might also be important to integrate evolutionary information using tools like evolutionary trees and networks (Yang et al., 2013(Yang et al., , 2014. Finally, computational intelligence such as neural networks (Song et al., 2018a,b) and evolutionary algorithms (Hang et al., 2018) can be applied in this field.

AUTHOR CONTRIBUTIONS
XR implemented the experiments and drafted the manuscript.
LL and CW initiated the idea, conceived the whole process, and finalized the paper. All authors have read and approved the final manuscript.

ACKNOWLEDGMENTS
The work was supported Natural Science Foundation of China (No.61872114, 91735306), and the National Key Research and Development Plan Task of China (No. 2016YFC0901902). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.