Identification of Phage Viral Proteins With Hybrid Sequence Features

Ru, Xiaoqing; Li, Lihong; Wang, Chunyu

doi:10.3389/fmicb.2019.00507

ORIGINAL RESEARCH article

Front. Microbiol., 26 March 2019

Sec. Systems Microbiology

Volume 10 - 2019 | https://doi.org/10.3389/fmicb.2019.00507

Identification of Phage Viral Proteins With Hybrid Sequence Features

1. School of Information and Electrical Engineering, Hebei University of Engineering, Handan, China
2. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China

Abstract

The uniqueness of bacteriophages plays an important role in bioinformatics research. In real applications, the function of the bacteriophage virion proteins is the main area of interest. Therefore, it is very important to classify bacteriophage virion proteins and non-phage virion proteins accurately. Extracting comprehensive and effective sequence features from proteins plays a vital role in protein classification. In order to more fully represent protein information, this paper is more comprehensive and effective by combining the features extracted by the feature information representation algorithm based on sequence information (CCPA) and the feature representation algorithm based on sequence and structure information. After extracting features, the Max-Relevance-Max-Distance (MRMD) algorithm is used to select the optimal feature set with the strongest correlation between class labels and low redundancy between features. Given the randomness of the samples selected by the random forest classification algorithm and the randomness features for producing each node variable, a random forest method is employed to perform 10-fold cross-validation on the bacteriophage protein classification. The accuracy of this model is as high as 93.5% in the classification of phage proteins in this study. This study also found that, among the eight physicochemical properties considered, the charge property has the greatest impact on the classification of bacteriophage proteins These results indicate that the model discussed in this paper is an important tool in bacteriophage protein research.

Introduction

In the biological world, bacteriophages are ubiquitous, with different genomes and lifestyles. According to their morphology, they can be classified as either tail, tailless, or filamentous bacteriophages. According to morphology and nucleic acid, phages are classified as infect bacteria and infect archaea. The bacteriophage must be attached to a host cell for growth and reproduction (Seguritan et al., 2012), and directly affects the host population by lysing host cells. In addition, each bacteriophage is specific and greatly reduces the damage to host cells (Haq et al., 2012). Identification and classification of various bacteria can be performed based on the universality, diversity, dependence, and specificity of bacteriophages (Marks and Sharp, 2015). The structure of bacteriophages is simple, consisting of only a protein shell and genetic material (DNA or RNA) (Haq et al., 2012), making them important substances for simplifying experimental research in bioinformatics. As a bacteriophage can insert genes into host cells (Ding et al., 2014), it is an important tool for studying genetics (Cheng et al., 2018; Hu et al., 2018). Hershey (Hershey and Chase, 1952) performed biological experiments using the T2 bacteriophage and bacteria in 1952, and finally confirmed that DNA is the genetic material of bacteriophages and other organisms. The significance of this research in the development of biological science earned Hershey and coworkers the Nobel Prize in Physiology. Bacteriophage provide experimental systems and tools for the molecular biological science revolution. The bacteriophage rapid development has led to dection of basic principles of ecology and evolution. Besides, it is relatively easy to synthesize and has modular characteristic, which cater to the needs of synthetic biologists and carry out engineering research and implementation of biological function.

Bacteriophage proteins are classified into virion and non-viron proteins (Zhang et al., 2015), with most practical interest focusing on the function of bacteriophage virion proteins (Feng et al., 2013b). Therefore, bacteriophage proteins must be accurately classified and identified so that researchers can further study the structure and function of a particular bacteriophage. After the human genome project was officially launched in 1990, the number of bacteriophage protein sequences with unknown functions increased dramatically (Seguritan et al., 2012; Chen et al., 2018a). Faced with a large volume of data, traditional biological experimental methods could no longer keep up with the post-gene era (Chen W. et al., 2016; Cheng et al., 2019; Mrozek et al., 2016; Hu et al., 2018). For this reason, researchers introduced different machine learning algorithms into bacteriophage classification and prediction research. For example, Li et al. (2007) developed a support vector machine system called SynFPS that uses the gene–gene distance determined by k-means clustering to identify closely related genomes and perform gene function prediction. Using the protein appearance frequency of amino acids and information of isoelectric points, Seguritan et al. (2012) developed an artificial neural network method to classify viral structures. Feng et al. (2013b) used the main amino acid and dipeptide components as an encoding scheme, and modified a naive Bayes classifier to identify bacteriophage proteins. Ding et al. (2014) used g-gap dipeptide composition to represent protein sequence information, incremental feature selection to analyze the variance and identify the optimal feature set, and a support vector machine for classification. Zhang et al. (2015) obtained sequence feature vectors with various techniques, and then used the incremental feature selection algorithm to select the optimal feature subsets. Finally, the prediction results of individual classifiers trained in different feature spaces were integrated to produce the final classification effect. Machine learning algorithm (Robert, 2012; Stephenson et al., 2018) automatically analyze and obtain rules from data and use them to predict unknown data (Chen and Yan, 2013; Yu et al., 2015, 2016a; Chen and Huang, 2017; Chen et al., 2018h; Wang et al., 2018). This saves time and money, but the results from such algorithms are not as convincing as those from biological experiments. Therefore, it is especially important to choose an appropriate machine learning algorithm to ensure the most accurate classification results (Liu, 2017; Yao et al., 2017; Yu et al., 2017a). In a protein classification experiment, the classification effect depends largely on the feature set extracted (Zou et al., 2013; Bin et al., 2015; Mrozek et al., 2015; Jia et al., 2016; Yu et al., 2016b, 2018; Zhang et al., 2016; Huang et al., 2017; Qu et al., 2017; Jiang et al., 2018; Qiao et al., 2018; Xiong et al., 2018; Xu et al., 2018b). To date, feature extraction methods are divided into sequence-based and structure-based approaches (Huang et al., 2017; Qu et al., 2017) The feature set extraction part of this study is obtained by combining the features extracted by the two feature extraction methods.

In this study, we examined the final classification effect of the selected methods and the stability of the dataset when the feature dimension was reduced. First, to remove the imbalance in the reference dataset, CD-Hit was used to remove redundant data, resulting in a balanced dataset that contains comprehensive information and less redundancy. Pearson's correlation coefficient and three distance functions (Euclidean and cosine distances and the Tanimoto coefficient) (Zou et al., 2016) were then used to calculate the correlation between features and class labels and the redundancy between features. Finally, the optimal feature subset with the strongest correlation between features and class labels and low redundancy between features was selected. According to some recent studies(Wu et al., 2009; Yi et al., 2011; Chen and Lin, 2012; Yang et al., 2015; Yu et al., 2017b; Zhang and Liu, 2017; Xu et al., 2018a; Liu et al., 2019), the best algorithms for protein classification are support vector machines and random forest algorithms. However, support vector machines are more suitable for small sample sets in which the number of dimensions is greater than the number of samples. Thus, the random forest algorithm was used in this study. The random forest algorithm (Breiman, 2001; Yao et al., 2017) combines multiple weak classifiers to produce a final result that has higher accuracy and better generalization performance. It can achieve good results, mainly because of the random nature of the “forest,” which makes the algorithm resistant to overfitting and more precise. Finally, in terms of bacteriophage protein classification, the data set extracted by combining the features and the feature selection of the feature set have a positive impact on the protein classification effect. Our results also show that, among the eight physicochemical properties of amino acids, the charge property has the greatest influence on the classification of bacteriophage proteins. To evaluate the performance of the models used in this study, the results were compared with those given by the methods introduced in (Feng et al., 2013b; Ding et al., 2014; Zhang et al., 2015). Figure 1 shows the workflow of this study.

Figure 1

Methods

Dataset Processing

Source: UniProt (Rolf, 2004; Consortium, 2012) is a widely used protein sequence database that offers low protein sequence redundancy and complete protein function interpretation (Cao and Cheng, 2016a; Jiang et al., 2016). As this website is free and open, researchers can download the desired protein sequence for free. The original positive samples used in this study (a total of 15,765 data), e.g., the number of bacteriophage virion proteins, were downloaded from this database. After obtaining the bacteriophage virion protein (positive) sample set, the PFAM family of positive samples was excluded from all PFAM families, such that the remaining samples were families of non-phage virion proteins. Finally, the longest protein sequence of the remaining families was extracted to form a negative sample set. The positive and counterexample datasets obtained as described above may all contain homologous sequences. Using such sample sets would result in the classification accuracy being overestimated, which is not conducive to the establishment of prediction models. Therefore, we used the CD-Hit tool to remove redundant positive and negative samples from the datasets.

Data integration: The CD-Hit (Li et al., 2001; Li and Godzik, 2006; Huang et al., 2010; Fu et al., 2012; Chen et al., 2017) redundancy tool effectively clusters similar sequences. The basic principle is to sort protein sequences in the dataset in descending order. The longest sequence is taken as the first class, and then this is compared with the second-longest protein sequence in terms of their similarity. If the similarity between the two is greater than some threshold, they are deemed to belong to the same class. Otherwise, the second-longest sequence forms a new class. Because the bacteriophage virion protein sequences were downloaded from UniProt, which ensures relatively low redundancy, the interrupt threshold was set to 0.8. The non-phage virion proteins had a higher degree of redundancy, so their interrupt threshold was set to 0.4. Thus, 6,251 bacteriophage virion protein sequences and 9,514 non-phage virion protein sequences were obtained. The union of the resulting positive and negative sample datasets gives the total dataset, and the intersection of the two is empty.

Feature Extraction

Representation Algorithms for Amino Acid Composition and Eight Physicochemical Properties

In this study, a feature set containing 188 dimensions was extracted based on amino acid composition and eight physicochemical properties. The amino acid composition is one of the most basic features of proteins (Zhang et al., 2015; Cao and Cheng, 2016b). Eight physicochemical properties of amino acids also play a role in the functional properties of bacteriophage proteins. In 1988, Coia et al. (1988) found that amino acids having lighter side chain groups are more likely to constitute bacteriophage virion sequences. In 1994, Marvin et al. (1994) proposed that hydrophilicity, hydrophobicity, and charge have a greater impact on the function of bacteriophage virion proteins. In 2008, Shen and Chou (2008) identified the vital role that the hydrophilicity and hydrophobicity of amino acids play in the folding of proteins. In 2014, Ting et al. (2014) used logistic regression to integrate several biological features, including physicochemical properties for predicting lysine acetylation, thus demonstrating the effect of physicochemical properties on protein structure and function. Therefore, the amino acid composition and its eight physicochemical properties are used to extract features that reflect the characteristics of bacteriophage proteins.

The 20 most common amino acids are as follows:

The occurrence frequency of each amino acid in a protein sequence can be expressed as:

Where n_i is the frequency with which amino acid i occurs in the protein sequence and L is the length of the protein sequence.

In addition, these 20 amino acids can be classified into three types according to their physicochemical properties (Chou and Com, 2010), as shown in Figure 2.

Figure 2

The composition, transformation, and distribution of amino acids were determined by Dubchak et al. (1995) based on a global description of protein sequences. The feature extraction methods for the eight physicochemical properties of a protein sequence are as follows. Taking the electrode polarity as an example (expressed by p), the 20 amino acids are divided into high-, medium-, and low-charged polarity groups, which are expressed by p_h, p_p, p_l, respectively. The composition, transformation, and distribution of the amino acids at this time can be represented by equations (3)–(7).

Composition features (Dubchak et al., 1995) (frequency of each charged electrode group in a sequence):

where f₂₁, f₂₂, f₂₃ denote the content of the high-, medium-, and low-charged polarity groups in a sequence, respectively,L is the length of the protein sequence,n₁, n₂, n₃ are the frequencies with which the three electrode groups appear in the sequence.

Conversion feature (Dubchak et al., 1995) (frequency of occurrence of bigeminal sequences):

Where f₃₁, f₃₂, f₃₃ denote the content of the three bigeminal groups p_hl, p_hp, p_pl, and m₁, m₂, m₃ are the frequencies of these three bigeminal groups appearing in sequence. There are three possible sequences of the charged polarity: p_hl, p_hp, p_pl In addition, in a protein sequence of length L, assuming that any two adjacent amino acids constitute a pair, the protein sequence contains L − 1 paired sequences (Zou et al., 2013).

Distribution features (Dubchak et al., 1995) (amino acid distribution of the high-, medium-, and low-charged polarity groups):

Where a_1%, a_25%, a_50%a_75%a_100% represent the positions of the first, 25, 50, 75, and 100% high-charged polarity groups in a sequence, b_1%, b_25%, b_50%, b_75%, b_100% represent the positions of the first, 25, 50, 75, and 100% medium-charged polarity groups in a sequence and c_1%, c_25%, c_50%, c_75%, c_100% represent the positions of the first, 25, 50, 75, and 100% low-charged polarity groups in a sequence.

In summary, (3 + 3 + 3 × 5) = 21-dimensional features can be extracted from each physicochemical property, and so 8 × 21 = 168-dimensional features can be extracted from the eight physicochemical properties. The 188-dimensional features (20-dimensional + 168-dimensional) are used to express the characteristics of bacteriophage proteins, and are extracted based on the content ratio of each of the 20 amino acids in the sequence and the eight physicochemical properties.

Adaptive k-skip-n-Gram Algorithm

A feature set containing 400 dimensions is extracted based on the adaptive k-skip-n-gram method (Feng et al., 2013c; Cao et al., 2017; Wei et al., 2017a; Tang et al., 2018). In this study, the value of n was set to 2 (20² = 400).

The K value represents the separation distance between two amino acids. For example, in the protein sequence S = A₁A₂A₃⋯A_L (where L is the length of the sequence),

And A_i,A_j are the ith and jth amino acids of S.

In a bacteriophage protein dataset, the sequences have very different lengths. If the parameter K is fixed to a specific value, the sequence information cannot be properly represented, which will affect the final classification effect. Therefore, the value of k was set to be adaptive so that K could vary with the length of the sequence.

For n = 2, the combinations of the 20 most common amino acids and the number of occurrences of each combination in the sample datasets are as shown in Figure 3.

Figure 3

This process is similar to full connection in a neural network. Among the 20 common amino acids, anyone can combine with another amino acid (or itself) in pairs, and the combination is random. In the same way as full connection, this leads to overfitting when there are too many data. Therefore, n should not be too high when using an adaptive k-skip-n-gram method. When n = 1, we have the traditional n-gram model proposed by Guthrie et al. (2006), which does not apply to shorter protein sequences. Therefore, n was set to 2 in this study.

In this feature extraction method, the combination set of two specified interval amino acids (Wei et al., 2017a) is given by:

In addition, C is used to represent a set of two amino acids that are combined at all intervals in a sequence (Wei et al., 2017a). Namely:

Finally, the feature extraction formula (Wei et al., 2017a) is:

Where N(C_skipgram) is the total number of elements in set C,a_m1a_m2⋯a_mn are the 20ⁿ kinds of amino acid combinations of length n, N(a_m1a_m2⋯a_mn) is the frequency that the two-two combination in a_m1a_m2⋯a_mn occurs in C_skipgram

Mixed Representation Algorithm (Seq-Str)

Some researchers have combined different feature extraction methods and achieved very good classification results (Dehzangi et al., 2013; Zou et al., 2014; Leyi et al., 2015, 2018; Chen X. et al., 2016; Ding et al., 2016, 2017a,b; Li et al., 2016; Chen et al., 2017,a,b, 2018c,d,e; Su et al., 2018 Shen et al., 2019; Wei et al., 2019; Zhu et al., 2019). Wei et al. (2015) proposed a novel feature extraction method that uses both the profile of PSI-BLAST (Altschul et al., 1997) and the profile of PSI-PRED (Jones, 1999), which contain rich evolutionary information and secondary structure information, respectively. In this way, the 473-dimensional feature can be extracted.

Extract 20-dimensional features based on PSI-BLAST as follows:
S_{z, i}indicates that during the evolution process, the residue at the “z” position in the sequence S is mutated to the fraction of the “i” species, and “i” is one of the 20 common residues. S_i indicates that during the evolution, the residue in sequence S is mutated to the average score of the ith residue.
Extracting 420-dimensional features based on n-gram: The Adaptive k-skip-n-gram algorithm that does not consider the k value is the n-gram method. Here, take n equal to 1 and n equal to 2
Based on the secondary structure sequence, the following six features are extracted (Wei et al., 2015): Three feature extraction formulas for spatial arrangement
Where P_{H_z}represents the position index of the zth H in the secondary structure of the sequence S. n_H represents the total number of occurrences of H in the secondary structure of sequence.
Two feature extraction formulas for the percentage of the maximum continuous length (Wei et al., 2015).
C_H represents the length of the fragment in which H appears consecutively in the sequence of the secondary structure.
A new feature for distinguishing between two structural classes, α + β and : (Wei et al., 2015)
This formula calculates the frequency at which βαβ appears in the fragmented sequence S_seg, n_βαβ represents the number of times βαβ appears in S_seg, L_segindicates the length of S_seg.
Extracting 27 features based on structural probability matrices: Three features from the overall information and 24 features from local information

Feature Selection

Based on the feature extraction methods described in section Feature extraction, We extracted a 188-dimensional, 400-dimensional feature set based on sequence information, and a 473-dimensional data set based on sequence and secondary structure information representing the entire bacteriophage protein sequence dataset. Some redundant or irrelevant cases were still present in these features. The existence of invalid features wastes time and computational resources, and affects the classification accuracy of the model (Chen et al., 2018b,f,g; Dao et al., 2018; Yang et al., 2018; Zhu et al., 2018a,b). In this paper, the Max-Relevance-Max-Distance (MRMD) (Zou et al., 2016) method was used to select features and identify higher-quality feature sets, i.e., the optimal feature subset. In this method, Pearson's correlation coefficient is used to calculate the correlation between features and class labels (MR), thus enabling the selection of features with strong correlation to the target class. Three distance functions (Euclidean and cosine distances and the Tanimoto coefficient) are used to calculate the redundancy between features (MD) and identify features with low redundancy.

Taking the two eigenvectors (X,Y) as an example, Pearson's correlation coefficient (Pearson, 1909) expressed as follows:

Where σ_X and σ_Y denote the standard deviation of the two vectors, cov(X, Y) is the covariance, which is used to measure the relationship between two random variables. The covariance formula is as follows:

Where and denote the mean of the respective vectors.

The formula for the Euclidean distance (Larson and Edwards, 1991; Deza and Deza, 2009) is:

Where M is the number of feature vectors,n is the total number of elements in each vector, and x_q, y_q are the q-th elements in X, Y, respectively.

The cosine distance formula (Tan et al., 2005) is:

Where

The Tanimoto coefficient (Rogers and Tanimoto, 1960) is given by:

Using these distance metrics, we identified the features with the strongest correlation and minimum redundancy with respect to the class labels. In different scenarios, we can increase the weights of MR and MD (max(wr × MR_i + wd × MD_i)) to ensure the acquired features are suitable for the classification task.

Experiments

Performance Evaluation Criteria

A 10-fold cross-validation method was employed to evaluate the models. There are four common evaluation indicators, namely the accuracy (ACC), sensitivity (SN), specificity (SP), and Matthews' correlation coefficient (MCC) (Feng et al., 2013a, 2018; Chen W. et al., 2016; Wei et al., 2017b,c; Xu et al., 2017; Jingjing et al., 2018). These are expressed as follows (Zou et al., 2013; Chen et al., 2014; Qu et al., 2017):

Where TP denotes true positive, i.e., the number of positive samples that are predicted to be positive samples, TN denotes true negative, i.e., the number of negative samples that are predicted to be negative samples, FP denotes false positive, i.e., the number of negative samples that are predicted to be positive samples, and FN denotes false negative, i.e., the number of positive samples that are predicted to be negative samples.

Classification Effects of Different Classifiers

Experiment 1: This part of the experiment is based on the feature sets of 188, 400, and 473 dimensions extracted by the method in Feature extraction. The accuracy of each classification algorithm before and after using the MRMD feature selection algorithm is presented in Table 1.

Table 1

Feature_extraction	Feature_selection	number of D	LibSVM (%)	Naive Bayes (%)	Random forest (%)
CCPA		188D	68.5	78.3	91.3
	MRMD	185D	68.5	78.3	91.5
AKSNG		400D	60.3	71.8	88.7
	MRMD	252D	60.3	72.8	89.0
Seq-Str		473D	80.6	80.9	92.6
	MRMD	189D	82.0	83.1	93.2

Classification results of three data sets under different classification algorithms.

The data in Table 1 indicate that, for the classification of bacteriophage proteins, no matter which feature extraction algorithm is used, whether or not feature selection is performed, the random forest algorithm is the best classification effect.

Performance of Different Feature Extraction Methods

Experiment 2: Experiment 1 showed that the random forest algorithm produces the best classification of bacteriophage proteins. In this second experiment, the 188-dimensional and 400-dimensional datasets extracted based on sequence information (Seq Based), a 473-dimensional dataset extracted based on structure (Seq and stru Based), and two combined feature sets (Com Based) were integrated into the random forest algorithm, and the resulting performance was compared. The experimental results are presented in Table 2.

Table 2

Extraction method	Number of D	SN (%)	SP (%)	ACC (%)	MCC (%)
Seq based	188D	87.4	93.6	91.3	81.5
	400D	82.8	92.4	88.7	76.1
Seq and str based	473D	86.2	97.2	92.6	85.1
Com based	588D	87.1	93.2	91.2	80.7
	661D	87.5	96.5	93.1	85.3

Classification performance under different feature extraction methods.

Feature fusion can boost the recognition performance by combining the complementary information of different features (Zhu et al., 2016, 2018c). A 588-dimensional feature set was obtained by combining the features of the 188- and 400-dimensional feature sets, and a 661-dimensional feature set was obtained by combining the features of the 188- and 473-dimensional feature sets. According to the experimental results, the 188-, 473-, 588-, and 661-dimensional feature set models give better bacteriophage protein classification performance, However, based on the data of the other three evaluation indicators, the 661-dimensional feature set obtained by combining the 188-dimensional feature set extracted based on the sequence information and the features of the 473-dimensional feature set extracted based on the sequence and the secondary structure is the best. This indicates that the feature set extracted by the feature representation algorithm containing both sequence information and structural information in phage protein classification has the best influence on the classification effect, and also shows that combining some feature sets in protein classification is effective for improving classification performance.

Importance of Feature Selection

Experiment 3: This experiment used the random forest classification algorithm to classify the feature sets after MRMD. The results are given in Table 3.

Table 3

Model	Feature_extraction	SN (%)	SP (%)	ACC (%)	MCC (%)
Mode l	CCPA (188)	87.5	93.4	91.5	81.4
Mode 2	AKSNG (400)	82.9	92.2	89.0	76.0
Mode 3	Seq-Str (473)	86.7	96.6	93.2	84.8
Mode 4	Combine (588)	87.6	93.5	91.5	81.5
Mode 5	Combine (661)	87.9	96.3	93.5	85.3

Classification performance under each model.

The comparison of the data in Tables 2, 3 shows that after using the feature selection algorithm (MRMD), the classification effect does not change with the decrease of the dimension, and even with the decrease of the dimension, the classification effect becomes better. After removing the redundant features, the best classification performance is still the data set obtained by feature combination, that is, the 256-dimensional feature set obtained by removing redundant features from the 661-dimensional feature set.

Comparison With Recent Methods

Experiment 4: To provide an objective demonstration of the performance of the model described in this paper, this experiment compared the optimal proposed model with bacteriophage protein classification models proposed in recent years. The results are presented in Table 4.

Table 4

Model	SN (%)	SP (%)	ACC (%)	MCC (%)
Feng et al. (2013b)	75.7	80.7	79.1	54.9
Ding et al. (2014)	75.7	89.4	85.0	65.5
Zhang et al. (2015)	87.0	83.0	85.0	70.1
This search	87.9	96.3	93.5	85.3

Performance comparison against recent methods.

It is clear from Table 4 that the bacteriophage classification model proposed in this paper achieves a good classification effect, with a classification accuracy of 93.5%. Compared with Feng, it has increased by 14%, compared with Ding and Zhang by 8%. In the other three evaluation indicators, there are also different degrees of improvement, indicating that the model proposed in this paper is an effective tool for phage protein classification.

Analyzing the Impact of Eight Physicochemical Properties

This section summarizes the first eight dimensional features that have a significant impact on the classification effect of bacteriophage proteins. The top eight features are listed in Table 5 in order of their impact.

Table 5

NO.	Fea name	Score	Implication
1	Fea 120	1.0	Position of the 100%th neutral electrical storage amino acid in a sequence
2	Fea 157	0.9968696407744475	Position of the 100%th helical amino acid in a sequence
3	Fea 178	0.9950260206126923	Position of the 100%th soluble amino acid in a sequence
4	Fea 99	0.9949600329187752	Position of the 100%th neutral polarizability amino acid in a sequence
5	Fea 136	0.9948079966447566	Position of the 100%th large tensile amino acid in a sequence
6	Fea 83	0.994509178771573	Position of the 100%th high-electrode amino acid in a sequence
7	Fea 52	0.994137797849692	Position of the 100%th small van der Waals volume amino acid in a sequence
8	Fea 31	0.9937317569946658	Position of the 100%th hydrophilic amino acid in a sequence

Impact of physicochemical properties on classification.

According to the information in this table, the effects of eight physicochemical properties of amino acids on the classification of bacteriophage proteins are evenly distributed, and that which has the greatest impact on the classification is the charge property of amino acids.

Conclusion

Bacteriophage proteins are of special significance for cell typing and pathological research. It is very important to correctly classify virion and non-virion bacteriophage proteins. Therefore, this paper has proposed the following classification model: (1) higher-quality feature datasets are extracted with extraction algorithms based on feature combination; (2) the optimal feature subset is selected using the MRMD algorithm for feature selection; and (3) the random forest algorithm is applied to perform protein classification. The model can achieve accuracy of up to 93.5% for the classification of bacteriophage proteins. This demonstrates that the model developed in this paper is an important tool for the classification of bacteriophage proteins. For the future direction, link prediction paradigms, which have been successfully applied in the prediction of disease genes (Zeng et al., 2017) and miRNAs (Liu et al., 2016; Zeng et al., 2018), can be considered for identification of bacteriophage proteins. It might also be important to integrate evolutionary information using tools like evolutionary trees and networks (Yang et al., 2013, 2014). Finally, computational intelligence such as neural networks (Song et al., 2018a,b) and evolutionary algorithms (Hang et al., 2018) can be applied in this field.

Statements

Author contributions

XR implemented the experiments and drafted the manuscript. LL and CW initiated the idea, conceived the whole process, and finalized the paper. All authors have read and approved the final manuscript.

Acknowledgments

The work was supported Natural Science Foundation of China (No.61872114, 91735306), and the National Key Research and Development Plan Task of China (No. 2016YFC0901902). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

1
AltschulS. F.MaddenT. L.SchafferA. A.ZhangJ.ZhangZ.MillerW.et al. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res.25, 3389–3402. 10.1093/nar/25.17.3389
2
BinL.FuleL.XiaolongW.JunjieC.LongyunF.Kuo-ChenC. (2015). Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res.43, W65–W71. 10.1093/nar/gkv458
- CrossRef
- Google Scholar
3
BreimanL. (2001). Random Forests. Mach. Learn.45, 5–32. 10.1023/A:1010933404324
- CrossRef
- Google Scholar
4
CaoR.ChengJ. (2016a). Integrated protein function prediction by mining function associations, sequences, and protein-protein and gene-gene interaction networks. Methods93, 84–91. 10.1016/j.ymeth.2015.09.011
5
CaoR.ChengJ. (2016b). Protein single-model quality assessment by feature-based probability density functions. Sci. Rep.6:23990. 10.1038/srep23990
6
CaoR.FreitasC.ChanL.SunM.JiangH.ChenZ. (2017). ProLanGO: protein function prediction using neural machine translation based on a recurrent neural network. Molecules22:1732. 10.3390/molecules22101732
7
ChenJ.GuoM.WangX.LiuB. (2018a). A comprehensive review and comparison of different computational methods for protein remote homology detection. Brief Bioinform.19, 231–244. 10.1093/bib/bbw108
8
ChenW.FengP.DingH.LinH. (2018b). Classifying included and excluded exons in exon skipping event using histone modifications. Front. Genet.9:433. 10.3389/fgene.2018.00433
9
ChenW.FengP.TangH.DingH.LinH. (2016). RAMPred: identifying the N1-methyladenosine sites in eukaryotic transcriptomes. Sci. Rep.6:31080. 10.1038/srep31080
- CrossRef
- Google Scholar
10
ChenW.FengP.YangH.DingH.LinH.ChouK. C. (2018c). iRNA-3typeA: identifying three types of modification at RNA's adenosine sites. Molecular therapy. Nucleic Acids11, 468–474. 10.1016/j.omtn.2018.03.012
- CrossRef
- Google Scholar
11
ChenW.FengP. M.LinH.ChouK. C. (2014). iSS-PseDNC: identifying splicing sites using pseudo dinucleotide composition. Biomed Res. Int.2014, 1–12. 10.1155/2014/623149
12
ChenW.LinH. (2012). Identification of voltage-gated potassium channel subfamilies from sequence information using support vector machine. Comput. Biol. Med.42, 504–507. 10.1016/j.compbiomed.2012.01.003
13
ChenW.YangH.FengP.DingH.LinH. (2017). iDNA4mC: identifying DNA N4-methylcytosine sites based on nucleotide chemical properties. Bioinformatics33, 3518–3523. 10.1093/bioinformatics/btx479
14
ChenX.GuanN. N.SunY. Z.LiJ. Q.QuJ. (2018d). MicroRNA-small molecule association identification: from experimental results to computational models. Brief. Bioinform. 2018:bby098. 10.1093/bib/bby098
- CrossRef
- Google Scholar
15
ChenX.HuangL. (2017). LRSSLMDA: Laplacian regularized sparse subspace learning for MiRNA-disease association prediction. PLoS Comput. Biol.13:e1005912. 10.1371/journal.pcbi.1005912
16
ChenX.SunY. Z.GuanN. N.QuJ.HuangZ. A.ZhuZ. X.et al. (2018e). Computational models for lncRNA function prediction and functional similarity calculation. Brief Funct. Genomics18, 58–82. 10.1093/bfgp/ely031
- CrossRef
- Google Scholar
17
ChenX.WangL.QuJ.GuanN. N.LiJ. Q. (2018f). Predicting miRNA-disease association based on inductive matrix completion. Bioinformatics34, 4256–4265. 10.1093/bioinformatics/bty503
18
ChenX.XieD.WangL.ZhaoQ.YouZ. H.LiuH. (2018g). BNPMDA: bipartite network projection for MiRNA-disease association prediction. Bioinformatics34, 3178–3186. 10.1093/bioinformatics/bty333
19
ChenX.XieD.ZhaoQ.YouZ. H. (2017b). MicroRNAs and complex diseases: from experimental results to computational models. Brief. Bioinform.2017:bbx130. 10.1093/bib/bbx130
- CrossRef
- Google Scholar
20
ChenX.YanC. C.ZhangX.YouZ. H. (2017a). Long non-coding RNAs and complex diseases: from experimental results to computational models. Brief. Bioinform.18, 558–576. 10.1093/bib/bbw060
21
ChenX.YanC. C.ZhangX.ZhangX.DaiF.YinJ.et al. (2016). Drug-target interaction prediction: databases, web servers and computational models. Brief. Bioinform.17, 696–712. 10.1093/bib/bbv066
22
ChenX.YanG. Y. (2013). Novel human lncRNA-disease association inference based on lncRNA expression profiles. Bioinformatics29, 2617–2624. 10.1093/bioinformatics/btt426
23
ChenX.YinJ.QuJ.HuangL. (2018h). MDHGI: Matrix decomposition and heterogeneous graph inference for miRNA-disease association prediction. PLoS Comput. Biol.14:e1006418. 10.1371/journal.pcbi.1006418
24
ChengL.HuY.SunJ.ZhouM.JiangQ. (2018). DincRNA: a comprehensive web-based bioinformatics toolkit for exploring disease associations and ncRNA function. Bioinformatics34, 1953–1956. 10.1093/bioinformatics/bty002
25
ChengL.SunJ.XuW.DongL.HuY.ZhouM. (2016). OAHG: an integrated resource for annotating human genes with multi-level ontologies. Sci. Rep.6, 34820. 10.1038/srep34820
26
ChengL.YangH.ZhaoH.PeiX.ShiH.SunJ.et al. (2019). MetSigDis: a manually curated resource for the metabolic signatures of diseases. Brief. Bioinform.20, 203–209. 10.1093/bib/bbx103
27
ChouK.ComM. P. (2010). Prediction of protein cellular attributes using pseudo-amino acid composition. Protein Struct. Funct. Bioinform.43:246–255. 10.1002/prot.1035
28
CoiaG.ParkerM. D.SpeightG.ByrneM. E.WestawayE. G. (1988). Nucleotide and complete amino acid sequences of Kunjin virus: definitive gene order and characteristics of the virus-specified proteins. J. Gen. Virol.69, 1–21. 10.1099/0022-1317-69-1-1
29
ConsortiumU. P. (2012). Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res.40, D71–D75. 10.1093/nar/gkr981
- CrossRef
- Google Scholar
30
DaoF. Y.LvH.WangF.FengC. Q.DingH.ChenW.et al. (2018). Identify origin of replication in Saccharomyces cerevisiae using two-step feature selection technique. Bioinformatics. 2018:bty943. 10.1093/bioinformatics/bty943
- CrossRef
- Google Scholar
31
DehzangiA.PaliwalK.SharmaA.DehzangiO.SattarA. (2013). A combination of feature extraction methods with an ensemble of different classifiers for protein structural class prediction problem. IEEE/ACM Trans. Comput. Biol. Bioinform.10, 564–575. 10.1109/TCBB.2013.65
32
DezaM. M.DezaE. (2009). Encyclopedia of distances. Refer. Rev.24, 1–583. 10.1007/978-3-642-00234-2
- CrossRef
- Google Scholar
33
DingH.FengP. M.ChenW.LinH. (2014). Identification of bacteriophage virion proteins by the ANOVA feature selection and analysis. Mol. Biosyst.10, 2229–2235. 10.1039/C4MB00316K
34
DingY.TangJ.GuoF. (2016). Predicting protein-protein interactions via multivariate mutual information of protein sequences. BMC Bioinform.17:398. 10.1186/s12859-016-1253-9
35
DingY.TangJ.GuoF. (2017a). Identification of protein-ligand binding sites by sequence information and ensemble classifier. J. Chem. Inf. Model.57, 3149–3161. 10.1021/acs.jcim.7b00307
36
DingY.TangJ.GuoF. (2017b). Identification of drug-target interactions via multiple information integration. Inf. Sci.418, 546–560. 10.1016/j.ins.2017.08.045
- CrossRef
- Google Scholar
37
DubchakI.MuchnikI.HolbrookS. R.KimS. H. (1995). Prediction of protein folding class using global description of amino acid sequence. Proc. Natl. Acad. Sci. U.S.A.92, 8700–8704. 10.1073/pnas.92.19.8700
38
FengC. Q.ZhangZ. Y.ZhuX. J.LinY.ChenW.TangH.et al. (2018). iTerm-PseKNC: a sequence-based tool for predicting bacterial transcriptional terminators. Bioinformatics. 2018:bty827. 10.1093/bioinformatics/bty827
- CrossRef
- Google Scholar
39
FengP. M.ChenW.LinH.ChouK. C. (2013a). iHSP-PseRAAAC: identifying the heat shock protein families using pseudo reduced amino acid alphabet composition. Anal. Biochem.442, 118–125. 10.1016/j.ab.2013.05.024
40
FengP. M.DingH.ChenW.LinH. (2013b). Naïve Bayes classifier with feature selection to identify phage virion proteins. Comput. Math. Methods Med.2013:530696. 10.1155/2013/530696
41
FengP. M.LinH.ChenW. (2013c). Identification of antioxidants from sequence information using naïve Bayes. Comput. Math. Methods Med.2013, 1–5. 10.1155/2013/567529
42
FuL.NiuB.ZhuZ.WuS.LiW. (2012). CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics28, 3150–3152. 10.1093/bioinformatics/bts565
43
GuthrieD.AllisonB.LiuW.GuthrieL.WilksY. (2006). A closer look at skip-gram modelling in Proceedings of the 5th International Conference on Language Resources and Evaluation (Genoa: LREC), 1–4.
- Google Scholar
44
HangX.ZengW.ZengX.YenG. G. (2018). An evolutionary algorithm based on minkowski distance for many-objective optimization. IEEE Trans. Cybern.99, 1–12. 10.1109/TCYB.2018.2856208
- CrossRef
- Google Scholar
45
HaqI. U.ChaudhryW. N.AkhtarM. N.AndleebS.QadriI. (2012). Bacteriophages and their implications on future biotechnology: a review. Virol. J., 9:9. 10.1186/1743-422X-9-9
46
HersheyA. D.ChaseM. (1952). Independent functions of viral protein and nucleic acid in growth of bacteriophage. J. Gen. Physiol.36, 39–56. 10.1085/jgp.36.1.39
47
HuY.ZhaoT.ZhangN.ZangT.ZhangJ.ChengL. (2018). Identifying diseases-related metabolites using random walk. BMC Bioinform.19(Suppl. 5):116. 10.1186/s12859-018-2098-1
48
HuangL.LiX.GuoP.YaoY.LiaoB.ZhangW.et al. (2017). Matrix completion with side information and its applications in predicting the antigenicity of influenza viruses. Bioinformatics33, 3195–3201. 10.1093/bioinformatics/btx390
49
HuangY.NiuB.GaoY.FuL.LiW. (2010). CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics26, 680–682. 10.1093/bioinformatics/btq003
50
JiaJ.LiuZ.XiaoX.LiuB.ChouK. C. (2016). iPPBS-Opt: a sequence-based ensemble classifier for identifying protein-protein binding sites by optimizing imbalanced training datasets. Molecules21:95. 10.3390/molecules21010095
51
JiangS.LiuB.ZouQ. (2018). HITS-PR-HHblits: protein remote homology detection by combining PageRank and hyperlink-induced topic search. Brief. Bioinform.2018:bby104. 10.1093/bib/bby104
- CrossRef
- Google Scholar
52
JiangY.OronT. R.ClarkW. T.BankapurA. R.D'AndreaD.LeporeR.et al. (2016). An expanded evaluation of protein function prediction methods shows an improvement in accuracy. Genome Biol.17:184. 10.1186/s13059-016-1037-6
53
JingjingH.TingF.ZizhengZ.BeiH.XiaoleiZ.YiX. (2018). PseUI: Pseudouridine sites identification based on RNA sequence information. BMC Bioinform.19:306. 10.1186/s12859-018-2321-0
- CrossRef
- Google Scholar
54
JonesD. T. (1999). Protein secondary structure prediction based on position-specific scoring matrices11Edited by G. Von Heijne. J. Mol. Biol.292, 195–202. 10.1006/jmbi.1999.3091
- CrossRef
- Google Scholar
55
LarsonR. E.EdwardsB. H. (1991). Elementary Linear Algebra.2nd Edn. Lexington, MA: D.C. Heath and Company.
- Google Scholar
56
LeyiW.HuangrongC.RanS. (2018). M6APred-EL: a sequence-based predictor for identifying N6-methyladenosine sites using ensemble learning. Mol. Ther.2018, 635–644. 10.1016/j.omtn.2018.07.004
- CrossRef
- Google Scholar
57
LeyiW.MinghongL.XingG.QuanZ. (2015). An improved protein structural classes prediction method by incorporating both sequence and structure information. IEEE Trans. Nanobiosci.14, 339–349. 10.1109/TNB.2014.2352454
- CrossRef
- Google Scholar
58
LiJ.HalgamugeS. K.KellsC. I.TangS. L. (2007). Gene function prediction based on genomic context clustering and discriminative learning: an application to bacteriophages. BMC Bioinform.8:S6. 10.1186/1471-2105-8-S4-S6
59
LiW.GodzikA. (2006). Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics22, 1658–1659. 10.1093/bioinformatics/btl158
- CrossRef
- Google Scholar
60
LiW.JaroszewskiL.GodzikA. (2001). Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics17, 282–283. 10.1093/bioinformatics/17.3.282
61
LiZ.TangJ.GuoF. (2016). Learning from real imbalanced data of 14-3-3 proteins binding specificity. Neurocomputing217, 83–91. 10.1016/j.neucom.2016.03.093
- CrossRef
- Google Scholar
62
LiuB. (2017). BioSeq-analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches. Brief. Bioinform.2017:bbx165. 10.1093/bib/bbx165
- CrossRef
- Google Scholar
63
LiuY.WangX.LiuB. (2019). A comprehensive review and comparison of existing computational methods for intrinsically disordered protein and region prediction. Brief. Bioinform.20, 330–346. 10.1093/bib/bbx126
64
LiuY.ZengX.HeZ.ZouQ. (2016). Inferring microRNA-disease associations by random walk on a heterogeneous network with multiple data sources. IEEE/ACM Trans. Comput. Biol. Bioinform.14, 905–915. 10.1109/TCBB.2016.2550432
65
MarksT.SharpR. (2015). Bacteriophages and biotechnology: a review. J. Chem. Technol. Biotechnol.75, 6–17. 10.1002/(SICI)1097-4660(200001)75:1<6::AID-JCTB157>3.0.CO;2-A
- CrossRef
- Google Scholar
66
MarvinD. A.HaleR. D.NaveC.Helmer-CitterichM. (1994). Molecular models and structural comparisons of native and mutant class I filamentous bacteriophages Ff (fd, f1, M13), If1 and IKe. J. Mol. Biol.235, 260–286. 10.1016/S0022-2836(05)80032-4
67
MrozekD.DaniłowiczP.Małysiak-MrozekB. (2016). HDInsight4PSi: Boosting performance of 3D protein structure similarity searching with HDInsight clusters in Microsoft Azure cloud. Inf. Sci.349, 77–101. 10.1016/j.ins.2016.02.029
- CrossRef
- Google Scholar
68
MrozekD.SochaB.KozielskiS.Małysiak-MrozekB. (2015). An efficient and flexible scanning of databases of protein secondary structures. J. Intell. Inf. Syst.46, 213–233. 10.1007/s10844-014-0353-0
- CrossRef
- Google Scholar
69
PearsonK. (1909). Determination of the coefficient of correlation. Science30, 23–25. 10.1126/science.30.757.23
70
QiaoY.XiongY.GaoH.ZhuX.ChenP. (2018). Protein-protein interface hot spots prediction based on a hybrid feature selection strategy. BMC Bioinform.19:14. 10.1186/s12859-018-2009-5
71
QuK.HanK.WuS.WangG.WeiL. (2017). Identification of DNA-binding proteins using mixed feature representation methods. Molecules22:E1602. 10.3390/molecules22101602
72
RobertC. (2012). Machine learning, a probabilistic perspective. Chance27, 62–63. 10.1080/09332480.2012.726570
- CrossRef
- Google Scholar
73
RogersD. J.TanimotoT. T. (1960). A computer program for classifying plants. Science132, 1115–1118. 10.1126/science.132.3434.1115
74
RolfA. (2004). UniProt: the Universal Protein knowledgebase. Nucleic Acids Res.32, D115–D119. 10.1093/nar/gkh131
- CrossRef
- Google Scholar
75
SeguritanV.AlvesN.Jr.ArnoultM.RaymondA.LorimerD.BurginA. B.Jr.et al. (2012). Artificial neural networks trained to detect viral and phage structural proteins. PLoS Comput. Biol.8:e1002657. 10.1371/journal.pcbi.1002657
76
ShenH. B.ChouK. C. (2008). PseAAC: A flexible web server for generating various kinds of protein pseudo amino acid composition. Anal. Biochem.373, 386–388. 10.1016/j.ab.2007.10.012
77
ShenY.TangJ.GuoF. (2019). Identification of protein subcellular localization via integrating evolutionary and physicochemical information into Chou's general PseAAC. J. Theor. Biol.462, 230–239. 10.1016/j.jtbi.2018.11.012
78
SongT.Rodríguez-PatónA.ZhengP.ZengX. (2018a). Spiking neural P systems with colored spikes. IEEE Trans. Cogn. Dev. Syst.10, 1106–1115. 10.1109/TCDS.2017.2785332
- CrossRef
- Google Scholar
79
SongT.ZengX.ZhengP.JiangM.Rodríguez-PatónA. (2018b). A parallel workflow pattern modeling using spiking neural p systems with colored spikes. IEEE Trans. Nanobiosci.17, 474–484. 10.1109/TNB.2018.2873221
80
StephensonN.ShaneE.ChaseJ.RowlandJ.RiesD.JusticeN.et al. (2018). Survey of machine learning techniques in drug discovery. Curr. Drug Metab. 10.2174/1389200219666180820112457. [Epub ahead of print].
81
SuR.WuH.XuB.LiuX.WeiL. (2018). Developing a multi-dose computational model for drug-induced hepatotoxicity prediction based on toxicogenomics dataIEEE/ACM Trans. Comput. Biol. Bioinform. 10.1109/TCBB.2018.2858756. [Epub ahead of print].
82
TanP. N.SteinbachM.KumarV. (2005). Introduction to Data Mining, Boston, MA: Pearson Addison Wesley.
- Google Scholar
83
TangH.ZhaoY. W.ZouP.ZhangC. M.ChenR.HuangP.et al. (2018). HBPred: a tool to identify growth hormone-binding proteins. Int. J. Biol. Sci.14, 957–964. 10.7150/ijbs.24174
84
TingH.GuangyongZ.PingyuZ.JiaJ.JingL.LuX.et al. (2014). LAceP: lysine acetylation site prediction using logistic regression classifiers. PLoS ONE9:e89575. 10.1371/journal.pone.0089575
- CrossRef
- Google Scholar
85
WangP.ZhuW.LiaoB.CaiL.PengL.YangJ. (2018). Predicting influenza antigenicity by matrix completion with antigen and antiserum similarity. Front. Microbiol.9:2500. 10.3389/fmicb.2018.02500
86
WeiL.LiaoM.GaoX.ZouQ. (2015). Enhanced protein fold prediction method through a novel feature extraction technique. IEEE Trans. Nanobiosci.14, 649–659. 10.1109/TNB.2015.2450233
87
WeiL.SuR.WangB.LiX.ZouQ.GaoX. (2019). Integration of deep feature representations and handcrafted features to improve the prediction of N6-methyladenosine sites. Neurocomputing324, 3–9. 10.1016/j.neucom.2018.04.082
- CrossRef
- Google Scholar
88
WeiL.TangJ.ZouQ. (2017a). SkipCPP-Pred: an improved and promising sequence-based predictor for predicting cell-penetrating peptides. BMC Genomics18:742. 10.1186/s12864-017-4128-1
89
WeiL.WanS.GuoJ.WongK. K. (2017b). A novel hierarchical selective ensemble classifier with bioinformatics application. Artif. Intell. Med, 83, 82–90. 10.1016/j.artmed.2017.02.005
90
WeiL.XingP.ZengJ.ChenJ.SuR.GuoF. (2017c). Improved prediction of protein-protein interactions using novel negative samples, features, and an ensemble classifier. Artif. Intell. Med.83, 67–74. 10.1016/j.artmed.2017.03.001
91
WuL. C.LeeJ. X.HuangH. D.LiuB. J.HorngJ. T. (2009). An expert system to predict protein thermostability using decision tree. Expert Syst. Appl.36, 9007–9014. 10.1016/j.eswa.2008.12.020
- CrossRef
- Google Scholar
92
XiongY.WangQ.YangJ.ZhuX.WeiD.-Q. (2018). PredT4SE-stack: prediction of bacterial type IV secreted effectors from protein sequences using a stacked ensemble method. Front. Microbiol.9:2571. 10.3389/fmicb.2018.02571
93
XuL.LiangG.ShiS.LiaoC. (2018a). SeqSVM: a sequence-based support vector machine method for identifying antioxidant proteins. Int. J. Mol. Sci.19:1773.
- Pubmed Abstract
- Google Scholar
94
XuL.LiangG.WangL.LiaoC. (2018b). A novel hybrid sequence-based model for identifying anticancer peptides. Genes9:E158. 10.3390/genes9030158
95
XuQ.XiongY.DaiH.KumariK. M.XuQ.OuH. Y.et al. (2017). PDC-SGB: Prediction of effective drug combinations using a stochastic gradient boosting algorithm. J. Theor. Biol.417, 1–7. 10.1016/j.jtbi.2017.01.019
96
YangH.LvH.DingH.ChenW.LinH. (2018). iRNA-2OM: a sequence-based predictor for identifying 2'-O-Methylation sites in homo sapiens. J. Comput. Biol.25, 1266–1277. 10.1089/cmb.2018.0004
97
YangJ.GrunewaldS.WanX. F. (2013). Quartet-net: a quartet-based method to reconstruct phylogenetic networks. Mol. Biol. Evol.30, 1206–1217. 10.1093/molbev/mst040
98
YangJ.GrünewaldS.XuY.WanX.-F. (2014). Quartet-based methods to reconstruct phylogenetic networks. BMC Syst. Biol.8, 21–21. 10.1186/1752-0509-8-21
99
YangR.ZhangC.GaoR.ZhangL. (2015). An ensemble method with hybrid features to identify extracellular matrix proteins. PLoS ONE10:e0117804. 10.1371/journal.pone.0117804
100
YaoY.LiX.LiaoB.HuangL.HeP.WangF.et al. (2017). Predicting influenza antigenicity from Hemagglutintin sequence data based on a joint random forest method. Sci. Rep.7:1545. 10.1038/s41598-017-01699-z
101
YiX.JuanL.Dong-QingW. (2011). An accurate feature-based method for identifying DNA-binding residues on protein surfaces. Proteins Struct. Funct. Bioinform.79, 509–517. 10.1002/prot.22898
- CrossRef
- Google Scholar
102
YuL.HuangJ.MaZ.ZhangJ.ZouY.GaoL. (2015). Inferring drug-disease associations based on known protein complexes. BMC Med. Genomics8:S2. 10.1186/1755-8794-8-S2-S2
103
YuL.MaX.ZhangL.ZhangJ.GaoL. (2016a). Prediction of new drug indications based on clinical data and network modularity. Sci. Rep.6:32530. 10.1038/srep32530
104
YuL.SuR.WangB.ZhangL.ZouY.ZhangJ.et al. (2017a). Prediction of novel drugs for hepatocellular carcinoma based on multi-source random walk. IEEE/ACM Trans. Comput. Biol. Bioinform.14, 966–977. 10.1109/TCBB.2016.2550453
105
YuL.WangB.MaX.GaoL. (2016b). The extraction of drug-disease correlations based on module distance in incomplete human interactome. BMC Syst. Biol.10:111. 10.1186/s12918-016-0364-2
106
YuL.ZhaoJ.GaoL. (2017b). Drug repositioning based on triangularly balanced structure for tissue-specific diseases in incomplete interactome. Artif. Intell. Med.77, 53–63. 10.1016/j.artmed.2017.03.009
107
YuL.ZhaoJ.GaoL. (2018). Predicting Potential Drugs for Breast Cancer based on miRNA and Tissue Specificity. Int. J. Biol. Sci.14, 971–982. 10.7150/ijbs.23350
108
ZengX.DingN.Rodríguez-PatónA.ZouQ. (2017). Probability-based collaborative filtering model for predicting gene disease associations. BMC Med. Genomics10:76. 10.1186/s12920-017-0313-y
109
ZengX.LiuL.LüL.ZouQ. (2018). Prediction of potential disease-associated microRNAs using structural perturbation method. Bioinformatics34, 2425–2432. 10.1093/bioinformatics/bty112
110
ZhangJ.JuY.LuH.XuanP.ZouQ. (2016). Accurate Identification of cancerlectins through hybrid machine learning technology. Int. J. Genomics2016, 1–11. 10.1155/2016/7604641
111
ZhangJ.LiuB. (2017). PSFM-DBT: identifying DNA-binding proteins by combing position specific frequency matrix and distance-bigram transformation. Int. J. Mol. Sci.18:E1856. 10.3390/ijms18091856
112
ZhangL.ZhangC.GaoR.YangR. (2015). An ensemble method to distinguish bacteriophage virion from non-virion proteins based on protein sequence characteristics. Int. J. Mol. Sci.16, 21734–21758. 10.3390/ijms160921734
113
ZhuP.HuQ.HanY.ZhangC.DuY. (2016). Combining neighborhood separable subspaces for classification via sparsity regularized optimization. Inf. Sci.370, 270–287. 10.1016/j.ins.2016.08.004
- CrossRef
- Google Scholar
114
ZhuP.HuQ.HuQ.ZhangC.FengZ. (2018c). Multi-view label embedding. Pattern Recognit.84, 126–135. 10.1016/j.patcog.2018.07.009
- CrossRef
- Google Scholar
115
ZhuP.XuQ.HuQ.ZhangC. (2018a). Co-regularized unsupervised feature selection. Neurocomputing275, 2855–2863. 10.1016/j.neucom.2017.11.061
- CrossRef
- Google Scholar
116
ZhuP.XuQ.HuQ.ZhangC.ZhaoH. (2018b). Multi-label feature selection with missing labels. Pattern Recognit.74, 488–502. 10.1016/j.patcog.2017.09.036
- CrossRef
- Google Scholar
117
ZhuX.-J.FengC.-Q.LaiH.-Y.ChenW.HaoL. (2019). Predicting protein structural classes for low-similarity sequences by evaluating different features. Knowledge Based Syst.163, 787–793. 10.1016/j.knosys.2018.10.007
- CrossRef
- Google Scholar
118
ZouQ.LiX. B.JiangW. R.LinZ. Y.LiG. L.ChenK. (2014). Survey of MapReduce frame operation in bioinformatics. Brief. Bioinform.15, 637–647. 10.1093/bib/bbs088
119
ZouQ.WangZ.GuanX.LiuB.WuY.LinZ. (2013). An approach for identifying cytokines based on a novel ensemble classifier. Biomed Res. Int.2013:686090. 10.1155/2013/686090
120
ZouQ.ZengJ.CaoL.JiR. (2016). A novel features ranking metric with application to scalable visual and bioinformatics data classification. Neurocomputing173, 346–354. 10.1016/j.neucom.2014.12.123
- CrossRef
- Google Scholar

Summary

Keywords

phage virion proteins, machine learning, feature extraction, feature selection, hybrid sequence features

Citation

Ru X, Li L and Wang C (2019) Identification of Phage Viral Proteins With Hybrid Sequence Features. Front. Microbiol. 10:507. doi: 10.3389/fmicb.2019.00507

Received

24 December 2018

Accepted

27 February 2019

Published

26 March 2019

Volume

10 - 2019

Edited by

Hongsheng Liu, Liaoning University, China

Reviewed by

Zhiwei Ji, University of Texas Health Science Center, United States; Nuria Quiles Puchalt, University of Glasgow, United Kingdom

Updates

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Chunyu Wang chunyu@hit.edu.cn

This article was submitted to Systems Microbiology, a section of the journal Frontiers in Microbiology

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Systems Microbiology

ORIGINAL RESEARCH article

Identification of Phage Viral Proteins With Hybrid Sequence Features

Abstract

Introduction