ORIGINAL RESEARCH article

Front. Microbiol., 26 March 2019

Sec. Systems Microbiology

Volume 10 - 2019 | https://doi.org/10.3389/fmicb.2019.00507

Identification of Phage Viral Proteins With Hybrid Sequence Features

  • 1. School of Information and Electrical Engineering, Hebei University of Engineering, Handan, China

  • 2. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China

Abstract

The uniqueness of bacteriophages plays an important role in bioinformatics research. In real applications, the function of the bacteriophage virion proteins is the main area of interest. Therefore, it is very important to classify bacteriophage virion proteins and non-phage virion proteins accurately. Extracting comprehensive and effective sequence features from proteins plays a vital role in protein classification. In order to more fully represent protein information, this paper is more comprehensive and effective by combining the features extracted by the feature information representation algorithm based on sequence information (CCPA) and the feature representation algorithm based on sequence and structure information. After extracting features, the Max-Relevance-Max-Distance (MRMD) algorithm is used to select the optimal feature set with the strongest correlation between class labels and low redundancy between features. Given the randomness of the samples selected by the random forest classification algorithm and the randomness features for producing each node variable, a random forest method is employed to perform 10-fold cross-validation on the bacteriophage protein classification. The accuracy of this model is as high as 93.5% in the classification of phage proteins in this study. This study also found that, among the eight physicochemical properties considered, the charge property has the greatest impact on the classification of bacteriophage proteins These results indicate that the model discussed in this paper is an important tool in bacteriophage protein research.

Introduction

In the biological world, bacteriophages are ubiquitous, with different genomes and lifestyles. According to their morphology, they can be classified as either tail, tailless, or filamentous bacteriophages. According to morphology and nucleic acid, phages are classified as infect bacteria and infect archaea. The bacteriophage must be attached to a host cell for growth and reproduction (Seguritan et al., 2012), and directly affects the host population by lysing host cells. In addition, each bacteriophage is specific and greatly reduces the damage to host cells (Haq et al., 2012). Identification and classification of various bacteria can be performed based on the universality, diversity, dependence, and specificity of bacteriophages (Marks and Sharp, 2015). The structure of bacteriophages is simple, consisting of only a protein shell and genetic material (DNA or RNA) (Haq et al., 2012), making them important substances for simplifying experimental research in bioinformatics. As a bacteriophage can insert genes into host cells (Ding et al., 2014), it is an important tool for studying genetics (Cheng et al., 2018; Hu et al., 2018). Hershey (Hershey and Chase, 1952) performed biological experiments using the T2 bacteriophage and bacteria in 1952, and finally confirmed that DNA is the genetic material of bacteriophages and other organisms. The significance of this research in the development of biological science earned Hershey and coworkers the Nobel Prize in Physiology. Bacteriophage provide experimental systems and tools for the molecular biological science revolution. The bacteriophage rapid development has led to dection of basic principles of ecology and evolution. Besides, it is relatively easy to synthesize and has modular characteristic, which cater to the needs of synthetic biologists and carry out engineering research and implementation of biological function.

Bacteriophage proteins are classified into virion and non-viron proteins (Zhang et al., 2015), with most practical interest focusing on the function of bacteriophage virion proteins (Feng et al., 2013b). Therefore, bacteriophage proteins must be accurately classified and identified so that researchers can further study the structure and function of a particular bacteriophage. After the human genome project was officially launched in 1990, the number of bacteriophage protein sequences with unknown functions increased dramatically (Seguritan et al., 2012; Chen et al., 2018a). Faced with a large volume of data, traditional biological experimental methods could no longer keep up with the post-gene era (Chen W. et al., 2016; Cheng et al., 2019; Mrozek et al., 2016; Hu et al., 2018). For this reason, researchers introduced different machine learning algorithms into bacteriophage classification and prediction research. For example, Li et al. (2007) developed a support vector machine system called SynFPS that uses the gene–gene distance determined by k-means clustering to identify closely related genomes and perform gene function prediction. Using the protein appearance frequency of amino acids and information of isoelectric points, Seguritan et al. (2012) developed an artificial neural network method to classify viral structures. Feng et al. (2013b) used the main amino acid and dipeptide components as an encoding scheme, and modified a naive Bayes classifier to identify bacteriophage proteins. Ding et al. (2014) used g-gap dipeptide composition to represent protein sequence information, incremental feature selection to analyze the variance and identify the optimal feature set, and a support vector machine for classification. Zhang et al. (2015) obtained sequence feature vectors with various techniques, and then used the incremental feature selection algorithm to select the optimal feature subsets. Finally, the prediction results of individual classifiers trained in different feature spaces were integrated to produce the final classification effect. Machine learning algorithm (Robert, 2012; Stephenson et al., 2018) automatically analyze and obtain rules from data and use them to predict unknown data (Chen and Yan, 2013; Yu et al., 2015, 2016a; Chen and Huang, 2017; Chen et al., 2018h; Wang et al., 2018). This saves time and money, but the results from such algorithms are not as convincing as those from biological experiments. Therefore, it is especially important to choose an appropriate machine learning algorithm to ensure the most accurate classification results (Liu, 2017; Yao et al., 2017; Yu et al., 2017a). In a protein classification experiment, the classification effect depends largely on the feature set extracted (Zou et al., 2013; Bin et al., 2015; Mrozek et al., 2015; Jia et al., 2016; Yu et al., 2016b, 2018; Zhang et al., 2016; Huang et al., 2017; Qu et al., 2017; Jiang et al., 2018; Qiao et al., 2018; Xiong et al., 2018; Xu et al., 2018b). To date, feature extraction methods are divided into sequence-based and structure-based approaches (Huang et al., 2017; Qu et al., 2017) The feature set extraction part of this study is obtained by combining the features extracted by the two feature extraction methods.

In this study, we examined the final classification effect of the selected methods and the stability of the dataset when the feature dimension was reduced. First, to remove the imbalance in the reference dataset, CD-Hit was used to remove redundant data, resulting in a balanced dataset that contains comprehensive information and less redundancy. Pearson's correlation coefficient and three distance functions (Euclidean and cosine distances and the Tanimoto coefficient) (Zou et al., 2016) were then used to calculate the correlation between features and class labels and the redundancy between features. Finally, the optimal feature subset with the strongest correlation between features and class labels and low redundancy between features was selected. According to some recent studies(Wu et al., 2009; Yi et al., 2011; Chen and Lin, 2012; Yang et al., 2015; Yu et al., 2017b; Zhang and Liu, 2017; Xu et al., 2018a; Liu et al., 2019), the best algorithms for protein classification are support vector machines and random forest algorithms. However, support vector machines are more suitable for small sample sets in which the number of dimensions is greater than the number of samples. Thus, the random forest algorithm was used in this study. The random forest algorithm (Breiman, 2001; Yao et al., 2017) combines multiple weak classifiers to produce a final result that has higher accuracy and better generalization performance. It can achieve good results, mainly because of the random nature of the “forest,” which makes the algorithm resistant to overfitting and more precise. Finally, in terms of bacteriophage protein classification, the data set extracted by combining the features and the feature selection of the feature set have a positive impact on the protein classification effect. Our results also show that, among the eight physicochemical properties of amino acids, the charge property has the greatest influence on the classification of bacteriophage proteins. To evaluate the performance of the models used in this study, the results were compared with those given by the methods introduced in (Feng et al., 2013b; Ding et al., 2014; Zhang et al., 2015). Figure 1 shows the workflow of this study.

Figure 1

Methods

Dataset Processing

Source: UniProt (Rolf, 2004; Consortium, 2012) is a widely used protein sequence database that offers low protein sequence redundancy and complete protein function interpretation (Cao and Cheng, 2016a; Jiang et al., 2016). As this website is free and open, researchers can download the desired protein sequence for free. The original positive samples used in this study (a total of 15,765 data), e.g., the number of bacteriophage virion proteins, were downloaded from this database. After obtaining the bacteriophage virion protein (positive) sample set, the PFAM family of positive samples was excluded from all PFAM families, such that the remaining samples were families of non-phage virion proteins. Finally, the longest protein sequence of the remaining families was extracted to form a negative sample set. The positive and counterexample datasets obtained as described above may all contain homologous sequences. Using such sample sets would result in the classification accuracy being overestimated, which is not conducive to the establishment of prediction models. Therefore, we used the CD-Hit tool to remove redundant positive and negative samples from the datasets.

Data integration: The CD-Hit (Li et al., 2001; Li and Godzik, 2006; Huang et al., 2010; Fu et al., 2012; Chen et al., 2017) redundancy tool effectively clusters similar sequences. The basic principle is to sort protein sequences in the dataset in descending order. The longest sequence is taken as the first class, and then this is compared with the second-longest protein sequence in terms of their similarity. If the similarity between the two is greater than some threshold, they are deemed to belong to the same class. Otherwise, the second-longest sequence forms a new class. Because the bacteriophage virion protein sequences were downloaded from UniProt, which ensures relatively low redundancy, the interrupt threshold was set to 0.8. The non-phage virion proteins had a higher degree of redundancy, so their interrupt threshold was set to 0.4. Thus, 6,251 bacteriophage virion protein sequences and 9,514 non-phage virion protein sequences were obtained. The union of the resulting positive and negative sample datasets gives the total dataset, and the intersection of the two is empty.

Feature Extraction

Representation Algorithms for Amino Acid Composition and Eight Physicochemical Properties

In this study, a feature set containing 188 dimensions was extracted based on amino acid composition and eight physicochemical properties. The amino acid composition is one of the most basic features of proteins (Zhang et al., 2015; Cao and Cheng, 2016b). Eight physicochemical properties of amino acids also play a role in the functional properties of bacteriophage proteins. In 1988, Coia et al. (1988) found that amino acids having lighter side chain groups are more likely to constitute bacteriophage virion sequences. In 1994, Marvin et al. (1994) proposed that hydrophilicity, hydrophobicity, and charge have a greater impact on the function of bacteriophage virion proteins. In 2008, Shen and Chou (2008) identified the vital role that the hydrophilicity and hydrophobicity of amino acids play in the folding of proteins. In 2014, Ting et al. (2014) used logistic regression to integrate several biological features, including physicochemical properties for predicting lysine acetylation, thus demonstrating the effect of physicochemical properties on protein structure and function. Therefore, the amino acid composition and its eight physicochemical properties are used to extract features that reflect the characteristics of bacteriophage proteins.

The 20 most common amino acids are as follows:

The occurrence frequency of each amino acid in a protein sequence can be expressed as:

Where ni is the frequency with which amino acid i occurs in the protein sequence and L is the length of the protein sequence.

In addition, these 20 amino acids can be classified into three types according to their physicochemical properties (Chou and Com, 2010), as shown in Figure 2.

Figure 2

The composition, transformation, and distribution of amino acids were determined by Dubchak et al. (1995) based on a global description of protein sequences. The feature extraction methods for the eight physicochemical properties of a protein sequence are as follows. Taking the electrode polarity as an example (expressed by p), the 20 amino acids are divided into high-, medium-, and low-charged polarity groups, which are expressed by ph, pp, pl, respectively. The composition, transformation, and distribution of the amino acids at this time can be represented by equations (3)–(7).

Composition features (Dubchak et al., 1995) (frequency of each charged electrode group in a sequence):

where f21, f22, f23 denote the content of the high-, medium-, and low-charged polarity groups in a sequence, respectively,L is the length of the protein sequence,n1, n2, n3 are the frequencies with which the three electrode groups appear in the sequence.

Conversion feature (Dubchak et al., 1995) (frequency of occurrence of bigeminal sequences):

Where f31, f32, f33 denote the content of the three bigeminal groups phl, php, ppl, and m1, m2, m3 are the frequencies of these three bigeminal groups appearing in sequence. There are three possible sequences of the charged polarity: phl, php, ppl In addition, in a protein sequence of length L, assuming that any two adjacent amino acids constitute a pair, the protein sequence contains L − 1 paired sequences (Zou et al., 2013).

Distribution features (Dubchak et al., 1995) (amino acid distribution of the high-, medium-, and low-charged polarity groups):

Where a1%, a25%, a50%a75%a100% represent the positions of the first, 25, 50, 75, and 100% high-charged polarity groups in a sequence, b1%, b25%, b50%, b75%, b100% represent the positions of the first, 25, 50, 75, and 100% medium-charged polarity groups in a sequence and c1%, c25%, c50%, c75%, c100% represent the positions of the first, 25, 50, 75, and 100% low-charged polarity groups in a sequence.

In summary, (3 + 3 + 3 × 5) = 21-dimensional features can be extracted from each physicochemical property, and so 8 × 21 = 168-dimensional features can be extracted from the eight physicochemical properties. The 188-dimensional features (20-dimensional + 168-dimensional) are used to express the characteristics of bacteriophage proteins, and are extracted based on the content ratio of each of the 20 amino acids in the sequence and the eight physicochemical properties.

Adaptive k-skip-n-Gram Algorithm

A feature set containing 400 dimensions is extracted based on the adaptive k-skip-n-gram method (Feng et al., 2013c; Cao et al., 2017; Wei et al., 2017a; Tang et al., 2018). In this study, the value of n was set to 2 (202 = 400).

The K value represents the separation distance between two amino acids. For example, in the protein sequence S = A1A2A3AL (where L is the length of the sequence),

And Ai,Aj are the ith and jth amino acids of S.

In a bacteriophage protein dataset, the sequences have very different lengths. If the parameter K is fixed to a specific value, the sequence information cannot be properly represented, which will affect the final classification effect. Therefore, the value of k was set to be adaptive so that K could vary with the length of the sequence.

For n = 2, the combinations of the 20 most common amino acids and the number of occurrences of each combination in the sample datasets are as shown in Figure 3.

Figure 3

This process is similar to full connection in a neural network. Among the 20 common amino acids, anyone can combine with another amino acid (or itself) in pairs, and the combination is random. In the same way as full connection, this leads to overfitting when there are too many data. Therefore, n should not be too high when using an adaptive k-skip-n-gram method. When n = 1, we have the traditional n-gram model proposed by Guthrie et al. (2006), which does not apply to shorter protein sequences. Therefore, n was set to 2 in this study.

In this feature extraction method, the combination set of two specified interval amino acids (Wei et al., 2017a) is given by:

In addition, C is used to represent a set of two amino acids that are combined at all intervals in a sequence (Wei et al., 2017a). Namely:

Finally, the feature extraction formula (Wei et al., 2017a) is:

Where N(Cskipgram) is the total number of elements in set C,am1am2amn are the 20n kinds of amino acid combinations of length n, N(am1am2amn) is the frequency that the two-two combination in am1am2amn occurs in Cskipgram

Mixed Representation Algorithm (Seq-Str)

Some researchers have combined different feature extraction methods and achieved very good classification results (Dehzangi et al., 2013; Zou et al., 2014; Leyi et al., 2015, 2018; Chen X. et al., 2016; Ding et al., 2016, 2017a,b; Li et al., 2016; Chen et al., 2017,a,b, 2018c,d,e; Su et al., 2018 Shen et al., 2019; Wei et al., 2019; Zhu et al., 2019). Wei et al. (2015) proposed a novel feature extraction method that uses both the profile of PSI-BLAST (Altschul et al., 1997) and the profile of PSI-PRED (Jones, 1999), which contain rich evolutionary information and secondary structure information, respectively. In this way, the 473-dimensional feature can be extracted.

  • Extract 20-dimensional features based on PSI-BLAST as follows:

    Sz, iindicates that during the evolution process, the residue at the “z” position in the sequence S is mutated to the fraction of the “i” species, and “i” is one of the 20 common residues. Si indicates that during the evolution, the residue in sequence S is mutated to the average score of the ith residue.

  • Extracting 420-dimensional features based on n-gram: The Adaptive k-skip-n-gram algorithm that does not consider the k value is the n-gram method. Here, take n equal to 1 and n equal to 2

  • Based on the secondary structure sequence, the following six features are extracted (Wei et al., 2015): Three feature extraction formulas for spatial arrangement

    Where PHzrepresents the position index of the zth H in the secondary structure of the sequence S. nH represents the total number of occurrences of H in the secondary structure of sequence.

    Two feature extraction formulas for the percentage of the maximum continuous length (Wei et al., 2015).

    CH represents the length of the fragment in which H appears consecutively in the sequence of the secondary structure.

    A new feature for distinguishing between two structural classes, α + β and : (Wei et al., 2015)

    This formula calculates the frequency at which βαβ appears in the fragmented sequence Sseg, nβαβ represents the number of times βαβ appears in Sseg, Lsegindicates the length of Sseg.

  • Extracting 27 features based on structural probability matrices: Three features from the overall information and 24 features from local information

Feature Selection

Based on the feature extraction methods described in section Feature extraction, We extracted a 188-dimensional, 400-dimensional feature set based on sequence information, and a 473-dimensional data set based on sequence and secondary structure information representing the entire bacteriophage protein sequence dataset. Some redundant or irrelevant cases were still present in these features. The existence of invalid features wastes time and computational resources, and affects the classification accuracy of the model (Chen et al., 2018b,f,g; Dao et al., 2018; Yang et al., 2018; Zhu et al., 2018a,b). In this paper, the Max-Relevance-Max-Distance (MRMD) (Zou et al., 2016) method was used to select features and identify higher-quality feature sets, i.e., the optimal feature subset. In this method, Pearson's correlation coefficient is used to calculate the correlation between features and class labels (MR), thus enabling the selection of features with strong correlation to the target class. Three distance functions (Euclidean and cosine distances and the Tanimoto coefficient) are used to calculate the redundancy between features (MD) and identify features with low redundancy.

Taking the two eigenvectors (X,Y) as an example, Pearson's correlation coefficient (Pearson, 1909) expressed as follows:

Where σX and σY denote the standard deviation of the two vectors, cov(X, Y) is the covariance, which is used to measure the relationship between two random variables. The covariance formula is as follows:

Where and denote the mean of the respective vectors.

The formula for the Euclidean distance (Larson and Edwards, 1991; Deza and Deza, 2009) is:

Where M is the number of feature vectors,n is the total number of elements in each vector, and xq, yq are the q-th elements in X, Y, respectively.

The cosine distance formula (Tan et al., 2005) is:

Where

The Tanimoto coefficient (Rogers and Tanimoto, 1960) is given by:

Using these distance metrics, we identified the features with the strongest correlation and minimum redundancy with respect to the class labels. In different scenarios, we can increase the weights of MR and MD (max(wr × MRi + wd × MDi)) to ensure the acquired features are suitable for the classification task.

Experiments

Performance Evaluation Criteria

A 10-fold cross-validation method was employed to evaluate the models. There are four common evaluation indicators, namely the accuracy (ACC), sensitivity (SN), specificity (SP), and Matthews' correlation coefficient (MCC) (Feng et al., 2013a, 2018; Chen W. et al., 2016; Wei et al., 2017b,c; Xu et al., 2017; Jingjing et al., 2018). These are expressed as follows (Zou et al., 2013; Chen et al., 2014; Qu et al., 2017):

Where TP denotes true positive, i.e., the number of positive samples that are predicted to be positive samples, TN denotes true negative, i.e., the number of negative samples that are predicted to be negative samples, FP denotes false positive, i.e., the number of negative samples that are predicted to be positive samples, and FN denotes false negative, i.e., the number of positive samples that are predicted to be negative samples.

Classification Effects of Different Classifiers

Experiment 1: This part of the experiment is based on the feature sets of 188, 400, and 473 dimensions extracted by the method in Feature extraction. The accuracy of each classification algorithm before and after using the MRMD feature selection algorithm is presented in Table 1.

Table 1

Feature_extractionFeature_selectionnumber of DLibSVM (%)Naive Bayes (%)Random forest (%)
CCPA188D68.578.391.3
MRMD185D68.578.391.5
AKSNG400D60.371.888.7
MRMD252D60.372.889.0
Seq-Str473D80.680.992.6
MRMD189D82.083.193.2

Classification results of three data sets under different classification algorithms.

The data in Table 1 indicate that, for the classification of bacteriophage proteins, no matter which feature extraction algorithm is used, whether or not feature selection is performed, the random forest algorithm is the best classification effect.

Performance of Different Feature Extraction Methods

Experiment 2: Experiment 1 showed that the random forest algorithm produces the best classification of bacteriophage proteins. In this second experiment, the 188-dimensional and 400-dimensional datasets extracted based on sequence information (Seq Based), a 473-dimensional dataset extracted based on structure (Seq and stru Based), and two combined feature sets (Com Based) were integrated into the random forest algorithm, and the resulting performance was compared. The experimental results are presented in Table 2.

Table 2

Extraction methodNumber of DSN (%)SP (%)ACC (%)MCC (%)
Seq based188D87.493.691.381.5
400D82.892.488.776.1
Seq and str based473D86.297.292.685.1
Com based588D87.193.291.280.7
661D87.596.593.185.3

Classification performance under different feature extraction methods.

Feature fusion can boost the recognition performance by combining the complementary information of different features (Zhu et al., 2016, 2018c). A 588-dimensional feature set was obtained by combining the features of the 188- and 400-dimensional feature sets, and a 661-dimensional feature set was obtained by combining the features of the 188- and 473-dimensional feature sets. According to the experimental results, the 188-, 473-, 588-, and 661-dimensional feature set models give better bacteriophage protein classification performance, However, based on the data of the other three evaluation indicators, the 661-dimensional feature set obtained by combining the 188-dimensional feature set extracted based on the sequence information and the features of the 473-dimensional feature set extracted based on the sequence and the secondary structure is the best. This indicates that the feature set extracted by the feature representation algorithm containing both sequence information and structural information in phage protein classification has the best influence on the classification effect, and also shows that combining some feature sets in protein classification is effective for improving classification performance.

Importance of Feature Selection

Experiment 3: This experiment used the random forest classification algorithm to classify the feature sets after MRMD. The results are given in Table 3.

Table 3

ModelFeature_extractionSN (%)SP (%)ACC (%)MCC (%)
Mode lCCPA (188)87.593.491.581.4
Mode 2AKSNG (400)82.992.289.076.0
Mode 3Seq-Str (473)86.796.693.284.8
Mode 4Combine (588)87.693.591.581.5
Mode 5Combine (661)87.996.393.585.3

Classification performance under each model.

The comparison of the data in Tables 2, 3 shows that after using the feature selection algorithm (MRMD), the classification effect does not change with the decrease of the dimension, and even with the decrease of the dimension, the classification effect becomes better. After removing the redundant features, the best classification performance is still the data set obtained by feature combination, that is, the 256-dimensional feature set obtained by removing redundant features from the 661-dimensional feature set.

Comparison With Recent Methods

Experiment 4: To provide an objective demonstration of the performance of the model described in this paper, this experiment compared the optimal proposed model with bacteriophage protein classification models proposed in recent years. The results are presented in Table 4.

Table 4

ModelSN (%)SP (%)ACC (%)MCC (%)
Feng et al. (2013b)75.780.779.154.9
Ding et al. (2014)75.789.485.065.5
Zhang et al. (2015)87.083.085.070.1
This search87.996.393.585.3

Performance comparison against recent methods.

It is clear from Table 4 that the bacteriophage classification model proposed in this paper achieves a good classification effect, with a classification accuracy of 93.5%. Compared with Feng, it has increased by 14%, compared with Ding and Zhang by 8%. In the other three evaluation indicators, there are also different degrees of improvement, indicating that the model proposed in this paper is an effective tool for phage protein classification.

Analyzing the Impact of Eight Physicochemical Properties

This section summarizes the first eight dimensional features that have a significant impact on the classification effect of bacteriophage proteins. The top eight features are listed in Table 5 in order of their impact.

Table 5

NO.Fea nameScoreImplication
1Fea 1201.0Position of the 100%th neutral electrical storage amino acid in a sequence
2Fea 1570.9968696407744475Position of the 100%th helical amino acid in a sequence
3Fea 1780.9950260206126923Position of the 100%th soluble amino acid in a sequence
4Fea 990.9949600329187752Position of the 100%th neutral polarizability amino acid in a sequence
5Fea 1360.9948079966447566Position of the 100%th large tensile amino acid in a sequence
6Fea 830.994509178771573Position of the 100%th high-electrode amino acid in a sequence
7Fea 520.994137797849692Position of the 100%th small van der Waals volume amino acid in a sequence
8Fea 310.9937317569946658Position of the 100%th hydrophilic amino acid in a sequence

Impact of physicochemical properties on classification.

According to the information in this table, the effects of eight physicochemical properties of amino acids on the classification of bacteriophage proteins are evenly distributed, and that which has the greatest impact on the classification is the charge property of amino acids.

Conclusion

Bacteriophage proteins are of special significance for cell typing and pathological research. It is very important to correctly classify virion and non-virion bacteriophage proteins. Therefore, this paper has proposed the following classification model: (1) higher-quality feature datasets are extracted with extraction algorithms based on feature combination; (2) the optimal feature subset is selected using the MRMD algorithm for feature selection; and (3) the random forest algorithm is applied to perform protein classification. The model can achieve accuracy of up to 93.5% for the classification of bacteriophage proteins. This demonstrates that the model developed in this paper is an important tool for the classification of bacteriophage proteins. For the future direction, link prediction paradigms, which have been successfully applied in the prediction of disease genes (Zeng et al., 2017) and miRNAs (Liu et al., 2016; Zeng et al., 2018), can be considered for identification of bacteriophage proteins. It might also be important to integrate evolutionary information using tools like evolutionary trees and networks (Yang et al., 2013, 2014). Finally, computational intelligence such as neural networks (Song et al., 2018a,b) and evolutionary algorithms (Hang et al., 2018) can be applied in this field.

Statements

Author contributions

XR implemented the experiments and drafted the manuscript. LL and CW initiated the idea, conceived the whole process, and finalized the paper. All authors have read and approved the final manuscript.

Acknowledgments

The work was supported Natural Science Foundation of China (No.61872114, 91735306), and the National Key Research and Development Plan Task of China (No. 2016YFC0901902). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  • 1

    AltschulS. F.MaddenT. L.SchafferA. A.ZhangJ.ZhangZ.MillerW.et al. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res.25, 33893402. 10.1093/nar/25.17.3389

  • 2

    BinL.FuleL.XiaolongW.JunjieC.LongyunF.Kuo-ChenC. (2015). Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res.43, W65W71. 10.1093/nar/gkv458

  • 3

    BreimanL. (2001). Random Forests. Mach. Learn.45, 532. 10.1023/A:1010933404324

  • 4

    CaoR.ChengJ. (2016a). Integrated protein function prediction by mining function associations, sequences, and protein-protein and gene-gene interaction networks. Methods93, 8491. 10.1016/j.ymeth.2015.09.011

  • 5

    CaoR.ChengJ. (2016b). Protein single-model quality assessment by feature-based probability density functions. Sci. Rep.6:23990. 10.1038/srep23990

  • 6

    CaoR.FreitasC.ChanL.SunM.JiangH.ChenZ. (2017). ProLanGO: protein function prediction using neural machine translation based on a recurrent neural network. Molecules22:1732. 10.3390/molecules22101732

  • 7

    ChenJ.GuoM.WangX.LiuB. (2018a). A comprehensive review and comparison of different computational methods for protein remote homology detection. Brief Bioinform.19, 231244. 10.1093/bib/bbw108

  • 8

    ChenW.FengP.DingH.LinH. (2018b). Classifying included and excluded exons in exon skipping event using histone modifications. Front. Genet.9:433. 10.3389/fgene.2018.00433

  • 9

    ChenW.FengP.TangH.DingH.LinH. (2016). RAMPred: identifying the N1-methyladenosine sites in eukaryotic transcriptomes. Sci. Rep.6:31080. 10.1038/srep31080

  • 10

    ChenW.FengP.YangH.DingH.LinH.ChouK. C. (2018c). iRNA-3typeA: identifying three types of modification at RNA's adenosine sites. Molecular therapy. Nucleic Acids11, 468474. 10.1016/j.omtn.2018.03.012

  • 11

    ChenW.FengP. M.LinH.ChouK. C. (2014). iSS-PseDNC: identifying splicing sites using pseudo dinucleotide composition. Biomed Res. Int.2014, 112. 10.1155/2014/623149

  • 12

    ChenW.LinH. (2012). Identification of voltage-gated potassium channel subfamilies from sequence information using support vector machine. Comput. Biol. Med.42, 504507. 10.1016/j.compbiomed.2012.01.003

  • 13

    ChenW.YangH.FengP.DingH.LinH. (2017). iDNA4mC: identifying DNA N4-methylcytosine sites based on nucleotide chemical properties. Bioinformatics33, 35183523. 10.1093/bioinformatics/btx479

  • 14

    ChenX.GuanN. N.SunY. Z.LiJ. Q.QuJ. (2018d). MicroRNA-small molecule association identification: from experimental results to computational models. Brief. Bioinform. 2018:bby098. 10.1093/bib/bby098

  • 15

    ChenX.HuangL. (2017). LRSSLMDA: Laplacian regularized sparse subspace learning for MiRNA-disease association prediction. PLoS Comput. Biol.13:e1005912. 10.1371/journal.pcbi.1005912

  • 16

    ChenX.SunY. Z.GuanN. N.QuJ.HuangZ. A.ZhuZ. X.et al. (2018e). Computational models for lncRNA function prediction and functional similarity calculation. Brief Funct. Genomics18, 5882. 10.1093/bfgp/ely031

  • 17

    ChenX.WangL.QuJ.GuanN. N.LiJ. Q. (2018f). Predicting miRNA-disease association based on inductive matrix completion. Bioinformatics34, 42564265. 10.1093/bioinformatics/bty503

  • 18

    ChenX.XieD.WangL.ZhaoQ.YouZ. H.LiuH. (2018g). BNPMDA: bipartite network projection for MiRNA-disease association prediction. Bioinformatics34, 31783186. 10.1093/bioinformatics/bty333

  • 19

    ChenX.XieD.ZhaoQ.YouZ. H. (2017b). MicroRNAs and complex diseases: from experimental results to computational models. Brief. Bioinform.2017:bbx130. 10.1093/bib/bbx130

  • 20

    ChenX.YanC. C.ZhangX.YouZ. H. (2017a). Long non-coding RNAs and complex diseases: from experimental results to computational models. Brief. Bioinform.18, 558576. 10.1093/bib/bbw060

  • 21

    ChenX.YanC. C.ZhangX.ZhangX.DaiF.YinJ.et al. (2016). Drug-target interaction prediction: databases, web servers and computational models. Brief. Bioinform.17, 696712. 10.1093/bib/bbv066

  • 22

    ChenX.YanG. Y. (2013). Novel human lncRNA-disease association inference based on lncRNA expression profiles. Bioinformatics29, 26172624. 10.1093/bioinformatics/btt426

  • 23

    ChenX.YinJ.QuJ.HuangL. (2018h). MDHGI: Matrix decomposition and heterogeneous graph inference for miRNA-disease association prediction. PLoS Comput. Biol.14:e1006418. 10.1371/journal.pcbi.1006418

  • 24

    ChengL.HuY.SunJ.ZhouM.JiangQ. (2018). DincRNA: a comprehensive web-based bioinformatics toolkit for exploring disease associations and ncRNA function. Bioinformatics34, 19531956. 10.1093/bioinformatics/bty002

  • 25

    ChengL.SunJ.XuW.DongL.HuY.ZhouM. (2016). OAHG: an integrated resource for annotating human genes with multi-level ontologies. Sci. Rep.6, 34820. 10.1038/srep34820

  • 26

    ChengL.YangH.ZhaoH.PeiX.ShiH.SunJ.et al. (2019). MetSigDis: a manually curated resource for the metabolic signatures of diseases. Brief. Bioinform.20, 203209. 10.1093/bib/bbx103

  • 27

    ChouK.ComM. P. (2010). Prediction of protein cellular attributes using pseudo-amino acid composition. Protein Struct. Funct. Bioinform.43:246255. 10.1002/prot.1035

  • 28

    CoiaG.ParkerM. D.SpeightG.ByrneM. E.WestawayE. G. (1988). Nucleotide and complete amino acid sequences of Kunjin virus: definitive gene order and characteristics of the virus-specified proteins. J. Gen. Virol.69, 121. 10.1099/0022-1317-69-1-1

  • 29

    ConsortiumU. P. (2012). Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res.40, D71D75. 10.1093/nar/gkr981

  • 30

    DaoF. Y.LvH.WangF.FengC. Q.DingH.ChenW.et al. (2018). Identify origin of replication in Saccharomyces cerevisiae using two-step feature selection technique. Bioinformatics. 2018:bty943. 10.1093/bioinformatics/bty943

  • 31

    DehzangiA.PaliwalK.SharmaA.DehzangiO.SattarA. (2013). A combination of feature extraction methods with an ensemble of different classifiers for protein structural class prediction problem. IEEE/ACM Trans. Comput. Biol. Bioinform.10, 564575. 10.1109/TCBB.2013.65

  • 32

    DezaM. M.DezaE. (2009). Encyclopedia of distances. Refer. Rev.24, 1583. 10.1007/978-3-642-00234-2

  • 33

    DingH.FengP. M.ChenW.LinH. (2014). Identification of bacteriophage virion proteins by the ANOVA feature selection and analysis. Mol. Biosyst.10, 22292235. 10.1039/C4MB00316K

  • 34

    DingY.TangJ.GuoF. (2016). Predicting protein-protein interactions via multivariate mutual information of protein sequences. BMC Bioinform.17:398. 10.1186/s12859-016-1253-9

  • 35

    DingY.TangJ.GuoF. (2017a). Identification of protein-ligand binding sites by sequence information and ensemble classifier. J. Chem. Inf. Model.57, 31493161. 10.1021/acs.jcim.7b00307

  • 36

    DingY.TangJ.GuoF. (2017b). Identification of drug-target interactions via multiple information integration. Inf. Sci.418, 546560. 10.1016/j.ins.2017.08.045

  • 37

    DubchakI.MuchnikI.HolbrookS. R.KimS. H. (1995). Prediction of protein folding class using global description of amino acid sequence. Proc. Natl. Acad. Sci. U.S.A.92, 87008704. 10.1073/pnas.92.19.8700

  • 38

    FengC. Q.ZhangZ. Y.ZhuX. J.LinY.ChenW.TangH.et al. (2018). iTerm-PseKNC: a sequence-based tool for predicting bacterial transcriptional terminators. Bioinformatics. 2018:bty827. 10.1093/bioinformatics/bty827

  • 39

    FengP. M.ChenW.LinH.ChouK. C. (2013a). iHSP-PseRAAAC: identifying the heat shock protein families using pseudo reduced amino acid alphabet composition. Anal. Biochem.442, 118125. 10.1016/j.ab.2013.05.024

  • 40

    FengP. M.DingH.ChenW.LinH. (2013b). Naïve Bayes classifier with feature selection to identify phage virion proteins. Comput. Math. Methods Med.2013:530696. 10.1155/2013/530696

  • 41

    FengP. M.LinH.ChenW. (2013c). Identification of antioxidants from sequence information using naïve Bayes. Comput. Math. Methods Med.2013, 15. 10.1155/2013/567529

  • 42

    FuL.NiuB.ZhuZ.WuS.LiW. (2012). CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics28, 31503152. 10.1093/bioinformatics/bts565

  • 43

    GuthrieD.AllisonB.LiuW.GuthrieL.WilksY. (2006). A closer look at skip-gram modelling in Proceedings of the 5th International Conference on Language Resources and Evaluation (Genoa: LREC), 14.

  • 44

    HangX.ZengW.ZengX.YenG. G. (2018). An evolutionary algorithm based on minkowski distance for many-objective optimization. IEEE Trans. Cybern.99, 112. 10.1109/TCYB.2018.2856208

  • 45

    HaqI. U.ChaudhryW. N.AkhtarM. N.AndleebS.QadriI. (2012). Bacteriophages and their implications on future biotechnology: a review. Virol. J., 9:9. 10.1186/1743-422X-9-9

  • 46

    HersheyA. D.ChaseM. (1952). Independent functions of viral protein and nucleic acid in growth of bacteriophage. J. Gen. Physiol.36, 3956. 10.1085/jgp.36.1.39

  • 47

    HuY.ZhaoT.ZhangN.ZangT.ZhangJ.ChengL. (2018). Identifying diseases-related metabolites using random walk. BMC Bioinform.19(Suppl. 5):116. 10.1186/s12859-018-2098-1

  • 48

    HuangL.LiX.GuoP.YaoY.LiaoB.ZhangW.et al. (2017). Matrix completion with side information and its applications in predicting the antigenicity of influenza viruses. Bioinformatics33, 31953201. 10.1093/bioinformatics/btx390

  • 49

    HuangY.NiuB.GaoY.FuL.LiW. (2010). CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics26, 680682. 10.1093/bioinformatics/btq003

  • 50

    JiaJ.LiuZ.XiaoX.LiuB.ChouK. C. (2016). iPPBS-Opt: a sequence-based ensemble classifier for identifying protein-protein binding sites by optimizing imbalanced training datasets. Molecules21:95. 10.3390/molecules21010095

  • 51

    JiangS.LiuB.ZouQ. (2018). HITS-PR-HHblits: protein remote homology detection by combining PageRank and hyperlink-induced topic search. Brief. Bioinform.2018:bby104. 10.1093/bib/bby104

  • 52

    JiangY.OronT. R.ClarkW. T.BankapurA. R.D'AndreaD.LeporeR.et al. (2016). An expanded evaluation of protein function prediction methods shows an improvement in accuracy. Genome Biol.17:184. 10.1186/s13059-016-1037-6

  • 53

    JingjingH.TingF.ZizhengZ.BeiH.XiaoleiZ.YiX. (2018). PseUI: Pseudouridine sites identification based on RNA sequence information. BMC Bioinform.19:306. 10.1186/s12859-018-2321-0

  • 54

    JonesD. T. (1999). Protein secondary structure prediction based on position-specific scoring matrices11Edited by G. Von Heijne. J. Mol. Biol.292, 195202. 10.1006/jmbi.1999.3091

  • 55

    LarsonR. E.EdwardsB. H. (1991). Elementary Linear Algebra.2nd Edn. Lexington, MA: D.C. Heath and Company.

  • 56

    LeyiW.HuangrongC.RanS. (2018). M6APred-EL: a sequence-based predictor for identifying N6-methyladenosine sites using ensemble learning. Mol. Ther.2018, 635644. 10.1016/j.omtn.2018.07.004

  • 57

    LeyiW.MinghongL.XingG.QuanZ. (2015). An improved protein structural classes prediction method by incorporating both sequence and structure information. IEEE Trans. Nanobiosci.14, 339349. 10.1109/TNB.2014.2352454

  • 58

    LiJ.HalgamugeS. K.KellsC. I.TangS. L. (2007). Gene function prediction based on genomic context clustering and discriminative learning: an application to bacteriophages. BMC Bioinform.8:S6. 10.1186/1471-2105-8-S4-S6

  • 59

    LiW.GodzikA. (2006). Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics22, 16581659. 10.1093/bioinformatics/btl158

  • 60

    LiW.JaroszewskiL.GodzikA. (2001). Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics17, 282283. 10.1093/bioinformatics/17.3.282

  • 61

    LiZ.TangJ.GuoF. (2016). Learning from real imbalanced data of 14-3-3 proteins binding specificity. Neurocomputing217, 8391. 10.1016/j.neucom.2016.03.093

  • 62

    LiuB. (2017). BioSeq-analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches. Brief. Bioinform.2017:bbx165. 10.1093/bib/bbx165

  • 63

    LiuY.WangX.LiuB. (2019). A comprehensive review and comparison of existing computational methods for intrinsically disordered protein and region prediction. Brief. Bioinform.20, 330346. 10.1093/bib/bbx126

  • 64

    LiuY.ZengX.HeZ.ZouQ. (2016). Inferring microRNA-disease associations by random walk on a heterogeneous network with multiple data sources. IEEE/ACM Trans. Comput. Biol. Bioinform.14, 905915. 10.1109/TCBB.2016.2550432

  • 65

    MarksT.SharpR. (2015). Bacteriophages and biotechnology: a review. J. Chem. Technol. Biotechnol.75, 617. 10.1002/(SICI)1097-4660(200001)75:1<6::AID-JCTB157>3.0.CO;2-A

  • 66

    MarvinD. A.HaleR. D.NaveC.Helmer-CitterichM. (1994). Molecular models and structural comparisons of native and mutant class I filamentous bacteriophages Ff (fd, f1, M13), If1 and IKe. J. Mol. Biol.235, 260286. 10.1016/S0022-2836(05)80032-4

  • 67

    MrozekD.DaniłowiczP.Małysiak-MrozekB. (2016). HDInsight4PSi: Boosting performance of 3D protein structure similarity searching with HDInsight clusters in Microsoft Azure cloud. Inf. Sci.349, 77101. 10.1016/j.ins.2016.02.029

  • 68

    MrozekD.SochaB.KozielskiS.Małysiak-MrozekB. (2015). An efficient and flexible scanning of databases of protein secondary structures. J. Intell. Inf. Syst.46, 213233. 10.1007/s10844-014-0353-0

  • 69

    PearsonK. (1909). Determination of the coefficient of correlation. Science30, 2325. 10.1126/science.30.757.23

  • 70

    QiaoY.XiongY.GaoH.ZhuX.ChenP. (2018). Protein-protein interface hot spots prediction based on a hybrid feature selection strategy. BMC Bioinform.19:14. 10.1186/s12859-018-2009-5

  • 71

    QuK.HanK.WuS.WangG.WeiL. (2017). Identification of DNA-binding proteins using mixed feature representation methods. Molecules22:E1602. 10.3390/molecules22101602

  • 72

    RobertC. (2012). Machine learning, a probabilistic perspective. Chance27, 6263. 10.1080/09332480.2012.726570

  • 73

    RogersD. J.TanimotoT. T. (1960). A computer program for classifying plants. Science132, 11151118. 10.1126/science.132.3434.1115

  • 74

    RolfA. (2004). UniProt: the Universal Protein knowledgebase. Nucleic Acids Res.32, D115D119. 10.1093/nar/gkh131

  • 75

    SeguritanV.AlvesN.Jr.ArnoultM.RaymondA.LorimerD.BurginA. B.Jr.et al. (2012). Artificial neural networks trained to detect viral and phage structural proteins. PLoS Comput. Biol.8:e1002657. 10.1371/journal.pcbi.1002657

  • 76

    ShenH. B.ChouK. C. (2008). PseAAC: A flexible web server for generating various kinds of protein pseudo amino acid composition. Anal. Biochem.373, 386388. 10.1016/j.ab.2007.10.012

  • 77

    ShenY.TangJ.GuoF. (2019). Identification of protein subcellular localization via integrating evolutionary and physicochemical information into Chou's general PseAAC. J. Theor. Biol.462, 230239. 10.1016/j.jtbi.2018.11.012

  • 78

    SongT.Rodríguez-PatónA.ZhengP.ZengX. (2018a). Spiking neural P systems with colored spikes. IEEE Trans. Cogn. Dev. Syst.10, 11061115. 10.1109/TCDS.2017.2785332

  • 79

    SongT.ZengX.ZhengP.JiangM.Rodríguez-PatónA. (2018b). A parallel workflow pattern modeling using spiking neural p systems with colored spikes. IEEE Trans. Nanobiosci.17, 474484. 10.1109/TNB.2018.2873221

  • 80

    StephensonN.ShaneE.ChaseJ.RowlandJ.RiesD.JusticeN.et al. (2018). Survey of machine learning techniques in drug discovery. Curr. Drug Metab. 10.2174/1389200219666180820112457. [Epub ahead of print].

  • 81

    SuR.WuH.XuB.LiuX.WeiL. (2018). Developing a multi-dose computational model for drug-induced hepatotoxicity prediction based on toxicogenomics dataIEEE/ACM Trans. Comput. Biol. Bioinform. 10.1109/TCBB.2018.2858756. [Epub ahead of print].

  • 82

    TanP. N.SteinbachM.KumarV. (2005). Introduction to Data Mining, Boston, MA: Pearson Addison Wesley.

  • 83

    TangH.ZhaoY. W.ZouP.ZhangC. M.ChenR.HuangP.et al. (2018). HBPred: a tool to identify growth hormone-binding proteins. Int. J. Biol. Sci.14, 957964. 10.7150/ijbs.24174

  • 84

    TingH.GuangyongZ.PingyuZ.JiaJ.JingL.LuX.et al. (2014). LAceP: lysine acetylation site prediction using logistic regression classifiers. PLoS ONE9:e89575. 10.1371/journal.pone.0089575

  • 85

    WangP.ZhuW.LiaoB.CaiL.PengL.YangJ. (2018). Predicting influenza antigenicity by matrix completion with antigen and antiserum similarity. Front. Microbiol.9:2500. 10.3389/fmicb.2018.02500

  • 86

    WeiL.LiaoM.GaoX.ZouQ. (2015). Enhanced protein fold prediction method through a novel feature extraction technique. IEEE Trans. Nanobiosci.14, 649659. 10.1109/TNB.2015.2450233

  • 87

    WeiL.SuR.WangB.LiX.ZouQ.GaoX. (2019). Integration of deep feature representations and handcrafted features to improve the prediction of N6-methyladenosine sites. Neurocomputing324, 39. 10.1016/j.neucom.2018.04.082

  • 88

    WeiL.TangJ.ZouQ. (2017a). SkipCPP-Pred: an improved and promising sequence-based predictor for predicting cell-penetrating peptides. BMC Genomics18:742. 10.1186/s12864-017-4128-1

  • 89

    WeiL.WanS.GuoJ.WongK. K. (2017b). A novel hierarchical selective ensemble classifier with bioinformatics application. Artif. Intell. Med, 83, 8290. 10.1016/j.artmed.2017.02.005

  • 90

    WeiL.XingP.ZengJ.ChenJ.SuR.GuoF. (2017c). Improved prediction of protein-protein interactions using novel negative samples, features, and an ensemble classifier. Artif. Intell. Med.83, 6774. 10.1016/j.artmed.2017.03.001

  • 91

    WuL. C.LeeJ. X.HuangH. D.LiuB. J.HorngJ. T. (2009). An expert system to predict protein thermostability using decision tree. Expert Syst. Appl.36, 90079014. 10.1016/j.eswa.2008.12.020

  • 92

    XiongY.WangQ.YangJ.ZhuX.WeiD.-Q. (2018). PredT4SE-stack: prediction of bacterial type IV secreted effectors from protein sequences using a stacked ensemble method. Front. Microbiol.9:2571. 10.3389/fmicb.2018.02571

  • 93

    XuL.LiangG.ShiS.LiaoC. (2018a). SeqSVM: a sequence-based support vector machine method for identifying antioxidant proteins. Int. J. Mol. Sci.19:1773.

  • 94

    XuL.LiangG.WangL.LiaoC. (2018b). A novel hybrid sequence-based model for identifying anticancer peptides. Genes9:E158. 10.3390/genes9030158

  • 95

    XuQ.XiongY.DaiH.KumariK. M.XuQ.OuH. Y.et al. (2017). PDC-SGB: Prediction of effective drug combinations using a stochastic gradient boosting algorithm. J. Theor. Biol.417, 17. 10.1016/j.jtbi.2017.01.019

  • 96

    YangH.LvH.DingH.ChenW.LinH. (2018). iRNA-2OM: a sequence-based predictor for identifying 2'-O-Methylation sites in homo sapiens. J. Comput. Biol.25, 12661277. 10.1089/cmb.2018.0004

  • 97

    YangJ.GrunewaldS.WanX. F. (2013). Quartet-net: a quartet-based method to reconstruct phylogenetic networks. Mol. Biol. Evol.30, 12061217. 10.1093/molbev/mst040

  • 98

    YangJ.GrünewaldS.XuY.WanX.-F. (2014). Quartet-based methods to reconstruct phylogenetic networks. BMC Syst. Biol.8, 2121. 10.1186/1752-0509-8-21

  • 99

    YangR.ZhangC.GaoR.ZhangL. (2015). An ensemble method with hybrid features to identify extracellular matrix proteins. PLoS ONE10:e0117804. 10.1371/journal.pone.0117804

  • 100

    YaoY.LiX.LiaoB.HuangL.HeP.WangF.et al. (2017). Predicting influenza antigenicity from Hemagglutintin sequence data based on a joint random forest method. Sci. Rep.7:1545. 10.1038/s41598-017-01699-z

  • 101

    YiX.JuanL.Dong-QingW. (2011). An accurate feature-based method for identifying DNA-binding residues on protein surfaces. Proteins Struct. Funct. Bioinform.79, 509517. 10.1002/prot.22898

  • 102

    YuL.HuangJ.MaZ.ZhangJ.ZouY.GaoL. (2015). Inferring drug-disease associations based on known protein complexes. BMC Med. Genomics8:S2. 10.1186/1755-8794-8-S2-S2

  • 103

    YuL.MaX.ZhangL.ZhangJ.GaoL. (2016a). Prediction of new drug indications based on clinical data and network modularity. Sci. Rep.6:32530. 10.1038/srep32530

  • 104

    YuL.SuR.WangB.ZhangL.ZouY.ZhangJ.et al. (2017a). Prediction of novel drugs for hepatocellular carcinoma based on multi-source random walk. IEEE/ACM Trans. Comput. Biol. Bioinform.14, 966977. 10.1109/TCBB.2016.2550453

  • 105

    YuL.WangB.MaX.GaoL. (2016b). The extraction of drug-disease correlations based on module distance in incomplete human interactome. BMC Syst. Biol.10:111. 10.1186/s12918-016-0364-2

  • 106

    YuL.ZhaoJ.GaoL. (2017b). Drug repositioning based on triangularly balanced structure for tissue-specific diseases in incomplete interactome. Artif. Intell. Med.77, 5363. 10.1016/j.artmed.2017.03.009

  • 107

    YuL.ZhaoJ.GaoL. (2018). Predicting Potential Drugs for Breast Cancer based on miRNA and Tissue Specificity. Int. J. Biol. Sci.14, 971982. 10.7150/ijbs.23350

  • 108

    ZengX.DingN.Rodríguez-PatónA.ZouQ. (2017). Probability-based collaborative filtering model for predicting gene disease associations. BMC Med. Genomics10:76. 10.1186/s12920-017-0313-y

  • 109

    ZengX.LiuL.L.ZouQ. (2018). Prediction of potential disease-associated microRNAs using structural perturbation method. Bioinformatics34, 24252432. 10.1093/bioinformatics/bty112

  • 110

    ZhangJ.JuY.LuH.XuanP.ZouQ. (2016). Accurate Identification of cancerlectins through hybrid machine learning technology. Int. J. Genomics2016, 111. 10.1155/2016/7604641

  • 111

    ZhangJ.LiuB. (2017). PSFM-DBT: identifying DNA-binding proteins by combing position specific frequency matrix and distance-bigram transformation. Int. J. Mol. Sci.18:E1856. 10.3390/ijms18091856

  • 112

    ZhangL.ZhangC.GaoR.YangR. (2015). An ensemble method to distinguish bacteriophage virion from non-virion proteins based on protein sequence characteristics. Int. J. Mol. Sci.16, 2173421758. 10.3390/ijms160921734

  • 113

    ZhuP.HuQ.HanY.ZhangC.DuY. (2016). Combining neighborhood separable subspaces for classification via sparsity regularized optimization. Inf. Sci.370, 270287. 10.1016/j.ins.2016.08.004

  • 114

    ZhuP.HuQ.HuQ.ZhangC.FengZ. (2018c). Multi-view label embedding. Pattern Recognit.84, 126135. 10.1016/j.patcog.2018.07.009

  • 115

    ZhuP.XuQ.HuQ.ZhangC. (2018a). Co-regularized unsupervised feature selection. Neurocomputing275, 28552863. 10.1016/j.neucom.2017.11.061

  • 116

    ZhuP.XuQ.HuQ.ZhangC.ZhaoH. (2018b). Multi-label feature selection with missing labels. Pattern Recognit.74, 488502. 10.1016/j.patcog.2017.09.036

  • 117

    ZhuX.-J.FengC.-Q.LaiH.-Y.ChenW.HaoL. (2019). Predicting protein structural classes for low-similarity sequences by evaluating different features. Knowledge Based Syst.163, 787793. 10.1016/j.knosys.2018.10.007

  • 118

    ZouQ.LiX. B.JiangW. R.LinZ. Y.LiG. L.ChenK. (2014). Survey of MapReduce frame operation in bioinformatics. Brief. Bioinform.15, 637647. 10.1093/bib/bbs088

  • 119

    ZouQ.WangZ.GuanX.LiuB.WuY.LinZ. (2013). An approach for identifying cytokines based on a novel ensemble classifier. Biomed Res. Int.2013:686090. 10.1155/2013/686090

  • 120

    ZouQ.ZengJ.CaoL.JiR. (2016). A novel features ranking metric with application to scalable visual and bioinformatics data classification. Neurocomputing173, 346354. 10.1016/j.neucom.2014.12.123

Summary

Keywords

phage virion proteins, machine learning, feature extraction, feature selection, hybrid sequence features

Citation

Ru X, Li L and Wang C (2019) Identification of Phage Viral Proteins With Hybrid Sequence Features. Front. Microbiol. 10:507. doi: 10.3389/fmicb.2019.00507

Received

24 December 2018

Accepted

27 February 2019

Published

26 March 2019

Volume

10 - 2019

Edited by

Hongsheng Liu, Liaoning University, China

Reviewed by

Zhiwei Ji, University of Texas Health Science Center, United States; Nuria Quiles Puchalt, University of Glasgow, United Kingdom

Updates

Copyright

*Correspondence: Chunyu Wang

This article was submitted to Systems Microbiology, a section of the journal Frontiers in Microbiology

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Outline

Figures

Cite article

Copy to clipboard


Export citation file


Share article

Article metrics