Classifying the Lifestyle of Metagenomically-Derived Phages Sequences Using Alignment-Free Methods

Phages are viruses that infect bacteria. The phages can be classified into two different categories based on their lifestyles: temperate and lytic. Now, the metavirome can generate a large number of fragments from the viral genomic sequences of entire environmental community, which makes it impossible to determine their lifestyles through experiments. Thus, there is a need to development computational methods for annotating phage contigs and making prediction of their lifestyles. Alignment-based methods for classifying phage lifestyle are limited by incomplete assembled genomes and nucleotide databases. Alignment-free methods based on the frequencies of k-mers were widely used for genome and metagenome comparison which did not rely on the completeness of genome or nucleotide databases. To mimic fragmented metagenomic sequences, the temperate and lytic phages genomic sequences were split into non-overlapping fragments with different lengths, then, I comprehensively compared nine alignment-free dissimilarity measures with a wide range of choices of k-mer length and Markov orders for predicting the lifestyles of these phage contigs. The dissimilarity measure, d2S, performed better than other dissimilarity measures for classifying the lifestyles of phages. Thus, I propose that the alignment-free method, d2S, can be used for predicting the lifestyles of phages which derived from the metagenomic data.


INTRODUCTION
Viruses are distributed in every corner of the earth, and they play important roles in the ecosystem (Srinivasiah et al., 2008). A virus is a small individual with a simple structure, containing only one type of nucleic acid (DNA or RNA), and must parasitize and replicate in living cells (Whitman et al., 1998). Viruses can infect all kinds of organisms, from mammals to bacteria. An important class of viruses is bacteriophages, which can infect and kill bacterial cells.
The lifestyles of phages can be divided into two different types, temperate and lytic (Chopin et al., 2001;Knowles et al., 2016). Temperate phages can replicate and spread by integrating their genetic information into the bacterial genome. However, the lytic phages replicate themselves in bacterial cells and spread by killing the cells. Bacteriophages play important roles in microbial community, and identifying their lifestyles is the first step to understand their functions. Now, the metavirome can generate a large number of fragments from the viral genomic sequences of entire environmental community, which makes it impossible to determine their lifestyles through experiments. So, developing computational methods is necessary to predict the lifestyles of phages.
The previous studies of classifying phages using genomic data were mainly using alignment-based methods (Proux et al., 2002;Rohwer and Edwards, 2002;Lima-Mendez et al., 2008, 2011. However, very few studies focus on classifying the lifestyles of phages. McNair et al., 2012 utilizes a similarity algorithm and a supervised Random Forest classifier to predict the lifestyle of a phage (McNair et al., 2012). The similarity algorithm which creates a training set from phages with known lifestyles based on the alignment of protein sequences, is used to train a Random Forest to classify the lifestyle of a phage. Mavrich and Hatfull (2017) identified the temperate phages as those containing at least one temperate phage Pham (Mavrich and Hatfull, 2017). These two methods are based on protein sequence alignment, thus, the completely assembled phages sequences were needed before the usage of their methods. Nowadays, metavirome studies using high throughput sequencing technology can generate massive amounts of short read sequences from virus genomes (Lecuit and Eloit, 2013;Wylie et al., 2013;Brum et al., 2015). However, assembly of these short reads were difficulty for the highly mosaic organization of virus genomes (Hendrix et al., 1999). So, metavirome studies could produce large amount of incomplete of fragments from viral genome which made the previous alignment-based methods could not been used for predicting the lifestyles.
Alignment-free methods based on the frequencies of k-mers (k-words or k-tuples) were widely used for genome and metagenome comparison as recently reviewed (Song et al., 2014;Zielezinski et al., 2017;Ren et al., 2018). A k-tuple is a short base fragment of length k on genomic sequences. The alignment-free dissimilarity measures, d S 2 and d * 2 , were firstly developed for comparing two long DNA sequences, and then, successfully applied in many other fields, including phylogenetic tree construction (Song et al., 2013), the comparison of metagenomic samples (Jiang et al., 2012;Liao et al., 2016;Song et al., 2019) and gene regulatory regions (Song et al., 2013), identification of horizontal gene transfer  and virus-host interactions (Ahlgren et al., 2017), and improving contig binning for metagenomes (Wang et al., 2017). Also, many other alignment-free methods have been developed and applicated in many fields, see the reviews (Zielezinski et al., 2017(Zielezinski et al., , 2019Ren et al., 2018).
In this study, I have conducted a comprehensive evaluation of nine alignment-free dissimilarity measures over various k-mer lengths for classifying the lifestyles of phages. To evaluate prediction accuracy, I used a benchmark dataset of 1,562 phages genomes available at the National Center for Biotechnology Information (NCBI) for which the lifestyles were reported. Then, the 1,225 of the phages identified before 31/12/2013 were used for constructing the training models. The 337 of the phages identified between 1/1/2014 and 31/12/2016 were used for testing different alignment-free methods. Overall, the d S 2 dissimilarity measure performed better than other dissimilarity measures for classifying the lifestyles of phages. The software is available at https://github.com/songkai1987/PhagePred.

Virus Databases
RefSeq genomes of phages infecting bacteria or archaea were downloaded from NCBI on 20/10/2019. The lifestyles for the 1,562 phages identified before 31/12/2016 were predicted in Mavrich and Hatfull (2017) (Mavrich and Hatfull, 2017). In the set of phages with a known lifestyle, there were 463 temperate phages and 1,099 lytic phages. The 1,225 phages identified before 31/12/2013 were used for constructing the training models (Supplementary Table 1). The 337 of the phages identified between 1/1/2014 and 31/12/2016 were used for testing different alignment-free dissimilarity measures (Supplementary Table 2). The 325 phages identified after 1/1/2017 were used for novel phages for testing (Supplementary Table 3). The lifestyles of these phages were predicted using the same methods in (Mavrich and Hatfull, 2017).
To mimic the phage contigs assembly from metagenomic data sets, the temperate and lytic phages genomic sequences were split into non-overlapping fragments with length L = 500, 1000, 3000, 5000, and 10,000 bp. Fragments were generated for phage genomes identified between 1 January 2014 and 31 December 2016 were used as testing sets ( Table 1). To generate the evaluation datasets with the same proportion of temperate and lytic phage contigs, the same number of contigs were randomly sampled from the genomic sequences of lytic phage as the number of contigs from temperate phages.

Alignment-Free Dissimilarity Measures
Several alignment-free dissimilarity measures based on genomic oligonucleotide frequencies have been developed to infer the relationship between genomic sequences. Here, I studied nine alignment-free measures based on two different principlesthose that consider background frequencies of k-mers and those that do not. First, the k-mer frequencies from the phage genomic sequences identified before 31 December 2013 were extracted and merged as two training sets for temperate and lytic phages, respectively. Then, for a contig, its k-mer frequencies were also extracted and used for calculating its distance to temperate and lytic k-mer frequencies for inferring its lifestyle. Several common methods are used to calculate the distance: Euclidean distance (Eu), Manhattan distance (Ma), Chebyshev distance (Ch), and d 2 (Blaisdell, 1986). The background normalization methods, including d * 2 , d S 2 (Song et al., 2013), CVTree (Qi et al., 2004a,b), Teeling (Teeling et al., 2004), and EuF (Pride et al., 2006), which compute the expected k-mer frequencies to eliminate the effect of background and enhance the signal of differences between the viral sequences. These dissimilarity measures are described below. Since a read could be from the forward or reverse strand of a genome, the read was considered together with its complement for calculating the occurrences of each k-mer. Thus, for a viral contig, all possible k-mers were calculated using a finite alphabet set S = {A, C, G, T}. For a given k-mer w, its occurrence in the contig is defined as X w and the relative frequency of this k-mer is defined as f X w = X w w X w . For a given k-mer w for temperate or lytic phages in training sets, its occurrence is defined as Y w .
Some dissimilarity measures, such as d * 2 and d S 2 , need an r-th order Markov model for the background sequence. The expected number of occurrences of word w = w 1 w 2 · · · w k , E(X w ), can be calculated using the Markov model. The transition probability matrix for the Markov model can be estimated based on the r-mers and (r-1)-mers, and the estimated probability of observing the k-mer w 1 w 2 · · · w r is P M (w r+1 |w 1 w 2 · · · w r ) = X w 1 w 2 ···w r+1 X w 1 w 2 ···w r . Then, E(X w ) can be calculated as: where L is the length of the contig. The difference between the occurrences of k-mer w and its expected occurrences is defined The Euclidean distance is defined as: The Manhattan distance is defined as: The Chebyshev distance is defined as: The d 2 dissimilarity measure is defined as: The d * 2 dissimilarity measure is defined as: The d S 2 dissimilarity measure is defined as: The CVTree dissimilarity measure is defined as: is estimated using a (k-2)-th order Markov model. The Teeling dissimilarity measure is defined based on the (k-2)-th order Markov model: where E(X w ) and var(X w ) for w = w 1 w 2 · · · w k can be calculated as: The EuF dissimilarity measure is also defined based on the (k-2)-th order Markov model: is estimated based on the (k-2)-th order Markov model as above.

RESULTS
The framework of my method is given in Figure 1.  Table 2). In order to evaluate the ability of these measures for classifying novel viruses based on the previous sequenced phage genomes, date was used for parting the training and testing sequences. To mimic fragmented metagenomic sequences, phage genomes in testing sets were split into non-overlapping fragments of various lengths L = 500, 1000, 3000, 5000, and 10,000 bp ( Table 1).
The Effects of k-mer Length, Markov Order, and Contig Length I used the temperate and lytic phages genomic sequences identified before 31 December 2013 to construct two k-mer frequency vectors, then calculated the distance (dissimilarity values) between a novel contig with these two k-mer frequency FIGURE 1 | The work flow of my approach. First, the k-mer frequencies from two training sets of temperate and lytic phages were extracted, respectively. Then, the k-mer frequencies for a contig from new phage were extracted and used for calculating its distance to temperate and lytic k-mer frequencies for inferring its lifestyle.
vectors. The ratio between the distance to temperate phages and to lytic phages which reflected the possibility of the contigs was temperate or not was calculated. The values lower than one indicated the contigs closer to the temperate phages and had higher possibility been from temperate phages. Then, the receiver operating characteristic (ROC) curves were used to evaluate d * 2 and d S 2 's performances for classification, while high values of the area under the ROC curves (AUROC) indicate good performance. For d * 2 and d S 2 , AUROC values increased as k-mer length increased (Figure 2 and Supplementary Figure 1). For contigs with length ≥1,000 bp, AUROC values also increased as the Markov order of background sequences increased. These two dissimilarity measures, d * 2 and d S 2 , had similar performance. For contigs with length ≥3,000 bp, the AUROC values were larger than 0.90 when the k-mer length was eight and Markov order was three. These high AUROC values demonstrate the strong ability of the d * 2 and d S 2 dissimilarity measures to correctly classifying newly obtained viral sequences. Based on these results, Markov order three was chosen for subsequent comparison with other alignment-free dissimilarity measures. To prove the validity of my proposed method, Supplementary Figure 2 showed that the distance of newly viral sequences to the temperate and lytic genomes.

Comparison of These Alignment-Free Dissimilarity Measures' Performance
I assessed the ability of the alignment-free dissimilarity measures, d * 2 and d S 2 , to correctly classify phage contigs in comparison to other alignment-free dissimilarity measures. All these measures were tested using the same set of evaluation contigs as above: equal numbers of contigs subsampled from temperate and lytic phage genomes identified after 1 January 2014 and before 31 December 2016. AUROC values were scored for each of these measures using k-mer lengths from 6 to 10 and contig lengths 500 -10,000 bp ( Table 2 and  Supplementary Table 4). AUROC values generally increased for all the measures when k-mer length was increased from 6 to 10. For contigs with length ≥1,000 bp, both of d * 2 Frontiers in Microbiology | www.frontiersin.org FIGURE 2 | The AUROC values of d * 2 and d S 2 for classifying the lifestyles of phage contigs using k-mer lengths from 6 to 10, Markov order from 0 to 3, and contig lengths 5,000 bp (a) and 3,000 bp (b).

Sensitivity of d * 2 and d S 2 to Mutations
Because of the alignment-free dissimilarity measures relies on nucleotide k-mer frequency and there are errors in sequencing technologies, the sensitivity of our newly developed alignmentfree dissimilarity measures, d * 2 and d S 2 , to mutations were tested. In Supplementary Figure 3, thirty replicates subsampled contigs with randomly inserted mutations at three different rates (0.001, 0.005, and 0.01) were used for comparing the performance with no mutations. The AUROC values were lower but not significantly for Markov order 0 and 1 at all the three different mutation rates. For Markov order 2 and 3, the AUROC values were only significantly lower at the highest rates of 0.01 mutations per bp (P-value < 0.01, t-test). As the sequencing error rates of Illumina and 454 platforms are ∼0.001 or 0.01, respectively (Glenn, 2011), sequencing errors only slightly impact the performance of the alignment-free dissimilarity measures for the NGS technologies.

Assessment of the Classification of Novel Phages
To assess the ability of these alignment-free dissimilarity measures to classify novel phages, the 136 phages (Supplementary Table 5) (18 temperate phages and 108 lytic phages, identified after 1 January 2014 and before 31 December 2016) that had no significant nucleotide similarity (blastn search, E-values < 10−5) to previously phages genome sequences were used for testing. I classify these novel phages according to their distance to the temperate and lytic trained k-mer frequencies.
The True Positive Rates (TPR) for temperate and lytic phages and the accuracy of classification for these phages were scored for these alignment-free dissimilarity measures using k-mer lengths from 6 to 10. I only showed the results that the TPRs for temperate and lytic phages were both larger than 60%. Table 3 showed that the best classification result was obtained by d S 2 using k-mer length of 10 and Markov order of three. d S 2 could correctly predicted 12 (66.7%) of temperate phages and 95 (87.9%) of lytic phages. For other alignment-free dissimilarity measures, the best classification result was obtained by Euclidean (Eu) distance using k-mer length of 10. Euclidean distance correctly predicted 11 (61.1%) of temperate phages and 91 (84.3%) of lytic phages.

Application to Classification of Phages Identified After January 2017
The 325 phage genomes identified after 1 January 2017 were downloaded for analysis (Supplementary Table 3). The lifestyle of these phages were predicted used the same method in Mavrich andHatfull, 2017 (Mavrich andHatfull, 2017), then 72 temperate phages and 253 lytic phages were identified. These phages were used for assessing the classification accuracy of the alignment-free dissimilarity measures. Table 4 showed that the best classification result was obtained by d S 2 using k-mer length of 10 and Markov order of three. d S 2 could correctly  To mimic fragmented metagenomic sequences, these virus genomes were split into non-overlapping fragments of various length L = 1000, 3000, and 5000 bp. Table 5 showed that the best classification result was also obtained by d S 2 using k-mer length of 10 and Markov order of three for contigs with length = 5000 bp. d S 2 could correctly predicted 665 (86.4%) contigs from temperate phages and 3441 (81.6%) contigs from lytic phages. For contigs with length = 1000 and 3000 bp, d S 2 also got the best classification results using k-mer length of 10 and Markov order of 3 (Supplementary Tables 6,7).

DISCUSSION
In this study, I have conducted a comprehensive evaluation of nine alignment-free dissimilarity measures over various k-mer lengths for classifying the lifestyles of phages. For these dissimilarity measures requiring a background model, different orders of Markov chains were used for estimating background k-mer frequencies.
These alignment-free dissimilarity measures, with a wide range of choices of k-mer length and Markov orders, were compared using the simulated metagenomic fragments of different length. The dissimilarity measure, d S 2 , could obtain the best performance for classifying the lifestyles of the phages contigs among these measures.
There are several limitations of the current study. First, for the dissimilarity measure, d * 2 , could obtain well performance as d S 2 using the evaluation of ROC values, however, the performance for d * 2 to classify novel phage contigs according to the distance to the temperate and lytic k-mer frequencies was very bad. The distribution of ratios between the distance to temperate phages and the distance to lytic phages calculated by d * 2 was skewed to larger than one which reflect the systematic deviation in predicting the lifestyles for this dissimilarity measure. The unequal number of temperate and lytic phage genomes used in training set maybe cause this deviation for d * 2 . Second, the performance of these alignment-free dissimilarity measures depends on the phage genomes chosen in the training sets. In this study, I used the date as a criterion to split the phage genomes into training and testing sets. However, only less than two thousand phage genomes could be used in the study of these alignment-free measures which limits the accuracy of these methods. With the high-throughput sequencing technology widely used in viromics research, the assembled genomes for phages are becoming increasingly more available which would facilitate the development and application of these alignment-free dissimilarity measures. Third, the k-mer size k and orders of Markov models can markedly impact the performance of these alignment-free measures. In general, the k-mer size of 9 or 10 and Markov order of 2 or 3 for background sequences can give good performance. Since the viral genomes have great variability and highly mosaic organization, so longer length of k-mer and higher order of Markov chain can model the genomic sequences well. More studies are needed to see if this conclusion is robust for more phage genomes sequenced in the future.
In this study, I focused on classifying the lifestyles of phage contigs using alignment-free dissimilarity measures. Compared to alignment-based methods, the alignment-free methods can have better performance in classifying short contigs as a few kilobases without complete gene structure, however, alignmentfree methods cannot give insights about the genome information responsible for the contigs. From this perspective, I can say that alignment-free and alignment-based methods for classifying phage contigs complement each other and should be used interactively for phage contigs classification.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.