Phage family classification under Caudoviricetes: A review of current tools using the latest ICTV classification framework

Bacteriophages, which are viruses infecting bacteria, are the most ubiquitous and diverse entities in the biosphere. There is accumulating evidence revealing their important roles in shaping the structure of various microbiomes. Thanks to (viral) metagenomic sequencing, a large number of new bacteriophages have been discovered. However, lacking a standard and automatic virus classification pipeline, the taxonomic characterization of new viruses seriously lag behind the sequencing efforts. In particular, according to the latest version of ICTV, several large phage families in the previous classification system are removed. Therefore, a comprehensive review and comparison of taxonomic classification tools under the new standard are needed to establish the state-of-the-art. In this work, we retrained and tested four recently published tools on newly labeled databases. We demonstrated their utilities and tested them on multiple datasets, including the RefSeq, short contigs, simulated metagenomic datasets, and low-similarity datasets. This study provides a comprehensive review of phage family classification in different scenarios and a practical guidance for choosing appropriate taxonomic classification pipelines. To our best knowledge, this is the first review conducted under the new ICTV classification framework. The results show that the new family classification framework overall leads to better conserved groups and thus makes family-level classification more feasible.

It is now demonstrated that phages can be found in a wide variety of environments, including aquatic ecosystems (Paul et al., 2002;Guttman et al., 2005), human gut (Manrique et al., 2017;Sutton and Hill, 2019), and soil (Chow and Suttle, 2015;Williamson et al., 2017). The first viral metagenome of uncultured marine viral communities was published in 2002 (Breitbart et al., 2002). Phages can shape the composition and function of underlying ecosystems through two different lifestyles: temperate and virulent. Temperate phages will integrate their genomes into bacterial chromosomes and replicate with their host. They will maintain this state, which is also called prophages, until being induced by the environment's condition, such as appropriate temperature and pH value. Then, temperate phages will enter the lytic cycle to kill the host (Campbell, 2003;Howard-Varona et al., 2017). In contrast, virulent phages do not integrate their genomes into the hosts. They stay in the lytic cycle and kill the hosts after replicating themselves (Hobbs and Abedon, 2016).
The unique properties and life styles make phages key players in multiple applications. For example, phage therapy is a promising strategy for treating bacterial infections, particularly those with antibiotic-resistant bacteria. It has been found that intravenous phage preparations could treat Staphylococcus aureus that induced pneumonia in mice (Saussereau and Debarbieux, 2012;Oduor et al., 2016). In addition, phages can be used to treat gastrointestinal infections. It has been demonstrated that phages are effective in reducing intestinal pathogens and have less impact on the composition of the intestinal microbiota compared to antibiotics (Jaiswal et al., 2013;Galtier et al., 2016;Nale et al., 2016;Gutiérrez and Domingo-Calap, 2020). Moreover, phages are important in food safety. The use of specific phage treatments in the food industry can prevent product spoilage and limit the spread of bacteria, providing a safe environment for animal and plant food production (Garcia et al., 2008;Coffey et al., 2010;Sillankorva et al., 2012;Gutiérrez et al., 2017). However, despite the abundance and importance of phages in various ecosystems, our understanding of phages is still very limited. According to the database supported by the National Center for Biotechnology Information (NCBI), the number of identified phages in class Caudoviricetes changed from 1,359 in 2015 to 4,483 in 2022 in the RefSeq database, which is tripled in size. Besides the reference genomes, there are roughly 63,588 assembled phages belonging to Class Caudoviricetes in the Genbank database in 2022, an almost five fold increase compared to 2015 (16,232). However, the characterization of phages cannot keep pace with the fast increase of the sequencing data.
Assigning phages into different taxonomic groups is a fundamental step following phage discovery. The official taxonomy was established by the International Committee on Taxonomy of Viruses (ICTV) (Adams et al., 2017), which organizes viruses in several taxonomic levels, including class, order, family, subfamily, genus, and so on. Within the ICTV, the Bacterial and Archaeal Viruses Subcommittee (BAVS) is responsible for the phages' taxa. BAVS classifies phages based on a variety of phage properties, including the molecular composition of the genome (ss/ds, DNA, or RNA), the morphology, the structure of the capsid, and the host range (Dion et al., 2020). Recently, with the increasing availability of viral genomes, using genomes for taxonomic classification has become more widely accepted (Lefkowitz et al., 2018). Due to the extensive sequencing efforts for virus discovery, ICTV cannot catch up with the sheer number of newly identified phages, and thus many viruses are still not classified. One challenge behind this delay is the lack of standard, accurate, and comprehensive taxonomic classification tools for phages. Indeed, phage classification is not a trivial problem. The taxonomic standard in ICTV is constantly changing as new phages are discovered. Recently, ICTV updated the phage classification system in August 2022, in which several major families in the previous ICTV system are removed, such as Siphoviridae, Podoviridae, and Myoviridae. These changes can significantly affect the performance of family classification. To our best knowledge, no quantitative evaluations of the performance change have been conducted. Table 1 shows the average similarity (calculated by Dashing; Baker and Langmead, 2019) of the largest four families in the old and new ICTV taxonomy classification systems. The updated families are more conserved as shown by the increased average similarity, making family-level classification more feasible.
Available taxonomic classification tools often have different designs and were tested on different datasets by their authors. Without a comprehensive comparison on the same training/reference data set and test set, it is difficult for users to choose the most appropriate solution for their needs. This paper presents a comprehensive benchmark of the main players in phage taxonomic classification under the latest ICTV standard. The remaining of this review is organized as follows. First, .
/fmicb. . we will describe the main methods/models for existing phage taxonomic classification approaches and discuss whether they can be retrained/used under the new ICTV taxonomy standard. Then, we evaluate the four representative approaches that can be retrained by newly labeled sequences in different usage scenarios. In particular, we tested these tools on complete virus genomes, short contigs, simulated metagenomic datasets, and low-similarity datasets. In addition, we conducted a leave-onefamily-out experiment to test whether these tools can recognize out-of-distribution sequences. By comparing their performance and analyzing the underlying reasons, we draw conclusions and provide guidance for users about choosing the most appropriate tools for different scenarios.

. Approaches for phage taxonomic classification
Most phage taxonomic classification approaches can output classification results in different ranks, such as order, family, and genus. In this review, we focus on comparing different tools' performance at the family level because of the following reasons. First, the taxonomy by ICTV is under constant changes, which affects the total genus number significantly. For example, there are 735 genera in the ICTV database released in 2016. However, the number of genera increased to 2,224 in 2020. The overhaul of the genus-level taxonomy can make the definition of "ground truth" ambiguous. In addition, hundreds of rare genera only contain one phage, making the construction of reference and test set difficult. Second, classification at higher taxonomic ranks is usually easier than at lower ranks due to the smaller inter-class similarities and more abundant sequences in each class. Thus classification at order or above is not as challenging as family classification. Caudoviricetes, a class of phage known as the tailed phages whose hosts are phage and archaea, contains the majority of the total phage sequences and can be classified by almost all of the tools mentioned above, we thus focus on the classification of the families under Caudoviricetes in this work.
The phage taxonomic classification methods are summarized in Table 2 following the chronological order, which includes a brief description, publication year, required input data type, and the lowest predicted level of each tool. A majority of these tools conduct phage taxonomic classification based on sequence comparison, utilizing nucleotide-level or protein-level similarity between a query virus and the reference database. The comparison-based methods differ in their constructed reference database, the alignment method, and how they utilize these alignments. Both pairwise sequence alignment and hidden Markov model (HMM)-based profile alignments are commonly used. Multiple tools construct virus protein families and use them as marker genes. Using markers usually incurs less memory usage than using all phage genomes. But newly sequenced phages with novel genes may not be aligned to any marker gene families and thus cannot be assigned to a known class. Learning-based models have also been applied to phage classification. Learning models can automatically infer the sequence patterns in phage genomes of different families and use the learned features for automatic classification. A more detailed description of these tools is provided below.
Phage Proteomic Tree (Rohwer and Edwards, 2002;Nishimura et al., 2017) is a relatively early program providing phage genome classification down to the family level. It extracts protein sequences from virus genomes and clusters these sequences using BLASTP (Altschul et al., 1997). Then the clusters in Phage Proteomic Tree are refined and scored. Finally, the alignment scores are converted to distances, which were used to generate the final tree using the neighbor-joining algorithm.
Taxon-specific signature genes can be identified in most virus taxa. POGs (Phage Orthologous Groups) (Kristensen et al., 2013) is a collection of clusters of orthologous genes from phages, presented as profiles (multiple sequence alignment).

The viral families of POGs are filtered as "Viruses[Organism] NOT cellular organisms [ORGN] NOT srcdb_refseq[PROP] AND vhost bacteria[filter] AND 'complete genome' [All Fields
]" in NCBI. Signatures are extracted for each taxon, and we can use BLASTP to search for matches among the viral protein sequences. POGs are designed to be well-suited for defining taxon-specific signature genes, and the profiles built from POGs are more sensitive and specific to search for signature genes in a given dataset.
GRAViTy (Aiewsakun and Simmonds, 2018) also extracts protein sequences from virus genomes and cluster these sequences using BLASTP (Altschul et al., 1997). GRAViTy generates protein profile hidden Markov models (PPHMMs) .

Name Year Description Input data Lowest level
Phage Proteomic Tree (Rohwer and Edwards, 2002) 2002 It uses the BLASTP distance and protein distance scores (similarity between two proteins) to generate phage proteomic trees, which can describe the relationships between different phages and can serve as a genome-based classification system for phages. Genome sequences Family and genomic organization models (GOMs) based on the sequences from BLASTP-based clustering. Then it computes Composite Generalized Jaccard (CGJ) similarity scores (a geometric mean of the two generalized Jaccard scores computed for a pair of PPHMM signatures and a pair of GOM signatures) between each sequence pair to construct the heat map and dendrogram and estimate sequences' relatedness. GRAViTy requires users to choose reference database freely but need sequences in GenBank format as input. CCP77 (Low et al., 2019) applies a concatenated protein phylogeny for the classification of tailed dsDNA viruses belonging to the specific order Caudovirales. Classiphage (Chibani et al., 2019a,b) uses phage-specific Hidden Markov Models (HMMs) (Eddy, 2011) profiles generated from clusters of related proteins for classification. The HMM profiles are built using the produced multisequence alignment files by the "hmmbuild" command. Classiphage 2.0 additionally trains an Artificial Neutral Network (ANN) using phage family-proteome to phagederived HMMs scoring matrix, which can classify more phage families and include more features than its previous version.
. /fmicb. . vConTACT (Bolduc et al., 2017;Bin Jang et al., 2019) is a high-throughput network-based approach utilizing wholegenome gene-sharing profiles. It clusters the input viral genomes together with characterized genomes. The genomes in the same cluster indicate the same family or genus, and the predicted family can be inferred if there are characterized genomes in the same cluster.
CAT (von Meijenfeldt et al., 2019) provides taxonomic classification using homology searches. It uses DIAMOND BLASTP to identify homologous sequences and then assigns query sequences into taxa with a voting approach. The authors of CAT show that using the best hit strategy can lead to low specificity and thus design a more robust strategy based on multiple hits. Users can select the reference database and tune the setting, which is more flexible than some other tools. Moreover, it has a very low memory usage.
MMseqs2 (Mirdita et al., 2021) is a fast contig taxonomic assignment tool. Similar to CAT, it conducts protein homology search against reference sequences and uses majority vote to assign the most specific taxon for a contig. With some optimizations and adoption of 2bLCA (Hingamp et al., 2013), MMseqs2 circumvents the need of adjusting a parameter in CAT and achieves faster speed on the tested bacterial and eukaryotic datasets. It allows users to supply a customized reference database.
VPF-Class (Pons et al., 2021) provides both taxonomic classification and host prediction for input viral genomes. It compares predicted proteins against the set of constructed Viral Protein Families (VPFs) (from the IMG/VR system). Then it derives taxonomic classifications and confidence scores from the list of VPFs detected on each query genome. However, VPF-Class does not require users to download and select the reference datasets.
PhaGCN (Shang et al., 2021) is a semi-supervised learning model for phage taxonomic classification developed by our team. This model constructs a knowledge graph by combining the DNA sequence features learned by Convolutional Neural Networks (CNN) and protein sequence similarity gained from the gene-sharing network. The learning model can incorporate the automatically learned features for each contig. However, unlike sequence comparison-based approaches, PhaGCN only accepts phage-like sequences as input. Thus, a pre-processing step is needed for detecting those contigs from metagenomic data. A number of tools, such as VirFinder (Ren et al., 2020), Seeker (Auslander et al., 2020), and PhaMer (Shang et al., 2022) can be applied in the pre-processing step.

. Experiments and results
Because of the changes in the ICTV classification system, the models/reference databases need to be updated using the latest labeled sequences. However, not all the tools in Table 2 can be updated easily. Among them, only CAT, GRAViTy, PhaGCN, MMseqs2, and vConTACT 2.0 allow users to change their reference databases or retrain the models with reasonable efforts. The others do not specify the feasibility of changing models or reference databases in the descriptions. The source code of CCP77 is only available on request but not to the public. The code of GRAViTy released at GitHub is the alpha version and the author mentioned that they are currently working on a new and improved version that is more userfriendly and written in python3. Nevertheless, we downloaded and installed the alpha version of GRAViTy. The alpha version is computationally expensive and requires 30 h to build a reference database with about 1200 genomes and another 25 h to process just 300 queries. Therefore, we focus on evaluating the performance of the four tools: PhaGCN, vConTACT 2.0, CAT, and MMseqs2. These tools were recently published and demonstrated good performance in their own or others' tests. In addition, the corresponding codes and tools are still under maintenance. None of them requires an internet connection or a web server. To mimic the scenario of applying these tools to datasets without known taxonomic composition, we apply all these tools with their default parameters, which are optimized by the authors. The commands for running all these tools are available in the Supplementary material. All the tools were run on IntelVR XeonVR Gold 6258 R CPU with 8 cores.

. . Dataset
We rigorously evaluated these phages taxonomic classification tools on multiple datasets. The detailed information is listed below.
• The RefSeq dataset RefSeq is a widely used benchmark dataset in phage classification tasks. By October 2022, there are 1,826 complete sequences with family-label under Class Caudoviricetes in the RefSeq database. In this paper, we only focus on the phages infecting bacteria. After filtering out the families that infect archaea or contain sequences less than 6, there are 19 families (including 1460 complete sequences) we can use in our experiments. Table 3 shows the number of sequences within the 19 families under class Caudoviricetes, among which Autographiviridae contains the largest number of sequences. For the tools that require protein sequences, we used Prodigal (Hyatt et al., 2010) to predict and translate the nucleotide sequence into the proteins.
We sorted the sequence by their release time at RefSeq. Then, we used the first 80% of the labeled complete sequences from each family as the training set/reference database to retrain/update the four tools, and the rest 20% as test set. Because .
/fmicb. . we split the data in chronological order, the data in the test set are more recent (almost all were released in 2020 or after).
For each length, we cut ten segments from each phage genome by selecting a random start position. Finally, we had 2,930 phage contigs for each length and 29,300 for all different lengths. Then, we used these segments to evaluate the performance of the four tools on short contigs.
• Simulated metagenomic dataset We used a simulated metagenomic dataset generated by six common bacteria living in human gut (Shang et al., 2022). We first utilized metaSPAdes (Nurk et al., 2017) to assemble the reads into contigs. Then PhaMer (Shang et al., 2022) was applied to identify bacteriophages from metagenomic data, and the labels of the contigs were determined using BLAST (Camacho et al., 2009). Eventually, 37 contigs were used in the experiments. More details about this dataset will be provided in the section of Experiment 4.
• Low-similarity dataset To test the tools' performance on classifying highly diverged phages, we constructed a hard case where the test sequences share low similarity with the reference database/training data. Specifically, we calculated the Dashing pairwise similarity of the sequences in each family and then used the approach in Petti and Eddy (2022) to partition the data into two parts with specified maximum similarity. With this method, we got 264 and 45 genomes for training and test, where each test genome has at most 0.015 Dashing similarity with any reference genome. Then we randomly cut 15 contigs with a length of 3,000 and 5,000 bp, respectively, from each testing genome. Finally, there are 675 contigs for each length in the test set.

. . Evaluating criteria for di erent tools . . . Metrics
An ideal phage classification tool should assign correct labels for as many inputs as possible. Nevertheless, there is usually a tradeoff between the percentage of prediction and the accuracy of the prediction. Some tools may sacrifice the percentage of prediction in order to achieve high specificity and accuracy, while others may predict more with lower accuracy. Thus the first metric is prediction rate, which is the ratio of outputs with prediction results (N pred in Equation 1) to the total input (N all in Equation 1). Because some tools only provide a family name as output, commonly used metrics such as AUROC cannot be computed. In this work, we calculated accuracy, recall, and precision for each tool (Equations 2-4). N correct is the number of sequences with correct predictions in output. N total is the total number of sequences used to evaluate, which can be N all or N pred when we report accuracy for all input phage sequences (N all ) or only for sequences with predictions in output (N pred ), respectively. Providing accuracy for all input sequences has the advantage of using the same denominator (i.e., N all ) for all tools. But it penalizes the tools of low prediction rate twice. On the other hand, reporting accuracy for only sequences with predictions removes the impact of prediction rate but may favor tools with low prediction rate (i.e., small N pred ). Thus, reporting both can provide a more comprehensive evaluation for users. For example, if there are 293 (N all ) sequences input, among which 290 sequences have classification prediction results (N pred ), and 285 of them have correct results (N correct ), the accuracy on all input will be 285/293 = 0.973, and the accuracy on predicted sequences will be 285/290 = 0.983. We only calculate the recall and precision of each family (Precision i and Recall i ) to check the performance on different families. TP i , FP i , and FN i are the true positive, false positive, and false negative for family i, respectively. .

. . Description of the output
Because the output format of each tool is different, we will describe how we process the output and calculate the metrics in detail.
vConTACT 2.0 can output the result of each sequence and assign it a "VC State", including "Singleton", 'Outlier", or "Clustered". In addition, the sequences with a "Clustered" state will be assigned to a VC cluster/subcluster. When the query sequence is within the same VC cluster as a reference genome, the taxonomic labels can be assigned based on the known labels. However, some sequences are clustered but have no reference genome in the same VC cluster, so they can not be assigned with a known label. Therefore, we treat the sequence with VC state of "Singleton", "Outlier", and "Clustered" but no reference genome in the same clusters, as "no prediction". In other words, N pred of vConTACT 2.0 refers to the number of the sequences that are clustered with reference genomes.
PhaGCN will not output the classification results for the sequences they can not classify, so N pred of PhaGCN is the number of sequences that can be predicted.
MMseqs2 and CAT will not output any prediction result for the sequences they cannot classify. The classification result of MMseqs2 and CAT can be a label at different ranks. If the prediction at the lowest rank is above family, we also treat this sequence as "no prediction" for the family level. The number of the rest sequences is N pred of MMseqs2/CAT.

. . Experiment : Leave-one-family-out experiments
The constant change of ICTV underscores a need for classification tools to recognize the sequences that are not part of the current classification system. For example, the three largest families, Siphoviridae, Podoviridae, and Myoviridae, were largely removed from the current ICTV system. Some of the sequences that belonged to these three families are not part of any existing family. Thus, the classification tools need to handle these out-ofdistribution sequences by providing a signal for users.
To examine whether the tested tools can single out those out-of-distribution sequences, we removed all the phages in one family from the training data and retrained the models. Then the retrained models are applied to the removed family members. Ideally, the test sequences in this removed family should not be classified into any existing family labels.
At first, we conducted the experiments on a small and a relatively large family: Guelinviridae and Rountreeviridae. The classification results are plotted in Figures 1, 2, which show that PhaGCN assigned all of the query genomes to one of the other families in the training set, while CAT and MMseqs2 can correctly recognize a few sequences as "no family label". However, vConTACT 2.0 can assign all sequences to "Outlier/Singleton" or a "VC cluster" without reference genomes.
We then extended the experiment to each family. Because the current version of PhaGCN is not designed to handle outof-distribution sequences, we only show the results for CAT, MMseqs2, and vConTACT 2.0 in Table 4. The output of these three tools for the test sequences are divided into two parts: those that did not output a family label ("no prediction", defined in the Section 3.2.2), and those that can output a family label from the training data (i.e., a misclassification in this experiment). Table 4 shows the misclassification rate of each tool. CAT and MMseqs2 assign more test sequences to other families in the reference database. In contrast, vConTACT 2.0 can assign almost all sequences of each family to "Outlier/Singleton" labels or "VC cluster" without reference genomes. The misclassification rates of CAT and MMseqs2 vary widely across different families, with the ranges 0-1 and 0-0.92, respectively. A closer look at those results reveals that the misclassified phages tend to distribute in a small set of families. For example, almost all sequences belonging to Guelinviridae are classified into Salasmaviridae by CAT, which is likely due to the higher interfamily similarity between them. Specifically, 29.6% proteins of Guelinviridae can align with Salasmaviridae using BLASTP. Similarly, sequences from Zobellviridae tend to be classified into family Autographiviridae because they share about 16.9% proteins. Therefore, the inter-family similarity is an essential factor leading to misclassification. Overall, the misclassification results of MMseqs2 are more divergent than CAT. For example, CAT will classify Autographiviridae genomes into 4 other families, while MMseqs2 will assign them into 8 families (including the 4 families in CAT).
Then we extended the experiment to the genomes that are unclassified at the family level in the RefSeq database under class Caudoviricetes. Because the three largest families Myoviridae, Siphoviridae and Podoviridae were removed, we used the genome sequences that initially belonged to these three families but now no longer have a family label as the test data. There are 2445 of them, and the classification result is shown in Figure 3. MMseqs2 and CAT misclassified about 65% of the input sequences. vConTACT 2.0 can identify .

FIGURE
The classification result of Guelinviridae sequences in tools that are retrained by removing all Guelinviridae sequences. "independent clustered": The sequences are in a VC cluster without any reference genome.

FIGURE
The classification result of Rountreeviridae sequences in tools that are retrained by removing all Rountreeviridae sequences. "independent clustered": The sequences are in a VC cluster without any reference genome.  98% unclassified sequences by assigning them in independent clusters or outputting a "Singleton/Outlier" label and only misclassified 2% sequences. In conclusion, vConTACT 2.0 performs better in identifying novel phages than the other three tools.
. . Experiment : Classification performance As we described in Section "Dataset", we used 20% (293) of the complete sequences from the RefSeq database as the test set, and the other 80% as the reference/training set. To mimic metagenomic assembled contigs, we generated six sets of segments of different lengths for comparison, including 500, 1,000, 3,000, 5,000, 10,000, and 15,000 bp. We randomly selected the start positions for each length and cut ten segments from each complete sequence. Finally, we had 2,930 phage fragments for each length and 29,593 for all different lengths as the test data (293 complete sequences + 2930 * 10 short fragments).
A good taxonomic classification tool should have a high prediction rate and high accuracy. First, we recorded the prediction rate of each tool on different lengths. Because PhaGCN only accepts contigs longer than 2,000 bp, we do not show its results on 500 and 1,000 bp in Figure 4. The prediction rate ( Figure 4A) of all tools becomes higher with the increase in sequence length. This is expected because longer sequences usually provide more information for classification. Almost all pipelines can maintain a high prediction rate (>80%) on short sequences except vConTACT 2.0. PhaGCN has the highest prediction rate if the inputs are longer than 5,000 bp, while CAT is slightly lower. vConTACT 2.0 is mainly designed for complete or long sequences, and its prediction rate drops sharply when the inputs are shorter than 15,000 bp. All four can handle more than 95% of complete sequences, among which PhaGCN can predict all of them (100%), and the prediction rates of MMseqs2, CAT, and vConTACT 2.0 are 99.3, 97.9, and 95.1%, respectively. Figure 4B shows the accuracy of the four tools on phage sequences with predictions (N pred in Equation 1). Similar to the prediction rates above, the accuracy of these approaches becomes better as the sequence lengths increase. The classification ability of CAT, PhaGCN, and MMseqs2 are not significantly affected by the change of contig lengths. On incomplete contigs, the accuracy of vConTACT 2.0 has an obvious upward trend when length increases. CAT gains the best prediction accuracy for contigs longer than 5,000 bp. Combined with the slightly lower prediction rate of CAT mentioned above, we can conclude that there is a tradeoff between the prediction rate and the accuracy of CAT. The accuracy of PhaGCN is slightly lower than the other two on contigs, and all three tools reach a high accuracy (100%) for all complete sequences with predictions. Figure 4C shows the accuracy of the four tools on all input phage contigs (N all in Equation 1), which combines the results in Figures 4A,B in order to display the overall performance of each tool. It reveals that PhaGCN keeps the best performance on contigs longer than 5,000 bp and reaches 100% accuracy on complete genomes because it gains 100% accuracy and prediction rate in Figures 4A,B, respectively. It is worth noting that the other three tools all have a less than 100% recall on Autographiviridae, most likely due to the lower pairwise similarity in Autographiviridae (Table 1). Due to the length limitation of PhaGCN, it is not suitable for classifying contigs shorter than 2,000 bp. When classifying contigs longer than 2,000 bp, PhaGCN and MMseqs2 are recommended for obtaining high prediction rates. Otherwise, CAT is a better choice if precision is the primary consideration.

. . Experiment : Impact of training set size on classification performance
Being a learning-based classification tool, PhaGCN can be affected by training data size. To test whether PhaGCN and other alignment-based tools suffer from reduced training data/reference database, we used 80% (the same as Experiment 2), 60%, and 50% of the RefSeq databases as the reference database for these tools, respectively. Then we tested them on the same test set as in Experiment 2. As shown in Figure 5A, the prediction rates of PhaGCN with different reference databases have no obvious differences. There is a slight change in the prediction rate of CAT, MMseqs2, and vConTACT 2.0, but the differences do not exceed 0.2%. In addition, the accuracy of these tools shown in Figure 5B are almost identical and are less affected than the prediction rate.

. . Experiment : Classification performance on the simulated metagenomic dataset
In this experiment, we used the simulated metagenomic dataset provided in PhaMer (Shang et al., 2022). The dataset is a small-scale metagenomic dataset simulated by CAMISIM (Fritz et al., 2019) using the commonly seen bacteria living in the human gut and the phages that infect these bacteria. The reads were assembled into contigs using metaSPAdes (Nurk et al., 2017).
We kept contigs of size above 3,000 bp. To assign labels to the contigs, we used BLAST (Camacho et al., 2009) to map contigs to reference genomes and calculated the coverage. Only the contigs with at least 90% of the sequence aligning to a reference genome were kept. Others are likely chimeric contigs due to assembly .

FIGURE
The classification result of , unclassified sequences. "independent clustered": The sequences are in a VC cluster without any reference genome.  errors and thus are not used for testing. Finally, the number of contigs we could use in the experiment is 37. The name of the families and the number of genomes within each family are listed in Table 5. Compared to Table 3, this test set contains a different abundance distribution for the component families, which can thus change the performance of these tools. As shown in Figure 6A, PhaGCN, MMseqs2, and CAT can classify all the simulated sequences correctly, which is slightly higher than that on the RefSeq data in Experiment 2. A plausible reason is that most of the sequences in this simulated dataset belong to Straboviridae and Ackermannviridae, which make up a large part of the reference database according to Table 3 (14%  and 4%). In addition, they have greater intra-family similarities. The performance of vConTACT 2.0 is lower than the other three tools because the assembled contigs are short. This experiment shows that PhaGCN, MMseqs2, and CAT can process assembled contigs with different lengths.

. . Experiment : Classification performance on the low-similarity dataset
Although the updated families under the new ICTV standard exhibit higher pairwise sequence similarity, there are still some diverged members. The diverged members may appear more often when sequencing new or underrepresented ecosystems. Thus, we test these tools' performance on predicting highly diverged sequences using the "low similarity dataset". There are 45 genomes in the test set with the maximum Dashing similarity of 0.015 with any reference genome. Then we randomly cut 15 contigs with a length of 3,000 bp and 5,000 bp from each test genome, leading to 1,350 contigs in total. Figure 6B shows the accuracy of all inputs. Because vConTACT 2.0 can not handle short contigs, we exclude it from this experiment. Figure 6B reveals that the accuracy of MMseqs2 decreases by more than 10% compared to Figure 4C from Experiment 2. And the accuracy drop in CAT (6%, 5.2%) are greater than PhaGCN (3.3%, 2%) on the contigs of the same lengths. Therefore, the increased divergence between test and training data has a greater impact on alignment-based tools than PhaGCN in this experiment.

. . Comparison of running time
Running time is also an essential factor to consider for practical usage. Table 6 shows the running time of the tools for processing 500 complete sequences in RefSeq when using a different number of CPUs. Users can save more time by increasing the number of CPUs. The table also shows that CAT and MMseqs2 take the least time to process 500 complete phages.

. Discussion and conclusion
This work presents a review of taxonomic classification tools on phage family classification under Caudoviricetes. To our  best knowledge, this is the first review under the new ICTV standard released in August 2022. Compared to the previous version of ICTV, the updated families in the latest system are more conserved, which warrants a high prediction rate and accuracy of alignment-based tools. For example, the prediction rate of CAT and vConTACT 2.0 were 62 and 92% on the data in the previous ICTV system, respectively. And their accuracy on complete genomes were only 61.7 and 86%. However, their prediction rate and accuracy are significantly better under the new classification system. The constant change of the taxonomic classification system by ICTV emphasizes the need for a tool to provide database updating or model retraining. Tools without these utilities can return obsolete or even wrong labels, making their practical usage limited. Many of these tools in Table 2 either lack this option or need excessive efforts to retrain.
Despite great efforts, the current classification system by ICTV is not complete. New families can appear with new viruses sequenced and discovered, particularly those from underrepresented ecosystems. Thus, it is desired that a classification tool can handle out-of-distribution inputs, which are not part of any existing families. Based on our leave-onefamily-out experiment, vConTACT 2.0 is more sensitive to those out-of-distribution sequences than others. However, a price paid by vConTACT 2.0 is its low prediction rate on short contigs, which is likely caused by the low gene sharing significance score between the query and the reference. Other tools perform better on short contigs, which is important for virus composition analysis in metagenomic data.
PhaGCN can only classify sequences on the family level. The lowest levels that the other three tools can classify are genus level or below. The experimental results show that all of them can perform well on complete genomes from the RefSeq database after retraining. PhaGCN has the highest prediction rate when classifying short contigs (>3,000 bp), and CAT gains a higher accuracy with a slightly lower prediction rate. Therefore, when classifying incomplete contigs larger than 3,000 bp, PhaGCN, CAT, and MMseqs2 can all be considered, but PhaGCN has a better overall performance. In addition, CAT and MMseqs2 can be used to classify contigs shorter than 2,000 bp because PhaGCN can not handle that length. All these four tools are robust against the size reduction of the reference database/training data.
The performance of PhaGCN is less affected in classifying highly diverged sequences that share low similarity with the reference genomes.
The focus of this review is family-level classification. While the current families annotated by ICTV usually contain multiple phages per family, the genus size distribution exhibits a much more skewed distribution with many genera only containing one phage genome. It is not trivial to create appropriate reference database/training data vs. test data with hundreds of rare genera. It is our future work to examine the impact of the long tail distribution on current classification tools.