The Promises and Pitfalls of Machine Learning for Detecting Viruses in Aquatic Metagenomes

Ponsero, Alise J.; Hurwitz, Bonnie L.

doi:10.3389/fmicb.2019.00806

PERSPECTIVE article

Front. Microbiol., 16 April 2019

Sec. Aquatic Microbiology

Volume 10 - 2019 | https://doi.org/10.3389/fmicb.2019.00806

The Promises and Pitfalls of Machine Learning for Detecting Viruses in Aquatic Metagenomes

1. Department of Biosystems Engineering, The University of Arizona, Tucson, AZ, United States
2. BIO5 Institute, The University of Arizona, Tucson, AZ, United States

Article metrics

View details

Citations

6,5k

Views

2,2k

Downloads

Abstract

Tools allowing for the identification of viral sequences in host-associated and environmental metagenomes allows for a better understanding of the genetics and ecology of viruses and their hosts. Recently, new approaches using machine learning methods to distinguish viral from bacterial signal using k-mer sequence signatures were published for identifying viral contigs in metagenomes. The promise of these content-based approaches is the ability to discover new viruses, with no or few known relatives. In this perspective paper, we examine the use of the content-based machine learning tool VirFinder for the identification of viral sequences in aquatic metagenomes and explore the possibility of using ecosystem-focused models targeted to marine metagenomes. We discuss the impact of the training set composition on the tool performance and the current limitation for the retrieval of low abundance viral sequences in metagenomes. We identify potential biases that could arise from machine learning approaches for viral hunting in real-world datasets and suggest possible avenues to overcome them.

Introduction

Viruses infect host cells from all domains of life and are highly adapted to their host genetics and their environmental niches (Hurwitz et al., 2014; Brum et al., 2015). Recently, metagenomics has laid the groundwork for understanding viruses and their uncultured hosts. Several tools provide rapid and accurate taxonomic assignment of metagenomic sequences directly from a microbiome by comparing them to known bacterial and viral genomes using k-mer based tools, such as Centrifuge (Kim et al., 2016), CLARK (Classifier based on Reduced K-mers) (Ounit et al., 2015), USEARCH (Edgar, 2010), KRAKEN (Wood and Salzberg, 2014), and NBC (Naive Bayes Classifier) (Rosen et al., 2008) [reviewed in Hurwitz et al. (2018)]. Importantly, these tools rely on finding sequence similarity to known viral sequences that represent only a small portion of viral diversity (Roux et al., 2015b). In practice, viromes have a high number of reads with no matches to known viral genomes, in prior studies, less than 10% of reads were assigned from ocean viromes (Hurwitz and Sullivan, 2013).

To explore the viral biodiversity and ecology, a number of bioinformatic tools perform a high-level taxonomic (viral or cellular origin) assignment of metagenomic sequences. They aim to provide means to collect all viral sequences in a metagenome and help the discovery of new viral groups. Several approaches were used: some tools align short reads to a viral marker gene database [MetaPhlAn (Segata et al., 2012), MetaPhlAn2 (Truong et al., 2015)], or to a reference database of whole genomes MG-RAST (Meyer et al., 2008), ViromeScan (Rampelli et al., 2016), VIP (Li et al., 2016), HoloVir (Laffy et al., 2016), FastViromeExplorer (Tithi et al., 2018). Other tools use assembled contigs and align to a viral genome database [Metavir (Roux et al., 2011, 2014), Virome (Wommack et al., 2012), MetaPhinder (Jurtz et al., 2016)].

These reference-based classification tools are limited in their ability to identify novel viruses and are biased toward the identification of previously isolated viruses. However, large scale efforts in retrieving viral sequences in metagenomes and viromes as the IMG/VR database allows for broader research into non-isolated viruses (Paez-Espino et al., 2019). In 2015, the release of VirSorter allows the user to identify potential viral sequences in metagenomes using Hidden Markov Models (HMM). The tool relies on both known viral genomes and viral sequences from viromes for broader detection of unknown viruses (Roux et al., 2015a).

In contrast to these reference-based approaches, an emerging approach is to use composition-based pattern detection leveraging machine learning algorithms. The idea behind this approach is to train a machine learning model to learn to identify a set of features that signal a viral origin to generalize the identification of all viral sequences. VirFinder (Ren et al., 2017) uses a machine learning approach to classify sequences as viral (phages) or prokaryotic based on their k-mer signatures. The model presented in the paper is a logistic classifier, trained on known phages and bacterial genomes from the RefSeq database (we’ll refer to this model as “phages-prok model”), and was shown to provide better accuracy for viral sequence detection than VirSorter, especially on short sequences (<5000 bp). Importantly, the tool was shown to have better recall for the identification of previously unknown phage sequences. The authors also provide a model trained on all DNA viral, including some eukaryotic viruses, and prokaryotic genomes from RefSeq (We’ll refer to this second model as “DNAvirus-prok”). Other machine-learning based tool for viral hunting in metagenome, such as MARVEL were developed (Amgarten et al., 2018; Bzhalava et al., 2018). However, these other tools base their prediction on various genomic features such as the relative synonymous codon usage (Bzhalava et al., 2018), gene density, strand shifts, and the number of significant hits against the pVOGs database (Amgarten et al., 2018). While these approaches are valuable, they lose information contained in non-coding sequences, and their use is limited to long contigs only. Moreover, these tools may add additional bias based on the choice of gene caller for the extraction of the genomic features.

In this work, we review the potential bias and pitfalls of composition-based machine learning approaches such as VirFinder for the detection of viral sequences in aquatic ecosystems. Because MARVEL relies on a pVOGs database, we focused our discussion on VirFinder that to our knowledge is the only tool using a completely database-independent approach for the detection of phages. In particular, we discuss three points of importance: (1) the training set composition of supervised machine-learning models and the possibility to obtain marine-focused models, (2) the impact of eukaryotic contamination in metagenomes and (3) limitations in current tools when considering the low abundance of viral sequences in most metagenomes.

The Composition of the Training Set and Discovery of New Viruses

Soueidan et al. (2015) explored the difficulties of classification of metagenomic sequences using k-mer-based machine-learning approaches (Soueidan et al., 2015). In their perspective paper, the authors used the concept of “hardness of the task.” Hardness measures were developed to understand why some instances are harder to classify correctly than others (Smith et al., 2014). Overlap of data from different classes was shown to be a principal contributor to instance hardness. Using a k-Disagreeing Neighbors (kDN) algorithm, Soueidan et al. (2015) show that, for a k-mer size of 3 bp, the high-level classification of viral sequences mixed with non-viral sequences is a hard task, whereas low-level classification (family level classification) is easier. These results suggest that machine learning models trained to classify viral sequences against cellular sequences may have a hard time generalizing to unknown viral families.

This idea is further confirmed by the performance of VirFinder that shows a dependence on abundant known viral groups in the tool’s training set. The “phages-prok” model’s viral training set is mainly composed of phages infecting Proteobacteria and firmicutes from the RefSeq database, and on the other hand, the training set is poor in Archaea infecting phages. Discussing this bias in their training set, the authors showed how VirFinder’s performance varied for several groups of viruses. They showed a markedly lower performance for the detection of Archeal phages than Bacterial phages and revealed that the tool is biased toward the identification of the most represented viral groups in their training set (Ren et al., 2017). Because different ecosystems harbor different viral groups and their hosts, we expect VirFinder’s ability to retrieve viral sequences to be significantly affected when considering different ecosystems. We evaluated the true positive rate, or recall (how many truly viral results are returned) from the “phages-prok model” for viral sequences isolated in various aquatic ecosystems (Figure 1A). Each evaluation set was composed of viral sequences isolated in pelagic, freshwater, hot spring, coral-associated and wastewater metagenomes available in the IMG/VR environment database. Not surprisingly, the recall of VirFinder varies according to the considered ecosystem. We measure a lower recall of the tool for viral contigs isolated in hot springs, coral-associated and wastewater environments compared to the tool performance for viral sequences isolated from pelagic and freshwater metagenomes. This suggests that while viral groups present in pelagic ecosystem are now well represented in the RefSeq database, a variety of viruses present in less-studied ecosystems such as coral-associated are currently unavailable (Figure 1A). Some of the differences in recall across the ecosystems can be explained by the presence of sequences from viruses infecting eukaryotes. These sequences would not be recognized as viral by the “phage-prok” model. However, the VirFinder “DNAvirus-prok model,” trained to identify both phage and eukaryote infecting viruses shows the same drop in recall for hot spring, coral-associated and wastewater metagenome (Supplementary Figure S1).

FIGURE 1

The inability of these models to recognize certain viral groups may be improved by increased sequencing effort of new viral genomes. It indeed is possible to train models on an ever-growing number of sequences: deep learning approaches are particularly suited for this task since they can deal with complex patterns and their performance increases with the number of training examples. In contrast to this approach, we explored into the possibility of training simpler models, tuned to an ecosystem of interest using metagenomic sequences. Indeed, viral communities vary in composition by environment as a function of host populations, which in turn occupy niches defined by specific physical and chemical properties (Hurwitz et al., 2014; Brum et al., 2015). Thus, when working in a given environment, the user only needs to recognize a small subset of viral sequences. Viromes can provide a set of viral signatures from a given ecosystem, that can be used to inform a machine learning model.

As a proof of concept, we developed pelagic-focused classifiers trained on Tara Oceans viromes and microbiomes. Using VirFinder training function, we trained models using metagenomes from the Tara Oceans Dataset (prokaryote-enriched fractions, 0.22 to 1.6 μm, 0.22 to 3 μm) for the non-viral sequences and sequences from the Tara Oceans Viromes (virus-enriched fraction, <0.22 μm) for the viral sequences (Sunagawa et al., 2015) (material and methods are detailed in Supplementary File S1 and a list of metagenomes in Supplementary File S2).

Two evaluation sets were constructed using published phages and prokaryotic genomes, isolated in marine ecosystem (“marine evaluation set”) or isolated in various ecosystems (“all genomes evaluation set”) (material and methods in Supplementary File S1, list of genomes in Supplementary File S2). To ensure that those sequences were not used to train the VirFinder “phages-prok” or “DNAvirus-prok” models, only genomes published after 2014 were used in the evaluation sets.

To take into account both recall and precision (a measure showing how many result returned are truly viral sequences) of the models in this evaluation, the F1-score (harmonic average of precision and recall for the model), was calculated. Globally, the F1-score of Tara-trained models is equivalent to those measured for VirFinder “phage-prok” model on this marine-focused evaluation set. The Tara-trained models also show a preferential detection for marine viral groups as their performance is greatly reduced when sequences from other ecosystems are taken into account (Figure 1B). More precisely, the recall is greatly reduced when the model is evaluated on all RefSeq genomes regardless of their origin, showing an ecosystem-focused specialization of the Tara-trained models (Supplementary Figure S2).

This result shows that it is possible to obtain ecosystem-focused models for the identification of viruses in metagenomes using the information from viromes available for the ecosystem of interest. Although we believe that this approach could be applied to other ecosystems, it is important to highlight that viromes can also provide a biased representation of the actual viral population. For example, viromes uniquely target the dsDNA viral community. Moreover, the DNA extraction method (Wood-Charlson et al., 2015) or the chosen filtration size (López-Pérez et al., 2017) used for viromes can greatly impact the composition of viral group retrieved, and therefore bring a bias in the training set. While this approach could provide an avenue to investigate environments where few viral genomes are available, it requires the availability of several viromes and microbiomes datasets from the ecosystem of interest. Such a sequencing effort is rarely met, however, this issue is expected to be reduced by the increasing number of metagenomic datasets available.

The Problem of Potential Eukaryotic Contamination

We further argue here that training on an ever-growing number of sequences may lead to unexpected effects. VirFinder published model, “phage-prok” was trained on RefSeq phages and prokaryotic genomes, however, the authors provide online the “DNAvir-prok model,” trained on all DNA viral and prokaryotic genomes from RefSeq. This model is more exhaustive in terms of virus groups included in its training set; however, it shows a strong misclassification of eukaryotic sequences, with an FPR superior 0.7 for genomic sequences from known fungi, plant, human and protozoa (Supplementary Figure S3).

The “phage-prok” model training set does not contain any eukaryotic sequences, and therefore shows an increased false positive rate toward eukaryotic sequences. This false positive rate is further increased when using the “DNAvirus-prok” model, where this misclassification is increased by the sequence length suggesting that this model learned to identify eukaryotic sequences as viral. At a tetra-nucleotide level, prokaryotic and eukaryotic viruses and their hosts have been shown to share a closer sequence composition, providing a potential explanation for this model’s behavior (Pride et al., 2006).

When sequencing a metagenome, eukaryotic contamination is common. Eukaryotic sequences in metagenomes usually come from human contamination when processing the metagenome but can also come from the eukaryotic host when considering host-associated microbiome (human or cow gut metagenome as an example). In those cases, the eukaryotic sequences can easily be removed by mapping the input contigs against the human and host genome. However, in aquatic ecosystems, eukaryotic sequences can also be present in metagenome from micro and pico-eukaryotes naturally present in the ecosystems. Dealing with those sequences is more difficult because of the lack of complete genomes for these organisms in the databases. New tools that take into account eukaryotic sequences are critical for exploring a variety of ecosystems of interest.

Dealing With Low Viral Content in Metagenomes

While viromes allow for enrichment in viral sequence content, real-world metagenomes often contain a low proportion of viral sequences (Breitbart et al., 2002; Reyes et al., 2010; Daly et al., 2011). Similar to other tools for viral detection in metagenomes, VirFinder’s precision when dealing with rare events is hampered by a simple Bayes’ relationship (Ren et al., 2017). Indeed, when working in datasets where viral reads are rare (less than 10%), the number of false positives can become comparable or even superior to the number of true positive hits. Some metrics for machine learning model performance are appropriate to study such imbalanced datasets. Receiver-operator curves (ROC) are commonly used to evaluate binary classifiers performances. Because ROC curves do not depend on the particular threshold value, they provide a better measure of the tradeoff between true and false positives rates. The area under a ROC curve (AUC) can be used to summarize a model’s performance. It is, however, important to notice that this metric, relies only on true and false positive rates and is therefore misleading when evaluating models on imbalanced datasets. On the other hand, metrics like precision-recall curves (PRC) and the area under precision-recall curves (AUPRC), allows one to measure the loss of precision when moving to imbalanced datasets.

In this context, models and methods with increased precision are certainly valuable. As an example, the Tara-trained models showed a lower false positive rate than VirFinder “phage-prok” model on a marine evaluation set. The precision and AUPRC for those models were evaluated on a set composed of 5% viral sequences from marine phage genomes and 95% non-viral sequences from marine prokaryotic genomes. While we do not claim that all ecosystem-focused models would perform better in the detection of rare events, this experiment shows how valuable high precision models can be in the case of very imbalanced datasets, with a significant improvement in the precision (Figure 2A) and AUPRC (Figure 2B) of the Tara-trained model compared to the VirFinder “phage-prok.”

FIGURE 2

Discussion

Sample bias occurs when the data used to train the algorithm does not accurately represent the problem space the model will operate in. A model trained on an incomplete and unrepresentative training dataset will be highly unlikely to perform well in real-world situations.

VirFinder is based on a logistic classifier model, trained on genomic datasets from RefSeq. The obtained model is tuned to identify certain viral groups that are well represented in the database. We argue that it is possible to develop ecosystem-focused models that are trained on sequences that are representative of the environment they are specialized in. Because these ecosystem-focused models focus on a subset of viral and prokaryotic groups, they can be trained on a smaller training set than models trying to encompass all ecosystems. As a proof-of-concept, we used metagenomic sequences from the Tara Oceans expedition as training set and obtained models tuned for the identification of marine viral sequences. As expected, our marine-focused models performed poorly on viral groups isolated in other ecosystems. While this approach is limited by the number and quality of viromes available, it is possible that a training set composed of both genomes and sequences from metagenomes could increase the recall of previously uncharacterized viruses.

We strongly believe that the development of machine-learning approaches tuned to deal with low-event detection is the key to developing reliable tools for viral sequence detection in metagenomes. Moreover, the investigation of a large number of ecosystems will require the development of tools dealing with potentially high eukaryotic sequence contamination.

Statements

Author contributions

BH and AP designed the theoretical study and experiments, analyzed the data, and wrote the manuscript. AP performed the experiments.

Funding

This work was supported in part by National Science Foundation award #1640775 to BH.

Acknowledgments

The authors would like to thank Jana U’Ren and members of the Hurwitz lab for helpful discussions.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fmicb.2019.00806/full#supplementary-material

FIGURE S1

Recall of VirFinder “DNAvirus-prok” model on viral contigs isolated in various aquatic ecosystems.

FIGURE S2

Recall and precision of a classifier trained on Tara Oceans Metagenomes.

FIGURE S3

False positive rate of VirFinder models for eukaryotic and prokaryotic sequences.

FILE S1

Material and methods.

FILE S2

List of metagenomes and genomes used in training and evaluation sets.

References

1
AmgartenD.BragaL. P. P.da SilvaA. M.SetubalJ. C. (2018). MARVEL, a tool for prediction of bacteriophage sequences in metagenomic bins.Front. Genet.9:304. 10.3389/fgene.2018.00304
2
BreitbartM.SalamonP.AndresenB.MahaffyJ. M.SegallA. M.MeadD.et al (2002). Genomic analysis of uncultured marine viral communities.Proc. Natl. Acad. Sci. U.S.A.9914250–14255. 10.1073/pnas.202488399
3
BrumJ. R.Ignacio-EspinozaJ. C.RouxS.DoulcierG.AcinasS. G.AlbertiA.et al (2015). Patterns and ecological drivers of ocean viral communities.Science348:1261498. 10.1126/science.1261498
4
BzhalavaZ.TampuuA.BałaP.VicenteR.DillnerJ. (2018). Machine learning for detection of viral sequences in human metagenomic datasets.BMC Bioinformatics19:336. 10.1186/s12859-018-2340-x
5
DalyG. M.BexfieldN.HeaneyJ.StubbsS.MayerA. P.PalserA.et al (2011). A viral discovery methodology for clinical biopsy samples utilising massively parallel next generation sequencing.PLoS One6:e28879. 10.1371/journal.pone.0028879
6
EdgarR. C. (2010). Search and clustering orders of magnitude faster than BLAST.Bioinformatics262460–2461. 10.1093/bioinformatics/btq461
7
HurwitzB. L.PonseroA.ThorntonJ.U’RenJ. M. (2018). Phage hunters: computational strategies for finding phages in large-scale ‘omics datasets.Virus Res.244110–115. 10.1016/j.virusres.2017.10.019
8
HurwitzB. L.SullivanM. B. (2013). The Pacific Ocean Virome (POV): a marine viral metagenomic dataset and associated protein clusters for quantitative viral ecology.PLoS One8:e57355. 10.1371/journal.pone.0057355
9
HurwitzB. L.WestveldA. H.BrumJ. R.SullivanM. B. (2014). Modeling ecological drivers in marine viral communities using comparative metagenomics and network analyses.Proc. Natl. Acad. Sci.11110714–10719. 10.1073/pnas.1319778111
10
JurtzV. I.VillarroelJ.LundO.LarsenM. V.NielsenM. (2016). MetaPhinder—identifying bacteriophage sequences in metagenomic data sets.PLoS One11:e0163111. 10.1371/journal.pone.0163111
11
KimD.SongL.BreitwieserF. P.SalzbergS. L. (2016). Centrifuge: rapid and sensitive classification of metagenomic sequences.Genome Res.261721–1729. 10.1101/gr.210641.116
12
LaffyP. W.Wood-CharlsonE. M.TuraevD.WeynbergK. D.BottéE. S.van OppenM. J. H.et al (2016). HoloVir: a workflow for investigating the diversity and function of viruses in invertebrate holobionts.Front. Microbiol.7:822. 10.3389/fmicb.2016.00822
13
LiY.WangH.NieK.ZhangC.ZhangY.WangJ.et al (2016). VIP: an integrated pipeline for metagenomics of virus identification and discovery.Sci. Rep.6:23774. 10.1038/srep23774
14
López-PérezM.Haro-MorenoJ. M.Gonzalez-SerranoR.Parras-MoltóM.Rodriguez-ValeraF. (2017). Genome diversity of marine phages recovered from mediterranean metagenomes: size matters.PLoS Genet.13:e1007018. 10.1371/journal.pgen.1007018
15
MeyerF.PaarmannD.D’SouzaM.OlsonR.GlassE.KubalM.et al (2008). The metagenomics RAST server – a public resource for the automatic phylogenetic and functional analysis of metagenomes.BMC Bioinformatics9:386. 10.1186/1471-2105-9-386
16
OunitR.WanamakerS.CloseT. J.LonardiS. (2015). CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers.BMC Genomics16:236. 10.1186/s12864-015-1419-1412
17
Paez-EspinoD.RouxS.ChenI.-M. A.PalaniappanK.RatnerA.ChuK.et al (2019). IMG/VR v.2.0: an integrated data management and analysis system for cultivated and environmental viral genomes.Nucleic Acids Res.47D678–D686. 10.1093/nar/gky1127
18
PrideD. T.WassenaarT. M.GhoseC.BlaserM. J. (2006). Evidence of host-virus co-evolution in tetranucleotide usage patterns of bacteriophages and eukaryotic viruses.BMC Genomics7:8. 10.1186/1471-2164-7-8
19
RampelliS.SoveriniM.TurroniS.QuerciaS.BiagiE.BrigidiP.et al (2016). ViromeScan: a new tool for metagenomic viral community profiling.BMC Genomics17:165. 10.1186/s12864-016-2446
20
RenJ.AhlgrenN. A.LuY. Y.FuhrmanJ. A.SunF. (2017). VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data.Microbiome5:69. 10.1186/s40168-017-0283-285
21
ReyesA.HaynesM.HansonN.AnglyF. E.HeathA. C.RohwerF.et al (2010). Viruses in the fecal microbiota of monozygotic twins and their mothers.Nature466334–338. 10.1038/nature09199
22
RosenG.GarbarineE.CaseiroD.PolikarR.SokhansanjB. (2008). Metagenome fragment classification using N-Mer frequency profiles.Adv. Bioinforma.2008:205969. 10.1155/2008/205969
23
RouxS.EnaultF.HurwitzB. L.SullivanM. B. (2015a). VirSorter: mining viral signal from microbial genomic data.PeerJ3:e985. 10.7717/peerj.985
24
RouxS.HallamS. J.WoykeT.SullivanM. B. (2015b). Viral dark matter and virus–host interactions resolved from publicly available microbial genomes.eLife4:e08490. 10.7554/eLife.08490
25
RouxS.FaubladierM.MahulA.PaulheN.BernardA.DebroasD.et al (2011). Metavir: a web server dedicated to virome analysis.Bioinformatics273074–3075. 10.1093/bioinformatics/btr519
26
RouxS.TournayreJ.MahulA.DebroasD.EnaultF. (2014). Metavir 2: new tools for viral metagenome comparison and assembled virome analysis.BMC Bioinformatics15:76. 10.1186/1471-2105-15-76
27
SegataN.WaldronL.BallariniA.NarasimhanV.JoussonO.HuttenhowerC. (2012). Metagenomic microbial community profiling using unique clade-specific marker genes.Nat. Methods9811–814. 10.1038/nmeth.2066
28
SmithM. R.MartinezT.Giraud-CarrierC. (2014). An instance level analysis of data complexity.Mach. Learn.95225–256. 10.1007/s10994-013-5422-z
- CrossRef
- Google Scholar
29
SoueidanH.SchmittL.-A.CandresseT.NikolskiM. (2015). Finding and identifying the viral needle in the metagenomic haystack: trends and challenges.Front. Microbiol.5:739. 10.3389/fmicb.2014.00739
30
SunagawaS.CoelhoL. P.ChaffronS.KultimaJ. R.LabadieK.SalazarG.et al (2015). Structure and function of the global ocean microbiome.Science348:1261359. 10.1126/science.1261359
31
TithiS. S.AylwardF. O.JensenR. V.ZhangL. (2018). Fast virome explorer: a pipeline for virus and phage identification and abundance profiling in metagenomics data.PeerJ6:e4227. 10.7717/peerj.4227
32
TruongD. T.FranzosaE. A.TickleT. L.ScholzM.WeingartG.PasolliE.et al (2015). MetaPhlAn2 for enhanced metagenomic taxonomic profiling.Nat. Methods12902–903. 10.1038/nmeth.3589
33
WommackK. E.BhavsarJ.PolsonS. W.ChenJ.DumasM.SrinivasiahS.et al (2012). VIROME: a standard operating procedure for analysis of viral metagenome sequences.Stand. Genomic Sci.6:421. 10.4056/sigs.2945050
34
WoodD. E.SalzbergS. L. (2014). Kraken: ultrafast metagenomic sequence classification using exact alignments.Genome Biol.15:R46. 10.1186/gb-2014-15-3-r46
35
Wood-CharlsonE. M.WeynbergK. D.SuttleC. A.RouxS.van OppenM. J. H. (2015). Metagenomic characterization of viral communities in corals: mining biological signal from methodological noise.Environ. Microbiol.173440–3449. 10.1111/1462-2920.12803

Summary

Keywords

virus, metagenomic, machine learning, sequence classification, viral signature

Citation

Ponsero AJ and Hurwitz BL (2019) The Promises and Pitfalls of Machine Learning for Detecting Viruses in Aquatic Metagenomes. Front. Microbiol. 10:806. doi: 10.3389/fmicb.2019.00806

Received

03 December 2018

Accepted

29 March 2019

Published

16 April 2019

Volume

10 - 2019

Edited by

Curtis A. Suttle, The University of British Columbia, Canada

Reviewed by

Jessica Labonté, Texas A&M University at Galveston, United States; Mohammad Moniruzzaman, Virginia Tech, United States

Updates

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Bonnie L. Hurwitz, bhurwitz@email.arizona.edu

This article was submitted to Aquatic Microbiology, a section of the journal Frontiers in Microbiology

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Aquatic Microbiology

PERSPECTIVE article

The Promises and Pitfalls of Machine Learning for Detecting Viruses in Aquatic Metagenomes

Abstract

Introduction

The Composition of the Training Set and Discovery of New Viruses

The Problem of Potential Eukaryotic Contamination

Dealing With Low Viral Content in Metagenomes

Discussion

Statements

Author contributions

Funding

Acknowledgments

Conflict of interest

Supplementary material

References

Summary

Outline

Figures

Cite article

Article metrics

PERSPECTIVE article

The Promises and Pitfalls of Machine Learning for Detecting Viruses in Aquatic Metagenomes

Abstract

Introduction

The Composition of the Training Set and Discovery of New Viruses

The Problem of Potential Eukaryotic Contamination

Dealing With Low Viral Content in Metagenomes

Discussion

Statements

Author contributions

Funding

Acknowledgments

Conflict of interest

Supplementary material

References

Summary

Outline

Figures

Cite article

Share article

Article metrics