The Challenge and Potential of Metagenomics in the Clinic

The bacteria, fungi, and viruses that live on and in us have a tremendous impact on our day-to-day health and are often linked to many diseases, including autoimmune disorders and infections. Diagnosing and treating these disorders relies on accurate identification and characterization of the microbial community. Current sequencing technologies allow the sequencing of the entire nucleic acid complement of a sample providing an accurate snapshot of the community members present in addition to the full genetic potential of that microbial community. There are a number of clinical applications that stand to benefit from these data sets, such as the rapid identification of pathogens present in a sample. Other applications include the identification of antibiotic-resistance genes, diagnosis and treatment of gastrointestinal disorders, and many other diseases associated with bacterial, viral, and fungal microbiomes. Metagenomics also allows the physician to probe more complex phenotypes such as microbial dysbiosis with intestinal disorders and disruptions of the skin microbiome that may be associated with skin disorders. Many of these disorders are not associated with a single pathogen but emerge as a result of complex ecological interactions within microbiota. Currently, we understand very little about these complex phenotypes, yet clearly they are important and in some cases, as with fecal microbiota transplants in Clostridium difficile infections, treating the microbiome of the patient is effective. Here, we give an overview of metagenomics and discuss a number of areas where metagenomics is applicable in the clinic, and progress being made in these areas. This includes (1) the identification of unknown pathogens, and those pathogens particularly hard to culture, (2) utilizing functional information and gene content to understand complex infections such as Clostridium difficile, and (3) predicting antimicrobial resistance of the community using genetic determinants of resistance identified from the sequencing data. All of these applications rely on sophisticated computational tools, and we also discuss the importance of skilled bioinformatic support for the implementation and use of metagenomics in the clinic.

The bacteria, fungi, and viruses that live on and in us have a tremendous impact on our day-to-day health and are often linked to many diseases, including autoimmune disorders and infections. Diagnosing and treating these disorders relies on accurate identification and characterization of the microbial community. Current sequencing technologies allow the sequencing of the entire nucleic acid complement of a sample providing an accurate snapshot of the community members present in addition to the full genetic potential of that microbial community. There are a number of clinical applications that stand to benefit from these data sets, such as the rapid identification of pathogens present in a sample. Other applications include the identification of antibiotic-resistance genes, diagnosis and treatment of gastrointestinal disorders, and many other diseases associated with bacterial, viral, and fungal microbiomes. Metagenomics also allows the physician to probe more complex phenotypes such as microbial dysbiosis with intestinal disorders and disruptions of the skin microbiome that may be associated with skin disorders. Many of these disorders are not associated with a single pathogen but emerge as a result of complex ecological interactions within microbiota. Currently, we understand very little about these complex phenotypes, yet clearly they are important and in some cases, as with fecal microbiota transplants in Clostridium difficile infections, treating the microbiome of the patient is effective. Here, we give an overview of metagenomics and discuss a number of areas where metagenomics is applicable in the clinic, and progress being made in these areas. This includes (1) the identification of unknown pathogens, and those pathogens particularly hard to culture, (2) utilizing functional information and gene content to understand complex infections such as Clostridium difficile, and (3) predicting antimicrobial resistance of the community using genetic determinants of resistance identified from the sequencing data. All of these applications rely on sophisticated computational tools, and we also discuss the importance of skilled bioinformatic support for the implementation and use of metagenomics in the clinic.

iNTRODUCTiON
The complement of microorganisms that live on and within us, our microbiome, and its role in health and disease has become a central focus of current research. Research over the past few years has revealed how fundamentally intertwined we are with our microbial passengers. We have known for many years the connection to immune system development and gut health, but the impact of the microbiome on our health goes far beyond this (1). Metabolism (2), skin health (3-6), and even our mental health (7) have been shown to be influenced or to influence the microbiome. Many diseases, often complex and multi-factorial, such as allergy (8), asthma (9), inflammatory bowel disease (10), and even cardiovascular disease (11) and cancer (12,13) have all been associated with alterations to the microbiome.
Shotgun sequencing of purified DNA, or metagenomics, is rapidly emerging as a powerful tool for both microbiology research and clinical applications due to the depth and breadth of information that can be acquired. The volume of DNA sequencing required to fully sequence a sample such as the human gut microbiome has traditionally made routine metagenomics unfeasible, particularly for diagnostics. However, the ever-increasing volumes and dropping prices are slowing bringing metagenomics into the clinician's toolbox. One of the powerful aspects of metagenomics is the non-targeted nature of the sequencing. Once acquired, the DNA sequence can be queried for any number of interesting questions such as the presence of pathogens, metabolic pathways, antimicrobial resistance genes, and overall community composition. In this review, we discuss how this information is useful in a clinical context for both diagnostics and research with a particular emphasis on complex microbiomeassociated phenotypes such as dysbiosis and Clostridium difficile infection. Furthermore, we make the case that if metagenomics is to be useful in the clinical context, skilled bioinformatic support will be essential, not just in the development of novel tools and algorithms but in the application of these tools and interpreting results. As sequencing technology continues to evolve rapidly, the most significant bottleneck for metagenomics (and other genomic analysis, for that matter) will not be sequencing, but the presence of skilled analysts to analyze the data and tease out the clinically relevant information is required.

wHAT iS MeTAGeNOMiCS?
Historically, microbial diversity has been primarily studied with culturing. Selective media and careful selection of culture conditions can recover tremendous range of organisms and is remarkably sensitive. In fact, culturing remains a valuable, albeit undervalued tool to study the microbiome (14)(15)(16)(17). However, microbiologists noticed some time ago, a discrepancy in the number of organisms that could be counted under a microscope and the number that was able to be recovered in laboratory culture (18). This has led to the development of culture-independent techniques, based primarily on the analysis of single-subunit (SSU) rRNA genes. The reduced costs and ever-increasing read lengths of high-throughput sequencing have transformed this into an accessible and powerful technique. Sequencing of rRNA gene amplicons from hundreds of samples can be done in parallel, and these data are used to infer abundance and taxonomic profiles of the microbial species in a sample. Metagenomics, on the other hand, is the collective genetic material from all the genomes present in the sample (1) and provides a view into the functional potential of the population.
The workflow for a metagenomics is fairly straightforward and is easily implemented in molecular biology laboratories. Several studies have been published recently that outline well many of the steps required when undertaking a microbiome study (19)(20)(21)(22)(23). A brief overview of the key steps for a metagenomics experiment is given here and is outlined in Figure 1.

DNA extraction
The first step in a metagenomics experiment is the same as any other culture-independent method and that is to extract the DNA from the sample. Unfortunately, sample extraction gets little attention relative to other aspects of the workflow, yet can have a significant impact on outcome. A recent evaluation of extraction protocols from two major microbiome initiatives, the Human Microbiome Project (HMP) and metaHIT, revealed differences in the distribution of bacterial taxa as well as differences in gene composition based on which protocol the fecal samples were extracted with (24). Such differences are thought to arise in part from differential lysis due to cell wall composition (24) and will likely be even more pronounced for analysis that wish to evaluate a multi-kingdom microbiota that would include fungi. Therefore choice of extraction protocol is an important first step and will influence downstream results. Preliminary or pilot studies may choose to include a few different protocols to evaluate. For clinical applications, rigorous and systematic evaluation of different protocols for different tissues will no doubt be a key component of implementing a metagenomics workflow.
To improve the detection of target organisms, sample preparation can also include steps to enrich target sequences or remove un-wanted sequences. For example, detection of viruses can be improved by filtering out cellular material (25). Human DNA can be removed using laboratory methods to increase the amount of DNA coming from target sequences, although this often leaves small amounts of DNA remaining and requires newer kits to prepare the sequencing library (23).

Kit Contamination and Batch effects
It is worth discussing in brief some recent work that has looked at the microbiome components that can be identified within the reagents and buffers of many commercial DNA extraction kits (26). The authors of this study sequenced the components of commonly used DNA extraction kits and demonstrated that contaminating DNA in these kit could significantly confound both 16S gene sequencing and metagenomics results, particularly low biomass samples. They went on to analyze a previously published study (27) demonstrating that the main finding in that study was completely confounded by which lot the extraction kit was from. Once the contaminating sequences were removed, the most significant feature of the data disappeared. Clearly batch effects are key issue for metagenomics studies, yet not often considered when designing these experiments. For example, a recent primer FiGURe 1 | Overview of a metagenomics experiment from sample preparation to analysis. There are a number of key steps in a metagenomics experiment that require attention as they will influence the final results. Considerations on DNA extraction method and contamination issues are very important at the sample preparation stage. At the sequencing stage, the main consideration is read depth and pilot experiments are recommended for this. Analysis can be quite complex and will vary depending on the research or clinical question. Essentially, following quality control, analysis can be done at the contig level, i.e. following metagenome assembly, or directly at the read level.
in cell (20) neglects to even mention batch effects, although the authors do include a section on potential contamination. Batch effects are widespread within high-throughput genomics experiments (28), and ameliorating and dealing with batch effects should be a key priority for any metagenomics experiment. Salter et al. (26) provide some recommendations (summarized in Figure 1), which mainly include sequencing reagent blanks as well as processing samples randomly if they need to be processed in multiple batches.

Sequencing Depth and instrumentation
Although long-read sequencers from companies like PacBio and Oxford Nanopore have great potential for improving metagenomic datasets, particularly de novo assembly, these methods are currently too expensive to be practical. Hence Illumina's shortread sequencers are currently the main choice for these types of experiments, and to acquire the read depth needed for good sample coverage, the higher output instruments such as the HiSeq and Metagenomics in the Clinic Frontiers in Immunology | www.frontiersin.org NextSeq are used. Previously, the main competitor for Illumina was a company called 454 (owned by Roche) whose sequencers could produce much longer reads (600-700 bp) as compared to Illumina's 300 bp. However, in 2013, Roche announced it was closing 454 (https://www.genomeweb.com/sequencing/rocheshutting-down-454-sequencing-business) and phasing out all the sequencers. Currently, Illumina's sequencers cannot be matched in terms of read quality and price-per-base and are really the only option for metagenomic studies. The ubiquity of Illumina's sequencers across research laboratories and growing number of clinical laboratories makes them relatively accessible for both clinical and basic researchers.
The question then becomes one of read depth; how many reads are needed to answer a particular question. There has been some attempt to calculate the required sequencing depth based on the predicted taxonomic profile of the community of interest (29); however, currently there is no hard and fast rule for how much sequence to acquire and will depend on the desired outcome. For instance, if the sequencing is being done solely to identify an unknown pathogen, presumably this pathogen will be present in some numbers and may require less depth in order to detect it. The final choice of read depth will be highly dependent on experimental design and budget and is best determined empirically in pilot experiments.

Analysis
Generating sequencing data is done mostly with standard molecular biology protocols that are accessible to most research and clinical laboratories. However, analysis of metagenomic data is far from being standardized and is an area of active research and often the bottleneck for these types of studies. New methods are being published almost on a weekly basis, and a good analysis will invariably require a skilled analyst (see section at the end for more discussion on this point). Clinical applications of metagenomics are still limited by computational methods (see next section for a specific example); however, with good experimental design and the appropriate budget for analysis (often underestimated or neglected), the bioinformatics need not be prohibitory to successful application of metagenomics. Figure 1 includes an overview of the main steps that may be included in a typical analysis. Very important is the initial quality control steps to ensure nothing went wrong with the sequencing. From there an analysis typically, depending on the experiment, can be divided up into different areas, although in most cases a variety of analysis will be done. Assigning taxonomy to the individual reads gives the taxonomic composition of the community, similar to what one would obtain from a 16S gene survey and is a key step for pathogen identification. There are quite a number of tools available to do this but few reach the speed and accuracy of Kraken (30) with MetaPhlAn as another common choice (31). To get functional composition, the reads are searched against a protein database such as KEGG. Again, a number of tools are available for this including web based tools such as MG-RAST (32) and MEGAN (33), which includes a graphical user interface. Some tools, such as HUMAnN (34), attempt to do metabolic reconstruction of the metagenomic data. As an alternative to read-level analysis, there is considerable work being done on metagenome assembly. A typical workflow might consist of a genome assembly with a specialized assembler such as Meta-IDBA (35) followed by binning contigs into groups with software like CONCOCT (36) and manual analysis and binning correction with a tool like Anvi' o (37). Contigs can then be annotated and subsequent functional analysis can be performed. Newer algorithms and tools are allowing for strain level analysis of metagenomic samples (38,39), which holds much promise for infectious disease metagenomics as it allows the simultaneous identification of specific pathogenic strains and any corresponding antibiotic resistance and virulence genes these strains might carry.

MeTAGeNOMiCS FOR DiAGNOSiS OF iNFeCTiOUS DiSeASe
Currently diagnosis of the vast majority of microbial diseases is carried out using traditional culture-based methods. In a clinical context, culture-based methods can fail to isolate disease-causing organism (40)(41)(42) and are time consuming and labor intensive (43). While not yet standard practice, utilizing a metagenomics approach in a clinical setting has the potential to identify and characterize bacterial and viral pathogens (44,45). It is likely that reduced costing and increased sensitivity will endorse the use of culture-independent metagenomics approaches in clinical practice, particularly for new and emerging pathogens, which do not yet have standard diagnostic testing (44).
As discussed in the previous section, extracting useful information from these large datasets is non-trivial, often requiring multiple complex steps, which are dependent on the particulars of the data set being examined [e.g., see Ref. (44)]. Some inroads have been made in the development of tools for the rapid detection of pathogens in metagenomic datasets, which are designed to be fast and easy to use, important factors for integration into a clinical lab. PathSeq (46) utilizes a sequence subtraction method, where host DNA is first identified by aligning to a human reference database and removed leaving a much smaller dataset to search. This approach was used to identify a previously unknown pathogen as the causative agent in cord colitis syndrome (47). Clinical PathoScope (48) also uses the sequence subtraction method but claims to be much faster. Other computational strategies have been employed to deal with the large amount of host DNA; rapid identification of non-human sequences (RINS) (49) utilizes a custom reference database, which, while dramatically reduces run times requires some sort of hypothesis about the organism being identified. Parallel processing has also been used as a way to reduce computational times (50) while others have attempted to leverage improved alignment algorithms and increasingly available cloud computing resources (51). The later, known as sequence-based ultrarapid pathogen identification (SURPI) was recently used to diagnose a viral infection in a patient with encephalitis (52).
Given the rapid development of tools targeted for pathogen identification, it is feasible that this may be an area where metagenomics will play a key role in the clinical laboratory.

MeTAGeNOMiCS FOR STUDYiNG DYSBiOSiS AND CLOSTRIDIUM DIFFICILE iNFeCTiON
In addition to infection diagnostics, metagenomics has a great deal of potential for unraveling the microbial ecology of complicated disorders such as Clostridium difficile infection (CDI). The role of the microbiome in CDI is well studied, and CDI is considered as the prototypical example of a disease state, which occurs as a result of dysbiosis. Reduction of microbial diversity in the gut, most often as a result of antibiotic use (53,54), results in reduced colonization resistance, promoting the overgrowth of Clostridium difficile. Using non-sequencing methods, it has been demonstrated that CDI patients had a decrease in the number of Bacteroides, Prevotella, and Clostridia groups IV and XIVa as well as higher levels of Enterobacteriaceae compared with healthy counterparts (55,56). More recent sequencing studies, which provide a more in-depth analysis of the community structure, have demonstrated a less diverse gut microbiota in CDI patients relative to controls and CDI individuals demonstrated reductions in Bacteroidaceae, Lachnospiraceae, and Ruminococcaceae families (57)(58)(59).
In the clinic standard treatment of CDI is with metronidazole for mild disease or vancomycin for more severe disease (60). Both treatments are damaging to the normal microbiota, which contributes to the approximately 20% recurrence risk of disease (56). Therefore, new treatments for CDI must not only kill the pathogen but also simultaneously prevent destruction of the protective host microbial microflora. We are beginning to see studies looking at the microbiome sparing properties of new antibiotics (61) and anticancer drugs (62) in the literature, but it has been suggested that the going forward the Food and Drug administration could look at the host microbiota during clinical trial phase for all new drugs to determine their effect on the microbiome.
Remarkably, a lot of the information that has been gathered on the gut microbiota has come about as a result of studies on fecal microbiome transplantation (FMT). FMT is essentially transfer of the gut microbiota from the stool of healthy donor to sick patients. It has become widely used in the last decade due to its high success rate (up to 92-100% depending on the protocol used) (63,64). In studies comparing recurrent CDI patients and pre-and post-FMT, it has been shown that intestinal microbiota changes from a low-diversity disease state dominated by Streptococcaceae, Enterococcaceae, and Enterobacteriaceae to a more diverse community, with significantly increased numbers of Lachnospiraceae and Ruminococcaceae (65,66). However, although we know FMT works in the clinic we do not yet fully understand the specific mechanisms of why it works. Studies over the last two decades ago have demonstrated that bacterial mixtures of 6-33 different species can resolve recurrent C. difficile as effectively as whole fecal transplants (67)(68)(69). Additionally, recent germ free mouse studies have shown that a single Lachnospiraceae strain can suppress C. difficile infection in mice (70). Understanding the mechanism of what constitutes colonization resistance is made inherently more complex by the variability of the gut microbiome between people and the fact that identification of specific bacterial populations in the gut does not provide any concrete information regarding overall function.
The diversity of the human intestinal microbiome is a key to a number of biological processes that ensure the wellbeing of an individual. While certain bacterial species have been suggested as potential "keystone" species (71), it is likely that the functional state of the microbiome plays a more important role, rather then presence of a single species. Metagenomic sequencing offers a more comprehensive approach than marker gene approaches. Not only it can provide a complete view of the microbial community present but it also has the ability to resolve information about overall community function. Weingarden et al. (72) demonstrated that patients with recurrent CDI (rCDI) have high concentrations of primary bile acids and that FMT can restore the intestinal microbiota and the composition of fecal bile acids to that seen in non-CDI donor samples (72). Further evidence of a role for bile acids (BAs) was indicated by a recent study which identified Clostridium scindens, as an efficient inhibitor of CDI. This bacterium can convert primary BAs to secondary BAs thereby correcting the biosynthesis of secondary bile acids and inhibiting CDI.
Additional studies looking at the role of other metabolites in CDI have suggested butyrate deficiency in the colon increases growth and toxin production of C. difficile (73,74). In turn, butyrate producers in the gut such as Lachnospiraceae and Ruminococcaceae are thought to have a protective role in preventing CDI. Paradoxically, butyrate has also been shown to be an activator of toxin synthesis in C. difficile (75). Furthermore, it was demonstrated using a gnotobiotic mouse model it that an abundance of commensally derived succinate allows C. difficile to expand efficiently and cause disease (76). In addition, genes involved in the conversion of succinate to butyrate were highly expressed under these conditions, suggesting a complex metabolic network is involved in pathogenesis. Finally, metagenomics will identify not only potentially beneficial bacteria and their functional role but also potential issues, such as antibioticresistance genes, or virulence genes; 16S RNA gene profiling is not sufficiently sensitive to differentiate between pathogenic and non-pathogenic or antibiotic-resistant and antibiotic-sensitive strains.

UNDeRSTANDiNG THe GLOBAL THReAT OF ANTiMiCROBiAL ReSiSTANCe USiNG MeTAGeNOMiCS
Antimicrobial resistance is recognized as a growing global threat. Studying disruption of the human microbiome through use of antimicrobials is a topic of growing interest among healthcare professionals, because it could be a driving force behind the introduction and proliferation of antibiotic-resistant organisms (ARO) in health-care settings. In addition to CDI, microbial imbalance in the gut is the major predisposing factor for vancomycin-resistant enterococci (VRE) (77,78) as well as other AROs including Klebsiella and Escherchia coli (53). If microbial imbalance is the major predisposing factor for infection with these organisms, then it stands to reason that a healthy microbiome is ones best defense against acquiring these organisms. Using a metagenomics approach, we can investigate the functional role of the host microbiome on the carriage and transmission of AROs; in patients who are asymptomatic carriers or actively infected with AROs, what is the state of their gut microbiome relative to non-carriers? Can we identify specific communities and functions that provide colonization resistance against some or all AROs?
In a diagnostic context, metagenomic data has tremendous potential for predicting antimicrobial sensitivity and resistance. Traditional methods of detecting antibiotic resistance suffer from the same problems as identifying pathogens given that these assays are performed on isolated organisms. However, the metagenome contains the collection of resistance determinants within the microbial community, known as the resistome (79), and can provide a comprehensive picture of the resistance "potential" of a community (80). Here, genetic determinants of resistance are identified in the data set and used to predict what resistance patterns will be. This works well for known mechanisms of resistance but would not be a useful for discovering novel methods or if the primary mechanism of resistance in a particular community is one that is unknown. The key advantage of a metagenomics approach is that all known determinants can be identified even ones that are not present in the disease causing pathogen, but may be transfered due to the highly mobile nature of many of these genes and pathways.
Although designed for whole genomes from single organisms, a great example of how this could work effectively is software called Mykrobe predictor (81), which uses de Bruijn graphs to identify a variety of different allele types, such as single-nucleotide polymorphisms (SNPs), indels and genes that are associated with antibiotic resistance. Using these data, a prediction is made on what antibiotics the organism will be resistant to. Although the authors demonstrate that it can be used to identify very minor alleles in a mixed infection it is at this stage, unlikely to work well in a complex infection environment such as the gut which would contain a fairly large number of antibiotic-resistance determinants from a variety of organisms (82,83). Nonetheless, due to its speed, ease of use, and accuracy, this tool demonstrates how sequencing data can be used to make rapid predictions about antibiotic resistance very well and is a very promising step forward.
Identifying antibiotic-resistance genes in metagenomic datasets will depend strongly on the quality and completeness of resistance gene databases. Efforts such as the Comprehensive Antibiotic-Resistance Database (CARD) (84) and the Antibiotic-Resistance Gene Database (ARDB) (85) are extremely important for the appropriate interpretation of resistance levels of a sample based on gene content.

THe ROLe OF THe BiOiNFORMATiCiAN iN CLiNiCAL MeTAGeNOMiCS
The semantics of definitions aside (86,87), we put forward that if metagenomics is to be useful in a clinical context, it will require skilled bioinformatic analysts in addition to novel and efficient computational tools. Metagenomics can be immensely useful in clinical diagnostics as demonstrated by a study using a sequence-based metagenomics approach to investigate a shiga-toxigenic Escherichia coli (44) but such a study required a complex and non-standard analysis. Standard pipelines and tools with simple user interfaces can be setup and Mykrobe predictor (81) is a good example of this. Indeed as demonstrated by the number of pathogen identification tools intended to be user friendly there is considerable effort being put forth to remove complex analysis as a bottleneck. However, many datasets and analysis are not standard and require custom analysis. Furthermore, and perhaps more importantly, interpreting the output of analysis should never be done blindly, that is to say, a fundamental understanding of the tools and their limitations is paramount to acquiring accurate answers from metagenomic data in the clinic.
An advantage of metagenomics is that as new tools are created and new discoveries made the sequencing data can be utilized in ways previously unknown (Figure 1). In addition, new sequencing technologies, such as single-molecule sequencing, which produces very long reads, are emerging and established tools may or may not work with these new types of data. Having skilled bioinformaticians will become even more important as clinical labs and research become more and more dependent on sequencing data.

CONCLUSiON
Metagenomics holds much promise for microbial diagnostics and research and there are several exiting proof-of-concept studies demonstrating the power of this approach for the clinical laboratory. Decreasing costs and increasing throughput will likely remove sequencing as a bottleneck leaving computational power, effective tools, and timely analysis as key issues that will need to be addressed to see to the full potential of metagenomic sequencing in the clinic.

AUTHOR CONTRiBUTiONS
All authors listed, have made substantial, direct and intellectual contribution to the work, and approved it for publication.