Analysis of plant microbe interactions in the era of next generation sequencing technologies

Next generation sequencing (NGS) technologies have impressively accelerated research in biological science during the last years by enabling the production of large volumes of sequence data to a drastically lower price per base, compared to traditional sequencing methods. The recent and ongoing developments in the field allow addressing research questions in plant-microbe biology that were not conceivable just a few years ago. The present review provides an overview of NGS technologies and their usefulness for the analysis of microorganisms that live in association with plants. Possible limitations of the different sequencing systems, in particular sources of errors and bias, are critically discussed and methods are disclosed that help to overcome these shortcomings. A focus will be on the application of NGS methods in metagenomic studies, including the analysis of microbial communities by amplicon sequencing, which can be considered as a targeted metagenomic approach. Different applications of NGS technologies are exemplified by selected research articles that address the biology of the plant associated microbiota to demonstrate the worth of the new methods.


INTRODUCTION
Plants live in association with diverse microorganisms, which thrive below ground in the rhizosphere and above in the phyllosphere (Vorholt, 2012;Bulgarelli et al., 2013). They are found as endophytes within the plant, as epiphytes attached on plant surfaces and in the nearby soil around the roots. These microorganisms can have beneficial, neutral, or detrimental effects on plant health and development (Newton et al., 2010). The majority of the diverse plant colonizing microorganisms follows a commensal lifestyle; they do not cause obvious harm to the plant, nor do they exert a strong plant growth promoting effect as known for instance from symbiotic nitrogen-fixing bacteria or mycorrhizal fungi. The opening questions to better understand the association between plants and their associated microbiota are the "Who is there?" and "What are they doing?" These are extended by "How do they life under given conditions?" "How do they respond to environmental changes and perturbations?" "How do they interact with each other?" and "How do they affect plant health and development?" Finding answers to these questions will lead to a better understanding of the association between microorganisms and plants; a prerequisite to assess if and how associated microorganisms may be used in the future to support plant growth and improve crop yield.
DNA based studies of the plant associated microbiota are of high value to address the aforementioned questions. Genomic analyses of individual microbial strains or metagenomic studies of whole microbial communities provide insight into the composition and physiological potential of plant associated microorganisms. RNA based studies can extend such studies in order to elucidate the actual metabolic activities and regulatory mechanisms of the microbial cells under given conditions. NGS technologies have a tremendous impact on DNA and RNA based analysis methods; they allow finding answers to questions that could not be addressed before, largely due to technical and financial limitations. Thus, plant microbe associations can now be studied at a speed and depth as never before.
The present review summarizes the main features of the currently available NGS systems and gives a brief outlook about what may be expected in the future. It critically discusses limitations of NGS platforms and shows up ways to compensate these. Applications in the context of plant-microbe-interactions are highlighted that profit from these new technologies, focusing on metagenomic analyses.

NEXT GENERATION SEQUENCING TECHNOLOGIES
Different NGS systems have in common that they produce a massive amount of sequencing data (up to gigabases and soon even terabases) in parallel. Often, NGS instruments are classified as second and third generation sequencing technologies (e.g., Schadt et al., 2010;Niedringhaus et al., 2011;Pareek et al., 2011;Liu et al., 2012). There is no consistent definition for this terminology, and it is difficult to assign all different instruments unambiguously to one or the other category (Schadt et al., 2010;Thompson and Milos, 2011). In this review I refer to all those methods that depend on a PCR step for signal intensification prior to sequencing as second generation sequencing instruments, opposed to single molecule sequencing. Second generation sequencing technology includes the 454 instruments from Roche, the different Illumina platforms and the Life Technologies instruments, i.e., the Sequencing by Oligonucleotide Ligation and Detection (SOLiD) and Ion Torrent sequencers. The only third generation sequencing instrument that is currently commercially available is the PacBio RS by Pacific Biosciences.

COMMON AND DISTINCT FEATURES OF SECOND GENERATION SEQUENCING TECHNOLOGIES
The main characteristics of NGS sequencers are described here in a comparative way in order to point out similarities and differences. A detailed description of second generation sequencing platforms and principles can be found in dedicated reviews (e.g., Voelkerding et al., 2009;Metzker, 2010;Glenn, 2011;Pareek et al., 2011;Zhang et al., 2011;Liu et al., 2012;Shokralla et al., 2012;Mardis, 2013;Morey et al., 2013). Despite differences in terms of sequencing principle, all current second generation sequencing platforms have several shared features with regard to library preparation, library amplification and the sequencing process (Figure 1, Table 1).

LIBRARY PREPARATION
Library preparation can be done from DNA (genomic or PCR amplified fragments) or RNA as input material. The latter has to be converted into cDNA during the library preparation process, direct sequencing of RNA is not yet possible. Due to size limitations for library molecules, genomic DNA and often also mRNA is fragmented, which is usually done mechanically, e.g., by sonication or nebulization, or enzymatically. The fragment size of a library is critical and depends on the sequencing platform that is going to be used. The standard fragment size of Illumina libraries is between 300 and 550 bp including adapters. Longer fragments up to 800 bp can be sequenced if cluster density on the flow cell is reduced to prevent interference of library molecules during the sequencing process. The size of libraries prepared for 454 sequencing depends on the sequencing run conditions. To obtain long reads with a modal length of 700 bp, a size of approximately 1500 bp is recommended. Libraries prepared for sequencing on the small-scale 454 Junior instrument or for sequencing using the older FLX chemistry should be smaller (300-750 bp). Libraries that are sequenced on the Ion Torrent Personal Genome Machine (PGM) platform should never be longer than the requested read length.
Libraries are constructed by adding sequencing platformspecific DNA adapters to the DNA molecules. This enables binding of the library fragments to a surface, which is either a microbead (454, Ion PGM, SOLiD) or a glass slide (Illumina, SOLiD). Moreover, the adapters allow amplification of the library fragments by emulsion PCR (emPCR) or bridge PCR. When amplicons are sequenced, e.g., in microbial community analyses, adapters are often already added during PCR using fusion primer constructs.
Diverse library preparation kits are commercially available and even more protocols have been published that are adapted to the specific needs of research projects. During the last years, library preparation methods were streamlined to reduce costs and preparation time and to enable high throughput library preparation on automated systems (e.g., Adey et al., 2010;Caruccio, 2011;Neiman et al., 2012;Rohland and Reich, 2012;Langevin et al., 2013) Methods were also optimized to reduce potential bias, e.g., by excluding PCR amplification steps (Kozarewa et al., 2009;Adey et al., 2010;Mamanova and Turner, 2011;Oyola et al., 2012;Van Dijk et al., 2014). Another goal is the reduction of the amount of input material. This ranges from several micrograms down to hundreds of pictograms (e.g., Adey et al., 2010;Tariq et al., 2011;Parkinson et al., 2012;Bowman et al., 2013;Langevin et al., 2013). In microbial metagenomic studies, which often aim at indepth analysis of gene diversity, it is advisable to prepare libraries from microgram amounts of input material to cover as much of the diversity as possible and obtain high sequencing depth. It also has to be considered that library preparation from just a few nanograms of input material will require additional PCR steps to amplify the material, which is a potential source of bias.
Library construction using standard methods can easily be outsourced. If library preparation is done by oneself, care has to be taken that the generated libraries are compatible with the sequencing platform that is used for sequencing, as adapters were in some cases modified since the release of the first instruments. For instance, the sequencing of libraries that are constructed according to an Illumina GAIIx protocol is not necessarily fully supported on HiSeq or MiSeq instruments. Details should be discussed prior to the preparation of libraries with the sequence provider.

BARCODING OF LIBRARIES
At least one of the library adapters usually carries a library specific DNA sequence, often a 6-to 12-mer, referred to as barcode, molecular identifier (MID) or tag. This barcode enables the pooling of different libraries, which can then be further processed and sequenced within the same region of a picotiterplate (454), a lane of a flow cell (Illumina, SOLiD) or on a chip (Ion PGM). Barcoding allows sequencing of a complex set of libraries at rather low depth, which is of particular interest in large-scale ecological or biodiversity studies comprising many samples. In amplicon sequencing projects, a sample specific barcode is often already added during PCR amplification of the target genes to enable parallel sample processing at an early step. It should be noted that bias may be introduced when using complex fusion primers with adapters and different barcodes. This can be compensated to certain extent by using a two-step PCR procedure (Berry et al., 2011).
Several different barcode sets have been developed by hand or using software tools. They vary in length and account more and more strictly for different types of sequencing errors and sequencing platform specific needs to maximize data output (Faircloth and Glenn, 2012 and references therein;Kircher et al., 2012;Buschmann and Bystrykh, 2013;Costea et al., 2013). In some articles the use of a dual barcoding strategy is proposed for paired end sequencing in order to decrease sample misidentification rate or to decrease the number of individually tagged PCR primers Carlsen et al., 2012;Degnan and Ochman, 2012;Kircher et al., 2012;Kozich et al., 2013).

LIBRARY AMPLIFICATION BY EMULSION PCR OR BRIDGE PCR
PCR amplification of the library molecules is required to increase signal intensity for the sequencing process. Amplification has to occur spatially separated for the individual library fragments on microbeads (454, PGM, SOLiD) via emPCR or on a glass surface FIGURE 1 | Schematic presentation of the library preparation and sequencing process of the most commonly used next generation sequencing platforms. All different types of starting molecules are converted into doublestranded DNA molecules that are flanked by adapters. Adapters are sequencing platform specific and enable the binding of the library molecules to surfaces, either beads or a flow cell, where they are amplified prior to sequencing. Clonal amplicons are spatially separated on the glass slides, chips, or picotiterplate.
Sequencing is either a sequencing by ligation process with fluorescently labeled oligonucleotides of known sequence (SOLiD) or a sequencing by synthesis process. During Illumina sequencing, four differently labeled nucleotides are flushed over the flow cell in multiple cycles, depending on the desired read length. During 454 and Ion PGM sequencing unlabeled nucleotides are flushed in a sequential order over the flow cell. Incorporation is detected via a coupled light reaction (454) or the detection of proton release during nucleotide incorporation.
(Illumina, SOLiD) via bridge PCR. Hybridization of the library fragments to the surfaces occurs via the adapters to surface-bound oligonucleotides. In the bead based method, each bead obtains only a single library molecule. The beads are spatially separated from each other during emPCR in individual water droplets in a water-oil emulsion. Beads with successfully amplified fragments are enriched and deposited in a picotiterplate (454), a semiconductor chip (Ion PGM) or hybridized to a glass surface (SOLiD) for sequencing. When library molecules are directly hybridized to a glass surface, their density on the surface has to be sufficiently low to prevent interference of library molecules, even after fragment amplification via bridge PCR (Figure 1). Since the production and recovery of successfully templated beads from the water-oil emulsion during emPCR is time consuming, technically challenging and rather expensive, sequencing companies search for alternative methods to amplify library molecules. This has been realized in the recently released Wildfire technology for the SOLiD sequencer (SOLiD 5500 W) and is under development for Ion Torrent sequencers (Merriman et al., 2012).

THE SEQUENCING PROCESS
Sequencing is performed in a massively parallel manner for ten thousands to billions of library fragments. It occurs via repeated cycles of nucleotide addition by a DNA polymerase or ligase (SOLiD), detection of incorporated nucleotides and washing steps. Due to this iterative procedure including extensive washing steps, sequencing lasts several hours to days. In case of Illumina and SOLiD sequencing the four differently labeled nucleotides are flushed over the glass slide in parallel, while a sequential flooding of non-labeled native nucleotides occurs during 454 and Ion PGM sequencing. In the former case incorporation of nucleotides is detected based on specific fluorescent labels attached to the nucleotide, in the latter case products of the enzymatic nucleotide incorporation reaction are detected, i.e., proton or pyrophosphate release. While proton release can be directly measured as pH change by the semiconductor chip of the Ion Torrent instruments (Merriman et al., 2012), the pyrophosphate signal is further converted into a light signal via subsequent reactions including the enzyme luciferase (Ronaghi et al., 1998). The generation of a light signal has led to the term "pyrosequencing" for this technology.
The different strategies of adding nucleotides to the DNA template strand affect sequence read length. During Illumina and SOLiD sequencing, a blocking group at each of the (oligo-) nucleotides prevents the addition of more than one molecule, so that the sequence is increased by one (oligo-) nucleotide at each step and the full read length is determined by the number of sequencing cycles performed (Bentley et al., 2008). In contrast, 454 and Ion PGM sequencing result in sequence reads of variable length. Due to the fact that the four different nucleotides are applied in a specified sequential order, a variable number of nucleotides is incorporated after four cycles, depending on the sequence of the respective library molecules. Several nucleotides are incorporated within the same cycle if the DNA template strand shows a homopolymeric region. This comes along with a proportional increase in signal strength, so that signal intensity is used to calculate the number of incorporated nucleotides (Margulies et al., 2005).

SPECIFICATIONS OF THE DIFFERENT SEQUENCING PLATFORMS
Major progress has been made during the last years with regard to sequence read length and output (number of reads per run) by technically improving the instruments, the chemistry and basecalling algorithms. A compilation of current specifications as given in Table 2 is useful to assess and compare the potential of the different instruments. The presented data were taken from the websites of the sequence providers. It should be kept in mind that those data were generated under optimum conditions. The specifications may not be met when more difficult sampling material is sequenced, e.g., libraries with more extreme GC content or of sub-optimal fragment length.
The SOLiD and Illumina HiSeq sequencers generate the largest amount of data per run at the lowest costs per base. Soon Illumina HiSeq instuments will produce up to 1000 Gb per run. At the same time, these platforms generate the shortest reads. In particular the very short SOLiD sequence reads are mostly used for resequencing and transcriptomics projects, in which reads can be mapped to known genomes, but not frequently in de novo sequencing projects. Between 8 and 11 days are needed to perform a run with maximum data output on these instruments. Illumina has developed strategies during the last years to reduce run time, resulting in the upgrade of the HiSeq 2000 instrument to HiSeq 2500. The upgrade allows sequencing in rapid run mode, which produces a smaller amount of data (approximately 25-30% of data compared to a so-called "high-output" run) within hours to 2 days, depending on the desired read length. The upgrade came along with an increase in maximum read length from 100 to 150 bp in rapid run mode.
The Illumina MiSeq platform was launched in 2011. This platform produces 22-25 million reads with a maximum length of 300 bp when using the new V3 chemistry. The costs per sequenced base are higher compared to the HiSeq instrument. However, the longer read length in combination with the lower read number can be of particular interest for amplicon sequencing projects. It is also very suitable for small scale metagenomics projects or initial sample evaluation prior to deep sequencing on a HiSeq. The newest releases from Illumina are the NextSeq 500 platform, which performs at intermediate scale in terms of output, read length, and costs per base compared to HiSeq and Miseq, and the HiSeqX ten, a package of 10 HiSeq sequencers, which allow even higher throughput than the HiSeq2500 in shorter time.
The 454 sequencer was the first commercially available NGS instrument (since 2005). In comparison to Illumina and SOLiD platforms, it generates longer reads (modal read length 750 bp, average read length 700 bp) in a shorter run time (1 day) using FLX+ chemistry. The total output per run of this platform is clearly lower in terms of reads (1 million) and bases (700 Mb). The higher costs per base are a major reason why its use is meanwhile often replaced by the aforementioned platforms, in particular in projects in which coverage is more important than read length, as it is for instance the case in transcriptomics projects, some metagenomic projects or amplicon sequencing projects. Also Roche has released a smaller-scale benchtop sequencing instrument, the 454 GS Junior (available since 2009). This sequencer produces approximately 100,000 reads per run with a modal read length of 450 bp, comparable to the read length obtained with the FLX+ platform when run with FLX chemistry instead of FLX+ chemistry.
The Ion Torrent PGM sequencer is available on the market since the end of 2010. Sequencing on this platform is done using semiconductor chips of different scale, which allow to sequence between 0.4 and 5.5 million reads. Read length on this platform increased successively from approximately 100 bp to meanwhile 400 bp. Sequencing on Ion instruments is very fast, taking only a couple of hours. The Ion Proton is a larger-scale instrument that produces 10-fold more bases per run using the Ion PI chip. A  larger scale chip (Ion PII) is announced for this platform. In terms of sequencing costs per base, the Ion PGM ranges in between 454 and Illumina/SOLiD technologies.

PAIRED END SEQUENCING AND MATE PAIR LIBRARIES
Most sequencers allow sequencing of library fragments from both ends. A corresponding reverse read can be assigned to each individual forward read in Illumina and SOLiD paired end sequencing mode. Since the average size of the library molecules is known, the distance between forward and reverse read is also known. This information is very helpful when performing assembly or read mapping. Paired end reads can also be used to improve sequence quality of short amplicons when overlapping reads are generated. Paired end sequencing is also possible on the Ion Torrent instruments and protocols are available, but this sequencing mode is not yet officially supported by the company. Paired end sequencing can be done for library fragments of up to approximately 800 bp. However, in de novo sequencing projects read pairs spanning even larger distances are helpful to bridge longer repetitive regions (Mavromatis et al., 2012). Paired sequence reads spanning distances between 1.5 and 20 kb can be obtained from mate pair libraries. The construction principle of such libraries is shown in Figure 2. Mate pair libraries are sequenced in paired end run mode if available. On 454 instruments, mate pair libraries can also be sequenced; the reads will contain sequence information from both ends, separated by the linker sequence somewhere in the middle of the read.
The construction of mate-pair libraries is quite expensive not only monetarily, but also with regard to the amount of input material. Mate pair libraries spanning long distances need 15-20 μg of high molecular weight DNA of which most is lost during the enrichment step of the end-to-end ligated fragments. A certain percentage of library molecules will consist of molecules in which one of the two ends is only represented by a few nucleotides due to the random fragmentation process of the circularized molecules. Such short fragments cannot be assembled with certainty and are discarded. Moreover, the library construction procedure is not free of bias, which can negatively affect assembly, and the diversity of fragments can be rather low, in particular when the amount of input material is limited. When sequencing organisms with small genomes such as bacterial strains, a few hundred thousand reads are usually sufficient to cover the diversity of constructs present in a library. The use of sequencing platforms that produce long reads such as the PacBio instrument appears to become an interesting alternative to mate pair library sequencing.

SINGLE MOLECULE SEQUENCING
Despite the fact that single molecule sequencing approaches are mostly still under development, they have already been described in diverse review articles (e.g., Gupta, 2008;Xu et al., 2009;Schadt et al., 2010;Treffer and Deckert, 2010;Niedringhaus et al., 2011;Pareek et al., 2011;Zhang et al., 2011;Liu et al., 2012;Morey et al., 2013). Currently, the instrument from Pacific Biosciences is the only commercially available platform. Helicos Biosciences, the company that actually released the first single molecule sequencer, vanished from the market in 2012. The major goals that guide the development of single molecule sequencing platforms are longer read length, higher throughput, higher accuracy, faster turnaround time and lower costs per base (Schadt et al., 2010). It remains to be seen how well all these specifications can be met by one single instrument and which of the different systems currently under development will successfully establish on this highly competitive market.

SINGLE MOLECULE SEQUENCING WITH THE PacBio RS
The sequencing technology of the PacBio RS is described in detail in the above mentioned reviews about single molecule sequencing and in articles that introduce this sequencing system to the scientific community (Eid et al., 2009;Korlach et al., 2010). In brief, the principle of this single molecule real-time (SMRT) technology is to attach a DNA polymerase molecule on the bottom surface of a zero-mode waveguide detector (ZMW). The ZMW enables the detection of fluorescence of individual nucleotides that are incorporated by the polymerase into a single complementary DNA strand during the synthesis process. Each type of dNTP has a unique fluorescent label that is cleaved off during DNA synthesis. The ZMWs can be considered as densely arranged nano-chambers in a perforated metal film on top of a glass surface, enabling the parallelization of the sequencing process in 150,000 ZMWs within a SMRT cell (Levene et al., 2003). The ZMWs are scanned for fluorescent signals by a confocal imaging system, resulting in movies of up to 120 min or even 240 min in the near future that document the successive incorporation of nucleotides, from which the sequence is deduced. Nucleotide incorporation occurs continuously without intermittent washing steps, which accelerates sequencing substantially compared to second generation sequencing systems.
Initially, the DNA synthesis reaction could be monitored only in half of the ZMWs on the PacBio RS system at the same time, but a recent upgrade to RS II enables parallel recording of all ZMWs. However, not all ZMWs produce usable reads, so that the expected number of reads for a SMRT cell is approximately 50,000 for the RS II system. Currently, sequencing is done with the C2/P4 chemistry, but will soon be changed to C3/P5, which will support longer movies and thus the generation of longer reads. The mean read length of the instrument is around 8000 bases, probably increasing to 8500 bases with the new chemistry. A maximum read length of more than 20 kb was observed in different projects, reads of 16 kb are regularly obtained in runs with good quality libraries. In comparison to other sequencing platforms, read length and sequencing time are superior, while output per run is clearly lower and the costs per base are rather high. However, the costs for one SMRT cell are relatively low. These specifications suit in particular bacterial genome sequencing projects.
To improve sequence read quality, a circular consensus sequencing (CCS) strategy was developed. It is based on the fact that PacBio libraries have a circular molecule structure, referred to as SMRTbell template (Travers et al., 2010). These libraries are constructed by ligating hairpin loop adapters to the DNA fragments. The circular structure allows a continuous and repeated sequencing of sense and antisense strand, which can be used to generate single consensus reads with very high accuracy (>99%). The accuracy comes at the expense of read length, since the maximum recording time is limited. Thus, the length of the library molecules determine how often a strand is sequenced within the given time. The higher the desired accuracy of the reads the shorter the reads should be. It depends on the project whether high accuracy reads or longer reads are more valuable. In de novo genome sequencing projects the length of the reads is of higher relevance to support genome assembly. In contrast, high-accuracy single consensus sequencing can be useful in metagenomic and especially in amplicon sequencing projects, as higher accuracy prevents an overestimation of biological diversity due to sequencing errors.

FUTURE SINGLE MOLECULE SEQUENCING TECHNOLOGIES
Nucleotide identification of currently available sequencing platforms is mostly based on optical systems that detect incorporation of fluorescently labeled nucleotides or reaction products during DNA synthesis. Future sequencing methods aim at realtime label-free sequencing, e.g., by direct analysis of the DNA molecule using electron microscopic techniques, scanning tunneling microscopy and spectroscopy, or analysis by Raman spectroscopy. Nanopore sequencing is another strategy that has gained much attention and has already been addressed in a couple of reviews (Bayley, 2006;Branton et al., 2008;Xu et al., 2009;Timp et al., 2010;Maitra et al., 2012). The different nanopore sequencing strategies that are under development enable individual base detection based on the measurement of conductivity changes across a lipid membrane while a DNA fragment is pulled through a nano-scale pore by an electric current. Conductivity changes are nucleotide-specific, enabling the identification of nucleotides as they traverse the pore. Biological nanopores are either constructed from engineered proteins, e.g., α-hemolysin (originally from Staphylooccus aureus) or MspA (Mycobacterium smegmatis porin A), or are entirely synthetic, e.g., graphene (Schadt et al., 2010;Thompson and Milos, 2011;Maitra et al., 2012). One of the major challenges in nanopore sequencing is reliable signal detection of each individual nucleotide at the high speed at which the DNA molecule traverses the pore and against a background of stochastic alterations in translocation rate (Branton et al., 2008;Morey et al., 2013).
As single molecule sequencing technologies do no longer depend on a PCR amplification step for signal detection, they overcome any bias introduced during emPCR or bridge PCR as well as dephasing problems (see Section Error Accumulation toward the End of Reads) that result in signal decay, which largely limits read length of current second generation instruments. These advantages come along with a higher sequencing error rate in individual reads, as errors cannot be compensated by the consensus read-out of clonal molecules in a cluster or on a bead. Future improvements of the sequencing technologies and the generation of consensus sequences, as explained for the PacBio instrument, have the potential to compensate these errors.

SEQUENCING ERRORS ESTIMATED ERROR RATES OF SECOND GENERATION SEQUENCING PLATFORMS
In comparison to Sanger sequencing, NGS technologies are known for higher error rates and different types of errors in the generated sequence reads. A direct comparison of error rates from different sequencing platforms and studies is difficult due to differences with regard to the sequenced sample material, the library preparation method, data filtering, and error calculation methods, and the fact that reads of different length (not necessarily the maximum possible length of a platform) are analyzed. Nevertheless, some values are compiled and provided as Table  S1 for orientation. They are mostly in the range of 0.4-1% for Roche 454, Illumina and the Ion PGM platforms. Clear differences between these platform are not evident from the data. The quality of Ion PGM data, which is discussed quite controversially in the literature, is often slightly lower in direct comparison to Illumina and 454 platforms (Liu et al., 2012;Loman et al., 2012;Quail et al., 2012;Jünemann et al., 2013;Perkins et al., 2013). Read quality of HiSeq data was mostly reported to be slightly better compared to GAIIx data (Meacham et al., 2011;Minoche et al., 2011;Quail et al., 2012). The error profiles for the Illumins GA, HiSeq, and MiSeq instruments remain principally the same (Minoche et al., 2011;Quail et al., 2012). The quality of sequencing data from different 454 platforms appears to be similar. Likewise differences in dependence of the used chemistry or the analyzed library type (shotgun or amplicon) are not evident.
Substantial effort has been made to identify different types and sources of sequencing errors with the aim to reduce these either during the sequencing process or afterwards by applying improved analyses and correction algorithms. Some sequencing errors are observed on all sequencing platforms, while others are platform-specific. The following discussion about sequencing errors is largely focused on two sequencing platforms, 454 and Illumina, since error evaluation has been most intensively done for these platforms and these are the most frequently used platforms.

ERROR DISTRIBUTION WITHIN READS OF A LIBRARY
If the distribution of errors among 454 reads would be completely random, an error rate of 0.5% would mean that each read of 500 bp has on average 2.5 errors. But sequencing errors occur only in a certain percentage of reads; most studies report around 70% error-free reads (Huse et al., 2007;Niu et al., 2010;Prabakaran et al., 2011;Niklas et al., 2013). Huse et al. (2007) observed that many of the erroneous reads in an amplicon dataset were characterized by the simultaneous presence of ambiguous base calls and explained this with multitemplated beads that carry similar library fragments.
In Illumina datasets, an increasing number of errors is observed in a successively decreasing number of reads (Dohm et al., 2008;Hillier et al., 2008;Nguyen et al., 2011). The percentage or error free reads was reported to be 57% for the GAIIx platform and 76% for the MiSeq platform in two available reports (Hillier et al., 2008;Quail et al., 2012). During paired end sequencing, the forward read was usually of slightly better quality than the reverse read (Quail et al., 2008;Minoche et al., 2011).

TYPES OF SEQUENCING ERRORS AND THEIR FREQUENCY
Insertions are the most frequent type of error during 454 sequencing (e.g., Margulies et al., 2005;Prabakaran et al., 2011;Vandenbroucke et al., 2011;Skums et al., 2012;Niklas et al., 2013). Several studies have reported deletions to be the second-most frequent type of error, followed by substitution errors (Huse et al., 2007;Gilles et al., 2011;Schloss et al., 2011;Niklas et al., 2013). The majority of indel errors occurs in homopolymeric regions (Margulies et al., 2005;Huse et al., 2007;Rozera et al., 2009;Gilles et al., 2011;Shao et al., 2013). The longer the homopolymeric region, the higher the probability of an indel error and the lower the quality scores of the bases toward the end of this region Luo et al., 2012b;Skums et al., 2012;Niklas et al., 2013). Indel errors are explained by the underlying sequencing principle. The preciseness of the proportionality of the detected light signal decreases with increasing number of identical bases (Margulies et al., 2005). Due to an analogous sequencing principle, the Ion PGM sequencer shows a similar error profile, dominated by indel errors in homopolymeric regions and clearly less substitution errors (Loman et al., 2012;Merriman et al., 2012;Bragg et al., 2013).
In contrast, substitution errors are the most frequent error type in Illumina sequencing (Dohm et al., 2008;Hillier et al., 2008;Hoffmann et al., 2009;Minoche et al., 2011;Nguyen et al., 2011) and for SOLiD sequencers (Shendure and Ji, 2008;Ratan et al., 2013). For the Illumina platform, Nguyen et al. (2011) identified 79-88% of all errors as substitution errors. Hillier et al. (2008) reported a 3.7-fold higher substitution error rate than indel error rate. Deletions are more frequent than insertions and insertions are likely to occur in homopolymeric regions (Dohm et al., 2008;Minoche et al., 2011). The lower rate of indel errors compared to 454 sequencing is achieved by the terminal blocking strategy during the sequencing process, which allows the incorporation of only one base per sequencing cycle, so that a homopolymeric region is sequenced base by base.

ERROR ACCUMULATION TOWARD THE END OF READS
Sequencing errors accumulate toward the end of reads, along with decreasing quality of the called bases. This is well known for Illumina reads, but has also been reported for 454 and Ion PGM data (Campbell et al., 2008;Lind et al., 2010;Schröder et al., 2010;Huse and Welch, 2011;Schloss et al., 2011;Loman et al., 2012;Bragg et al., 2013;Perkins et al., 2013). This accumulation of errors is the result of a decreasing signal-to-noise ratio during the sequencing process, which largely determines the maximum read length of all sequencing platforms.
Errors in 454 reads occur more likely beyond base 200-300 under FLX run conditions on the FLX and the GS Junior platform (Campbell et al., 2008;Gilles et al., 2011;Schloss et al., 2011;Niklas et al., 2013). In particular substitutions and ambiguous base calls accumulate (Gilles et al., 2011). Such an error profile is the result of a loss of synchronism during the sequencing process on the multitemplated beads. Even though the basecalling software accounts for this artifact and reads are trimmed, it does not fully eliminate these effects (Margulies et al., 2005;Gilles et al., 2011). Another reason for a decreasing signal-to-noise ratio toward the end of a read is signal drooping due to premature termination of the sequencing process on templates. This was reported for Ion PGM sequencing (Merriman et al., 2012;Golan and Medvedev, 2013).
In Illumina reads, an accumulation of errors toward the end mainly affects long reads. It becomes obvious in the last third to fourth of 100 or 150 bp reads (Dohm et al., 2008;Claesson et al., 2010;Minoche et al., 2011;Nakamura et al., 2011;Liu et al., 2012). The result of this accumulation are lower overall quality values for longer reads. Also on Illumina platforms, the decreasing signal-to-noise ratio is largely a problem of signal dephasing during the sequencing process (Erlich et al., 2008;Kircher et al., 2009;Metzker, 2010;Schadt et al., 2010). Dephasing occurs when part of the clonal fragments in a cluster on the flow cell lag behind or are advanced compared to the overall sequencing procedure. The signal-to-noise ratio also decreases when the fluorescent label is not efficiently cleaved from the nucleotides added in the previous cycle (Dohm et al., 2008), and due to fluorescent dye decay during the sequencing process over several days (Kircher et al., 2009).

SEQUENCING ERROR CONTEXT DEPENDENCE
Substitution errors in Illumina reads were analyzed in more detail to identify possible error sources (Dohm et al., 2008;Meacham et al., 2011;Minoche et al., 2011;Nakamura et al., 2011;Nguyen et al., 2011;Abnizova et al., 2012;Luo et al., 2012b;Quail et al., 2012). Certain types of substitutions were found to occur more frequently than others and accumulate at specific positions. They are sequence context dependent, for instance after G-rich regions (Dohm et al., 2008;Minoche et al., 2011). Moreover, many substitution errors occur strand-specific, i.e., either predominantly in reads that cover a genomic region in forward direction or in those of reverse direction (Meacham et al., 2011;Nguyen et al., 2011). Such errors can be identified during data assembly or read mapping based on their strand-specificity and the fact that they are associated with low quality values for the respective erroneous base (Minoche et al., 2011). Abnizova et al. (2012) observed that the correct base was frequently detected with the second most intensive sequencing signal at erroneous positions, providing a possibility for correction. That errors tend to accumulate at specific positions within a genome was also observed for SOLiD data (Meacham et al., 2011).
PCR steps were identified as a major cause introducing GC bias (Hillier et al., 2008;Aird et al., 2011;Quail et al., 2012). Standard Illumina and Ion PGM library preparation protocols include a PCR amplification step prior to bridge PCR or emPCR. To reduce GC bias, PCR free protocols have been developed for Illumina library construction (Kozarewa et al., 2009;Mamanova and Turner, 2011) and have meanwhile also been implemented in dedicated Illumina kits. Since PCR-free library preparation methods are problematic when the available input material is limited, PCR protocols were also optimized, as well as other library preparation steps that may introduce such bias (Van Dijk et al., 2014). High cluster densities on the Illumina flow-cell were also discussed to suppress GC-rich reads (Aird et al., 2011). Error correction algorithms were developed and can be applied to account for GC-bias in projects where quantitative information is inferred from the sequencing data such as transcriptomic studies (Hansen et al., 2010;Li et al., 2010;Benjamini and Speed, 2012).

DUPLICATE READS
Another artifact that has been reported in particular for 454 sequencing data is the occurrence of duplicate reads in shotgun (meta-)genomic sequencing projects. These start at the same base position and, depending on the strictness of the definition, are fully identical or different in only few positions and/or read length. Such sequence reads can be true duplicates that arise when genomic DNA is sequenced at very high coverage, or they are artificial duplicates. The source of this type of error is not fully known. It was speculated that duplicates are generated during emPCR, when amplified DNA is attaching to empty beads (Briggs et al., 2007). However, emPCR is also used to amplify library fragments during Ion PGM sequencing, but duplicate reads appeared not to be a major problem in one study in which this issue was specifically assessed (Bragg et al., 2013).
The analysis of several metagenomic sequencing projects revealed between 10 and 45% of duplicate reads (Gomez-Alvarez et al., 2009;Niu et al., 2010;Balzer et al., 2013). Duplicate reads can affect quantitative data analyses, e.g., species or gene abundance analyses in metagenomic studies. To identify and remove duplicates, software tools such as cd-hit-454 (Niu et al., 2010) (Balzer et al., 2013) can be applied. Criteria that define artificial duplicates can be defined in such software tools. Nevertheless, some true duplicate reads may also be eliminated by these filters. The percentage of true duplicates among all identified duplicates can vary largely between 2 and 72% (Niu et al., 2010).

REPRODUCIBILITY ACROSS RUNS AND BETWEEN REGIONS OR LANES
The overall reproducibility between 454 runs and samples from different regions of the picotiter plate is usually high (Vandenbroucke et al., 2011;Niklas et al., 2013). However, variation in error rates, in particular for indel errors, was seen between different 454 sequencing runs (Gilles et al., 2011;Prabakaran et al., 2011;Shao et al., 2013). Variation in terms of read composition of a sample may also occur, as observed in a study in which the same 16S rRNA gene PCR products were sequenced at different sequencing centers and in different runs (Schloss et al., 2011). A similarity analysis of the datasets revealed a clustering according to sequencing centers and, to lesser extent, to runs.
For Illumina, some studies report variation between runs and from lane to lane, e.g. with regard to sequencing errors (He et al., 2010;Aird et al., 2011;Nguyen et al., 2011;Chen et al., 2013), but also in this case it seems not to be a consistent problem (Abnizova et al., 2012;Benjamini and Speed, 2012). Nguyen et al. (2011) reported that variation with regard to sequencing errors largely diminished after data quality filtering. Highly reproducible results were also obtained in a study by Caporaso et al. (2012) across lanes and even on different platforms (i.e., HiSeq 2000 and MiSeq), showing that cross-platform data handling is possible (Bokulich et al., 2013).
It will depend on the project whether possible variation in sequencing performance is acceptable or will negatively affect results and conclusions. It can be a relevant issue when highly similar samples are comparatively analyzed, e.g., in amplicon sequencing projects. To identify method related variation in such critical studies, the inclusion of a standardized reference sample is highly recommended (Schloss et al., 2011;Bokulich et al., 2013).

SEQUENCING ERRORS OF THE PacBio RS SYSTEM
Sequencing errors of PacBio single reads are reported in the range of 13-20% (Thompson and Milos, 2011;Quail et al., 2012) but this high error rate can be reduced to 1% or less by CCS (Metzker, 2010). Sequencing errors on the PacBio system are mostly insertions and deletions (Eid et al., 2009). During single molecule sequencing, dephasing is not an issue, so that errors do not accumulate toward the end of the reads. Moreover, sequencing errors appear not to be sequence context specific (Carneiro et al., 2012;Koren et al., 2012) contributing to the high consensus accuracy that can be achieved when sequencing is done with high coverage (>20-fold) or by using the CCS strategy. Good performance was reported in difficult to sequence regions and GC-rich samples, resulting in more even coverage (Quail et al., 2012;Ross et al., 2013;Shin et al., 2013).

COMPENSATING AND CORRECTING SEQUENCING ERRORS
Once the types and sources of sequencing errors are known, different strategies and tools can be developed to compensate and correct errors. As a general strategy, accuracy is improved by sequencing with high coverage, usually 20-to 60-fold, depending on the sequencing purpose (Margulies et al., 2005;Voelkerding et al., 2009;Luo et al., 2012b). Also, the combination of sequencing data generated from different sequencing platforms with different error profiles was suggested and has been applied to identify and eliminate sequencing errors (Nakamura et al., 2011;Koren et al., 2012). These strategies are effective in de novo genomic sequencing and resequencing projects, but they are of limited use in metagenomic or metatranscriptomic studies that deal with biological variation. Each different read can represent a distinct genotype in such studies or is the result of a sequencing error. Sophisticated methods are needed to distinguish between natural sequence variation and sequencing errors in order not to overestimate diversity.
One way to reduce error rates is to apply alternative basecallers that show superior performance compared to the standard basecalling algorithms (e.g., Ledergerber and Dessimoz, 2011;Das and Vikalo, 2013;Golan and Medvedev, 2013). However, their application is often limited, as it comes along with a transfer of massive amounts of raw signal data from the sequencing service center to the customer and the need for high computational power to perform basecalling, in particular for large Illumina datasets.
In order to improve data quality after basecalling, filtering algorithms were developed. Such filters discard reads with lowquality bases or with uncalled/ambiguous bases, or they clip the lower quality 3 -ends of reads. Many of these filters use the information contained in quality values that are calculated for each base during the base calling process. Minoche et al. (2011) studied the effect of different filtering methods on Illumina data and could reduce the error rate to <0.2% by eliminating approximately 15-20% of the low-quality bases, mostly via 3 -end trimming. Nguyen et al. (2011) reported a 5-fold decrease of the error rate by applying a filter that eliminated reads with low quality bases (<Q30; i.e., with 0.1% likelihood of a false basecall), which resulted in a loss of 24-35% of sequence reads. It has to be kept in mind that low quality bases are to certain extent localized in specific regions of a genome. Discarding such reads can result in a more uneven coverage, introducing potential bias in quantitative studies (Minoche et al., 2011;Nakamura et al., 2011).
An alternative strategy to read clipping and exclusion of low quality reads is error correction. Several tools (e.g., Coral, HiTEC, Musket, Quake, RACER, Reptile, or SHREC) have been developed for this purpose, in particular for the correction of substitution errors in Illumina data (Ilie and Molnar, 2013;Liu et al., 2013;Yang et al., 2013). Some of these tools (Coral, HSHREC, KEC, and ET) have implemented indel correction algorithms and are thus suited for the analysis of 454 and Ion PGM data (Salmela, 2010;Salmela and Schröder, 2011;Skums et al., 2012). Error correction methods make use of the high sequence coverage in order to identify and correct errors. Moreover, most algorithms take into account the quality scores given for the individual bases and/or analyze the neighboring contextual sequence information. The application of error correction tools has been proven useful in de novo genome sequencing projects, resequencing and amplicon sequencing projects (e.g., Skums et al., 2012;Yang et al., 2013). At the same time, Yang et al. (2013) pointed out a need for improved algorithms, in particular for non-uniform data sets, such as metagenomic or (meta-)transcriptomic data. A strategy that can be applied in metagenomics studies to correct sequencing errors is the generation of overlapping paired end reads that are assembled prior to further analyses (Zhou et al., 2011;Masella et al., 2012;Eren et al., 2013).

METAGENOMIC SEQUENCING OF THE PLANT ASSOCIATED MICROBIOTA SEQUENCING AND ANALYSIS STRATEGIES FOR METAGENOMICS STUDIES
The optimal sequencing strategy for a metagenomics project will largely depend on the aim of the project. For a functional description of a microbial community, the Illumina HiSeq sequencing platform will be a good choice due to the low costs per sequenced base, which allows sequencing to high depth in order to gain as much information as possible, even from less-abundant microorganisms that may nevertheless play important roles for ecosystem functioning. Initially, the rather short read length of this platform was considered to be a critical issue (Wommack et al., 2008), but it appears that this is not necessarily a problem. A comparative study of a metagenomic analysis based on 454 and Illumina reads revealed that assembled data derived from both methods reflected the genomic composition of the sample equally well, with the Illumina dataset showing even a slightly better assembly result (using a 5-fold higher volume of data) (Luo et al., 2012b). Annotation of unassembled reads was slightly better for the longer 454 reads. In general, short reads will not allow the generation of a high number of large contigs, in particular for complex samples. As an example, assembly success for a metagenomic sample from the soybean phyllosphere microbiota, which showed medium complexity, was only moderate. The assembly of approximately 1 mio 454 reads with a mean read length of 235 bp resulted in 140,000 contigs with a mean length of 276 bp and left 30% of the reads unassembled. The largest contig had a length of 12,888 bp (Delmotte et al., 2009). In another study with datasets from complex freshwater microbial communities between 50 and 60% of 454 and Illumina reads remained unassembled (Luo et al., 2012b). Despite this moderate success, gene prediction or identification of protein domains is possible. This is even the case for unassembled short reads, though it becomes more difficult when no close homolog is present in the reference database (Scholz et al., 2012;Luo and Moran, 2013). Moreover, annotation of several million unassembled short reads can become a very time-consuming step, depending on the algorithm that is used.
An alternative to assembly and/or direct annotation of short sequence reads is the mapping of reads to existing genomes. The prerequisite for this strategy is that the genomes of the organisms of interest have been genome sequenced. This is currently still a limiting factor (Weinstock, 2011), although the entries in public databases are much more strongly growing since NGS technologies became available. Currently, there are nearly 3000 complete genome sequences of microorganisms deposited in the NCBI database and genomic information of approximately 16,000 microorganisms is available as scaffolds or contigs. It can be a very valuable step to enrich, isolate and sequence the dominant community members, as it is for instance done in the Human Microbiome Project (Turnbaugh et al., 2007), or was already done for 21 bacterial isolates from the Populus rhizosphere . Such attempts will be of value for diverse studies of plant associated microorganisms, as the plant associated microbiota appears to show certain degree of consistency in terms of colonizing taxa (Bulgarelli et al., 2012;Lundberg et al., 2012;Vorholt, 2012), so that stains sequenced in one study may support data analysis of another study using plants grown under different conditions or even different model plants. Thus, the generation of further individual genome sequences will improve data analysis of future metagenomics, metatransriptomics, and metaproteomics studies of plant-associated microorganisms.
As several microbial taxa remain unculturable, some metagenomic studies aim at the reconstruction of individual genomes to obtain information from these organisms. In such studies sequence read assembly is a key step and challenging due to the complexity and uneven composition of microbial communities (Scholz et al., 2012). Assembly will be most successful if the complexity of the microbial community is rather low and dominated by one or a few phylogenetically distinct bacterial taxa. Different studies have meanwhile demonstrated that genome reconstruction of individual members in metagenomic samples is possible, even when rather short Illumina reads are generated (Mackelprang et al., 2011;Albertsen et al., 2013).
Assembly success also depends on sequence read length and the coverage with which the genome(s) of interest are sequenced (Kunin et al., 2008;Schatz et al., 2010;Weinstock, 2011;Luo et al., 2012a); parameters that can be considered in the design of the sequencing strategy. In an in silico study, Luo et al. (2012a) demonstrated that a 20-fold coverage was sufficient to reconstruct the genome of a dominant member in a metagenomic sample and that a higher coverage did not substantially improve the assembly result. Strategies that are frequently applied in pure culture genome sequencing projects to improve assembly are the inclusion of longer reads, paired end reads or reads from mate pair libraries (Schatz et al., 2010). This strategy can also be useful in metagenomic sequencing projects. The combination of sequencing data from different platforms that generate reads of different lengths and with different error profiles was reported multiple times as a successful strategy to improve genome assembly of individual bacterial strains (Aury et al., 2008;Reinhardt et al., 2009;Koren et al., 2012). In particular the PacBio instrument holds potential to fulfill the need for long reads in order to bridge larger gaps or repetitive regions (English et al., 2012;Mavromatis et al., 2012). These strategies have not yet been widely applied in metagenomics projects, but it appears likely that they are of value (Niedringhaus et al., 2011).
Assemblies may also be improved by using new assembly strategies, e.g., a nested strategy, in which the short reads are assembled to longer reads in a first step, before those are further assembled. The in silico generation of Sanger-like reads from Illumina reads by filling the gaps between paired end reads can be done by searching for reads within the same library that fill the gap between a read pair or by constructing paired end libraries of successively decreasing insert length, which are searched for suitable paired end reads to close the gaps between those paired end reads that are contained in the library with the largest library molecules (Rodrigue et al., 2010;Nadalin et al., 2012;Ruan et al., 2013). This strategy may be of particular help to fill small gaps, i.e., of a distance smaller than the size of the largest library molecules, but will not help to bridge repetitive regions that are larger than the largest library molecules.

BIOINFORMATICS TOOLS FOR METAGENOMIC DATA ANALYSIS
The massive amount of sequence data that are generated in metagenomic projects demand new and efficient computational methods for data processing, analysis, and storage (Pop and Salzberg, 2008;Tautz et al., 2010). Substantial progress has been made in this field, as evident from the many different tools that are meanwhile available, e.g., for sequence read assembly, read mapping, or gene prediction (for an overview of available tools see for instance Voelkerding et al., 2009;Guazzaroni and Ferrer, 2011;Zhang et al., 2011;Thomas et al., 2012). New tools become available that are specifically designed for the analysis of metagenomic data, including assemblers such as MetaVelvet or Meta-IDBA (Peng et al., 2011;Namiki et al., 2012), annotation tools such as MG-RAST or CAMERA (Glass et al., 2010;Sun et al., 2011), tools for read mapping and alignment and for further data analysis, e.g., taxon identification and analysis of the microbial community composition based on phylogenetic marker genes (e.g., Stark et al., 2010;Scholz et al., 2012;Sunagawa et al., 2013). It would go beyond the scope of this review to discuss the diverse options for the analysis of metagenomic data along with the available software tools. Several recent reviews have addressed this aspect in detail (Kunin et al., 2008;De Filippo et al., 2012;Hunter et al., 2012;Logares et al., 2012;Scholz et al., 2012;Teeling and Glöckner, 2012;Davenport and Tümmler, 2013;Kim et al., 2013;Preheim et al., 2013;Segata et al., 2013).
Not only powerful software tools are required for the analysis of NGS data, but also high-performance computing capacity, in particular for large metagenomics datasets. This may pose a problem to research laboratories that are not specialized on NGS data analysis. Cloud computing, i.e., the rental of processing time on a computer cluster on demand over a network, is discussed and developing as a possible solution to this problem (Angiuoli et al., 2011;Wilke et al., 2011;Zhang et al., 2011;Dai et al., 2012;Nagasaki et al., 2013), though it has to be considered that this is often not free of costs and may pose security issues related to data transfer (Angiuoli et al., 2011;Hunter et al., 2012).

TARGETED GENE SEQUENCING OF AMPLICONS FROM METAGENOMIC DNA SELECTING THE APPROPRIATE SEQUENCING STRATEGY FOR AMPLICON SEQUENCING
Targeted sequencing approaches of metagenomic DNA are mostly applied to identify the members of microbial communities or to compare their composition in different samples. Diversity studies are usually based on the 16S rRNA gene as bacterial marker and 18S rRNA or ITS as fungal markers (Table S2), while functional marker genes are analyzed when microorganisms with specific metabolic functions such as chitin degradation are addressed (Cretoiu et al., 2012). Until now the fast majority of amplicon sequencing studies have been performed using 454 technology (Table S2), mostly due to the fact that this was the first available NGS platform and due to the relatively long reads, that can be obtained from this platform. However, a shift toward the Illumina platform is currently noticable. First studies were already performed on the GAIIx platform with 76 bp paired end reads and later on with longer paired end reads up to 150 bp, followed by analysis on the HiSeq instrument and recently also on the MiSeq platform (Claesson et al., 2010;e.g., Gloor et al., 2010;Hummelen et al., 2010;Caporaso et al., 2011Caporaso et al., , 2012Jogler et al., 2011;Degnan and Ochman, 2012;Kozich et al., 2013;Bokulich et al., 2014). The generation of overlapping paired end reads is recommended on these platforms as it will help to minimize the error rate (Eren et al., 2013;Kozich et al., 2013). As outlined above, errors accumulate toward the end of the reads, so that they can be corrected if consensus reads are generated from the read pairs. In particular the MiSeq instrument is a suitable platform for such studies, as it produces reads with a length comparable to those of the first 454 instruments, but at much lower costs. The read number obtained from MiSeq runs will in many cases be sufficient to obtain a sequencing depth that allows to answer a research question. In a few studies, the Ion Torrent PGM was used to analyze bacterial or fungal communities based on reads with a length of approximately 100 or 200 bp (Whiteley et al., 2012;Kemler et al., 2013). Longer reads are meanwhile possible on this sequencer and a protocol for paired end sequencing is available (though not yet officially supported by the company), so that this platform can be an alternative to the previously mentioned systems for amplicon sequencing.
The taxonomic resolution that is achieved with reads from these sequencers is clearly lower compared to Sanger reads. Nearly full length 16S rRNA gene sequences were Gold standard for clone library analysis based on Sanger reads and have led to the comprehensive sequence databases we have today. They enable species differentiation and often even the distinction of different strains. In contrast, the short NGS reads provide a resolution at maximum down to genus level. It turned out that this is frequently sufficient, in particular if the method is used for comparative purposes and microbial communities in the samples of interest do not contain many closely related species. Compared to clone library analysis, DGGE or T-RFLP, NGS amplicon sequencing allows analysis at greater depth so that many more low-abundant taxa can be detected. Thus, despite the lower taxonomic resolution, sensitivity of the method is reached here due to sequencing depth. It is up to the researcher to decide which information, resolution of taxa or sequencing depth will be more important for a project.
In case taxon resolution is important, sequence information of longer reads is needed, and the Roche 454 sequencer is a better choice. With the latest software update to version 2.9, amplicon sequencing is supported under FLX+ run conditions. Under these conditions, 16S rRNA gene and ITS sequence reads with a mean length of 650 and 750 bp were obtained (Perazzolli et al., 2014). Even longer amplicons can be sequenced when using the PacBio RS platform. A recent study demonstrated the feasibility of amplicon sequencing for community analysis on this platform (Marshall et al., 2012;Fichot and Norman, 2013), although another study reported higher error rates for PacBio amplicon sequence reads compared to 454 reads of equal length, despite that fact that the CCS strategy was used (Mosher et al., 2013). Rather short movies of only 45 min were recorded in that study. By increasing the recording time higher quality sequences can be obtained. The current release of new sequencing chemistry and future improvements will enable the generation of higher quality sequences that will probably allow resolution even below genus level.

SEQUENCE READ ANALYSIS OF AMPLICON DATA
Diverse tools have been developed specifically for the analysis of amplicon data derived from metagenomic DNA, in particular for 454 data. This is largely due to the fact that many projects aim at an estimation of the microbial diversity within samples and along with this the indispensable need to differentiate between true diversity and sequencing errors (Sogin et al., 2006;Quince et al., 2009;. The fact that amplicon sequencing on NGS platforms is more and more widely applied has expedited the development of specific data analysis tools. Based on the initial findings of Huse et al. (2007), who reported an accumulation of errors within a rather small subset of 454 reads, it became common to discard reads with one or more errors in the index and the target gene specific primer region. Likewise, reads with ambiguous basecalls (Ns), of unexpected length, with low quality scores or those that cannot be aligned to the gene of interest are assumed to be unspecific PCR products and are often removed (Huse et al., 2007(Huse et al., , 2010Huse and Welch, 2011;Schloss et al., 2011;Zhou et al., 2011). Read trimming based on quality scores has also been applied to improve quality of 454 and Illumina data Caporaso et al., 2011;Schloss et al., 2011;Bokulich et al., 2013). In some studies singletons, i.e., sequence reads that occur only once, are removed from the datasets to further reduce the error rate (Caporaso et al., 2011;Shade et al., 2013).
Besides this quality filtering, specific algorithms are applied to improve quality. These aim at the correction of errors and the selection of representative sequence reads (=denoising), so that the number of reads or bases is not further decreased. The methods are based on the assumption that erroneous reads are representatives of more abundant error-free reads. Representative error free reads are identified and selected based on comparative sequence analysis, e.g., in the single-linkage preclustering (SLP) approach of Huse et al. (2010) or by the Pyrotagger tool . Denoising algorithms such as PyroNoise, its successor AmpliconNoise or the DeNoiser analyze 454 flow grams (Quince et al., 2009;Reeder and Knight, 2010;Quince et al., 2011). The latter two algorithms have been reported to be very efficient, but demand much computational power, which has limited their application Bragg et al., 2012). The SeqNoise algorithm, implemented in the software package Mothur, is less computationally demanding and therefore more often used. In comparative studies, the AmpliconNoise algorithm performed very well for OTU estimation Bragg et al., 2012;Gaspar and Thomas, 2013). Critical analyses of different denoising tools demonstrated that parameters have to be chosen very carefully in order not to introduce bias by read modification during the generation of representative consensus reads. Default settings did not necessarily provide the best results (Bragg et al., 2012;Gaspar and Thomas, 2013).
The identification and elimination of chimeric sequences is another type of error that needs to be accounted for. Chimeric sequences originate during PCR and have been reported to contribute between 5 and 45% of a PCR product (Lahr and Katz, 2009;Haas et al., 2011). Available algorithms to eliminate these artifacts are Perseus, which was developed together with AmpliconNoise , ChimeraSlayer , or UCHIME (Edgar et al., 2011). While ChimeraSlayer needs a chimera-free reference database for chimera detection, Perseus is used without reference database. UCHIME offers both options and was reported to be faster compared to the other two methods (Edgar et al., 2011). UCHIME performed best in a comparative study when a reference database was used. Without reference database, UCHIME and Perseus performed equally well (Schloss et al., 2011). Considering that the use of databaseindependent methods is not limited by the quality and diversity of data in the reference database, database-free methods may be preferred.
Not all tools can be applied to Illumina datasets, for instance denoising algorithms that use 454 flow grams as input data. Moreover, some tools are computationally too demanding to be used for large Illumina datasets. A specific quality filtering approach for Illumina data was recently described using the "Quantitative Insights Into Microbial Ecology" (QIIME) toolkit (Bokulich et al., 2013). Other packages that combine the above mentioned analysis steps for error reduction with further analyses such as OTU clustering, taxonomy assignment or multiple sample comparison, are Mothur or the UPARSE pipeline (Caporaso et al., 2010;Schloss et al., 2011;Edgar, 2013).

APPLICATION OF NGS TECHNOLOGIES IN PRESENT STUDIES OF PLANT ASSOCIATED MICROORGANISMS SHOTGUN METAGENOMIC STUDIES
Until today, only a limited number of shotgun metagenomic studies of plant associated microorganisms exist ( Table 3). Most of the studies are based on Roche 454 sequencing technology and generated a few hundred Mb of sequence data. In a very recent study of Mendes et al. (2014) the epiphytic rhizosphere microbiome of soybean was compared to that in bulk soil with regard to taxonomic and functional composition. A specific rhizosphere microbiota was observed, representing a subset of the taxonomic and functional diversity present in bulk soil. Moreover, functions that may be of benefit for the plant in terms of growth promotion and nutrition were detected, likewise as in a study of Sessitsch et al. (2012), who performed the first extensive metagenomic study of plant associated microorganisms, still using Sanger sequencing technology. In two other rhizosphere studies, the genomic basis for phosphorous acquisition was addressed. Unno and Shinano (2013) analyzed the rhizosphere metagenome of plants that showed enhanced growth in the presence of phytic acid and detected genes encoding enzymes related to phytic acid utilization such as alkaline phosphatase or citrate synthase. Chhabra et al. (2013) applied a targeted metagenomic approach by constructing a fosmid library in Escherichia coli, which was screened in an assay for mineral phosphate solubilization activity. Six positive clones were shotgun sequenced using 454 technology. Genes and operons with homology to phosphorous uptake systems, regulatory, and solubilization mechanisms were identified. Metagenomic data of phyllosphere associated microbial communities are available from soybean, rice, clover, Arabidopsis thaliana, Tamarix, and tomato (Delmotte et al., 2009;Atamna-Ismaeel et al., 2012a;Knief et al., 2012;Ottesen et al., 2013b). Some of these datasets were analyzed in combination with metaproteomic data obtained from the same sampling material (Delmotte et al., 2009;Knief et al., 2012). These analyses revealed high consistency in the metaproteomes of phyllosphere bacteria from different plant species. In agreement, microbial community composition as inferred from these phyllosphere metagenomic datasets revealed consistency in microbial community composition at phylum level (Vorholt, 2012). Comparative analyses of metagenomic and metaproteomic data of rice phyllosphere and rhizosphere samples revealed a higher complexity of the rhizosphere microbiota and a clearly distinct metagenomic and -proteomic composition (Knief et al., 2012). The phyllosphere metagenomic datasets generated in these studies were further used in combination with a metagenomic dataset from Tamarix associated phyllosphere bacteria to screen for photosynthetic genes that are known from other microorganisms to be involved in light-driven energy generation (Atamna-Ismaeel et al., 2012a,b).
Another kind of metagenomic project was performed with the aim to obtain a complete sequence of an unculturable plant pathogen, "Candidatus Liberibacter asiaticus," which causes citrus huanglongbing (Duan et al., 2009). This pathogen is transmitted by phloem-feeding insects. Metagenomic DNA was extracted from a single Asian citrus psyllid and not from an infected plant, due to the fact that the natural enrichment of the target organism is higher in the insect. Extracted DNA was subjected to multiple displacement amplification prior to sequencing using 454 technology. Sequence read assembly resulted in 38 contigs for "Candidatus L. asiaticus," which were identified by PCR confirmation reactions from a total of 1475 generated contigs. Gap closure was achieved by sequencing gap bridging PCR products. Genome analysis revealed a heavily reduced genome of this highly divergent member of the family Rhizobiacea, as it is seen frequently for microorganisms with a predominantly intracellular lifestyle.

AMPLICON SEQUENCING STUDIES
NGS technologies are increasingly often used for amplicon sequencing of bacterial and fungal marker genes in order to characterize the communities in the phyllosphere and rhizosphere. There are more than 100 rhizosphere and at least 37 phyllosphere articles published until now that have used these techniques (see Supplementary Material for a compilation of studies). The fast majority of these studies applied Roche 454 sequencing technology. Only few used the Ion PGM platform (Kavamura et al., 2013;Kemler et al., 2013;Yergeau et al., 2014) or the Illumina MiSeq . A detailed look at the phyllosphere studies (Table S2) reveals that the generated read numbers in amplicon studies are mostly in a range from a few thousand to ten thousand reads per sample (Table S2). The obtained read length increased successively over time, along with the development of the Roche 454 sequencing platform. With the 454 FLX+ instrument a mean read length of 750 bp was recently obtained for 16S rRNA gene amplicons (Perazzolli et al., 2014).
NGS amplicon sequencing was so far almost exclusively applied for the analysis of bacterial or fungal communities. Bacterial phyllosphere communities were studied based on the 16S rRNA gene without a preference for one specific region within this gene (Table S2). Fungal communities were mostly analyzed based on the ITS region. The only functional marker gene that has been studied so far in plant associated microorganisms via amplicon sequencing is chiA, encoding a chitinase (Cretoiu et al., 2012). The aim of that particular study was an assessment of chiA gene diversity in different habitats, including rhizosphere samples from two arctic plant species. Analysis revealed that the rhizosphere of Oxyria digyna was among the samples with the highest chiA diversity.
Most amplicon sequencing studies in the phyllosphere were performed to describe and understand plant colonization by microorganisms. In particular biogeographic patterns, the role of the plant taxon for shaping communities and the temporal succession of the microbiota were addressed (e.g., Redford et al., 2010;Rastogi et al., 2012;Bokulich et al., 2014;Maignien et al., 2014). Also differences in the colonization of different plant compartments were analyzed (Bodenhausen et al., 2013;Ottesen et al., 2013b). The impact of specific treatments during plant cultivation such as irrigation were also addressed in some studies .
Amplicon sequencing projects performed in the rhizosphere addressed basically the same questions, i.e., aspects of biogeographical dispersal of rhizosphere microorganisms, or the impact of factors such as season, host plant species, soil type, or plant growth conditions (Gottel et al., 2011;Lundberg et al., 2012;Navarrete et al., 2013;Peiffer et al., 2013;Zhang et al., 2013). A major additional focus of rhizosphere studies is the analysis of endo-and ectomycorrhiza (Lumini et al., 2010;Dumbrell et al., 2011;Yu et al., 2012). It has become clear that the plant plays a significant role in shaping the associated microbiota and that root exudates are involved in this process (Badri et al., 2013), but to better understand how plants affect this process, plant mutant strains altered in root exudation or, in case of the phyllosphere with altered leaf surface properties, were analyzed (Badri et al., 2009;Reisberg et al., 2013). Furthermore, aspects of bioremediation, disease suppressiveness or possible impacts of herbicide application or of genetically modified plants have been addressed in rhizosphere studies (Barriuso et al., 2010;Rosenzweig et al., 2012;Dohrmann et al., 2013;Bell et al., 2014). All these exemplarily selected publications demonstrate the usefulness of NGS amplicon sequencing projects for studying microbial plant colonization. Future studies in this field will lead to an even better understanding of the factors that determine microbial plant colonization.

TRANSCRIPTOMIC AND METATRANSCRIPTOMIC STUDIES
NGS technologies have not only stimulated research in the field of (meta-)genomics, but are also excellent tools to perform (meta-)transcriptomic analyses. The appearance of these technologies has boosted transcriptomic studies of plant associated microorganisms, until now in particular of pathogenic fungi (e.g., Tremblay et al., 2012;Weßling et al., 2012;Thakur et al., 2013). Both, Illumina and 454 technology have been used in such studies. NGS is of particular advantage when the organisms of interest have not been genome sequenced, which is a prerequisite for the alternative microarray analyses. In some studies, the transcriptome of the host and the pathogen were even analyzed in parallel (e.g., Fernandez et al., 2012;Zhuang et al., 2012). The success of such parallel analyses depends on the ratio of plant to fungal mRNA in the sequenced sample.
First metatranscriptomic studies of the whole plant associated microbial communities appeared just recently. Chaparro et al. (2014) analyzed the microbial metatranscriptome of the Arabidopsis thaliana rhizosphere at different plant development stages. They observed that microbial genes involved in metabolism of carbohydrates, amino acids and secondary metabolites changed over time in correspondence to root exudate patterns, which also changed over time. Yergeau et al. (2014) compared the microbial metatranscriptomic composition in the rhizosphere of willow with that in bulk soil in soils contaminated with organic pollutants. Different genes involved in hydrocarbon degradation were expressed in rhizosphere and bulk soil microbial communities. Genes related to carbon and aminoacid uptake and utilization were in general up-regulated in the rhizosphere.
Instead of an mRNA analysis, Turner et al. (2013) performed rRNA sequencing to characterize the active microbiota in the rhizosphere of different crops (wheat, oat, pea). Analyzing microbial communities based on rRNA instead of their rRNA genes is assumed to reflect the physiologically active microbiota in a sample and does not necessarily need extensive PCR amplification of the target molecules prior to library preparation, as demonstrated in that study. Clear differences were observed in the composition of the active prokaryotic and eukaryotic communities compared to bulk soil samples and between the different plant species. A strong response in the fungal community to plant produced anti-fungal avenacins was observed in the rhizosphere.

APPLICATION OF NGS TECHNOLOGIES IN FUTURE METAGENOMICS STUDIES WILL ADVANCE UNDERSTANDING IN PLANT-MICROBE ASSOCIATIONS
With the availability of second generation sequencing platforms many of the limitations metagenomic studies had to deal with at the time when Sanger sequencing was the predominant technology have been overcome. In particular the preparation of metagenomic/sequencing libraries can be done much faster and the sequencing costs per base are drastically reduced. The new technologies allow much deeper sequencing of microbial communities, providing more information about identity and physiological potential of microbial communities associated to plants.
Limitations of NGS approaches such as shorter reads and higher sequencing error rates can be largely compensated by using specifically designed sequence data analyses methods. Future developments of the sequencing technology will enable us to obtain even more and longer reads; the generation of sequence information will thus most likely not be a limiting factor in future studies, but enable to address the open questions in phyllosphere and rhizosphere research, as outlined in the introduction, in even more detail.
A current limitation of metagenomic sequencing studies is a high ratio of sequences that represent unknown genes of known or unknown organisms, and of sequences for which no homolog is found in public databases that would enable to infer further information. To improve the still challenging task of linking genes and thus function to phylogeny, genomic sequencing of representative pure cultures and the genetic and physiological characterization of strains will remain an important task. Genome sequencing projects of strain collections from the ecosystems of interest are one step further to overcome this limitation (Turnbaugh et al., 2007;Brown et al., 2012). Concerted sequencing of currently underrepresented organisms in databases, e.g., based on evolutionary relationship as in the GEBA project, will further improve databases (Wu et al., 2009). Likewise, advance in single cell genome sequencing has recently enabled the sequencing of yet uncultivated microorganisms; 200 bacterial and archaeal cells representing diverse largely uncharacterized phyla were successfully sequenced (Rinke et al., 2013). This genomic information will enable a more specific assignment of metagenome reads to taxa. (Meta-)transcriptomic and -proteomic studies based on known and well characterized representative model organisms under controlled conditions will contribute to a deeper understanding of microbial life in the phyllosphere.
The complementation of metagenomics data with metatranscriptomic, metaproteomic, and (meta-)metabolomic data will be one of the future goals to obtain a more complete view of the activities and the physiological potential of plant associated microbial communities under given conditions at systems level (Zhang et al., 2010;Knief et al., 2011;Segata et al., 2013). Such information is inevitable to build up models that can explain and predict microbially mediated processes and interactions in the phyllosphere and rhizosphere under different environmental conditions, including agricultural practices, responses to pathogen attack and disease, or to climate change.