Edited by: Roger Deal, Emory University, United States
Reviewed by: Cory Hirsch, University of Minnesota Twin Cities, United States; Margaret L. Worthington, University of Arkansas, United States
†Present address: Shanshan Yang, Virginia G. Piper Center for Personalized Diagnostics, Biodesign Institute, Arizona State University, Tempe, AZ, United States
This article was submitted to , a section of the journal Frontiers in Plant Science
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Amplicon sequencing (AmpSeq) is a practical, intuitive strategy with a semi-automated computational pipeline for analysis of highly multiplexed PCR-derived sequences. This genotyping platform is particularly cost-effective when multiplexing 96 or more samples with a few amplicons up to thousands of amplicons. Amplicons can target from a single nucleotide to the upper limit of the sequencing platform. The flexibility of AmpSeq’s wet lab methods make it a tool of broad interest for diverse species, and AmpSeq excels in flexibility, high-throughput, low-cost, accuracy, and semi-automated analysis. Here we provide an open science framework procedure to output data out of an AmpSeq project, with an emphasis on the bioinformatics pipeline to generate SNPs, haplotypes and presence/absence variants in a set of diverse genotypes. Open-access tutorial datasets with actual data and a containerization open source software instance are provided to enable training in each of these genotyping applications. The pipelines presented here should be applicable to the analysis of various target-enriched (e.g., amplicon or sequence capture) Illumina sequence data.
A diverse array of genotyping platforms and molecular marker types have been utilized for various applications in plant species, including both nuclear and cytoplasmic loci, such as simple sequence repeats (SSR) and single-locus or multi-locus single nucleotide polymorphism (SNP) assays. SSRs are particularly well-suited for genetic mapping and comparative genomics because of their multi-allelic nature and high transferability among distinct species or genera (
Next-generation sequencing (NGS) technology offers a potential opportunity for unbiased genotyping with high-throughput and low per-sample cost. Simultaneous marker discovery and genotyping delivers many benefits including availability of flanking DNA sequence information, high resolution, and high-sample throughput and scalability (multiplexed loci) (
The AmpSeq genotyping platform (
The objectives of applying AmpSeq may be to develop markers for high-throughput genotyping or to explore the sequence diversity of the targets, and then use the output data for distinct purposes, such as population genetics or allele discovery and mining. The target sequences have been primarily based on four sources (
General workflow of the AmpSeq strategy. Here, we describe three sources of short AmpSeq markers derived from GBS 64 bp sequences (QTL analysis, FST, Discriminative tags) as well as longer amplicons from SSRs or gene sequences, and considerations for primer design and selection. A primary focus of the documentation provided here is on data processing and interpretation.
In this document, a pipeline for the processing of AmpSeq data prior to interpretation is presented. Emphasis will be given to three main types of data: SNPs, haplotypes and amplicon read counts. The basic information is presented here, along with key output files and their descriptions. Data and a more comprehensive step-by-step procedure and well as a containerization instance with the software required are available through the GitHub and Singularity Hub repositories.
Note that several variants of amplicon sequencing are available through commercial providers. In addition, sequence capture approaches have a parallel concept of target enrichment prior to sequencing and are available through an array of commercial providers. Whether genotyping in-house or with a service provider, in most cases these diverse target-enriched Illumina sequence data types could be analyzed with the pipelines described here, particularly if reference sequences (for SNP calling), 5′ sequence of the target region (for haplotyping), or a larger component of target sequence (for read counts) is available.
The requirements of hardware to analyze AmpSeq data will depend mostly on the number of samples to be analyzed, the type of data to be generated, and the sequencing platform used for the generation of the amplicon sequences. Up to 2500 samples have been successfully analyzed using the procedure described herein by using a workstation with 32 GB of RAM and a quad-core microprocessor. In terms of storage, for those 2500 samples processed with a mixture of small (up to 50 bp) and medium (up to 200 bp) length amplicons, less than 100 GB were required. The final outputs of the AmpSeq computational analysis are small, usually VCF files or text files of KB or MB (in the example provided here, the basic data is ∼300 MB, and after finishing all the routines the directory is ∼6 GB).
If the number of samples is greater than 3000 and hundreds of amplicons of distinct lengths are going to be analyzed, it is recommendable to use dedicated workstations or high-performance computing clusters, in order to accelerate the processing and also obtain support related to customizations of the scripts and container provided in this communication. In addition, having access to institutional or dedicated computing clusters allows both new and experienced users to save time in dealing with new installations and configuring access to software.
The computational analysis of AmpSeq data requires the use of the command line and the installation of Perl 5.16 and above, Java 8 and above. The software packages (and their dependencies) considered for this pipeline include: BWA (
Supporting this manuscript, a compilation of examples of input files and scripts is available at the AmpSeq repository:
These procedures design target-specific primers to use for library preparation. This routine was developed with the purpose of designing primers to amplify GBS-tags containing SNPs in a given chromosomic region of interest (e.g., a QTL). This routine requires a SAM file from the SNP discovery pipeline (in this case Tassel GBS pipeline v. 3), a VCF with the polymorphisms discovered out of the SNP-discovery pipeline, and a fasta file with the reference sequence of the chromosome, contig or scaffold. A subsample of the GBS data analyzed in
These are procedures to discover SNPs relative to a reference sequence. This routine automatically trims sequence reads for adaptors, length and quality to subsequently process reads through procedures of aligning against reference, sorting from Picard and calling SNPs through the HaplotypeCaller routine of GATK. The outcome is a VCF file with the list of SNPs per sample analyzed. The data for this section and Haplotyping consist of paired-ended sequencing data of two parents (Parent1 and 2) and 10 progenies (Progeny01 to 10) for which amplicons targeted the powdery mildew resistance loci
PCR amplicons can contain one or more polymorphic sites that together can be considered a unique haplotype, or unique sequence read. The haplotyping procedure identifies unique haplotypes as alleles and associates allelic combinations with each sample. This routine automatically trims the sequencing data (for adaptors, length, and quality) to subsequently process unique reads through multiple sequence alignment using Clustal Omega and sorting from Picard. If reference sequences are provided for the amplicons analyzed, it is possible also to call SNPs by using the HaplotypeCaller routine of GATK. The expected outcome is Variant Call Format (VCF)-type file and a list of haplotype sequences. The data for this section and SNP Calling consist of two parents (Parent1 and 2) and 10 progenies (Progeny01 to 10) described above. Example data are available at:
This procedure quantifies reads of known sequences. For this rapid routine written in a perl script, the reads are parsed using the forward primers or sequences of the amplicons and enumerated. This procedure is useful when presence/absence markers or a specific allele are used to assay the sample population. The expected outcome is a text file containing a matrix of read counts. A subsample of the data for amplicons generated through single-ended sequencing targeting the powdery mildew resistance loci
To get started with AmpSeq, the use of existing primers that are known to produce amplicons in any platform is the simplest way. Nearly all of these (>95%, in our experience) will return data in AmpSeq multiplexes. Similarly, simply using Primer3 (
As described by
The Perl wrapper script run_gatk2.pl trims sequence reads to 45 bp before aligning them to the reference genome. It produces one SAM file per individual, converted to BAM files for sorting, cleaning and merging before using GATK for SNP calling and VCF file generation. That VCF file contains all the AmpSeq SNPs identified in the analysis and can be manipulated using vcftools or TASSEL (
The amplicon sequencing technology is flexible with respect to the type of designs that one can develop. The initial emphasis was on short, 45 bp markers converted from GBS data to genotype SNPs and presence/absence sequences, or tags. In addition, longer amplicons from gene regions, SSRs, or other sources can be genotyped, taking advantage of pair-ended sequence technology. This enables cost-effective integration of diverse marker platforms, for applications such as marker-assisted selection. When multiplexing amplicons of different lengths, shorter amplicons tend to return significantly more data than longer amplicons. Further, data analysis is simplified when the range of amplicon sizes is restricted.
In this example, data analysis focuses on longer amplicons, specifically SSR markers (
Output of the haplotype alleles for two parent samples and seven markers in the data provided at
Sample |
|||
---|---|---|---|
Marker | Haplotypes (frequency)a | Parent1b | Parent2 |
Ren1_SC47_6 | 1(0.38);3(0.29);2(0.25);4(0.08); | 2/3:1587,1511 | 1/4:2927,2653 |
Ren6_PN9_063 | 2(0.29);4(0.29);3(0.12);1(0.12);10(0.08);11(0.04);9(0.04); | 9/10:7,4 | 1/3:610,170 |
Ren6_PN9_068 | 4(0.32);2(0.27);3(0.27);1(0.14); | 3/4:8,3 | 1/2:35,14 |
Ren7_PN19_018 | 1(0.50);2(0.33);4(0.08);3(0.08); | 2/4:141,130 | 1/1:1752 |
Ren7_VVin74 | 1(0.33);4(0.29);2(0.25);3(0.12); | 2/4:139,42 | 3/1:175,117 |
Rpv12_UDV014 | 4(0.32);2(0.18);1(0.18);5(0.18);6(0.14); | 1/2:76,63 | ./.:0 |
Rpv12_UDV370 | 1(0.33);2(0.33);3(0.17);5(0.12);4(0.04); | 3/1:3735,3018 | 5/2:4379,3709 |
GBS tags are reference-independent sequences unbiased by the genomic background of samples and are particularly useful for introgression of traits from wild accessions lacking a reference genome. Marker design uses the tags-by-taxa (TBT) file generated during the TASSEL-GBS pipeline, prior to read alignment to a reference, to detect differential presence/absence of tags with respect to phenotypic data (such as presence/absence of disease resistance). Tag sequences differentially enriched in a given germplasm pool can be used for BLAST query to identify redundant tags (optional) and Primer3-based primer design for the development of 45–50 bp presence/absence AmpSeq markers.
After sequencing, the output of tag_presence.pl is a matrix with the counts for every tag (rows) in every individual (columns), with normalized and non-normalized versions. The normalized version shows values normalized by total reads then multiplied times 1,000,000, which can be useful to deal with uneven DNA concentration and sequencing depth (
Output of the read counts in matrix_normalized.txt for four samples and seven markers in the data provided at
Sample |
||||
Marker | 7_18_06_3 | 7_18_06_5 | 7_20_03_1 | 7_20_03_3 |
---|---|---|---|---|
Ren4_Tag4 | 0 | 22 | 9211 | 11,013 |
Ren4_Tag6 | 0 | 0 | 2414 | 3607 |
Ren4_Tag9 | 118 | 156 | 99 | 143 |
Ren4_Tag16 | 0 | 0 | 172 | 239 |
Ren4_Tag18 | 0 | 0 | 74 | 24 |
Ren4_Tag19 | 0 | 0 | 99 | 96 |
Run1_CB3334_Haplo64 | 0 | 0 | 0 | 0 |
The files can be easily opened in Excel, where it is possible transpose the matrix and start to manipulate the information, such as by adding phenotypic information. The subsequent manipulation of the information depends on the data quantity and application. Often, quantitative read counts are continuously variable even for alleles that are present or absent, and mean read depth per marker is not even across loci. For these reasons, read count standardization and thresholding can be helpful (
Distribution of read counts for the REN4_Tag4 sequence for a marker-assisted selection pseudotestcross from
Establishing a threshold may be experiment specific, in part because sequencing depth is dependent on sequencing platform and the number and identity of multiplexed markers or samples, which may change from one batch to the next. For example, note that the maximum read count in
The read counts are somewhat influenced by the zygosity of the locus, even though the high number of AmpSeq PCR cycles makes the read counts quantitatively imprecise. Thus, loci that are in a homozygous state usually have higher read counts in comparison with those in heterozygous states (not necessarily double). Sometimes such a distinction may be useful to discriminate individuals resulting from unintended self-fertilization. Having selfs in the dataset may significantly deviate the thresholds to call for the presence/absence of the marker; therefore, it is highly desirable to analyze for possible selfs beforehand, which we typically do using data from biallelic SNP markers with known parental alleles.
AmpSeq provides flexibility to address the limitations of other marker platforms, particularly in high diversity, heterozygous taxa like
AK, JF-R, SY, LC-D, and QS analyzed the data. BR developed the germplasm. LC-D, QS, and JF-R planned the study. AK, JF-R, SY, and LC-D wrote the manuscript. All authors read and approved the final manuscript.
Mention of trade names or commercial products is solely for the purpose of providing specific information and does not imply recommendation or endorsement by the U.S. Department of Agriculture. USDA is an equal opportunity provider and employer.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
We gratefully acknowledge the USDA-NIFA Specialty Crop Research Initiative (Award No. 2011-51181-30635) for funding the project “Accelerating grape cultivar improvement via phenotyping centers and next generation markers” (
The Supplementary Material for this article can be found online at:
Output of primer design from genotyping-by-sequencing (GBS) markers linked with
Output of markers in the relevant chromosomes from paired-ended sequencing data of two parents (Parent1 and 2) and 10 progenies (Progeny01 to 10) for which amplicons targeted the powdery mildew resistance loci