Cross-Comparison of Exome Analysis, Next-Generation Sequencing of Amplicons, and the iPLEX® ADME PGx Panel for Pharmacogenomic Profiling

Whole-exome sequencing (WES) has been widely used for analysis of human genetic diseases, but its value for the pharmacogenomic profiling of individuals is not well studied. Initially, we performed an in-depth evaluation of the accuracy of WES variant calling in the pharmacogenes CYP2D6 and CYP2C19 by comparison with MiSeq® amplicon sequencing data (n = 36). This analysis revealed that the concordance rate between WES and MiSeq® was high, achieving 99.60% for variants that were called without exceeding the truth-sensitivity threshold (99%), defined during variant quality score recalibration (VQSR). Beyond this threshold, the proportion of discordant calls increased markedly. Subsequently, we expanded our findings beyond CYP2D6 and CYP2C19 to include more genes genotyped by the iPLEX® ADME PGx Panel in the subset of twelve samples. WES performed well, agreeing with the genotyping panel in approximately 99% of the selected pass-filter variant calls. Overall, our results have demonstrated WES to be a promising approach for pharmacogenomic profiling, with an estimated error rate of lower than 1%. Quality filters, particularly VQSR, are important for reducing the number of false variants. Future studies may benefit from examining the role of WES in the clinical setting for guiding drug therapy.


INTRODUCTION
Whole-exome sequencing (WES) is an increasingly important technology in rare-disease (Maxmen, 2011) and drug-response genetics (Price et al., 2012). Its core technique comprises simultaneous capture, enrichment and sequencing of protein-coding and untranslated regions within the genome (exome). Besides being an effective tool for detecting potentially diseasecausing variant(s), WES can also provide added information on variation in pharmacogenes. Though WES data has been shown to be highly accurate in previous studies, provided that appropriate quality filters are applied (Wang et al., 2013;Linderman et al., 2014;Strom et al., 2014;Yi et al., 2014), none of these studies have specifically explored the use of WES data for pharmacogenomic profiling. Notably, the accuracy of WES variant calling could be compromised by the failure of the technology to resolve highly similar genes (Drögemöller et al., 2013). In particular, sequencing of the CYP2D6 gene is confounded by the presence of closely related pseudogenes, CYP2D7 and CYP2D8, such that pre-amplification with long-range PCRs is usually applied to avoid undesired sequence contamination (Stüven et al., 1996).
The work described in this paper had two primary aims. First, we carried out an in-depth cross-validation of WES variant calls in CYP2D6 and CYP2C19 for 36 samples by amplicon sequencing on the MiSeq R platform. Then, we expanded our findings and evaluated the more general applicability of WES to pharmacogenomic profiling by cross-comparison with the iPLEX R ADME PGx Panel. The iPLEX R ADME PGx panel uses the MassARRAY R system (Agena Bioscience, San Diego, CA, USA) to simultaneously analyze 184 single nucleotide polymorphisms (SNP), insertions and deletions (INDELs) and 16 copy number variations (CNV) across 36 genes relevant to drug absorption, distribution, metabolism, and excretion.

Sample Population
A total of 36 samples were included in this study. These sequenced samples comprised various research samples referred to our laboratory for pharmacogenomic investigation. This study was approved by the Southern Health and Disability Ethics Committee, New Zealand. Potential participants were contacted first by mail and were required to indicate interest to participate by filling in and returning an enclosed form. Face-to-face interviews were subsequently conducted to obtain written consent and collect relevant medical history. The study information sheet and consent form included procedures for handling of incidental findings, which would be followed up in consultation with a clinical geneticist. DNA was extracted from peripheral blood leukocytes using a KingFisher Flex Magnetic Particle Processor, as per the manufacturer's instructions (Thermo Fisher Scientific, Waltham, MA, USA).
Details for processing of amplicon sequencing data are presented in Supplementary Methods. Raw sequence reads, which had a mean length of 151 bp, were first trimmed using Trimmomatic v0.30 to remove contaminating adapterindex sequences (Lohse et al., 2012). Subsequent analysis was performed using tools available on the Galaxy server (Giardine et al., 2005;Blankenberg et al., 2010;Goecks et al., 2010). Trimmed reads were aligned to a custom reference sequence using BWA-backtrack, duplicates were removed with Picard v1.56.0, then local base realignment around indels was carried out with GATK. Finally, variants were called with GATK's Unified Genotyper v0.0.6.
A subset of twelve samples were then selected for multiplexed genotyping by the iPLEX R ADME PGx Panel (Agena Bioscience, San Diego, CA, USA). DNAs from these samples were standardized to 10 ng/µL in a final volume of 200 µL. This was followed by genotyping on the MassARRAY R System (Agena Bioscience) using iPLEX R Gold Biochemistry and Typer v4.0 Software (Agena Bioscience).

Validation of WES Variant Calls in CYP2D6 and CYP2C19 by Amplicon Sequencing
Whole-exome sequencing genotype calls having a read depth < 4 or a genotype quality score < 10 were designated "not evaluable." GATK defines genotype quality score as "the Phredscaled confidence that the genotype assignment is correct." Further, to assess the effectiveness of VQSR at improving variant calling accuracy, we divided all variants into two sets, namely those that were called below the 99% truth sensitivity threshold, also designated "pass-filter, " and those that were called with surplus sensitivity (< 99.9%). These values represent varied levels of variant detection sensitivity relative to a set of known, true variants; and 99% is a commonly adopted threshold. Then, because the call-sets comprised a mix of onand off-target variants, those sites that were distant from the target segments, defined for the TruSeq TM capture kit, were apt to be poorly covered. Variant calls generated by MiSeq R amplicon sequencing were required to have an approximate read depth of at least 10-fold to be considered sufficiently confident. A false-positive event was defined as the calling of the alternate allele that was determined to be absent by MiSeq R amplicon or follow-up Sanger sequencing. A falsenegative event was defined as the failure to detect the alternate allele(s).

Cross-Comparison with the iPLEX R ADME PGx Panel
Comparison was carried out for variant calls emitted by both WES and the iPLEX R ADME PGx Panel. Again, only passfilter WES genotype calls, i.e., having a read depth ≥ 4 and a genotype quality score ≥ 10, were considered. For the iPLEX R data set, only genotype calls that had a call rate > 85%, indicating good quality, were selected for validation. The call rates were determined by the Typer 4 (Agena Bioscience) software, having assessed spectra quality related to the reported haplotype. Sanger sequencing was performed to resolve disagreement between the two platforms.

Validation of CYP2D6 and CYP2C19 Variants by MiSeq R Amplicon Sequencing
Whole-exome sequencing was carried out on 36 subjects referred to our laboratory for PGx analysis. We initially compared the WES data quality for two key pharmacogenes, CYP2D6 and CYP2C19, with data generated by an amplicon sequencing assay we developed on the MiSeq R (Illumina) platform, for the same 36 samples (Supplementary Methods; per-exon depths of coverage are presented in Supplementary Figures S1 and S2). A total of 43 variant sites, identified by WES in the CYP2D6 and CYP2C19 genes, were analyzed (Table 1); of these, 27 were called without violating the 99% truth-sensitivity threshold. WES variant calling was highly accurate for the 27 pass-filter variant sites, achieving a concordance rate of 99.60% ( Table 2). For the 43 variant sites identified by WES a total of 943 individual WES genotype calls were generated across the 36 samples, but 202 calls were excluded for being of insufficient quality (read depth < 4 or genotype quality < 10) or because MiSeq R amplicon sequencing data were not available. Of the 741 verifiable WES genotype calls, only three calls were found to be discordant with MiSeq R data. None were false-positive. Sanger sequencing revealed that two of the three WES genotype calls that were discordant with MiSeq R data (rs17885098 and rs3758581) were actually accurate ( Table 3).
In contrast, when the truth sensitivity threshold was relaxed to include an additional collection of 16 variant sites (< 99.9%), the rate of concordance with MiSeq R data dropped considerably to 89.11% ( Table 2). A total of 1512 calls were generated for the 43 variant positions; but quality filtering and a dearth of sufficiently confident MiSeq R data (read depth < 10) for a number of sites resulted in a final comparison set of 1249 WES genotype calls. Of these, 136 mismatches were identified, representing a large increase of 133 discordant calls in relation to the more stringent call-set (truth-sensitivity threshold < 99%). The majority of these discrepant calls were false-positives.

Cross-Comparison with the iPLEX R ADME PGx Panel
We next sought to examine the concordance between WES data and a broad PGx profile of 192 nucleotide variations in 36 genes, generated on a subset of 12 subjects using the iPLEX R ADME PGx Panel (Agena Bioscience). Of all 192 polymorphisms covered by the iPLEX R ADME PGx panel, 184 are SNPs and INDELs. Notably, 16 variant sites were not captured by the Nextera exome capture kit; hence, these variants could not be screened by WES (Table 4). Only pass-filter WES genotype calls (a read depth ≥ 4 and a genotype quality score ≥ 10) were considered for comparison, which resulted in a final set of 64 variant sites that were called by both WES and the iPLEX R ADME PGx panel ( Table 5).
After eliminating all low-quality (described above) and missed calls from both panels, 719 genotype calls were included in our final analysis. Of these, eight calls at four variant sites were found to be discordant with the iPLEX R ADME PGx Panel (Table 6), yielding a concordance rate of 98.89%. By Sanger sequencing, we confirmed that all these discordant calls were correctly genotyped based on the WES data.

DISCUSSION
Previous studies have established WES to be an effective, highthroughput variant detection tool that has been successfully used in the analysis of Mendelian disorders (Wang et al., 2013;Linderman et al., 2014;Strom et al., 2014;Yi et al., 2014). This sequencing technique constitutes a potent driving force in personalized medicine, generating genomic profiles that could also be utilized to tailor pharmacological treatment. Suppose an individual undergoes WES to aid diagnosis of an unknown condition, surely the added information on pharmacogenomic polymorphisms could be interpreted, curated and stored for guiding future drug therapy? Here we report the results of our assessment of WES as a potential tool for pharmacogenomic profiling.
We demonstrated that pass-filter WES variant calls in CYP2D6 and CYP2C19 were highly accurate, yielding a near-perfect degree of concordance with the MiSeq R amplicon sequencing data, despite previous concerns that WES is likely to underperform in genes with closely related homologs (Drögemöller et al., 2013). The concordance rate decreased substantially to approximately 89% when the truth-sensitivity threshold was raised to 99.9%. The difference in concordance rate was rather striking, confirming previous findings which have demonstrated the effectiveness of VQSR at reducing errors in WES data (Yi et al., 2014). In the clinical setting, a highly accurate call-set is desired, albeit at the expense of variant-detection sensitivity. We recommend 99% to be the optimal cut-off that should minimize the number of erroneous genotype calls without overly compromising variant discovery.
Another interesting point is the usability of off-target variants ( Table 1 and Supplementary Figure S3). For instance, an important defective CYP2D6 variant, rs3892097 ( * 4), is located outside the capture intervals of the TruSeq TM kit. Nevertheless, 100% agreement was found between WES and MiSeq R amplicon sequencing data for the variant. Had strict target definitions been adopted with no interval extension or "padding, " this variant would be overlooked. Given the intrinsic mechanism of TruSeq TM capture, labeling the off-target sequences as such is slightly misleading. The kit employs contiguous probes to target regions of interest; thus capture of overhanging sequences in fragmented genomic DNA is unavoidable. The current GATK practice supports a certain extent of flexibility in WES data processing to encompass variants located within the exon-intron boundaries. Off-target variants represent a valuable subset that gives added information, and this has already been reported by other investigators (Guo et al., 2012).
We then further analyzed the 20 off-target sites) with respect to their approximate read depth and distance from the nearest   target segment. Coverage appeared to range from poor to good even for variants located at similar distances from TruSeq TM targets, but was consistently low beyond 100 bases (<150fold per 36 samples; Supplementary Figure S3). This suggests that a maximum distance of 100 bases may be a reasonable qualifying threshold for invoking an off-target variant site. Nonetheless, further analysis will still be required to more accurately quantify the acceptable deviation from the target regions that would maintain sufficient data quality for variant discovery; and to ascertain whether call-sets produced using different commercial capture kits, which have varied target definitions, could be combined and subjected to the same analysis pathway.
Using the truth-sensitivity threshold pre-defined above, we then assessed the broader applicability of WES to pharmacogenomic profiling by cross-comparison with the iPLEX R ADME PGx Panel in a subset of 12 samples. The panel successfully detected 181 single-nucleotide variants per sample but of these, only 64 were also called by WES without violating the truth-sensitivity threshold. The majority of the WES-derived variant calls were consistent with those obtained from the iPLEX R ADME PGx Panel, giving a concordance rate of 98.89%. Eight discordant WES genotype calls were observed and were subsequently verified by Sanger sequencing, again confirming the accuracy of WES variants.
Overall, our results have demonstrated that as a pharmacogenomic screening tool, WES has an estimated error rate of lower than 1% for VQSR-filtered variants. This accuracy of the WES dataset reflects the increased reliability and quality of data now available using longer reads for WES analysis. The error rate would probably decrease further in the future with improved analysis software or sequencing workflows. Applying VQSR alone appears unlikely to remove all false-positive variant calls and additional filters are required. GATK Best Practice currently implements the Genotype Refinement workflow to achieve higher data quality. In this pipeline, genotypes with quality score < 20 are filtered out following VQSR. A recent study has also found that read depth ≥ 8 and genotype quality ≥ 20 are good thresholds for removing unreliable genotype calls (Carson et al., 2014). Our results suggest that less stringent cut-offs (read depth ≥ 4 and genotype quality ≥ 10) could be adopted to obtain more genotype calls, at marginal cost to the error rate.

LIMITATIONS
Remote or non-exonic variants of functional significance are not detectable using the WES technology and this may limit its usefulness (Londin et al., 2015). For instance, −806C > T, the transcription-enhancing promoter variant of the CYP2C19 gene, was not detected by WES. This variant is common in the Caucasian population with an allele frequency of 18% (Sim et al., 2006) and may have an important influence on clopidogrel responsiveness (Tiroch et al., 2010). Despite improved accuracy after applying the VQSR filter, 20% of the genotype calls from the WES dataset could not be analyzed due to their poor quality or low read depth limiting the use of WES as a reliable technique for clinical application to replace mutation scanning approaches. It is also worth noting that only about one-third of the 192 iPLEX R variants were covered by pass-filter WES data. This study did not examine any samples harboring CYP2D6 hybrid alleles, which arise from large-scale CYP2D7 conversion of the CYP2D6 gene. Consequently, it is not clear how well BWA-backtrack and the downstream variant caller would resolve these chimeric sequences, which would contain a large number of mismatches to the CYP2D6 reference sequence. Because BWAbacktrack is not designed to tolerate a high error rate (Li and Durbin, 2009), it is probably not able to process extensively CYP2D7-converted reads. The presence of hybrid alleles is therefore likely to adversely affect the quality of CYP2D6 variant calling on WES data, and other strategies may need to be employed to circumvent this issue.
Finally, different analysis pipelines were employed to process MiSeq R amplicon sequencing and WES data. This could have contributed to the observed discrepancy between the two sequencing approaches. However, we believe the effect was unlikely to have been sufficiently severe to affect the conclusions that we have drawn.

CONCLUSION
We have demonstrated that WES is a promising tool in detecting pharmacogenomic variants, even for complex loci such as the CYP2D6 gene. VQSR is an essential quality filter for the removal of likely false variant sites. Future studies should examine the adoption of WES in the clinical setting for guiding pharmacological therapy. For instance, exome analysis could be applied to subjects in whom CYP2D6 activity has been pharmacokinetically validated, to determine genotype-phenotype correlation (that could be occasionally obscured by extensive pseudogene conversion of the CYP2D6 gene). Various practical aspects of reporting WES results should be considered, such as obtaining patient consent for storing and utilizing this piece of information, translating the WES data into an easy-tounderstand format, and determining the actionability of novel reported variants.

AUTHOR CONTRIBUTIONS
EWC, SC, KT, and PS performed the experiments; EWC, SC, KT, and KL analyzed the sequence data; EWC and KT drafted the manuscript; SC, KL, PS, NH, and MK critically reviewed the manuscript.