Universal Target Capture of HIV Sequences From NGS Libraries

Background: Global surveillance of viral sequence diversity is needed to keep pace with the constant evolution of HIV. Recent next generation sequencing (NGS) methods have realized the goal of sequencing circulating virus directly from patient specimens. Yet, a simple, universal approach that maximizes sensitivity and sequencing capacity remains elusive. Here we present a novel HIV enrichment strategy to yield near complete genomes from low viral load specimens. Methodology: A non-redundant biotin-labeled probe set (HIV-xGen; n = 652) was synthesized to tile all HIV-1 (groups M, N, O, and P) and HIV-2 (A and B) strains. Illumina Nextera barcoded libraries of either gene-specific or randomly primed cDNA derived from infected plasma were hybridized to probes in a single pool and unbound sequences were washed away. Captured viral cDNA was amplified by Illumina adaptor primers, sequenced on a MiSeq, and NGS reads were demultiplexed for alignment with CLC Bio software. Results: HIV-xGen probes selectively captured and amplified reads spanning the entirety of the HIV phylogenetic tree. HIV sequences clearly present in unenriched libraries of specimens but previously not observed due to high host background levels, insufficient sequencing depth or the extent of multiplexing, were now enriched by >1,000-fold. Thus, xGen selection not only substantially increased the depth of existing sequence, but also extended overall genome coverage by an average of 40%. We characterized 50 new, diverse HIV strains from clinical specimens and demonstrated a viral load cutoff of approximately log 3.5 copies/ml for full length coverage. Genome coverage was <20% for 5/10 samples with viral loads 90% for 35/40 samples with higher viral loads. Conclusions: Characterization of >20 complete genomes at a time is now possible from a single probe hybridization and MiSeq run. With the versatility to capture all HIV strains and the sensitivity to detect low titer specimens, HIV-xGen will serve as an important tool for monitoring HIV sequence diversity.


INTRODUCTION
As the HIV epidemic continues, surveillance of HIV diversity is essential to monitor the emergence of new subtypes and the presence of new strains (Hemelaar, 2013). Knowing which subtypes or recombinants historically predominate in a geographic region and whether the identity and proportion of new infections reflect a static or changing landscape will help inform appropriate interventions. For example, in Europe where the majority were once subtype B, 50% of new infections are non-B or recombinants (Semaille et al., 2007). The number of non-B infections in the US has also been on the rise (Pyne et al., 2013;Oster et al., 2017). Of equal concern is knowing whether mutations have arisen in current strains that could lead to misdiagnosis of HIV status, under-quantification of patients on antiretroviral therapy, or inadequate screening of the blood supply (Brennan et al., 2006).
Next generation sequencing (NGS) has been applied to the fight against HIV in several capacities, including monitoring for drug resistance, cell tropism determination, superinfection studies, network analysis and transmission patterns, and intrapatient quasi-species detection (Bimber et al., 2010;Archer et al., 2012;Redd et al., 2012;Swenson et al., 2012;Giallonardo et al., 2014). These typically involve high-throughput amplicon sequencing of defined sub-genomic regions such as pol or env. By contrast, metagenomic approaches using random priming permit an assessment of the full extent of sequence diversity in strains and can be applied to surveillance. Complete genomes of HIV and co-infecting viruses can be obtained directly from patient specimens, provided viral loads are high enough (Luk et al., 2015;Yamaguchi et al., 2017). To increase sensitivity, the HIV-SMART method utilizes reverse primers in conserved sequences of HIV fused to a tag to facilitate combined virusspecific priming and amplification (Berg et al., 2016;Rodgers et al., 2017a). In an alternate HIV-specific approach, individual long fragments (e.g., >2 kb) amplified with HIV primer pairs can be converted to NGS libraries to achieve detection limits nearing that of PCR, although the process is more labor intensive (Gall et al., 2012). What remains elusive is a simple, universal NGS approach for surveillance that balances sensitivity and cost, allowing numerous specimens of any titer and from any HIV group to be sequenced simultaneously while also maximizing NGS capacity and minimizing the use of resources.
To increase the sensitivity of microbial metagenomics from patient specimens, numerous pre-treatment approaches have been applied to reduce host background, including nuclease pretreatment, rRNA depletion, filtration, centrifugation, etc. (Hall et al., 2014;Conceição-Neto et al., 2015). An alternate means of enriching for viral reads after NGS library preparation has shown great promise. Briese, et al synthesized nearly 2 million biotinlabeled probes to cover all coding regions of vertebrate viruses, hybridized them to conventional high throughput sequencing libraries containing spiked in viral nucleic acid, and amplified sequences captured on streptavidin beads. The VirCapSeq-VERT method yielded a 100-10,000-fold increase in viral reads from patient specimens (Briese et al., 2015). An analogous approach tailored specifically for hepatitis C virus (HCV) has been equally successful, tiling all seven genotypes with just 1,100 probes (Bonsall et al., 2015). HIV probes tiling only subtype B showed in principle that genomes of integrated provirus can be selectively amplified from host DNA (Miyazato et al., 2016). As with HCV, HIV-1 and HIV-2 exhibit substantial diversity and therefore probes inclusive of all subtypes and groups are needed for surveillance. Here we developed a comprehensive probe set to accommodate the full spectrum of HIV strains encountered. We demonstrate that the HIV-xGen method is highly effective, particularly for low titer patient specimens, moving the field closer toward a universal, sensitive, high volume NGS solution.

Specimens
Specimens were collected by the Abbott Global Surveillance Program through collaborations in Cameroon, Uganda, South Africa, Senegal, and Thailand. All specimens were de-identified and obtained according to local regulations in each country at the time of collection between 1998 and 2016, including local IRB approval when required. Specimens were identified as HIV-1 positive by rapid diagnostic tests done in source countries before sequence analysis and testing at Abbott Laboratories (Swanson et al., 2000;Brennan et al., 2008;Rodgers et al., 2017b). Deidentified specimens from the Democratic Republic of Congo were obtained in 1987 as part of Project SIDA by the US National Institute of Allergy and Infectious Diseases (Cohen, 1997).

Specimen Pretreatment and Extraction
Plasma specimens were pre-treated with Ultra-pure benzonase (Sigma, St. Louis MO) for 3 h at 37 • C and extracted on the m2000sp (Abbott Laboratories, Des Plaines IL) with the RNA protocol (500 µl input/70 µl elute) as described (Berg et al., 2016).

HIV Viral Loads
Viral loads were determined by the HIV-1 RealTime (Abbott Laboratories, Des Plaines IL) assay as described (Berg et al., 2016). cDNA Synthesis and Nextera XT Library Production Viral RNA was concentrated to 10 µl with RNA Clean and Concentrator-5 spin columns (Zymo Research, CA) as described (Rodgers et al., 2017a). For gene-specific reverse transcription, cDNA was generated with HIV-SMART essentially as described (Berg et al., 2016;Rodgers et al., 2017a). Minor modifications described here were made to primer concentrations, cDNA input and PCR cycling conditions to reduce loss and boost sensitivity. The pool of six HIV-SMART primers used was 300 nM final for each (formerly 1 µM), the entire 10 µl of SMART cDNA (without dilution) was used as input for SMART amplification, and 18 rounds (formerly 35) of PCR were performed. HIV-SMART libraries were then purified with 1.4X vol of AMPpure beads, eluted in 12 µl of EB buffer, and 5 µl (undiluted) was used as template for Nextera XT reactions. For reverse transcription by random primers, metagenomic libraries were prepared by using Superscript III (SSRTIII) 1st Strand reagents (Life Technologies) followed by 2nd strand synthesis with Sequenase V2.0 T7 DNA pol (Affymetrix). Double stranded cDNA was recovered with DNA clean and concentrator spin columns (Zymo Research) and eluted in 7 µl. HIV-SMART and SSRTIII libraries were tagmented by Nextera XT and amplified for 16 and 24 cycles, respectively, with Set A indices lacking 5 ′ biotin tags (see below), according to manufacturer instructions (Illumina, Carlsbad CA). Nextera libraries were purified with Agencourt AMPpure XP beads (Beckman Coulter), visualized on a BioAnalyzer TapeStation (Agilent), and quantified on a Qubit instrument (Life Technologies) with dsDNA broad range reagents (Molecular Diagnostics).

Design of HIV xGen Probe Set
For HIV-1 group M, approximately 10 complete reference sequences from each subtype (A-K, plus CRF02) and representing a diversity of countries of origin were selected from the Los Alamos National Lab 2015 alignment. HIV-xGen probes (n = 82) were first designed against the consensus sequence derived from this reduced alignment. All degenerate base codes (e.g., R, Y) were replaced with specific, majority base call designations to permit probe synthesis. Next, the consensus was compared to individual subtypes by scanning each sequence in 100 bp windows to identify regions with <80% identity. Where diversity exceeded this cutoff and substitutions were found to be common across several subtypes, additional probes (n = 101) tiling 122-366 bp segments were designed to ensure complete capture of these regions. In order to adequately tile stretches with >10% diversity (e.g., env), probes were either designed against the entire individual sequence or to those sequences flanking the variable region. Consensus sequences were similarly obtained to design probe sets for HIV-1 groups N (N = 78), O (N = 83) and P (N = 83), and HIV-2 groups A (N = 87) and B (N = 86), each supplemented with additional probes (N = 11, 31, 6, 3, and 0, respectively) covering regions of increased diversity.

Synthesis of HIV xGen Probes, Nextera
Barcoding Primers, and Blocking Oligos 120 nt probes encoding the sense strand of HIV with 1 nt overlap and modified with a 5 ′ biotin tag were synthesized at Integrated DNA Technologies (IDT, Coralville IA). Mini-pools made at 3 pmol/probe were resuspended in 15 µl of TE to a storage concentration of 0.2 pmol/probe/µl. Custom DNA Ultramers (4 nmol) of Nextera XT Set A indices (N701-N715 and S502-S511) lacking a biotin label and containing two 3 ′ phosphothioate modifications were synthesized at IDT and working stocks were each diluted to 5 µM. Blocking oligos complementary to Nextera Set A i5 and i7 index primers (e.g., P5/P7 adaptors_8 nt barcodes_R1/R2 sequencing primers) were also synthesized at IDT at a 1 µl/reaction working concentration.

xGen Hybridization, Washes and Library Amplification
The enrichment for HIV reads was performed essentially as described in the Hybridization capture of DNA libraries using xGen R Lockdown R probes and reagents protocol from IDT. Individually barcoded Nextera libraries (n = 6-26) were pooled to have a minimum 500 ng of cDNA and then combined with 5 µg of Cot-1 DNA carrier and 1 µl of each blocking oligo. Samples were dried for >15 min in a SpeedVac set at 45 • C. Pelleted material was resuspended in 8.5 µl of 2X hybridization buffer, 2.7 µl of Hybridization Buffer Enhancer, and 1.8 µl of nuclease-free water. After a 10 min denaturation at 95 • C, 4 µl of a working probe stock (100 attomoles/probe/µl) was added and mixed to bring the final volume to 17 µl. Hybridizations were incubated for 4 h at 65 • C. M-270 Streptavidin beads (100 µl per capture) were equilibrated in Bead Wash buffer, pelleted by magnet, and mixed with hybridizations for another 45-min incubation at 65 • C, vortexing every 12 min to ensure beads remained in solution. Washes were performed as recommended and beads were resuspended in 20 µl of nuclease-free water for an initial 12 cycles of amplification with the KAPA PCR mix. Agencourt AMPpure beads (1.5X volume/75 µl) were added to PCR reactions and captured/washed material was eluted in 20 µl. A repeat KAPA amplification of 10 cycles followed by AMP-Pure clean-up was performed and libraries were visualized on an Agilent 2200 TapeStation and quantified with a Qubit fluorometer using the dsDNA high-sensitivity kit.

NGS Analysis
Analysis was performed as described (Berg et al., 2016;Rodgers et al., 2017a). Barcodes were parsed on the MiSeq instrument and filtered for Q-scores above 30. Fastq files were imported into CLC Genomics Workbench 9.0 software (CLC bio/Qiagen, Aarhus Denmark, version 9.0) where reads below 70 nt were discarded and the SMART adaptor was removed separately from both strands. When no preliminary Sanger sequence was available, raw data was simultaneously mapped to multiple HIV reference sequences to determine the subtype/group with the greatest identity. The following alignment settings were applied: match = 1, mismatch = 2, insertion = 3, deletion = 3, length fraction = 0.7, and similarity fraction = 0.8. An iterative approach was used to derived the final sequence, whereby the initial consensus served as the reference to refine the consensus in a second round of alignment. The raw NGS data (see below) was realigned to the final consensus sequence to generate NGS run statistics found in Tables 1, 2.

Removal of Contaminating Sequences
To detect possible contaminating reads originating from a different set of barcodes, raw data from one sample was individually mapped to the consensus sequences of samples sequenced in the same run, requiring a similarity fraction of 0.99 (e.g., ≥99% identical). Mapped reads were removed from the fasta file. Unmapped reads (e.g., unique to the sample of interest) were collected and realigned to an appropriate reference to generate a consensus.

Sequence Gap Closure
RT-PCR and Sanger (population) sequencing were performed as described to fill gaps in NGS sequence data (Rodgers et al., 2017a). Primer pairs used and their sequences are listed in the

Phylogenetic Analysis
Alignments, neighbor-joining trees, and recombination analysis by SIMPLOT were all performed as described (Berg et al., 2016).

Nucleotide Sequence Accession Numbers
Open reading frames for all 28 full and 3 near complete genomes were verified and annotated using SeqBuilder

HIV-xGen Target Enrichment Strategy
To enable full genome characterization of all HIV strains, we designed xGen probes to selectively capture and amplify viral sequences from cDNA libraries. Individual alignments were compiled for HIV-1 group M (subtypes A-K and circulating recombinant form [CRF] 02), groups N, O and P, as well as HIV-2 groups A and B. A minimum 80% identity has been shown to be required for effective xGen probe hybridization of viral sequences (Bonsall et al., 2015). Therefore, to eliminate redundancy and the synthesis of a prohibitively expensive number of probes, a consensus sequence from each group was generated from which an initial set of 120 nt single-stranded DNA probes tiling the genome at 1X coverage was derived. Alignments were then scanned in 100 nt windows to identify regions of strong nucleotide conservation and those of considerable heterogeneity. The minimum number of probes were selected to tile the former (e.g., >80% identity in pol), whereas those for heterogeneous regions (e.g., <80% identity in env) found in individual subtypes and strains were added as needed. Using this approach, only 183 probes were required for HIV-1 group M, compared to several thousand that would have been needed if probes were designed against the entire genomic sequence of individual strains. A total of 651 probes covered all HIV-1 and HIV-2 strains (Figure 1). cDNA from HIV-infected plasma can be synthesized by either random (Superscript RTIII; SSRTIII) or virus sequence-specific priming (HIV-SMART; Figure 1), followed by topoisomerasemediated fragmentation, adaptor tagging and amplification with Nextera XT (Berg et al., , 2016. Previously, despite a 17-20-fold increase in sensitivity over metagenomic (random primed) libraries, together with additional optimization of the HIV-SMART protocol described here in section Materials and Methods to now consistently obtain full genomes from ≥log4 copies/ml samples, both library approaches on their own still yield a minority of viral sequences (1-5%; red inserts in Figure 1) relative to host and reagent background (black inserts in Figure 1; Luk et al., 2015;Berg et al., 2016). Using the probes described above, we explored whether target capture of HIV reads from these conventional libraries could boost NGS sensitivity for viral sequences present in low abundance (Figure 1).
Since xGen probes are modified with a 5 ′ -biotin tag, Nextera XT adaptors lacking biotin needed to be synthesized to avoid streptavidin-mediated capture of all input sequences. Individually barcoded libraries were pooled together for a single capture, hybridized to HIV xGen probes on magnetic streptavidin beads, and washed to eliminate background (non-HIV) sequences. After PCR amplification, we note that the size range of resulting xGen libraries was often noticeably larger than their unenriched precursors (350-500 vs. 200-300 nt; data not shown). MiSeq runs were performed on an HIV-xGen superlibrary that typically multiplexed 6-26 samples, all with unique dual barcodes to permit parsing of data (Figure 1).

Diverse Sequences Are Captured and Enriched in HIV-xGen Libraries
To establish that the expected range of diversity was indeed captured by this method and determine whether the yield of viral reads increased, HIV xGen was applied to a variety of HIV-1 and HIV-2 strains. We began with high-titer specimens or virus isolates we previously sequenced to assess the fidelity of probes and exclude the possibility that gaps in xGen-generated coverage were a result of reads missing from the Nextera starting material. Also, by remaking libraries and arriving at the same consensus, we could demonstrate that HIV-xGen was not introducing artifacts or sequence bias. For HIV-1 group M, a FIGURE 1 | HIV-xGen strategy. 651 probes were selected to tile all HIV-1 and HIV-2 strains present in the phylogenetic tree at 1X coverage. Reverse transcription and second strand synthesis were performed by random priming with Superscript/Sequenase or by the HIV-SMART method. Nextera XT was used to convert cDNA to barcoded Illumina libraries consisting of both HIV (red inserts) and background (black) reads. Pooled libraries were hybridized to xGen probes (green) with 5 ′ -biotin tags (gold) for a single capture and selected by magnetic bead separation. Multiplexed libraries were amplified by universal KAPA primers, sequenced on a MiSeq, and reads were parsed by barcode.
high titer subtype C strain from South Africa (8119636; log 5.62 copies/ml) previously sequenced by HIV-SMART was remade this time by random priming (SSRTIII) and once again yielded 98% coverage (Berg et al., 2016). Following a post-Nextera HIV-xGen selection, 100% coverage was obtained, implying that sufficiently complementary sequences were present among the probes (Figure 2; Table 1). Notably, whereas only 2.69% of metagenomics reads mapped to HIV, this improved dramatically to 98.23% with HIV-xGen selection. The resulting SSIII-derived HIV-xGen consensus sequence was 99.99% identical to the HIV-SMART sequence. We continued to evaluate the probe set for the ability to capture HIV-1 groups O and N and HIV-2A sequences from virus isolate-generated libraries. Complete coverage was previously obtained in each case, with HIV-SMART libraries comprised of 3.8% (LA34; group O) and 10.0% (LA28; group N) HIV reads and randomly primed Ovation Single Cell libraries with 1.33% (LA38; HIV-2A) HIV reads (Berg et al., 2016;Yamaguchi et al., 2017). Here, cDNA libraries of each isolate were remade by random priming in this study and once again, 100% coverage was achieved for all three strains with a post-Nextera HIV-xGen enrichment, confirming that probes adequately covering these diverse groups were present and functional. The percentage of HIV reads once again increased to 97.23% for LA34 (group O), 99.39% for LA28 (group N) and 96.89% for LA38 (HIV-2A; Figure 2A). Likewise, the consensus sequences derived from HIV-xGen were 100%, 99.99%, and 100% identical to prior sequences.
The HIV-xGen method was then applied to additional clinical samples from Cameroon and Senegal with either viral loads ≥log5 copies/ml or those with unknown titers, this time with HIV-SMART libraries as the starting cDNA. For HIV-1 group M, a subtype F2, a CRF06, and a unique recombinant (URF) were sequenced and 100%, 91%, and 100% coverage was obtained for each, respectively ( Figure 2B; Table 2). An HIV-1 group O strain (O-AB260) also yielded 99% coverage. The percentages of HIV reads in the total ranged from 86 to 99%. Thus, a diverse set of high titer specimens were fully sequenced by HIV-xGen selection regardless of which cDNA synthesis method was deployed.

HIV-xGen Dramatically Increases Sensitivity for Low Titer Specimens
The value of HIV-xGen will reside in its ability to fully sequence low titer specimens while multiplexing to the same or greater extent. Representative results from strains with viral loads of log 4.59 copies/ml (PBS1342; URF), log 3.86 copies/ml (PBS1191; subtype G) and log 3.47 copies/ml (PBS888; subtype A) demonstrate the dramatic improvements in coverage with enrichment compared to without ( Figure 3A). For HIV SMART, genome coverage increases from 86, 51, and 56% without xGen to 95%, 100%, and 100% with xGen, respectively. Similarly, genome coverage with Superscript (SSRTIII) changes from 99, 45, and 54% without xGen to 97, 96, and 96% with xGen, respectively. Here again, HIV-xGen libraries were comprised almost entirely (90-99%) of HIV sequence regardless of the cDNA synthesis method chosen (SMART or SSRTIII) and resulting consensus sequences were 99.51, 99.04, and 99.76% identical, respectively.
Thirteen additional samples from the Democratic Republic of Congo, ranging in titers of log 3.58 to 5.78 copies/ml, were sequenced here by the HIV-SMART ± xGen method ( Table 1). These specimens together with the examples above reveal a median 1,147x (range 24.7-56, 509x) boost in HIV read yield upon xGen enrichment. For high titer specimens, percent genome coverage is largely unaffected (see below Figure 3C), although the depth of coverage is substantially increased as the same reads are re-sequenced. However, specimens <log 4.5 copies/ml saw both a significant increase in depth and overall genome coverage, indicating that additional HIV reads are present in libraries which have not been sequenced without enrichment (Table 1, Figure 3C). In a few instances, fewer than 10 reads were initially mapped, which following xGen selection, resulted in 97-100% coverage (subtype K, P3844; subtype A1, PBS0724). The average increase in percent genome coverage for xGen-enriched compared to unenriched samples was 40.5%.
Indeed, all samples >log 3.5 copies/ml had ≥79% coverage, with the majority (72%) of these having >95% genome coverage. Once again, strain consensus sequences were virtually identical independent of xGen, as well as when PCR duplicate reads were removed during mapping (Supplemental Table S2A). Similarly, the total number of minor variants (MV; 10-50%) detected was consistent between methods. However, while PCR duplicate removal had minimal effect on the extent to which the exact same MVs were detected (70-100% overlap), this overlap was reduced when comparing (-)xGen to (+)xGen datasets (28-84% overlap; Supplemental Table S2B).
We next explored strains with viral loads ranging from log 2.6-3.6 copies/ml, continuing with the HIV-SMART + HIV xGen approach ( Table 2). Coverage plots for xGen-enriched libraries in Figure 3B illustrate that while full genomes are not possible with this method in this titer range, the partial coverage obtained for some can still be substantial (e.g., 40-80%). Once again, sequences from different geographies were successfully captured, including a subtype F2 (112-11, Cameroon), D (129-26; Cameroon), and CRF01 (10047105267; Thailand). As testament to the method sensitivity, an HIV-2 strain (136-16; Cameroon), which typically replicates at low titers, was extracted from diluted patient plasma and attained 75% genome coverage ( Figure 3B). The average percent genome sequenced for those with ≤log 3 copies/ml is 26 ± 18%. For two samples at log 3.24 and 3.26 copies/ml, zero HIV reads were obtained. It is noteworthy that input for cDNA synthesis in this range is <200 copies of virus and that for many of these samples, simply obtaining a product by RT-PCR was also a challenge.
To summarize the results for all samples attempted in which a titer was known, genome coverage was plotted against viral load. Most samples >log 3.5 copies/ml achieved near-complete coverage and those below this threshold yielded a partial genome ( Figure 3C). Different colors (±xGen) for the same samples (identical viral loads) illustrate the significant jumps in coverage following enrichment. There were several strains for which low sample volume precluded viral load testing, particularly for HIV-1 group O and rare circulating recombinants ( Table 2). For more than half of these we succeeded in obtaining >90% of the genome.

HIV xGen Facilitates Classification and Characterization of Diverse HIV Strains
A total of 50 new strains originating from the Democratic Republic of Congo, Cameroon, Thailand, South Africa, Senegal, and Uganda were sequenced by HIV-xGen. Phylogenetic classification of 28 complete or near-complete (>90%) genomes determined in this study are shown using a 6,252 nt gap-stripped alignment ( Figure 4A). Many of the major subtypes and CRFs are represented here, as well as four group O strains, illustrating the breadth of viral diversity captured by this method. Phylogenetic classifications of partial genome sequences with 75-90% (Supplemental Figure S1) and <75% (Supplemental Figure S2) coverage are found in the Supplemental Information.
Strains branching basal to Group M subtype nodes were investigated further by Simplot and boot-scanning to reveal evidence of recombination. For each initial Simplot analysis preceding the final bootscan shown, appropriate references were included to verify that the strain in question was more similar to the recombinant sequence than to the pure reference sequence of the same subtype (e.g., A, G, etc.). Sub-genomic RT-PCR of env immunodominant region (IDR) originally categorized 770-8 as a CRF13. While NGS confirmed the classification of this portion of the sequence, we were able to determine with the full genome that it was actually a unique recombinant form consisting of CRF02, CRF37 and CRF13 sequences. Similarly, full genome characterization of 814-43 demonstrated it was not simply a CRF06, but rather a URF consisting of CRF06 and CRF02 sequence ( Figure 4B). Plotting consensus base call percentages at each position ruled out dual or super-infections, since aside from the occasional minor variant, values approached 100% throughout the genome. No continuous stretches of lower consensus base call percentages were observed, indicating only one major recombinant species was present (data not shown).

DISCUSSION
HIV-xGen is a universal, robust, and cost-effective backend to any cDNA method deployed for next generation sequencing of HIV-1 and HIV-2. Previously, with benzonasetreated extractions and either our optimized gene-specific (HIV-SMART) or standard random priming approaches for cDNA synthesis, we were still challenged by sensitivity. While we obtained much greater coverage for samples with titers between log 4 and log 5 copies/ml, genomes were still incomplete (Berg et al., 2016;Rodgers et al., 2017a). Now, with HIV-xGen, we can routinely obtain full genomes at a lower limit of log 3.5 copies/ml, whereas previously without enrichment, samples in the log 3.5-4.5 range would only yield 20-50% coverage. Thus, it was clear that these HIV reads were actually present in libraries, but we were not sequencing to a sufficient read depth to observe them. Below this log 3.5 threshold, the ability to adequately sequence samples is likely a limitation of cDNA synthesis, and not xGen; it cannot capture and amplify material that was never reverse transcribed. With specimens each having inherent differences in host background, detection is not linear in this range. Some may obtain >50% coverage whereas others of similar titer recover no sequence at all. Overall, our results are consistent with what others have reported for probe-mediated positive selection of viral sequences (Bonsall et al., 2015;Briese et al., 2015). Simply sequencing deeper, on a higher throughput instrument (e.g., HiSeq vs. MiSeq), or multiplexing less could certainly provide improved detection and coverage of low titer samples, but the overall read proportions would likely remain the same. Now with xGen, far more samples can be processed and sequenced at once as a greater percentage of reads are viral, saving time and resources. Throughput and NGS capacity are further increased by virtually eliminating the sequencing of host background.
As with HIV-SMART, the objective with HIV-xGen was to fully characterize strains for the purpose of surveillance. With the sensitivity we now demonstrate, the number of specimens previously deemed too challenging to sequence by NGS due to low titer has markedly declined. As a greater proportion of patients are on therapy and able to suppress viral loads, it is imperative that our methods can adequately characterize low titer specimens. While we only attempted to multiplex a maximum of 26 samples at a time, the high percentage of HIV reads from the total per barcode suggests many more libraries (e.g., an entire 96well plate) could be pooled without detriment. Considering the depth of coverage now possible, assessing levels of minor variants and quasi-species in samples could be an attractive application for HIV-xGen. For example, detection of minor variants at clinically relevant levels (e.g., >10%) to predict the emergence of drug resistance and treatment failure should be readily achievable with this method (Li and Kuritzkes, 2013;Obermeier et al., 2014;Noguera-Julian et al., 2017). However, primer IDs controlling for starting cDNA populations and potential PCR bias were not used in this study and the original proportions of minor variants after Nextera amplification might be expected to drift further after additional rounds of HIV-xGen amplification (Jabara et al., 2011;Boltz et al., 2016). Indeed, we showed that while consensus sequences do not change with enrichment or inclusion of duplicate reads, minor variant populations are affected by xGen (Supplemental Table S2). Nevertheless, our primary goal was to not limit interrogation to sub-genomic regions via amplicon sequencing, but rather to better comprehend the complete extent of diversity in entire genomes found in different geographic regions at different times. As an example, our prior surveillance efforts in Cameroon have consistently observed a predominance of CRF02, CRF06, CRF13, and CRF37 (Rodgers et al., 2017b). Now it appears that the URFs we are detecting in this region are recombinants and contain genome segments that do not phylogenetically cluster with homologous sequences derived from any of the classified HIV-1M (Figure 4).
With the batching of samples during hybridization and amplification steps, the inability to control for the resulting proportion of reads/barcode is one shortcoming of the method. Generally, the resulting read numbers were a reflection of the starting viral loads. Thus, we recommend, if possible, that samples be grouped by titer to avoid an imbalance. Nevertheless, given the potential for one or more samples to predominate in a run relative to the others in the pool, cross-contamination of reads between barcodes on the MiSeq was of concern (Lee et al., 2016). Therefore, raw data from each sample was mapped to each individual HIV xGen-derived consensus sequence of other samples in the same run to identify regions of perfect identity which might indicate the incorrect binning of reads. Fortunately, this was infrequently observed, with most errant reads derived from high read depth samples, and particularly in those samples where a barcode may have been shared. Thus, we further recommend that unique dual barcode pairs be chosen for each library in a pool. Here, reads clearly originating from another sample were removed from the dataset and the mapping was repeated to derive an accurate consensus. In most cases, we had prior Sanger data to compare the NGS consensus sequences against to verify that the final sequence obtained was correct.
Another shortcoming of the method is that the full scope of HIV diversity cannot be known, therefore probes comprehensive of all sequences cannot be designed, and yet this is an argument in favor of continued surveillance. As an example in Figure 3A, PBS1342 exhibits a gap in the env region using either cDNA synthesis approach. The HIV-SMART reverse primer binding site is considerably downstream of the gap and fails to explain the absence of sequence, irrespective of enrichment. In contrast, randomly primed libraries did sequence successfully across this region, for which only a small portion (nt 7,196-7,455) was not recovered by xGen. Comparison of the 120 nt probes spanning this region to the PBS1342 consensus sequence revealed an overall lower identity (78-85%), but presumably the concentration of mismatches and indels we observed were the major factor. In the first probe, 14 of the total 18 mismatches were focused in the 3 ′ 50 nucleotides, the middle probe had two insertions (6, 3 nt), one deletion (3 nt), and 13 mismatches all situated in the 5 ′ 53 nucleotides, and in the third there were 19 mismatches over a stretch of 33 nucleotides, preceded by two insertions (6, 27 nt) and a deletion (15 nt). Adequate coverage of envelope will likely remain a challenge, particularly as we explore geographies with high sequence diversity such as the DRC, yet the tolerance for mismatches displayed throughout the genomes of numerous strains from HIV-1 and HIV-2 groups speaks to the robustness of the method.
It is essential that diagnostic tests keep pace with HIV and other rapidly mutating viruses by proactively seeking out strains in circulation that may escape detection with current assays (Brennan et al., 2006). The ability to fully characterize multiple samples simultaneously without regard for subtype, group, or titer opens up greater opportunities for future surveillance. At the same time, using HIV-xGen to retrospectively examine archived specimens and understand the origins of the HIV epidemics is an equally compelling application (Rodgers et al., 2017a). Metagenomic NGS combined with selective viral sequencing promise a new era in diagnostic virology (Barzon et al., 2011;Quiñones-Mateu et al., 2014;Kumar et al., 2017). As sensitivity and throughput increases with methods like HIV-xGen, we are one step closer to realizing this potential.

AUTHOR CONTRIBUTIONS
JY and MB conceived of methodology and designed HIV-xGen probes. JY, AO, and KF performed experiments and data analysis. MR performed data analysis, assisted with experimental design, and reviewed manuscript. MB performed data analysis, directed experimental design, wrote manuscript and made figures. GC assisted experimental design and reviewed manuscript. TQ and OL collected and characterized DRC samples and reviewed manuscript. NN, DM, and LK collected and characterized Cameroon samples and reviewed manuscript.

FUNDING
This study was funded by Abbott Laboratories.