SNP and Phylogenetic Characterization of Low Viral Load SARS-CoV-2 Specimens by Target Enrichment

Background: Surveillance of SARS-CoV-2 across the globe has enabled detection of new variants and informed the public health response. With highly sensitive methods like qPCR widely adopted for diagnosis, the ability to sequence and characterize specimens with low titers needs to keep pace. Methods: Nucleic acids extracted from nasopharyngeal swabs collected from four sites in the United States in early 2020 were converted to NGS libraries to sequence SARS-CoV-2 genomes using metagenomic and xGen target enrichment approaches. Single nucleotide polymorphism (SNP) analysis and phylogeny were used to determine clade assignments and geographic origins of strains. Results: SARS-CoV-2-specific xGen enrichment enabled full genome coverage for 87 specimens with Ct values <29, corresponding to viral loads of >10,000 cp/ml. For samples with viral loads between 103 and 106 cp/ml, the median genome coverage for xGen was 99.1%, sequence depth was 605X, and the “on-target” rate was 57 ± 21%, compared to 13%, 2X and 0.001 ± 0.016%, respectively, for metagenomic sequencing alone. Phylogenetic analysis revealed the presence of most clades that existed at the time of the study, though clade GH dominated in the Midwest. Conclusions: Even as vaccines are being widely distributed, a high case load of SARS-CoV-2 infection persists around the world. Viral genetic surveillance has succeeded in warning the public of new variants in circulation and ensured that diagnostic tools remain resilient to a steadily increasing number of mutations. Target capture offers a means of characterizing low viral load samples which would normally pose a challenge for metagenomic sequencing.


INTRODUCTION
Since the first cases of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection were reported in late 2019 in China, nearly 205 million cases of coronavirus disease 2019 (COVID-19) and 4.3 million COVID-19 deaths have been reported (as of August 2021; Johns Hopkins University, https:// coronavirus.jhu.edu/). The severity of COVID-19 symptoms and the risk of hospitalization and/or death are linked to various patient co-morbidities, age, and viral load (VL) (1,2). Sequencing and phylogenetic analyses have been used to track the global spread of SARS-CoV-2; for example, in the United States, the virus first entered from the west into Washington state from China, and then from the east into New York from Europe (3)(4)(5)(6). Over time, surveillance efforts have documented the emergence of variants and major clades, with the ascendency of clade G (containing spike-D614G) and its descendants over the early clades S (containing NS8-L84S), V (containing NS3-G251V), and L (7,8). In late 2020 and early 2021, new SARS-CoV-2 variants linked to heightened transmissibility have been reported almost monthly, such as Alpha, Beta, Gamma, and Delta, revealing the need for constant and rapid surveillance to characterize mutations and evaluate their effects on infection patterns and vaccine efficacy (9)(10)(11)(12).
A tremendous amount of sequencing information has been generated for SARS-CoV-2, with well over 1 million sequences deposited in GISAID and the NCBI as of August 2021 (https:// www.gisaid.org/ and https://www.ncbi.nlm.nih.gov/sars-cov-2/). These sequences are generally >99.9% identical, distinguished by only a few SNPs (13). As an RNA virus, SARS-CoV-2 has a capacity for rapid mutation, and variants will continue to arise as the pandemic persists and the virus circulates. The introduction of vaccines may also be applying a selective pressure on the virus. While there are >1 million SARS-CoV-2 sequences in GISAID, having a sensitive, scalable, reliable, and rapid wholegenome sequencing method for low VL specimens is needed. Thus, far, genetic surveillance has relied on multiplex PCRs (e.g., the ARTIC method); however, two weaknesses of this method are inherent amplification bias and mutations that negatively affect the generation of amplicons (14)(15)(16)(17). Additionally, in individuals with low VL, PCR-based methods must be even more sensitive to detect the virus.
Random priming coupled to xGen target enrichment permits both metagenomic exploration and direct capture for sequencing of SARS-CoV-2 from the same library. Long probe lengths readily tolerate the limited number of expected mutations (all known SARS-CoV-2 genomes are >99.8% identical) and amplify targeted libraries from low VL specimens. As demonstrated by our previous work with HIV, Hepatitis B, and Hepatitis Delta viruses, and Bonsall et al.'s work with Hepatitis C virus, the xGen approach is a highly sensitive tool to maximize viral genome coverage when coupled with Illumina's next-generation sequencing platform (18)(19)(20)(21). Here, we examined SARS-CoV-2 full-genome diversity on specimens collected from sites in the Midwest and Northeast United States from March to June 2020. We detected sequences representative of all clades that existed at the time, and in specimens with low viral loads, we applied a SARS-CoV-2-specific xGen approach to achieve full genome coverage.

Virus Quantification
Specimens were heat-inactivated at 65 • C for 30 min in a Biosafety Level 3 (BSL-3) facility, then subjected to the RealTime SARS-CoV-2 qualitative EUA diagnostic assay. There are 10 unread ("dark") cycles on the Abbott Molecular platform, in which case C t values are 10 less than most other manufacturer assays; the C t values reported in this study are adjusted by 10 to include the dark cycles. Viral loads were approximated based on equivalent C t values obtained with a RealTime SARS-CoV-2 quantitative assay on the m2000sp/rt that uses calibrators and controls (Abbott Molecular, Des Plaines, IL) (22)(23)(24).

Nucleic Acid Extraction for Sequencing
Remaining volumes of the inactivated specimens were subjected to total nucleic acid (TNA) extraction on an m2000sp (Abbott Molecular, Des Plaines IL) using a research-use only RNA/DNA protocol (500 µl input/50 µl eluate). This TNA extraction uses silica coated magnetic particles for capture and modified buffers to recover both RNA and DNA.

mNGS Library Production
Metagenomic libraries (mNGS) were prepared in a 96-well format with an epMotion automated liquid handler (Eppendorf North America, Enfield, CT). Starting with 10 µl of TNA, RNA was reverse transcribed with random primers and oligodeoxythymidine using Superscript IV (SSRT-IV) 1st Strand reagents (Life Technologies, Carlsbad, CA), followed by 2nd strand synthesis with Sequenase v2.0 DNA Polymerase (Applied Biosystems, Foster City, CA). Double stranded DNA/cDNA was recovered with Agencourt AMPpure XP beads (Beckman Coulter, Brea, CA). The purified DNA/cDNAs were tagmented and barcoded using a Nextera XT Library Prep Kit (Illumina Inc., San Diego, CA) coupled with custom non-biotinylated IDT-Nextera indexes (IDTDNA, Coralville, IA) using 24 cycles of amplification. Nextera libraries were purified with Agencourt AMPpure XP beads (Beckman Coulter) and assessed for quality and concentration by a 2200 TapeStation (Agilent, Santa Clara, CA) and Qubit fluorometer (Life Technologies), respectively. SARS-CoV-2 xGen Capture xGen probe sets were designed by, and purchased from, IDTNDA (Coralville, IA) based on the NCBI Wuhan-Hu-1 reference strain sequence, accession NC_045512.2. A total of 498 probes, each 120-nt in length, tiled the genome with 2X coverage. Probe stocks modified with a 5 ′ biotin tag (3 pmol supplied of each probe) and blocking oligos complementary to Nextera Set A i5 and i7 index primers (1 µl per reaction) were included in the kit. The hybridization procedure (including probe and blocking oligo working concentrations), streptavidin bead captures, and washes were performed essentially as previously described (20). Following bead capture and removal of non-bound libraries, an initial 12 cycles of amplification were performed with the libraries bound to the beads, followed by elution off streptavidin beads and a repeat KAPA amplification of 10 cycles. SARS-CoV-2 xGen captures were performed on mNGS libraries for which <90% genome coverage was obtained. In general, 15-25 Nextera libraries were pooled per capture, stratified by C t as described below.

Next-Generation Sequencing (NGS)
Metagenomic libraries were segregated according to SARS-CoV-2 C t values prior to multiplexing. To avoid intra-run contamination from "barcode hopping, " those with high viral loads (Abbott RealTime SARS-CoV-2 C t < 15, C t < 25 for other manufacturers) were pooled and sequenced together while those with low viral loads (Abbott RealTime SARS-CoV-2 C t > 15) were pooled and sequenced on separate runs. Genome coverage results determined which libraries proceeded to additional xGen capture and enrichment. SARS-CoV-2 xGen libraries were sequenced independently of corresponding mNGS libraries since these share the same barcodes. mNGS and xGen libraries were sequenced on a MiSeq (in-house) or a HiSeq (Novogene, Sacramento, CA) using 2x 150 bp paired-end chemistry.

Sequence Analysis
Mapping of NGS reads to the NCBI NC_045512.2 reference sequence was performed with CLC Genomics Workbench 21.0 software (Qiagen, Aarhus, Denmark). Duplicate reads were removed for xGen-enriched libraries as previously described (19). The consensus SARS-CoV-2 sequence for each specimen was generated using the Extract Consensus Sequence function in the NGS Toolbox.
SNPs in each specimen relative to the reference sequence were recorded in Excel using the pivot table function. Pairwise global alignments of each full genome sequence against the reference strain were generated and merged together in BioEdit (Nucleics, Woollahra, Australia) and evaluated with Los Alamos National Laboratory's online Highlighter tool (https://www.hiv.lanl.gov/ content/sequence/HIGHLIGHT/highlighter_top.html) using the NC_045512.2 sequence as the master reference. Nucleotide changes were indicated by color: green (mutation to A), red (mutation to T), orange (mutation to G), light blue (mutation to C), dark blue (ambiguous base call), and gray (nucleotide deletion) hashes. Additionally, a local instance of the open-source NextClade software (https://github.com/nextstrain/ nextclade) was used to output a table of every SNP and assign GISAID-style and NextStrain-style clades for each sequence.

Phylogenetic Analysis
NextStrain (https://nextstrain.org/ncov/global) was accessed on January 14, 2021. This service sub-samples the entire GISAID database (https://www.gisaid.org) for sequences representative of the overall clade and regional distribution of SARS-CoV-2 genomes to perform a reduced time-correlated lineage analysis. From GISAID, we downloaded all ∼4,000 sequences that appeared in NextStrain on that day, eliminated those sequences with <95% genome coverage, then calculated the clade (e.g., S, G, GH, etc.) and regional (e.g., North America, Europe, etc.) distributions of the remaining sequences. Then, 110 sequences (including the Wuhan-Hu-1 reference) were randomly sampled such that the distribution of clade assignments and regions was the same as before the random sampling. These 110 sequences were used as our reference dataset and their accession numbers are listed in the Supplementary Data.
The 110 references were combined with 79 of our 87 in-house generated sequences (8 of the 87 sequences were omitted due to excessively long, i.e., >200 bp long, strings of ambiguous or uncalled nucleotides) into a single file, then full-genome multiple sequence alignment was performed using the program MAFFT v7.475 (25) with the settings optimized for full-length alignment of closely related (>99% identity) viral genomes (https://mafft. cbrc.jp/alignment/software/closelyrelatedviralgenomes.html). The 5 ′ and 3 ′ UTRs were then removed from the alignment due to the general concentration of ambiguous base calls and sequencing errors in these regions (26,27).
The evolutionary history of these 189 sequences was inferred using the Maximum Likelihood method implemented in IQ-TREE (28). Sequence alignment revealed that in a total of 29,383 aligned sites, 28,409 were constant/invariant (96.6825%) and 232 sites were parsimony-informative. Forty-eight nucleotide models were tested using IQ-TREE's ModelFinder tool (29). The General Time Reversible substitution model with empirical base frequencies and a two-category model of flexible rateheterogeneity-across-sites (GTR+F+R2) generated the lowest Bayesian Information Criterion score (108,416.6063) and was considered the best to describe the substitution pattern. By default, IQ-TREE uses a stochastic algorithm for finding initial ML trees. The Nearest Neighbor Interchange (NNI) heuristic method was then performed to further improve those starting trees. The ML tree generated with the highest log likelihood (−52,212.7286) is presented in this study. Additionally, 1,000 Ultrafast Bootstrap (UFboot) replications (30) and 1,000 replications of the Shimodaira-Hasegawa approximate likelihood ratio test (SH-aLRT) (31) were performed to provide branch supports to the presented tree.
The ML tree was imported into Dendroscope for midpoint rooting (32). Simple mid-point rooting was chosen due to the statistical uncertainty in outgroup-rooting SARS-CoV-2 phylogenies (33). The rooted tree was then visualized using ggtree (34). The presented tree is drawn to scale, with branch lengths scaled by the average number of substitutions per site. All nodes with support >60% for both the SH-aLRT and UFboot tests are denoted with black dots.

Nucleotide Sequence Accession Numbers
The N = 79 full-length SARS-CoV-2 genome sequences which were part of the phylogenetic analysis were submitted to GenBank under the accessions MZ363485-MZ363563.

Target Capture Dramatically Improves Genome Coverage of Low VL SARS-CoV-2 Specimens
One-hundred ninety-four nasopharyngeal swabs in VTM with positive SARS-CoV-2 diagnostic tests were collected from several clinical sites in the United States between March 2020 and June 2020. We first subjected these specimens to the Abbott RealTime SARS-CoV-2 qPCR EUA assay. This assay has 10 "dark" cycles that are not read, so to relate our data to values obtained by other manufacturers, we add 10 cycles to the C t . The C t distribution for these specimens (Figure 1A; left panel) indicated approximately one third (36%) had C t ≤ 25, corresponding to a viral load > 5.35 log copies/ml. Another 39% had C t 25-35, with viral loads between 5.35 log and 2.1 log copies/ml. Seventeen percent had detectable virus below 2.0 log copies/ml. The remaining 8% tested negative in our laboratory by the Abbott RealTime SARS-CoV-2 qPCR assay, even though they tested positive at the originating laboratory. This C t distribution is consistent with SARS-CoV-2 titers determined by Abbott for unrelated specimens, including N = 48 specimens sold by New York Biologics (New York, NY; of which N = 9 were actually qPCR negative by Abbott RealTime) and from N = 51 Abbott RealTime positive specimens obtained from 11 pre-clinical sites across the United States; when these cohorts are combined (N = 99), a similar C t distribution to the present study is observed (Figure 1A, right panel). This same trend is evident in the literature, wherein the majority of SARS-CoV-2 clinical specimens had titers <100,000 cp/ml, which generally presents a challenge for further characterization by NGS (35,36).
Metagenomic libraries (mNGS) were generated for the N = 194 cohort (Rush, Montefiori, iSpecimen, and MCW) by coupling randomly primed cDNA synthesis to a Nextera XT Library Prep kit for barcoding and adapter addition. Libraries were segregated into groups according to C t value to avoid "barcode hopping" of high titer specimen reads into those with low titers. As expected, mNGS genome coverage (Figure 1B; orange bars) was greatest for specimens with low C t values. Any specimen with VL higher than 7.0 log cp/ml obtained full coverage, but this varied considerably across 6 log, and below 5.9 log we were assured of only partial coverage. In total, 16 out of 194 specimens generated enough mNGS reads to assemble a SARS-CoV-2 genome with >90% coverage.
To address the remaining specimens that did not generate >90% genome coverage via mNGS, we deployed custom biotinylated xGen probes tiling the Wuhan-Hu-1 (NC_045512.2) reference genome to select and amplify SARS-CoV-2 sequences from the metagenomic Nextera libraries. xGen target capture was performed on libraries with 5-90% coverage by mNGS and qPCR C t values between 20 and 36. A substantial boost in coronavirus reads per million total reads and genome coverage was realized across this spectrum of VLs, even in those at log 2 cp/ml or lower ( Figure 1B; blue bars). We consistently obtained near-complete or complete genome coverage for specimens below a threshold of C t = 29, corresponding to 10,000 cp/ml or higher. Beyond this cutoff, recovery of the SARS-CoV-2 sequence was highly variable, in agreement with data for other viruses to which we have applied specific xGen target enrichment (18)(19)(20). Specimens that were negative by qPCR were also negative by xGen-NGS.

Clade Representation Matches Global Trends
Our 87 genomes were aligned to the Wuhan-Hu-1 NC_045512.2 reference to reveal SNPs and indels (Figure 2A). Sequences were grouped by collection site and listed in descending order by viral load, with each nucleotide change (A, T, C, G) represented by a different color. Our observations were consistent with a metaanalysis performed on 48,635 strains retrieved from GISAID on June 26, 2020, which ran concurrent with the time we collected our specimens (8). Genomes had an average of 8.7 SNPs (range: 4-16) compared to 7.23 globally, with transitions (usually C→T, in red) favored over transversions and indels. Due to early collection of these specimens, we utilize the GISAID clade nomenclature, which was more commonly used at the time.  N = 99, right). Specimens that were confirmed qPCR negative by Abbott's RealTime test were indicated as positive by the originating lab. (B) Stacked histogram plot indicating genome coverage (y-axis) without (mNGS; orange) or with (xGen; blue) SARS-CoV-2 xGen target capture. Libraries (N = 106) with paired NGS and VL data are sorted from low to high C t (x-axis). (C) Sequence statistics (median ± standard deviation for genome coverage, read depth, and % "on target") by viral load intervals comparing mNGS (orange) to xGen (blue). In each panel, a color bar illustrating the VL range in log cp/ml is included to facilitate comparison of VL to C t . The majority of sequences (>81%) contained hallmark SNPs at the four positions (relative to the Wuhan-Hu-1 reference genome) indicative of the major clade G: 241 (C→T; 5 ′ UTR; silent), 3037 (C→T; nsp3; silent), 14408 (C→T; nsp12b/RDRP; P314L), and 23403 (A→G; spike; D614G) (Table 1, Figure 2B). Most sequences from New York (Montefiori) and Michigan (iSpecimen) were GH clade, a derivative of G that was dominant from an early time in the Americas, with co-occurring SNPs at positions 1,059 (C→T; nsp2; T85I) and 25,563 (G→T; ORF3a; Q57H). Within this GH clade there were four related strains from New York (mzz119200090, −034, −093, and −015) with additional shared mutations at positions 11,916 (C→T; nsp7; S25L) and 18,998 (C→T; nsp14; A320V). These were known to circulate in North America and were seen occasionally in specimens from other cohorts (e.g., Rush-0063). By contrast, most Midwest sequences from Chicago (Rush) and Milwaukee (MCW) lacked the 1,059 and/or 25,563 mutations. Of note are four specimens from Wisconsin (CA017, −018, −064, and −071) which have a spike mutation at position 23,592 (A→G; spike; Q677R). This spike mutation has not been widely reported. Three of these specimens contain two other novel mutations at positions 7,764 (C→T; nsp3; S1682F) and 25,904 (C→T; ORF3a; S171L). A specimen from New York, mzz119200019, contains the mutation at position 25,904, but not the ones at positions 23,592 or 7,764.
We observed the linked G28881A/G28882A/G28883C mutations (nucleocapsid, R203K/G204R) indicative of the GR clade seven times, all originating from the Milwaukee site. These same individuals (CA082, CA076, CA099, CA005, CA032, CA085, and CA086) also possessed silent mutations at positions 12,025 and 19,839, a variant which has not yet been widely reported. At the time these specimens were collected, the GR clade was most common in Europe, Australia, and South America (8).
There were seven specimens from the Midwest presumably belonging to the S clade, defined by mutations at positions 8,782 (C→T; nsp4; silent) and 28 The profiles of five specimens from Milwaukee did not match any widely reported criteria and presumably belong to an "O" clade. CA026 was dissimilar to all other specimens, lacking all mutations in common to the other 86 sequences. CA092 and CA093 were distinct from all other specimens but highly related to each other, with shared mutations at positions 556, 10,029, 12,729, 18,812, 22,295, and 29,711. CA027 was largely devoid of SNPs, but resembled CA092 and CA093 by having a mutation at position 556 (C→T; nsp1; silent at 78Y) and CA092, CA093, and CA026 by having a mutation at position 29711 (G→T; 3 ′ -UTR). Specimen CA034 also has a mutation pattern unlike most other specimens, except that it contains a mutation at position 25,563 (G→T; ORF3a; Q57H) that is usually indicative of the GH clade. Curiously, specimen CA034 does not contain other mutations indicative of the G clade or its descendants.

Mutational Profiles Are Representative of Those Found Early in the Pandemic and Reflect Multiple Independent Introductions
Phylogenetic trees were constructed to confirm the clade assignments that were made based on SNPs alone. One hundred and nine full-length SARS-CoV-2 sequences with representatives from each clade were downloaded from GISAID on January 14, 2021 and aligned with 79 of our 87 complete genomes and the Wuhan-Hu-1 reference to construct maximum likelihood trees (Figure 3). As expected from our analysis of hallmark SNPs, most specimens grouped within the GH clade, with high representation from all four sampling sites. Most of the Michigan (iSpecimen) and Milwaukee (MCW) specimens within the GH clade had short branch lengths, owing to their relatively low number of mutations, whereas related specimens from Chicago (RUSH) and New York (Montefiore) had longer branch lengths.
The four Wisconsin specimens bearing the Q677R spike protein mutation (CA017, −018, −064, and −071) formed a monophyletic group within the GH clade, indicating that this mutation arose just once in our cohort and that these four specimens are likely closely related in time and space. Indeed, a retrospective chart review revealed 3 of the 4 subjects (all 61 years of age or older) were from the same long-term care facility and all had the additional mutation at position 7,764. While patients would have been symptomatic, no medical records outside of the testing could be obtained. Q677 lies within a loop near the furin cleavage site, downstream of the receptor binding domain. Except for the proline (P681) immediately proximal to the polybasic RRAR motif and Q675 (∼4,600 mutant sequences), Q677 exhibits greater diversity compared to other residues in this disordered region (Supplementary Figure 2) The GR clade sequences were all from Milwaukee and these also clustered together with short branch lengths. We detected a small number of original clade G specimens in Chicago and Milwaukee, and we confirmed the presence of S and O clade representatives in the Midwest. Thus, our tree depicts the heterogeneity of clades found early in the pandemic and reflects a low number of accumulated mutations. The specimens CA027, −026, and −093, which could be assigned to no known clade, clustered close to the Wuhan-Hu-1 reference sequence in our tree. The other O clade specimen, CA034, does not branch near these other three and represents a separate O lineage.

DISCUSSION
The enormous worldwide effort to sequence SARS-CoV-2 specimens has produced the most complete picture of the emergence and early evolution of a viral pathogen in realtime (37). Genetic surveillance of this virus is necessary to identify new variants that can lead to superspreading events, evasion of diagnostic tests, and circumvention of vaccination efforts. In this study, we succeeded in applying a SARS-CoV-2specific xGen enrichment method to boost sequencing coverage in specimens with low VL. With metagenomic NGS alone, we could only recover high-coverage genomes from specimens with high VL (reliable high coverage >log 7 cp/ml, variable rates of high coverage down to log 5.6 cp/ml); however, our xGen enrichment method allowed us to extend this coverage in specimens with VLs as low as log 2-3 cp/ml. This range of sensitivity agrees with that observed by Xiao et al. (38); however, our study used more clinical specimens and compared target capture-derived genome coverage directly to qPCR C t values obtained from the on-market Abbott RealTime SARS-CoV-2 assay. While no greater genetic diversity was observed in the low viral load samples (nor did we expect to see this), we stress that a benefit of the method is its "on target" rate. With more than half (57%) of xGen reads aligning to SARS-CoV-2 vs. 0.001% for mNGS, this represents a substantial improvement for samples in the log 3-6 cp/ml interval (Supplementary Table 1). A onetime purchase of probes, minimal reagent cost, and an additional day of work will add expense; however, we contend that the data quality (e.g., genome coverage, sequence depth) and the ability to multiplex more samples per NGS run is more than worth the trade-off. Additionally, xGen target enrichment has an advantage over multiplex PCR techniques like ARTIC in that leftover unenriched metagenomic libraries can be analyzed by NGS to look for secondary acute or chronic infections that may contribute to a patient's disease state.
Using xGen target enrichment also provides flexibility in obtaining full genomes over a larger window during infection. Low VL specimens may simply result from being collected during the early stage of viremia or just prior to clearance when viral loads are expected to be low. Thus, one can capture full genomes of strains that are still be capable of transmitting and contributing to the pandemic. Since over half of the 194 SARS-CoV-2-positive clinical specimens evaluated in this study had VL values below log 5 cp/ml (C t > 26, Figure 1A), we extrapolate that this enrichment method could potentially double the number of specimens from which full SARS-CoV-2 genomes can be recovered by NGS in other laboratories.
With xGen-NGS we were able to recover an extra 71 full or nearly-full genomes from medium and low VL specimens, bringing our total to 87. This >4X boost in quality genomes recovered allowed us to better evaluate the geographic distribution of SNPs at an early stage of the pandemic in the midwestern and northeastern United States. Through SNP and phylogenetic analysis, we observed an overall pervasiveness of specimens belonging to the GH clade, reinforcing previous work showing that this clade quickly became dominant in the United States in the spring of 2020 (8). However, a surprising finding was the high level of clade diversity seen from the specimens originating from MCW and Rush in May-June 2020; this indicates that in the early stages of the pandemic the populous greater Chicago and Milwaukee areas (separated by <100 miles) acted as a "melting pot" of sorts. The presence of clade G (and its descendants; probably due to early introduction from Europe) and clades S/V/O/L (probably due to early introduction from Asia) highlights the high probability of multiple introduction events due to international travel before mitigation efforts were put into place.
On the other hand, less diversity was seen in the specimens sourced from New York City (Montefiori Health System); all specimens closely clustered within Clade GH in Figure 3. Linking this finding to the mitigation efforts that were in place at the time potentially indicates that the travel and business restrictions imposed on the city during its first major case wave in the spring of 2020 temporarily lessened viral diversity. Equally homogenous were the specimens from Michigan (iSpecimen). Though these specimens were collected even earlier in the pandemic (March 2020) and could be biased due to a smaller sample size, the majority of them belonged to clade GH, which at the time contained an overall higher number of SNPs than most other clades. We can thus conclude that even before the first major American mitigation efforts were imposed in early March, the emergence of clade GH and its further diversification had already been achieved for some time.
A Q677R mutation in spike (position 23,592) was observed in four symptomatic patients in Wisconsin. Surveillance efforts have independently identified multiple lineages containing other Q677 mutations, such as Q677P and Q677H, on the rise in the southeastern and southwestern United States in the latter half of 2020 (39). It was suggested that given the proximity to the S1/S2 boundary in spike this mutation may affect proteolytic processing and cell entry, providing a selective advantage for transmissibility to drive convergent evolution at this position (39). The appearance of Q677 variants around the globe supports the idea of positive selection (40). Alternatively, an alignment of betacoronaviruses illustrates Q677 is not well-conserved and that multiple amino acids are tolerated at this position (41). The paucity of other reported mutations for SARS-CoV-2 in the immediate vicinity reflect this evolutionary conservation (Supplementary Figure 2). These Milwaukee variants containing 7,764 and 25,904 mutations ultimately did not spread out of Wisconsin, but our collection in April 2020 establishes a much earlier emergence of Q677 mutations in general, in agreement with those deposited by nearby University of Wisconsin-Madison (GISAID sequences USA/WI-UW-950/2020 and USA/WI-UW-957/2020).
More recent variants of concern affecting the spike receptor binding domain (RBD, aa 350-550) such as N501Y, E484K, K417T, and P681R from the Alpha (B.1.1.7), Beta (B.1.351), Gamma (B.1.1.28), and Delta (B.1.617.2) variants, respectively, now dominate worldwide (9,10,12). Sequencing infections in a high-quality manner is the first step for ascertaining whether new mutations have appeared and if they have potential to impact transmissibility, disease progression, or immune evasion (42)(43)(44)(45). Having the ability to sequence low viral load samples through xGen target enrichment will allow us to characterize specimens that may be recalcitrant to standard sequencing approaches, positively contributing to the mission of recovering as many SARS-CoV-2 sequences as possible.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found at: https://www.ncbi.nlm.nih. gov/genbank/, MZ363485-MZ363563.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by Rush Medical Center, Medical College of Wisconsin, Montefiori Health Center. The patients/participants provided their written informed consent to participate in this study.

AUTHOR CONTRIBUTIONS
GO and MB performed data analysis, interpretation, and wrote paper. KF performed experiments and analysis. TM, IM, and AM performed bioinformatic analysis. MF, CJ, AL, DG, and AM supervised specimen collection. GC supervised project and wrote paper. All authors contributed to the article and approved the submitted version.

FUNDING
All funding for this research was provided by Abbott Laboratories. The funder provided support in the form of salaries for authors GO, KF, TM, IM, AM, MB, and GC, but did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.