Somatic Populations of PGT135–137 HIV-1-Neutralizing Antibodies Identified by 454 Pyrosequencing and Bioinformatics

Select HIV-1-infected individuals develop sera capable of neutralizing diverse viral strains. The molecular basis of this neutralization is currently being deciphered by the isolation of HIV-1-neutralizing antibodies. In one infected donor, three neutralizing antibodies, PGT135–137, were identified by assessment of neutralization from individually sorted B cells and found to recognize an epitope containing an N-linked glycan at residue 332 on HIV-1 gp120. Here we use next-generation sequencing and bioinformatics methods to interrogate the B cell record of this donor to gain a more complete understanding of the humoral immune response. PGT135–137-gene family specific primers were used to amplify heavy-chain and light-chain variable-domain sequences. Pyrosequencing produced 141,298 heavy-chain sequences of IGHV4-39 origin and 87,229 light-chain sequences of IGKV3-15 origin. A number of heavy and light-chain sequences of ∼90% identity to PGT137, several to PGT136, and none of high identity to PGT135 were identified. After expansion of these sequences to include close phylogenetic relatives, a total of 202 heavy-chain sequences and 72 light-chain sequences were identified. These sequences were clustered into populations of 95% identity comprising 15 for heavy chain and 10 for light chain, and a select sequence from each population was synthesized and reconstituted with a PGT137-partner chain. Reconstituted antibodies showed varied neutralization phenotypes for HIV-1 clade A and D isolates. Sequence diversity of the antibody population represented by these tested sequences was notably higher than observed with a 454 pyrosequencing-control analysis on 10 antibodies of defined sequence, suggesting that this diversity results primarily from somatic maturation. Our results thus provide an example of how pathogens like HIV-1 are opposed by a varied humoral immune response, derived from intrinsic mechanisms of antibody development, and embodied by somatic populations of diverse antibodies.


INTRODUCTION
Recent years have seen revolutions in both genomics and computational science (Lander et al., 2001;Venter et al., 2001;Chen et al., 2012). In both of these fields, capabilities are advancing exponentially (Kahn, 2011). The impact of this non-linear development on biology is pervasive and multifaceted. With respect to virus research, the influence has been profound and is the focus of this special issue of Frontiers. Medical interest in viruses is focused on pathogens and their infection, and the biological mirror of infection is the host immune response. Advances in genomics and computational science have the potential for an equally profound impact on our understanding of the immune response. Here we focus on the application of new genomic and computational techniques, particularly 454 pyrosequencing of B cell transcripts (Reddy et al., 2009;Reddy and Georgiou, 2011;Wu et al., 2011) and systems-level bioinformatics (Kitano, 2002), to understand the antibody response to infection.
The human immunodeficiency virus type I, HIV-1, is the etiological agent of a global pandemic, which has killed over 30 million people, and currently infects ∼1% of adults worldwide (UNAIDS, 2010). HIV-1 is a retrovirus and member of the lentivirus genus (Gonda et al., 1985;Sonigo et al., 1985). Global genetic diversity of HIV-1 is extraordinarily high (Starcich et al., 1986;Korber et al., 2001), and this is thought to result from the low fidelity of its genome replication (Preston et al., 1988) as well as the persistent nature of the infection: the diversity of HIV-1 virus within a single individual after 6 years of infection is equivalent to the global diversity of H1N1 influenza observed annually (Korber et al., 2001). Infection by HIV-1 elicits many antibodies, but in general these are not capable of neutralization of diverse strains of HIV-1. However, after several years of infection, 10-25% of infected individuals develop broadly neutralizing antibodies (Li et al., 2007;Gray et al., 2009;Sather et al., 2009;Simek et al., 2009;Stamatatos et al., 2009;Doria-Rose et al., 2010;Gnanakaran et al., 2010). These antibodies provide little or no benefit to the infected host, as the evolution of the virus outpaces the immune response Poignard et al., 1999;Wei et al., 2003). Nevertheless these antibodies, when tested in humanized mice or macaque models by passive antibody transfer, impart effective immunity to challenge with HIV-1 or simian/human chimeric immunodeficiency viruses (Mascola et al., 1999(Mascola et al., , 2000Parren et al., 2001;Mascola, 2003;Veazey et al., 2003;Hessell et al., 2009a,b;Balazs et al., 2011), indicating the potential for their use as targets for re-elicitation by rationally designed vaccines (reviewed in Walker and Burton, 2010;Kwong et al., 2011). Thus, substantial interest has focused on understanding human antibodies that effectively neutralize diverse strains of HIV-1.
A number of techniques have recently been applied to identification of such antibodies. These methods -including antigenspecific B cell sorting (Scheid et al., 2009;Wu et al., 2010) and direct assessment of neutralization by antibodies secreted from individually sorted B cells , each coupled to single B cell sequencing techniques -have so far yielded dozens of broadly HIV-1-neutralizing antibodies. These antibodies represent an extraordinarily sparse sampling of the humoral immune response, which typically generates roughly a billion new B cells in a healthy individual each day. We therefore asked whether the revolutionary new capabilities of next-generation sequencing (Mardis, 2008a,b;Boyd et al., 2010;Hawkins et al., 2010) and computational science could expand this sampling to generate a more complete understanding of the humoral immune response. In principle, memory B cells contain a persistent record of the antibody response to infection. As memory B cells are readily attained from blood, they provide a convenient means to access the antibody record, with B cell transcripts in peripheral blood mononuclear cells (PBMCs) providing a genetic representation. Using three antibodies, PGT135-137 from Protocol G donor 39  as an example, we used 454 pyrosequencing of PCR-amplified heavy-and light-chain transcripts to capture a more comprehensive genetic record. We used bioinformatics approaches to interrogate this record, to identify populations of neutralizing antibodies, and to characterize their ontogenies. We link these ontogenies to the natural mechanisms of B cell development to provide a view of how somatic populations of antibodies engender a diverse immunological response to infection.

HUMAN SPECIMENS
The PBMCs of the HIV-1 infected donor 39 were obtained from the International AIDS Vaccine Initiative (IAVI) protocol G. The same sample was used to isolate broadly neutralizing antibodies PGT135-137 . Human peripheral blood samples were collected after obtaining informed consent and appropriate Institutional Review Board (IRB) approval.

SAMPLE PREPARATION FOR 454 PYROSEQUENCING
Ten previously described heavy-chain plasmids with known sequences (Wu et al., 2011) were selected to assess 454 pyrosequencing error. Ten plasmids (100 ng each) were combined in 35 µl water, and 1 µl of the ten-plasmid combination was used to template polymerase chain reactions (PCRs). The heavy and kappa chain PCR samples for 454 pyrosequencing from donor 39 were prepared as described (Wu et al., 2011) with minor modifications. Briefly, mRNA was extracted from 20 million PBMCs into 200 µl of elution buffer (Oligotex kit, Qiagen), then concentrated to 10-30 µl by centrifuging the buffer through a 30 kD micron filter (Millipore). The reverse transcription was performed in one or multiple 35 µl-reactions, each composed of 13 µl of mRNA, 3 µl of oligo(dT) 12-18 at 0.5 µg/µl (Invitrogen), 7 µl of 5× first strand buffer (Invitrogen), 3 µl of RNase Out (Invitrogen), 3 µl of 0.1 M DTT (Invitrogen), 3 µl of dNTP mix (each at 10 mM), and 3 µl of SuperScript II (Invitrogen). The reactions were incubated at 42˚C for 2 h. The cDNAs from each reaction were combined, applied to the NucleoSpin Extract II kit (Clontech), and eluted in 20 µl of elution buffer. In this way, 1 µl of the cDNA comprised transcripts from 1 million PBMCs. The immunoglobulin gene family-specific PCR was set up in a total volume of 50 µl, using 1 µl of the heavy-chain plasmid mix or 5 µl of the cDNA as template (equivalent of transcripts from 5 million PBMCs). The DNA polymerase systems used was the Platinum Taq High-Fidelity (HiFi) DNA Polymerase System (Invitrogen). According to the instructions of the manufacturer, the reaction mix was composed of water, 5 µl of 10× buffer, and 1 µl of supplied MgSO4, 2 µl of dNTP mix (each at 10 mM), 1-2 µl of primers (Table S1 in Supplementary Material) at 25 µM, and 1 µl of Platinum Taq HiFi DNA polymerase. The primers each contained the appropriate adaptor sequences (XLR-A or XLR-B) for subsequent 454 pyrosequencing. The PCRs were initiated at 95˚C for 30 s, followed by 25 cycles of 95˚C for 30 s, 58˚C for 30 s, and 72˚C for 1 min, then incubated at 72˚C for 10 min. The PCR products at the expected size (∼500 bp) were gel extracted and purified (Qiagen), followed by further phenol/chloroform purification.

PYROSEQUENCING AND LIBRARY PREPARATION
The 454 pyrosequencing was carried out as described previously (Wu et al., 2011). Briefly, PCR products were quantified using Qubit (Life Technologies, Carlsbad, CA, USA). Library concentrations were determined using the KAPA Biosystems qPCR system (Woburn, MA, USA) with 454 pyrosequencing standards provided in the KAPA system. Pyrosequencing of the PCR products was performed on a GS FLX sequencing instrument (Roche-454 Life Sciences, Bradford, CT, USA) using the manufacturer's suggested methods and reagents. Initial image collection was performed on the GS FLX instrument and subsequent signal processing, quality filtering, and generation of nucleotide sequence and quality scores were performed on an off-instrument linux cluster using 454 application software (version 2.5.3). The amplicon quality filtering parameters were adjusted based on the manufacturer's recommendations (Roche-454 Life Sciences Application Brief No. 001-2010). Quality scores were assigned to each nucleotide using methodologies incorporated into the 454 application software to convert flowgram intensity values to Phred-based quality scores and as described (Brockman et al., 2008). The quality of each run was assessed by analysis of internal control sequences included in the 454 pyrosequencing reagents. Reports were generated for each region of the PicoTiterPlate (PTP) for both the internal controls and the samples.

BIOINFORMATICS ANALYSIS OF 454 PYROSEQUENCING-DETERMINED ANTIBODY SEQUENCES
Our previously described bioinformatics pipeline (Wu et al., 2011) was refined and currently consists of five steps. Starting from a 454 pyrosequencing-determined antibodyome, each sequence read was (1) reformatted and labeled with a unique index number; (2) assigned to variable (V ), diverse (D), and joining (J ) gene families and alleles using an in-house implementation of IgBLAST 1 , and sequences with E-value > 10 −3 for V gene assignment were rejected; (3) subjected to a template-based error-correction procedure, in which 454 pyrosequencing homopolymer errors in V, D, and J regions were detected based on the alignment to their respective germline sequences. Note that only insertion and deletion errors of less than three nucleotides were corrected. D and J gene were corrected only when their gene assignment was reliable, indicated by E-value < 10 −3 ; (4) compared with the a set of template antibody sequences at both nucleotide level and amino-acid level using a global alignment module in CLUSTALW2 (Larkin et al., 2007); (5) subjected to a multiple sequence alignment (MSA)-based scheme to determine the third complementaritydetermining region (CDR H3 or L3), which was further compared with a set of template CDR H3 or L3 sequences at nucleotide level, and to determine the sequence boundary of variable domain. For a large population of highly similar sequences, a "divide-andconquer" procedure could be used to derive a consensus sequence to represent the population and to reduce random sequencing errors. First, a clustering using BLASTClust (Altschul et al., 1997) with a 95% sequence identity cutoff is performed on the sequence population. Then, the largest cluster is divided into 10-50 sets, for each of which a consensus can be derived from MSA. A final consensus is obtained by averaging over the subset consensuses.
Intra-donor phylogenetic analysis use the same procedure as cross-donor phylogenetic analysis, which has been described in detail in previous study (Wu et al., 2011), except that the template antibodies are from the same donor (intra-donor) rather than added exogenously (cross-donor), and intra-donor phylogenetic analysis is equally applicable to heavy and light chains. Briefly, the computational procedure consists of an iterative analysis based on the neighbor-joining (NJ) method (Kuhner and Felsenstein, 1994) implemented in CLUSTALW2 (Larkin et al., 2007) and a final analysis based on the maximum-likelihood (ML) method with molecular clock implemented in DNAMLK 2 in the PHYLIP package v3.69 3 . In the NJ-based analysis, donor sequences of a particular germline origin were first randomly shuffled and divided into subsets of no more than 5,000 sequences. Then, PGT135-137 and respective germline sequence, IGHV4-39 * 07 for heavy chain and IGKV3-15 * 01 for light chain, were added to each subset. A NJ tree was constructed for each subset using the "Phylogenetic trees" option in CLUSTALW2 (Larkin et al., 2007). The donor sequences that clustered in the smallest branch that contains PGT135-137 were extracted from each NJ tree and combined into a new data set for the next round of analysis. The analysis was repeated until convergence, where all the donor sequences resided within a subtree containing PGT135-137 and no other sequences resided between this subtree and the root, and where further repeat of the analysis did not change the NJ tree. The ML-based analysis was used to confirm the intra-donor dendrogram derived from the NJ-based analysis. Starting from the data set obtained from the last iteration of NJ analysis, the MSA generated by CLUSTALW2 (Larkin et al., 2007) was provided as input to construct a phylogenetic tree using DNAMLK. Usually, any sequences outside the ML-defined subtree were discarded, but in this study we tested light chains identified by NJ method but immediately outside the rooted ML-defined PGT135-137 subtree. The displayed phylogenetic trees were generated using Dendroscope (Huson et al., 2007), ordered to ladderize right and rooted at the germline genes.
A description of the antibodyomics software (Antibody-omics1.0) utilized in this paper is being prepared for publication.

ANTIBODY EXPRESSION AND PURIFICATION
Antibody production followed previously described procedures (Wu et al., 2011). Briefly, sequences were selected using the respective bioinformatics procedure and checked for sequencing errors using an automatic error-correction procedure followed by manual inspection. The corrected antibody sequences were synthesized (GenScript USA Inc. and Blue Heron Biotech, LLC.) and cloned into the CMV/R expression vector (Barouch and Nabel, 2005) containing the constant regions of IgG1. All synthesized heavy chains were paired with PGT137 light-chain DNA, and synthesized light chains were paired with PGT137 heavy-chain DNA for transfection. Full-length IgGs were expressed from transient transfection of 293F cells and purified using a recombinant protein-A column (Pierce).

HIV-1 NEUTRALIZATION
Neutralization was measured using HIV-1 Env-pseudoviruses to infect TZM-bl cells as described (Li et al., 2005;Wu et al., 2009;Seaman et al., 2010). Neutralization curves were fit by non-linear regression using a five-parameter hill slope equation as described . The 50% and 80% inhibitory concentrations (IC 50 and IC 80 ) were reported as the antibody concentrations required to inhibit infection by 50% and 80% respectively.

RESULTS
Experiments involving both sequencing technologies and computational analyses are described. Because variable region transcripts of antibodies are over 300 nucleotides in length and because the high similarity between different antibody transcripts precludes assembly of full sequences from fragments, we used 454 pyrosequencing, which is currently one of the few next-generation sequencing technologies to provide reads of sufficient length www.frontiersin.org FIGURE 1 | Sequence variation as a consequence of 454 pyrosequencing for ten plasmid-control antibodies. To quantify sequencing error, ten antibodies, input as purified plasmid DNA, were subjected to 454 pyrosequencing. Tested plasmid antibodies included VRC01, VRC03, VRC-PG04, VRC-CH31, VRC-CH33, a codon-optimized version of inferred, reverted unmutated ancestor of VRC-PG04 (termed VRC-PG04 cog ), gVRC-H3 d74 , gVRC-H6 d74 , gVRC-H12 d74 , and gVRC-H15 d74 . Heavy chain sequences are plotted as a function of sequence identity to the plasmid antibody (vertical axes) and of sequence divergence from their germline gene allele, IGHV1-2 * 02 (horizontal axes). The sequencing data used for divergence/identity analysis was processed by the standard bioinformatics pipeline without the error-correction step. Color coding indicates the number of sequences. For VRC01 and VRC03, additional contour plots displaying the estimated mutational error range (one root-mean-square deviation, 1.38% for VRC01 group and 1.26% for VRC03 group) have been shaded red around the input antibody. (Reddy et al., 2009;Reddy and Georgiou, 2011;Wu et al., 2011). However, 454 pyrosequencing is known to suffer from high error rates (Prabakaran et al., 2011). We therefore begin by characterizing the accuracy of 454 pyrosequencing applied to a set of plasmid standards consisting of known HIV-neutralizing antibodies. We then describe 454 pyrosequencing of antibody heavy-chain transcripts from donor 39 , and analyze these data bioinformatically and functionally. We follow this with a similar analysis of donor 39 light-chain transcripts.

CHARACTERIZATION OF 454 PYROSEQUENCING ERRORS ON ANTIBODY TRANSCRIPTS
To investigate the extent of 454 pyrosequencing errors on the antibodyome analysis, we carried out a sequencing experiment on the heavy chains of 10 selected antibodies (Wu et al., 2011), including five from B cell sorting-based isolation, VRC01, VRC03, VRC-PG04, VRC-CH31, and VRC-CH33, one codon-optimized version of inferred reverted unmutated ancestor of VRC-PG04 (termed VRC-PG04 cog ), and four identified from previous 454 pyrosequencing study, gVRC-H3 d74 , gVRC-H6 d74 , gVRC-H12 d74 , and gVRC-H15 d74 . The plasmid sequencing data was processed with the same bioinformatics pipeline used for donor sequencing data ( Figure S1 in Supplementary Material). Sequence reads were subjected to an error-correction procedure, which was aimed to fix deletion and insertion errors that cause protein translation problems (see Materials and Methods). Results obtained with and without error correction were compared to examine the effect of error correction on observed sequence variation.
A divergence/identity analysis was first carried out on the 10 plasmid data set, obtained without ( Figure 1) and with error correction ( Figure S2 in Supplementary Material). Since divergence and identity were calculated at the nucleotide level, error correction appeared to have little effect on the sequence distribution. Ideally, if the 454 pyrosequencing did not produce any Frontiers in Microbiology | Virology errors, especially mutations, the distribution -irrespective of the antibody being used as template -would yield, on divergence/identity plots, 10 discrete points, each corresponding to one of the input sequences. In contrast, divergence/identity plots revealed broad islands centered around each of these 10 antibody sequences (Figure 1). The shape and area of each island provide a visual representation of the extent of the 454 pyrosequencing errors. As shown in Table 1, 5 of the 10 antibodies -those with an identity gap of 25% or greater to the next most closely related sequence -were easily distinguished from each other, while other more closely related variants, e.g., VRC-CH31 and VRC-CH33, overlapped (Figure 1). Based on identity considerations ( Table 1) and the scope of each island in divergence/identity plots (Figure 1), a single cutoff of 75% was applied to group 454 pyrosequencing-determined sequences for VRC01, VRC03, VRC-PG04 cog , gVRC-H3 d74 , and gVRC-H6 d74 . Each of these five 454 pyrosequencing-determined sequence groups was analyzed for mutations, insertions, and deletions relative to the input plasmid sequence, as well as total number of reads and their redundancy ( Table 2). For four of the plasmids ∼50,000 reads were obtained; for gVRC-H6 d74 , however, only about one fourth as many were obtained, which may relate to a lower efficiency of the primer used for gVRC-H6 d74 . In terms of redundancy, for three of the plasmids between one fifth and one half of the reads were identical to the input plasmid, whereas for VRC01 and gVRC-H6 d74 , only a small fraction (<1 and <10%) of the reads were identical to the input plasmid, a result of insertions in most of the sequences. Note that after error correction, 20-3254 more sequences became identical to the input antibodies (Table 2). Overall, for an antibody of typical length, ∼5-nucleotide mutations were observed between 454 pyrosequencing reads and corresponding input sequences. Error correction appeared to cause an increased count of mutation errors while decreasing insertion and deletion errors that produce stop codons and nonsense codons in protein translation. Currently used correction procedure was able to improve the identity of translated protein sequence to respective germline gene by an average of 14.1% (Figures S1C,D in Supplementary Material).
We then examined the accuracy of bioinformatically selected representative sequences for these five antibody groups. Note that all these sequences have been subjected to a template-based error-correction procedure in the pipeline processing. A "divideand-conquer"procedure (See Materials and Methods) was used for sequence calculation. Remarkably, the representative sequence was 100% identical to the "true" sequence used as input for 454 pyrosequencing for VRC-PG04 cog , gVRC-H3 d74 , and gVRC-H6 d74 , while having one 1-nucleotide deletion and two 1-nucleotide insertions for VRC01 and VRC03, respectively. None had mutation errors. Such consensus-based sequence picking procedure may prove useful in the cases where a population of closely related sequences is observed on the divergence/identity plot, as indicated by a densely populated island.

PYROSEQUENCING OF DONOR 39 IGHV4 FAMILY AND BIOINFORMATICS ANALYSIS OF HEAVY CHAINS
We next performed 454 pyrosequencing of PGT135-137-related heavy-chain transcripts from donor 39 PBMCs. mRNA from ∼5 million PBMCs was used for reverse transcription to produce template cDNA, and PCR was used to amplify IgG and IgM heavy-chain sequences from the IGHV4 family using forward primers that overlapped the end of the V gene leader sequence and the start of the V region and reverse primers covering the start of the constant domain (Table S1 in Supplementary Material).
Next-generation pyrosequencing provided 918,298 reads, which were processed with a bioinformatics pipeline that involved assignment of germline origin genes, 454 pyrosequencing-error correction, and extraction of CDR H3 regions for lineage assignment. Overall about 85.3% of the raw reads spanned over 400 nucleotides, covering the entire variable domain. After computational assignment of V, D, and J gene components, 142,842 sequences were assigned to IGHV4-39 germline family, accounting for ∼16% of the expressed VH4 antibodyome. Each sequence www.frontiersin.org

X denotes the type of sequencing error to be characterized, mutation (Mut), insertion (Ins), and deletion (Del), respectively; X denotes the averaged sequencing error;
N Seq denotes the total number of sequences within a given antibody group.
The RMS values were normalized using RMS normalized = RMS unnormalized /lengh unnormalized × 100 to take into account the difference in sequence length.
was subjected to an automatic error-correction scheme. For donor 39 heavy chains, the correction procedure improved the accuracy of protein translation, measured by protein sequence identity to inferred gemline gene, by an average of 20.4%. The results for pipeline processing of heavy-chain data set are listed in Figure S3 in Supplementary Material. First, germline family analyses were performed using two standard methods -IMGT (Brochet et al., 2008) and IgBLAST (see text footnote 1; Table 3). These analyses assigned PGT135-137 gene origins to IGHV4-39 with two possible alleles ( * 03 or * 07), to three potential D genes, and the J gene IGHJ5 * 02. An analysis of the third complementarity-determining region of the heavy chain (CDR H3) showed 80-90% sequence identity between PGT135-137, suggestive of a common lineage. The likely clonal origin of PGT135-137 indicates that they will all have the same V(D)J origin, with the different origin gene assignments by IMGT and IgBLAST likely due to their high divergence of ∼20% from ancestral gene.
Second, a divergence/identity analysis of 454 pyrosequencingderived sequences assigned to IGHV4-39 origin was performed (Figure 2). The IGHV4-39-related sequences revealed a maximum divergence of 30.4% and an average divergence of 7.7% from germline. An island of sequences was observed at ∼90% identity to PGT137 with divergence of 20-25% from VH4-39, indicative of PGT137-related antibodies with similar evolutionary distance from the origin.
Third, intra-donor phylogenetic analysis (see Materials and Methods) was applied to identify the somatic variants of PGT135-137 from the donor 39 heavy-chain sequencing data. In this analysis, a set of clonally related template antibodies is used to interrogate sequences from the same donor using phylogenetic analysis. Phylogenetic analysis, using a tree rooted by the inferred germline gene IGHV4-39 * 07, produced a ML dendrogram with 202 heavy-chain variable-domain sequences identified by their co-segregation with PGT135-137 (Figure 3). Most of the intra-donor-identified sequences clustered with PGT137, and one sequence clustered with PGT136.
Fourth, CDR H3 variation was analyzed for the 202 PGT135-137-related heavy-chain variable-domain sequences. One hundred seven were found to have identical CDR H3 sequences, as the same as the nucleotide-sequence consensus. With a maximum of five mutations from the consensus, the average CDR H3 variation was 1.2, indicative of a rather conserved signature of PGT135-137 lineage.

PGT135-137 SOMATIC HEAVY-CHAIN POPULATIONS AND FUNCTIONAL CHARACTERIZATION
To gain insight into the functional diversity of the antibodies identified by 454 pyrosequencing and bioinformatics methods, a clustering procedure was used to analyze the 202 identified heavy chains and to select representative sequences for further characterization. We used BLASTClust (Altschul et al., 1997) clustering SeqID togerm is the nucleotide-sequence identity with respect to the germline gene.

Frontiers in Microbiology | Virology
SeqID toPGT135 is the nucleotide-sequence identity with respect to PGT135. function and an identity cutoff of 95% to sample the natural variation. We chose this cutoff to be greater than the ∼1.6% "false" sequence variation induced by 454 pyrosequencing errors ( Table 2). A total of 15 clusters emerged. In the BLASTClust output, the first sequence of each cluster was selected to"represent" the cluster ( Figure 4A) and were synthesized and reconstituted with the PGT137 light chain for functional assessment of HIV-1 neutralization, which was carried out on two viruses sensitive www.frontiersin.org  to PGT135-137 antibodies. Out of 15 tested heavy-chain variable domain sequences, when paired with PGT137 light chain, 11 reconstituted antibodies showed neutralization to different extents ( Table 4).
The two largest clusters, with 136 and 46 sequences, respectively, accounted for ∼90% of the sequences (Figure 4B), while 10 of the 15 clusters contained only a single member. A consensus sequence (ConsAA), calculated from the alignment of 15 representative sequences (Figure 4C), was also synthesized. Notably, the reconstituted amino-acid consensus displayed neutralization almost on par with wild-type PGT137 ( Table 4).
Despite their apparent clonality, the clustering procedure reveals 15 clusters. The topology of the dendrogram produced from phylogenetic analysis indicates that these 15 clusters represent populations of somatically related antibodies evolving along distinct branches by standard mechanisms of hypermutation (Figure 3). We analyzed these 15 somatic populations for prevalence of mutations, insertions, and deletions (Table S2 in Supplementary Material). Note that the representative sequence of cluster 1 (#844305) contained two insertions in the CDR H3 region which were not seen in other members of the cluster, suggesting that these insertions might be sequencing errors. Indeed, this heavy chain could not be expressed when reconstituted with PGT137 light chain. We also analyzed each of these populations by divergence/identity plot (Figure 5). Overall, sequences chosen to represent the 15 somatic populations showed diverse neutralization characteristics (Table S2 in Supplementary Material). Some antibodies, for example from clusters 2, 3, 14, and www.frontiersin.org Columns include sequence index (for heavy chains, the amino-acid consensus is denoted by "ConsAA"; for controls, the antibody name is used as sequence index), neutralizing antibody name based on the nomenclature used in previous studies (Wu et al., 2011;Zhu et al., under review) The wild-type mAbs PGT135-137 and wild-type VRC01 were included as controls.
MuLV stands for murine leukemia virus, which was included as a negative control.
"-" denotes expressed but non-neutralizing sequence after reconstituted with the PGT137-partner chain.

PYROSEQUENCING OF DONOR 39 IGKV3 FAMILY AND BIOINFORMATICS ANALYSIS OF LIGHT CHAINS
We next performed 454 pyrosequencing of PGT135-137-related light-chain transcripts from donor 39 PBMCs. mRNA from ∼5 million PBMCs was used for reverse transcription to produce template cDNA, and PCR was used to amplify light-chain sequences from the IGKV3 family.
The 454 pyrosequencing provided 971,165 reads, which were then processed using a pipeline adapted for κ-chain analysis. For donor 39, about 83.3% of the raw reads were 400 nt or longer, effectively covering the light-chain variable domain. After V and J gene assignment, 91,951 sequences were determined to belong to Frontiers in Microbiology | Virology IGKV3-15 germline family, accounting for 10% of the light chain reads obtained. After error correction, the accuracy of protein translation measured by the protein sequence identity to inferred gemline gene was improved by an average of 16.5%. The results for pipeline processing of light-chain data set are listed in Figure  S4 in Supplementary Material.
First, the recombination origins of PGT135-137 light chains were analyzed (Table 3). PGT135-137 light chains were assigned to the same germline V gene allele, IGKV3-15 * 01, recombined with the same J gene, IGKJ1 * 01, supporting the notion that the discrepancy in heavy-chain germline assignment was likely an artifact caused by their high divergence.
Second, the divergence/identity analysis of 454 pyrosequencingderived sequences assigned to the IGKV3-15 * 01 origin was performed (Figure 6). The IGKV3-15 * 01-related sequences revealed a maximum divergence of 20.9% and an average divergence of 6.3% from germline. Distinct sequence islands were observed at ∼100% identity to PGT136 and 95% identity to PGT137 -both with divergence of 10-15% from IGKV3-15 * 01. No distinct sequence island was observed that was closely related to PGT135.
Third, to identify light-chain somatic variants, we performed intra-donor phylogenetic analysis that combined an iterative NJ procedure for the high-throughput screening of sequencing data, and a ML calculation to confirm the NJ analysis and to provide the final dendrogram (see Materials and Methods). Two methods were usually in agreement, e.g., for donor 39 heavy chains, but differed here. The NJ-based analysis yielded 72 sequences within the www.frontiersin.org PGT135-137 subtree, whereas the subsequent ML-based analysis retained 57 of the 72 sequences within the PGT135-137 subtree (Figure 7), providing an example for functional characterization of similar but somatically unrelated sequences.

PGT135-137 SOMATIC LIGHT-CHAIN POPULATIONS AND FUNCTIONAL CHARACTERIZATION
By using the same 95% clustering procedure as for heavy chains, 14 light-chain clusters were identified from the phylogenetic tree. Representative sequences were selected, also as described for heavy chains, from the first 10 clusters for functional characterization ( Figure 8A). We analyzed these 10 clusters for prevalence of mutations, insertions, and deletions (Table S3 in Supplementary Material). The largest cluster, lying within the population of PGT137-like sequences, contained 45 sequences or 63% of the subtree sequences ( Figure 8B). All selected light-chain sequences possessed CDR L3s of the same length except for the sequences selected from the clusters 2 and 3 ( Figure 8C). Out of 10 tested light-chain variable domain sequences, when reconstituted with the PGT137 heavy chain, six antibodies -representing six sequence clusters -showed neutralization of two HIV-1 strains from clade A and clade D. Notably, two of the light chains (gVRC-L1 d39 and gVRC-L2 d39 ) showed neutralization breadth slightly better than PGT135-137, and the light-chain variants neutralized clade A about 10-fold more effectively than the clade D ( Table 4).
In contrast to the 454 pyrosequencing-identified heavy chains, the six neutralizing light-chain clusters were not localized to a single divergence/identity island (Figure 9). Indeed, neutralization was observed with clusters from at least three diverse locations on the divergence/identity plot. Nevertheless, the topology of the light-chain phylogenetic analysis indicates that these six clusters represent populations of somatically related antibodies (Figure 7).

DISCUSSION
Recently, select antibodies with the ability to neutralize diverse strains of HIV-1 have been identified in HIV-1 infected donors Corti et al., 2010;Wu et al., 2010Wu et al., , 2011Scheid et al., 2011). Like PGT135-137, antibodies from these donors often appear to be clonally related, to possess similar neutralization characteristics, and to cluster in a localized island (or islands) on identity/diversity plots. These islands observed in 454 pyrosequencing-derived analyses are often nearby but rarely overlap the few antibodies experimentally isolated from the same Frontiers in Microbiology | Virology individual (even if they start with samples of exactly the same time point, as we have done here with donor 39). The differences between antibodies identified from sorting of memory B cells or by 454 pyrosequencing of B cell transcripts suggest that the experimental approaches may capture or sample different B cell population. In addition to exploring differences in phenotype of antibody identified by the two methods, we also explored differences related to the quantity of identified antibody. In particular, we ask whether the less-sparse view of the antibody repertoire provided by next-generation sequencing and systems-level bioinformatics might provide insight into the diversity of the antibody response.
With the heavy chains of PGT135-137, select sequences representing 15 distinct populations, showed dramatically different neutralization characteristics toward clade A and D viruses when reconstituted with the same light chain from PGT137. With the light chains of PGT135-137, select sequences representing 10 distinct populations were not localized to a discrete sequence island, indicating substantial differences in identity and diversity (Figure 8). Thus, even though these antibodies are somatically related, both their neutralization and sequence characteristics can diverge substantially (Table 4). These results demonstrate the utility of next-generation sequencing, which provides a more comprehensive sampling of sequences, and of www.frontiersin.org systems-level bioinformatics approaches, which enable these data to be mined effectively. Overall, data-intensive methods may be generally required to obtain true insight into questions of biological diversity such as the humoral immune response.
Prior next-generation sequencing and bioinformatics analyses have revealed the extraordinary genetic diversity of HIV-1 (Eriksson et al., 2008;Archer et al., 2009;Tsibris et al., 2009;Fischer et al., 2010). These same methods are now beginning to reveal the extraordinary diversity of antibodies generated in response to HIV-1 infection (Wu et al., 2011). Although this response appears to provide little benefit to the HIV-1-infected host (Poignard et al., 1999), if similar responses could be generated through vaccination, then in principle effective protection could be achieved in the setting of initial infection (Burton, 2002;Burton et al., 2004Burton et al., , 2005. The populations of antibodies we identify here may provide broader protection than a monoclonal member of the group. Furthermore, responses to infection or vaccination would be expected to generate diverse populations of antibodies, as we have shown here. Thus, population diversity, even within a single antibody clone or lineage, is likely to have a substantial impact on the effectiveness of the immune response.

ACKNOWLEDGMENTS
We thank H. Coleman, M. Park, B. Schmidt, and A. Young for 454 pyrosequencing at the NIH Intramural Sequencing Center (NISC), J. Stuckey for assistance with figures. We also thank members of the Structural Biology Section and Structural Bioinformatics Core, Vaccine Research Center, for discussions or comments on the manuscript. We would like to thank all the study participants and research staff at each of the Protocol G clinical centers, and all of the Protocol G team members, the IAVI Human Immunology Laboratory, and all of the Protocol G clinical investigators, specifically, George Miiro, Anton Pozniak, Dale McPhee, Olivier Manigart, Etienne Karita, Andre Inwoley, Walter Jaoko, Jack DeHovitz, Linda-Gail Bekker, Punnee Pitisuttithum, Robert Paris, Jennifer Serwanga, and Susan Allen. Support for this work was provided by the Intramural Research Program of the Vaccine Research Center, National Institute of Allergy and Infectious Diseases and the National Human Genome Research Institute, National Institutes of Health, and by grants from the International AIDS Vaccine Initiative's Neutralizing Antibody Consortium.

SUPPLEMENTARY MATERIAL
The Supplementary Material for this article can be found on line at http://www.frontiersin.org/Virology/10.3389/fmicb.2012.00315/ abstract