cAb-Rep: A Database of Curated Antibody Repertoires for Exploring Antibody Diversity and Predicting Antibody Prevalence

The diversity of B cell receptors provides a basis for recognizing numerous pathogens. Antibody repertoire sequencing has revealed relationships between B cell receptor sequences, their diversity, and their function in infection, vaccination, and disease. However, many repertoire datasets have been deposited without annotation or quality control, limiting their utility. To accelerate investigations of B cell immunoglobulin sequence repertoires and to facilitate development of algorithms for their analysis, we constructed a comprehensive public database of curated human B cell immunoglobulin sequence repertoires, cAb-Rep (https://cab-rep.c2b2.columbia.edu), which currently includes 306 immunoglobulin repertoires from 121 human donors, who were healthy, vaccinated, or had autoimmune disease. The database contains a total of 267.9 million V(D)J heavy chain and 72.9 million VJ light chain transcripts. These transcripts are full-length or near full-length, have been annotated with gene origin, antibody isotype, somatic hypermutations, and other biological characteristics, and are stored in FASTA format to facilitate their direct use by most current repertoire-analysis programs. We describe a website to search cAb-Rep for similar antibodies along with methods for analysis of the prevalence of antibodies with specific genetic signatures, for estimation of reproducibility of somatic hypermutation patterns of interest, and for delineating frequencies of somatically introduced N-glycosylation. cAb-Rep should be useful for investigating attributes of B cell sequence repertoires, for understanding characteristics of affinity maturation, and for identifying potential barriers to the elicitation of effective neutralizing antibodies in infection or by vaccination.

The diversity of B cell receptors provides a basis for recognizing numerous pathogens. Antibody repertoire sequencing has revealed relationships between B cell receptor sequences, their diversity, and their function in infection, vaccination, and disease. However, many repertoire datasets have been deposited without annotation or quality control, limiting their utility. To accelerate investigations of B cell immunoglobulin sequence repertoires and to facilitate development of algorithms for their analysis, we constructed a comprehensive public database of curated human B cell immunoglobulin sequence repertoires, cAb-Rep (https://cab-rep.c2b2.columbia.edu), which currently includes 306 immunoglobulin repertoires from 121 human donors, who were healthy, vaccinated, or had autoimmune disease. The database contains a total of 267.9 million V(D)J heavy chain and 72.9 million VJ light chain transcripts. These transcripts are full-length or near full-length, have been annotated with gene origin, antibody isotype, somatic hypermutations, and other biological characteristics, and are stored in FASTA format to facilitate their direct use by most current repertoire-analysis programs. We describe a website to search cAb-Rep for similar antibodies along with methods for analysis of the prevalence of antibodies with specific genetic signatures, for estimation of reproducibility of somatic hypermutation patterns of interest, and for delineating frequencies of somatically introduced N-glycosylation. cAb-Rep should be useful for investigating attributes of B cell sequence repertoires, for understanding characteristics of affinity maturation, and for identifying potential barriers to the elicitation of effective neutralizing antibodies in infection or by vaccination.

INTRODUCTION
B cells comprise a crucial component of the adaptive immune response (1). B cells recognize three-dimensional epitopes of antigens through the variable domains of the B cell receptor (BCR), or its various secreted forms of antibody. The variable domains of BCRs and antibodies are composed of immunoglobulin heavy and light chains, encoded by separate genes.
Through V(D)J gene recombination and somatic hypermutation (SHM), a high level of sequence diversity is introduced to the variable domain (1)(2)(3)(4), allowing B cells to recognize diverse antigens. Thus, an interrogation of BCR diversity and function is key to understanding the B cell immune response. The application of next-generation sequencing (NGS) to BCR repertoires provides snapshots of BCR diversity, and such studies in the past decade have characterized numerous features of B cell responses to infection, immunization, and autoimmune disease (3,(5)(6)(7)(8)(9)(10)(11)(12)(13)(14)(15)(16)(17)(18). Databases, programs, and websites have been developed to store, process, and annotate BCR repertoire data (17,(19)(20)(21)(22)(23)(24) [software reviewed in Chaudhary and Wesemann (25)]. Nonetheless, most repertoire data are deposited in public databases in the format of raw NGS reads, which contain both sequencing duplicates and sequencing errors and may be difficult to annotate. A comprehensive database of curated and well-annotated BCR transcripts, should therefore accelerate repertoire studies including but not limited to characterization of B cell receptor diversity, mechanisms of clonal expansion, development of BCR repertoire analysis algorithms, estimation of frequency of antigen-specific antibodies and their precursorlike cells, and effects of SHM. In addition, such a database should assist researchers in performing repertoire-related data mining.
Designing vaccines that can elicit broadly neutralizing antibodies (bnAbs) is a long-term goal for preventing infections from fast evolving and/or highly diversified pathogens such as HIV-1 and influenza (26,27). Studies have revealed that some epitopes could elicit bnAbs in different individuals with similar modes of recognition and shared genetic signatures [e.g., V(D)J gene origin, complementarity determining region 3 (CDR3) length, CDR3 motifs, SHM] (9, 28-33), defined as convergent, multidonor, or public bnAb classes. Such antibody classes often target conserved epitopes; those that can elicit the activation and maturation of naïve B cells with bnAb potential might provide templates for universal vaccines (30,(34)(35)(36). However, not all antibody classes appear with high frequencies in BCR repertoires; those that do not may be limited due to disfavored developmental steps or "roadblocks." Thus, the identifications of bnAb classes with precursor-like cells prevalent in humans is a critical consideration for determining which template antibodies are good targets for elicitation, and thus for immunogen design (9,37,38). A comprehensive database of curated B cell repertoire transcripts, for which uncertainties and variations in B cell repertoires (e.g., sampling bias, infection history, aging, and genetic diversification) (12) have been minimized, would be helpful for predicting antibody class prevalences.
A database of curated BCR repertoires could, moreover, be used to investigate SHM preference and mechanisms of affinity maturation. Antibodies accumulate mutations with high preference determined by both intrinsic gene mutability and functional selection (1,39). We previously developed genespecific substitution profiles (GSSPs) to characterize positional substitution types and frequencies in 69 human V genes (14). By incorporating additional BCR transcripts, GSSPs could be built for more genes, and this could have broad application; GSSPs or similar approaches have been applied to examine whether bnAbs mature with shared pathways (40), to identify highly frequent SHMs and common mechanisms of affinity modulation (41,42), and to estimate whether rare SHMs (SHMs generated with very low frequency by the SHM machinery) in bnAbs could form barriers to re-elicitation by vaccination (34,43).
In this study, we constructed a database of curated human B cell immunoglobulin sequence repertoires (cAb-Rep) from 306 high quality human repertoires, and developed methods to search cAb-Rep using either sequence or sequence signature. We used the database to construct GSSPs for 102 human antibody V genes and developed a script to identify rare SHMs in input sequences. We evaluated the robustness of query results relative to the number of repertoires, using antibody prevalence estimates as a test case. In summary, the database developed here should help to investigate B cell repertoire features, to find antibody templates for vaccine design, and overall to understand mechanisms of antibody development.

BCR Repertoire Dataset
We assembled a total of 376 BCR repertoire datasets from 108 human donors deposited in the NCBI short reads archive (SRA) database (https://www.ncbi.nlm.nih.gov/sra). These repertoires were all sequenced using illumina MiSeq or HiSeq and were published in 11 studies ( Table 1). In addition, annotated full length sequences of three donors (sequenced by AbHelix with high depth or HD repertoires, ∼10 8 high quality sequences per donor) (4) and the near full length V(D)J sequences (sequenced with framework 1 region primers) from 10 HD repertoires (2) were downloaded from the links provided by the authors.

Next-Generation Sequencing Data Processing
The NGS data were analyzed with the SONAR pipeline, version 2.0 (https://github.com/scharch/sonar/) developed in our lab (23). Briefly, USEARCH was used to merge the 2 × 300 or 2 × 250 raw reads to single transcripts and to remove transcripts potentially containing more than 20 miscalls calculated from sequencing quality score (44). Merged transcripts shorter than 300 nucleotides were removed. BLAST (http://www.ncbi.nlm. nih.gov/blast/) was used to assign germline V, D, and J genes to each transcript with customized parameters (23,45). CDR3 was identified from BLAST alignment using the conserved 2nd Cysteine in V region and WGXG (heavy chain) or FGXG (light chain) motifs in J region (X represents any of the 20 amino acids). To assign an isotype for each heavy chain transcript, we used BLAST with default parameters to search the 3 ′ terminus of each transcript against a database of human heavy chain constant domain 1 region obtained from the international ImMunoGeneTics information system (IMGT) database. A BLAST E-value threshold of 1E-6 was used to find significant isotype assignments. Then, sequences other than the V(D)J region of a transcript were removed and transcripts containing frame-shift and/or stop codon were excluded. We then applied two approaches to remove PCR duplicates and sequencing errors. For datasets sequenced with unique molecular identifier (UMI), we first grouped merged raw transcripts having identical UMI. Due to PCR crossover and UMI collisions, transcripts in a group may originate from different cells (2,12). We therefore clustered the raw transcripts in each group using USEARCH with a 97% sequence identity. All transcripts in a USEARCH cluster were aligned using CLUSTALO (46) and a consensus sequence was generated. To further reduce redundant transcripts derived from multiple mRNA molecules of the same cell, we removed duplicates in the consensus sequences. Singletons, which have neither UMI duplicates nor consensus duplicates, were removed. SONAR was used to annotate the unique transcripts. For datasets sequenced without UMI, similar to our previous studies (14,18,23), we clustered transcripts of each repertoire using USEARCH with sequence identity of 0.99, and selected only one transcript with the highest sequencing depth or numbers of duplicates per cluster for later analyses. To further remove low quality reads, we excluded clusters with size <2. Finally, a unique dataset of transcripts was generated for each repertoire. To reduce artifacts and effect of small sample size on the comparisons of features of antibody repertoires, we excluded repertoires containing fewer than 1,700 unique transcripts.
For each of the 13 HD repertoires curated in previous studies, we further removed duplicates, sequences containing frameshifts or/and stop codons, and corrected annotation errors to form a unique dataset. The gene annotation information was retrieved from the annotation files provided by the studies.

Germline Gene Database and Prediction of New Gene/Alleles
The human germline gene database from IMGT was used to assign germline V(D)J genes. To detect new germline genes or alleles, we combined unique reads from 108 repertoires, and used IgDiscover v0.9 with default parameters to identify potential new germline genes or alleles that are observed in multiple repertoires (47). While IgDiscover prefers to identify new genes or alleles from IgM repertoire, we used repertoires containing all isotypes. Nonetheless, identical unique reads were observed for each predicted gene/allele in at least two donors, suggesting that the predictions are still reliable. The predicted genes were submitted to European Nucleotide Archive (ENA) with project accession numbers: PRJEB31020. For each of the 13 HD repertoires, we randomly selected 1 million IgM transcripts for novel gene prediction (as recommended by IgDiscover manual) and no new gene or allele was found.

V and J Gene Usage and Antibody Position Numbering
The unique dataset for each repertoire was used to calculate the distributions of germline gene usage and SHM levels. For each transcript, we used ANARCI to assign each position according to the Kabat, Chothia, and IMGT numbering schemes (48).

Signature Prevalence and Rarefaction Analysis
We developed a python script, Ab_search.py, to search the BCR repertoire database in two modes: signature motif and full V(D)J sequence. Briefly, in the signature searching mode, an amino acid signature motif from either CDR3 region or other positions of interest (defined with the Kabat, Chothia, and IMGT numbering scheme) is converted to python regular expression format and searched against all transcripts in the database. For the sequence searching mode, BLAST+ is called to find similar transcripts with E-value < 1E-6 (49). Signature frequency in each repertoire is calculated by dividing the number of matched transcripts with the total number of unique transcripts originated from the same germline gene or allele.
To evaluate the effect of number of repertoires on estimation of signature frequency, we performed rarefaction analysis by sampling i subset repertoires from PBMC samples (1 to 35). For each subset i, the random sampling was performed N times (we used N = 20 in the current study) and the mean signature frequency from each sampling f i was calculated. Then the coefficient of variance for each i, CV i , was calculated as:

Identification of Antibody Clones and Construction of GSSPs
For each repertoire, we first sorted all transcripts to groups based on identical V and J gene usages. For each group, transcripts with 90% CDR3 sequence identity and the same CDR3 length were clustered into clones using USEARCH. One representative sequence was selected in each clone and representative transcripts from all repertoires were combined to build GSSPs for V genes using mGSSP (14). We exclude GSSPs built with <100 clones because of lack of information (14). A python script, SHM_freq.py, was developed to identify SHMs in an input sequence, to search the GSSP of an assigned V gene, and to output the rarity of each mutation, calculated as: Rarity = (1 -Frequency of mutation) * 100%.

Construction of Gene-Specific N-Glycosylation Profile
Glycosylation sites for sequences having SHMs >1% in the unique datasets of healthy and vaccination donors were predicted using an artificial neural network method NetNGlyc v1.0 (http://www.cbs.dtu.dk/services/NetNGlyc/) 1 . Predictions with high specificity (potential >0.5 and jury agreement of 9/9) and do not match Asn-Pro-Ser/Thr motifs were included. Nglycosylation sites encoded in germline genes were excluded.

Statistical Tests
ANOVA and T-tests were performed in R.

cAb-Rep Contains a Diverse Compendium of Annotated Next-Generation Sequencing Datasets
We first assembled 376 BCR repertoire deep sequencing data sets from NCBI SRA database, each sequenced by Illumina MiSeq or HiSeq and with library preparation protocols to cover full length V(D)J region [5 ′ primers at leader regions or 5 ′ Rapid amplification of cDNA ends (RACE), 3 ′ primers targeting constant region 1] or near full length (5 ′ primers at N-terminus of framework 1 region). These datasets contain a total of 247 million raw reads ( Table 1 and Table S1). We performed quality control including removing sequencing errors, PCR crossover, PCR duplicates, and annotated each sequence using SONAR (See Methods and Figure 1). A total of 293 repertoires from 108 donors were selected to construct the cAb-Rep database, with 6.6 and 0.9 million curated full length V(D)J heavy and light chain transcripts. Twenty-five thousand five hundred forty-one curated transcripts were obtained per repertoire. In addition, annotated sequences from 13 healthy donor repertoires sequenced with high depth (more than 20 million unique sequences from about 10 7 -10 8 B cells per donor) (2, 4), were curated and incorporated. These repertoires were from studies including Hepatitis B vaccination, influenza vaccination, Tetanus vaccination, Multiple Sclerosis, Myasthenia Gravis, Systemic Lupus Erythematosus, and healthy donors [ Table 1 and Table S1; (6-9, 12, 15-17, 50)]. Among all repertoires, 72 repertoires were from 43 healthy donors (35.5% of total donors). The database repertoires were obtained from three tissues: 2 from brain, 15 from cervical lymph node, and 289 from peripheral blood mononuclear cell (PBMC). Among PBMC samples, 16 were from antibody secreting B cells, 27 from Hepatitis B surface antigen (HBsAg)+ B cells, 8 from Human Leukocyte Antigen -DR isotype (HLA-DR) + plasma cells, 139 from IgG + B cells, 20 from memory B cells, 34 from IgM or naïve B cells, and 52 from whole PBMC (Table S1). Thus, cAb-Rep contains BCR repertoire data in different settings, which will enable cross-study comparisons.
For each transcript in cAb-Rep, we numbered positions using Kabat, Chothia, and IMGT schemes (Figure 1), and annotated gene origin, CDR3 sequence and length, somatic hypermutation level, isotype, donor, and clonotype. These curated datasets can be used for downstream analyses. The transcripts are stored in FASTA format with annotation information in the header line of each transcript. Such a format can be easily altered to feed into other programs for repertoire analysis.
Because the 13 HD repertoires contained over 330 million unique sequences, to avoid sampling bias for profile constructions and improve speed of searching cAb-Rep (see sections below), we randomly selected 20,000 unique sequences of each isotype of each repertoire and combined with the rest of repertoires for all analyses below. Nonetheless, in case rare FIGURE 1 | Flowchart for the processing of repertoire data and developed tools. Next-generation sequencing data was processed and annotated using SONAR. Other published programs used were highlighted in bold font. Scripts developed in this study were in italic font. events of BCR signatures are of interests, we provided an option in our scripts and at cAb-Rep website to search the 13 HD repertoires alone.

Ab_Search Identifies Similar Antibody Transcripts and Estimates Signature Frequency
To identify antibody transcripts of interest from cAb-Rep, we developed a python script, Ab_search.py. The script accepts sequences or amino acid motifs as input and output frequency of the signature and matched sequences (See Materials and Methods). Here, as examples, we searched transcripts having signature motifs of a selected malaria antibody class in PBMC BCR repertoires or of a selected HIV-1 bnAb classes in IgM and IgG repertoires (Figure 2).
The selected malaria antibody class contains both heavy (IGHV3-33 gene with a tryptophan at position 52) and light chain (IGKV1-5 gene with 8 amino acid CDRL3) signatures (51). The heavy chain signature was observed with a frequency of 3.2 ± 0.94% in 15 PBMC repertoires in healthy donors (Figure 2A), the light chain signature was found with a frequency of 1.24 ± 0.18% in 17 healthy donor PBMC repertoires. All healthy donors we searched contain both heavy and light sequences. By assuming random pairing (52), heavy-light paired antibodies similar to the malaria antibody class could be generated with a high frequency (∼40 per million B cells). Thus, vaccine mediated elicitation of such antibodies are promising candidates to enable protection.
The HIV-1 VRC01 class contains both heavy and light chain signatures (31)(32)(33)53). Its heavy chain uses IGHV1-2 * 02 allele. The light chain is originated from a limited set of genes, including IGKV1-33, IGKV3-15, IGKV3-20, or IGLV2-14 and contains a CDR3 of five amino acids matching motif X-X-[AFILMYWV]-[EQ]-X. Searching cAb-Rep with these signatures showed that the heavy chain allele for VRC01 appears in IgM and IgG repertories with similar frequencies (2.7 ± 1.9% and 2.85 ± 2.13%, respectively; Figure 2B). Further, VRC01 light chain-like transcripts were found in ∼33% of donors, with a mean frequency of 0.005 ± 0.0098% in PBMCs ( Figure 2C). By assuming random pairing of heavy and light chains, our calculation of the mean frequency of VRC01 class-like antibodies was 1.4 per million B cells in healthy donors, close to estimates from previous studies (33,38).

Impact of Repertoire Diversity on Signature Prevalence
Due to diversification in BCR repertoires by antigen selection and other factors, the predicted prevalence of transcripts of interest could vary. To evaluate the effect of the number of sampled repertoires on prevalence prediction, we performed rarefaction analysis to sample a subset of repertoires (ranging from 1 to 35) with each subset randomly sampling 20 times to calculate frequencies of the malaria IGHV3-33/IGKV1-5 class and VRC01 class light chain signatures. For each sampling size, we calculated the coefficient of variation (standard deviation/mean) to measure the degree of variation among sampling repeats (Figure 2D). For all signatures, our analysis revealed that the coefficient . We used coefficient of variation to measure the variation of signature frequency at each sampling size. The analysis showed that the coefficient of variation decreases substantially when sampling more than 10 repertoires, suggesting that increasing sampling size helps to estimate signature frequency with higher confidence. The coefficient of variation for malaria VH3-33/KV1-5 class antibody heavy and light chain signatures were colored blue and orange, respectively, while those of the HIV bnAb VRC01 light signature were colored magenta.
of variation decreases dramatically when the sampling size increases to 10 or more. This suggests that 10 or more repertoires will be optimal to have a consistent estimation of signature frequency.

Gene-Specific Substitution Profiles and Substitution Frequency Analysis
To investigate substitution preferences in V genes, we predicted new germline genes in the database, clustered transcripts in each repertoire into clones, and selected one representative sequence per clone to build gene-specific substitution profiles (GSSPs) (see Methods). Overall, we identified 5 novel heavy chain alleles, 2 novel lambda chain alleles, and 1 kappa chain allele, each of which was found in two or more donors. By incorporating these germline gene sequences in cAb-Rep sequence annotation, we built GSSPs for 102 human V genes, compared to GSSPs built for 69 genes using three repertoires in our previous study. Our analysis further showed that the GSSPs of the two studies were highly consistent (r = 0.983 for IGHV1-2 gene, Figure 3A).
To facilitate exploring substitution preference, we developed a python script, SHM_freq.py, to identify mutations in an input sequence, call the GSSP of corresponding V gene, and find the frequency of the mutation being generated by the somatic hypermutation machinery. To demonstrate how this information can be helpful, we analyzed frequencies of substitutions observed in the heavy chain of VRC01 class bnAbs (Figure 3B). This analysis showed that all lineages in this class contain over 30% mutations, with ∼30% of the mutations being low frequency or rare mutations (frequency <0.5% in IGHV1-2 GSSP). These mutations are generated with low frequency either because they require multiple nucleotide substitutions (14) or are from single substitutions in silent SHM positions (43). Functional studies have shown that some rare mutations are essential for potency and neutralization (54). However, the likelihood of immunogens maturing antibodies to have similar mutations could be low or require longer maturation times. In contrast, The gene-specific substitution profile of IGHV1-2 and rarity of somatic hypermutations in HIV-1 bnAbs and autoantibodies. Rare mutations, colored red, are observed frequently in HIV-1 bnAbs but not in autoantibodies, suggesting the mutation patterns in HIV-1 bnAbs may be generated with low frequency. For each antibody sequence, residues identical to IGHV1-2*02 germline gene were shown with dots. Missing residues were showed with minus sign. The disease and antigen were labeled on the right side of each sequence.
we observed that autoantibodies [e.g., collected from HIV, autoimmune thyroid disease, atherosclerosis, Hashimoto disease, and rheumatic carditis (55-60)] originated from IGHV1-2 genes contain very few rare mutations, suggesting somatic mutations may not provide a barrier to elicitation of these lineages.

Gene-Specific N-Glycosylation Profiles (GSNPs)
Post-translation modifications (PTM) (glycosylation, tyrosine sulfation, etc.), which affects antibody functions (42,61), can be introduced to antibodies by V(D)J recombination and somatic hypermutation processes. To understand the frequency and preference of PTMs generated by somatic hypermutation, as an example, we predicted V-gene-specific frequency of N-glycosylation sequons at each position using healthy and vaccination donor unique sequences that having more than 1% SHM. Overall, consistent with previous study (42), the predicted N-glycosylation sites were enriched in CDR1, CDR2, and framework 3 regions, but different genes have different hotspots for glycosylation ( Figure 4A). Structural analysis showed that the side chains of these hotspot positions to be surface-exposed (Figure 4B), suggesting these sites to be spatially accessible for modification. GSNPs should thus be able to provide information for further experimental validation and investigations of impact of N-glycosylations.

cAb-Rep Website to Search Frequencies of Signature Motif and SHM
While we developed scripts to search cAb-Rep, these may be of limited utility to users not familiar with programing. Therefore, we developed a website for searching cAb-Rep (https://cab-rep. c2b2.columbia.edu/). The website implements all functions of the scripts we developed above, including querying cAb-Rep using the three signature modes (CDR3, position, BLAST) with specified isotype, numbering scheme, and VJ recombinations, identifying rare SHMs for an input sequence, and showing the GSSP of a V gene (Figure 5). Users can also query GSNP of an input gene as well as download all the datasets in this study.

DISCUSSION
To understand the mechanisms of BCR diversity generated by V(D)J recombination and somatic hypermutation, high quality BCR repertoire datasets are critical. In this study, we have constructed a database of curated BCR transcripts from previous deep sequencing studies of B cell immunoglobulin sequence repertoires and have demonstrated how this database can provide helpful information for repertoire studies both locally and online. Currently, heavy-light pairing information is not obtained in cAb-Rep datasets. As new technologies are applied to repertoire sequencing, we believe more paired heavy-light chain transcripts will be incorporated, which will greatly advance functional characterization of BCRs. We will continue to update this database to include more disease conditions as well as those from animal models.
In cAb-Rep, we incorporated BCR datasets sequenced both with and without UMI. Besides filtering out transcripts with low sequencing quality, different approaches were used to identify high quality transcripts. For repertoires sequenced using UMI, we used the UMI information to generate consensus sequences, which have proven effective at removing sequencing errors (62). For BCR repertoires sequenced without UMI, we used a clustering approach, which we first sorted transcripts based on redundancy and selected transcripts with the most redundancy as seeds to cluster transcripts at a given identity cutoff and only used the seeds as high quality transcripts.
The assumption was that each B cell contains multiple BCR mRNA molecules and the original BCR could be PCR amplified and sequenced with many copies. Sequencing errors and PCR crossover are rare and close to random events (63). Sequences that differ from the seed due to PCR crossover are likely to appear as singletons after clustering, whereas differences arising from sequencing errors are likely to be only a few hamming distances away from the seed transcripts, and cluster together. Thus, removing transcripts highly similar to the seed transcripts and singletons will remove many sequencing errors, although we may lose a portion of biological transcripts with low sequencing coverage. This approach is proven effective at finding functional transcripts in previous studies (64,65).
Compared to other BCR repertoire databases, cAb-Rep provides more flexibility to genetic signature search. Another public database, iReceptor, supported by the adaptive immune receptor repertoire community, has been developed to store, query, and analyze immune receptor repertoire data (66). While iReceptor includes BCR repertoires amplified with various library preparation protocols and sequenced in multiple platforms (Illumina, 454, ion torrent, etc.), cAb-Rep is limited to include full-length or near-full length sequences with enough sample size for repertoire related analyses. For repertoire analysis, iReceptor allows query with V(D)J recombination and a CDR3 peptide (no regular expression grammars allowed), while cAb-Rep provides BLAST mode and a more flexible CDR3 searching mode. cAb-Rep also allows to search motifs across the V(D)J region in the position mode. Although iReceptor includes curated repertoire data deposited by researchers, these curated datasets are processed using different bioinformatics pipeline and parameters (germline gene database, exclusion of redundancy, etc.), which may introduce bias when performing statistical analysis across studies. Another B cell repertoire database is available, but only includes repertoires from three donors (20). Moreover, the 13 HD curated BCR repertoires, which provide high depth BCR diversity information, have not been incorporated in other databases. These datasets increase the statistical power to study features as well as rare events of BCR diversity.
A further goal of cAb-Rep is to advance the investigation of impact of somatic hypermutation, which, as far as we know, is not available in other B cell repertoire databases. We provide a query portal and GSSPs for human antibody V genes to understand gene-and positional-specific substitution preferences. We also predicted gene-specific frequencies of N-glycosylation in human antibody V genes. Such sequence-derived information, together with functional study, is critical to identify common mechanism of modulation of affinity and understand similarities in pathways of antibody affinity maturation (40)(41)(42). Nonetheless, our current knowledge on impact of SHM is still very limited. In the future, we will collect literatures to characterize how SHM affects antibody structure and function as well as develop new tools to predict effects of SHM. A portal in cAb-Rep will be created to query studied SHMs, which will elucidate the relations of antibody sequence, structure, and function and provide knowledge for antibody design.
In summary, we believe cAb-Rep and the tools developed in this study to be complimentary to other B cell repertoire databases and to be helpful to researchers not familiar with repertoire annotation for exploring features of repertoires and compare datasets across studies and diseases.

DATA AVAILABILITY STATEMENT
The raw next-generation sequencing data were downloaded from NCBI SRA database (https://www.ncbi.nlm.nih.gov/sra). The project accession numbers were listed in the ProjectID column of Table S1. The curated data can be found at our website: https:// cab-rep.c2b2.columbia.edu/tools/.

AUTHOR CONTRIBUTIONS
PK, LS, and ZS designed the research. YG, KC, and ZS analyzed data. YG, PK, LS, and ZS wrote the paper. All authors reviewed, commented on, and approved the manuscript.