The usage of human IGHJ genes follows a particular nonrandom selection: The recombination signal sequence affects the usage of human IGHJ genes

The formation of the B cell receptor (BCR) heavy chain variable region is derived from the germline V(D)J gene rearrangement according to the “12/23” rule and the “beyond 12/23” rule. The usage frequency of each V(D)J gene in the peripheral BCR repertoires is related to the initial recombination, self-tolerance selection, and the clonal proliferative response. However, their specific differences and possible mechanisms are still unknown. We analyzed in-frame and out-of-frame BCR-H repertoires from human samples with physiological and various pathological conditions by high-throughput sequencing. Our results showed that IGHJ gene frequency follows a similar pattern where IGHJ4 is used at high frequency (>40%), IGHJ6/IGHJ3/IGHJ5 is used at medium frequencies (10%∼20%), and IGH2/IGHJ1 is used at low frequency (<4%) under whether physiological or various pathological conditions. Furthermore, analysis of the recombination signal sequences suggested that the conserved nonamer and heptamer and certain 23 bp spacer length may affect the initial IGHD-IGHJ recombination, which results in different frequencies of IGHJ genes among the initial BCR-H repertoire. Based on this “background repertoire”, we recommend that re-evaluation and further investigation are needed when analyzing the significance and mechanism of IGHJ gene frequency in self-tolerance selection and the clonal proliferative response.


INTRODUCTION
The diversity of the initial vertebrate B cell receptor (BCR) originates from the recombination of multiple germline genes (V(D)J) and insertion and deletion during the recombination process. There is a consensus recombination signal sequence (RSS) (1) at the 5' or 3' end of each V(D)J gene segment that participates in recombination according to the "12/23" rule (2,3,4) and the "beyond 12/23" rule (5). In addition, recombination-activating gene (RAG) enzymes, terminal deoxynucleotidyl transferase (TDT), heterodimer-KU70/KU80, DNA-dependent protein kinase (DNA-PK/Artemis), DNA ligase IV (XRCC4) and other proteins are involved in the complex V(D)J recombination process (4,6).
Theoretically, the usage frequency of V(D)J gene segments is random in the pro-B cell or pre-B cell recombination process (before autoantigen selection). However, in vitro experiments in B cell lines confirmed that V(D)J gene segments contribute unequally to the primary repertoire, and the consensus heptamer and nonamer sequences of the RSSs are considered the major factor (7). The contributing factors may relate to the usage frequency of V(D)J gene segments. The usage of proximal and distal gene segments in recombination is not random; for example, the JH-proximal VH gene of pre-B cell lines has a preferential usage (8), and VH near Cu may be preferred during early rearrangement (9). During pre-B cell differentiation and development, the initial DH-JH rearrangements employ more 3' (JH-proximal) DH segments (10); however, Feeney AJ et al found that there is no apparent preference for the more JK-proximal over the more JK-distal genes in the proximal region (11). In addition, compared with RSSs with one or more base mutations, the corresponding gene subfamily of RSSs with a consensus heptamer/nonamer (conserved) has preferred usage (3,12,13,14,15). Moreover, the usage frequency of the corresponding gene segment will be affected when the lengths of the 12 bp spacer/23 bp spacer in RSSs increase or decrease (12,13,15) and when the base sequences of the 12 bp spacer/23 bp spacer in RSS change (16,17,18).
However, these results are derived from experiments based on B cell lines in vitro, and whether RSSs influence the V(D)J usage frequency of initial repertoires in vivo is unclear. The difference in each V(D)J usage frequency in the peripheral B cell repertoires is mainly derived from the selection of self-tolerance and the response of clonal proliferation (8,19,20,21). How the difference in usage frequency of each V(D)J gene segment in initial repertoires influences the peripheral repertoire has not been clarified and has received little attention.
With the development of high-throughput sequencing (HTS) analysis for V(D)J tracking, analyzing each V(D)J usage frequency in individual BCR-H repertoires is now possible. We have broadly analyzed the composition characteristics of BCR-H repertoires by HTS since 2013 and found that the human IGHJ4 gene subfamily has the highest usage frequency in physiological and various pathological conditions, followed by IGHJ6, IGHJ3 and IGHJ5 with medium usage frequency and by IGHJ1 and IGHJ2 with significantly low usage frequency. Additionally, the usage frequency of 6 IGHJ gene families shows amazing consistency by analyzing the BCR-H sequences of public databases (IMGT, etc.) and published articles (HTS data) from subjects with physiological or various pathological conditions. Moreover, we analyzed the composition characteristics of the RSSs in human IGHJ genes. Our results suggest that the consensus nonamer and heptamer, the standard spacer length (23 bp), and the mutation site of RSSs may affect the usage frequency of 6 IGHJ gene segments (nonrandom selection), and this specific primary repertoire may result in the lack of significant changes in the usage frequency of 6 IGHJ genes in the peripheral repertoire under physiological and various pathological conditions.

The IGHJ gene frequency follows a similar pattern and is rarely influenced by antigen selection
The number of BCR-H sequences from 6 healthy volunteer samples ranged from 250,000 to 1,250,000 (Supplementary Table 1). The order of frequency of IGHJ genes (in-frame) was IGHJ4>IGHJ6>IGHJ3>IGHJ5>IGHJ2>IGHJ1, while out-of-frame sequences followed an order of IGHJ4>IGHJ6>IGHJ5>IGHJ3>IGHJ1>IGHJ2 ( Figure 1A). For these two groups, the frequency of IGHJ4 was significantly higher than that of each IGHJ gene, while IGHJ1 and IGHJ2 were significantly less frequently used ( Figure 1A). Supplementary Table 2 shows the data of the naive B cell repertoire (primary repertoire, n=48,167) and the memory B cell repertoire (n=50,290). The order of IGHJ gene usage (in-frame) was IGHJ4>IGHJ6>IGHJ3>IGHJ5>IGHJ2>IGHJ1. Sequences (n=9,340) from the IMGT/LIGM-DB also followed this pattern (Supplementary Table 3 and Figure 1C), while the usage of IGHJ genes (out-of-frame) followed IGHJ4>IGHJ6>IGHJ5>IGHJ3>IGHJ1>IGHJ2 ( Figure 1B). Similarly, IGHJ4 was significantly used, while the IGHJ1 or IGHJ2 frequency was significantly lower than those of other IGHJ genes.
In addition, we analyzed the ratio of unique to total sequences of each IGHJ gene (in-frame and out-of-frame) and found no differences in 6 IGHJ gene families (Supplementary Table 1 and 2, and Figure 2), which suggests that the multiplex PCR library and the experimental system of HTS did not show obvious bias.
Taken together, these results indicate that IGHJ gene frequency follows a similar pattern where IGHJ4 is used at high frequency (>40%), IGHJ6/IGHJ3/IGHJ5 is used at medium frequencies (10%~20%), and IGH2/IGHJ1 is used at low frequencies (<4%). Therefore, the pattern shows high consistency in physiological and various pathological conditions, which suggests that the recombination selection of each IGHJ gene is nonrandom and rarely influenced by antigen selection.

IGHJ-IGHD pairing and trimming and insertion between IGHD and IGHJ
Six IGHJ gene families have different initial BCR-H repertoires, which may be related to nonrandom selection Trimming and insertion between IGHD and IGHJ mainly presented as 3'D trimmed, 5'J trimmed, and N2 insertion ( Figure 4). We found that the mean length of 5'J trimmed showed significant differences among different IGHJ genes under some conditions, while 3'D trimmed and N2 insertion did not show significant differences (data not shown). For IGHJ1 and IGHJ2, the 5'J trimmed length of IGHJ1 (in-frame sequences) showed significant differences compared with the other IGHJ subfamilies in the SLE and IgM with HBV vaccine groups (one-way ANOVA with Bonferroni correction, p<0.001). A similar situation occurred on 5'J trimmed of IGHJ2 in the breast cancer group. The mean length of 5'J trimmed of the IGHJ4 (in-frame or out-of-frame sequences) showed significant differences compared with the other IGHJ genes in the SLE group (one-way ANOVA with Bonferroni correction, each p<0.001). In all groups, IGHJ4 (high usage) showed significant differences compared with IGH1 and IGHJ2 (low usage) (one-way ANOVA with Bonferroni correction, each p<0.001). The mean length of 5'J trimmed from IGHJ6/IGHJ5/IGHJ3 (in-frame sequences) showed significant differences compared with that of the other 5 IGHJ subfamilies in different groups (one-way ANOVA with Bonferroni correction, each p<0.001). These results suggest that the composition of the IGHJ front end (5'J trimmed) may have an impact on the usage and efficiency of the D-J recombination, especially for the IGHJ genes with high or low usage.

The usage frequency of 6 IGHJ families in the BCR-H repertoires from 19 published articles
We analyzed the usage frequency of the 6  The usage frequency of the IGHJ4 gene subfamily was higher than that of other IGHJ genes, suggesting that IGHJ4 had the highest frequency in the initial rearrangement and showed high consistency in peripheral repertoires (after self-tolerance selection or the clonal proliferation response). The usage frequencies of IGHJ1 and IGHJ2 were significantly lower than those of the other IGHJ genes, suggesting that IGHJ1 and IGHJ2 may be partially restricted in the initial rearrangement and that they showed consistency in the peripheral repertoires. IGHJ6, IGHJ3, and IGHJ5 have a medium usage frequency, and the usage frequency of IGHJ6 was higher than that of IGH3 and IGHJ5, except for articles 2, 7, 8, 13 and 19. Additional results showed that IGHJ3 usage was higher than IGHJ5. Regardless of the physiological or pathological conditions, the usage frequencies of the 6 IGHJ gene families in our results are almost identical to those in the 19 published articles. The overall results indicate the nonrandomness of the 6 human IGHJ gene usages during the initial rearrangement process.

IGHD-IGHJ recombination may affect IGHJ gene usage through the RSS composition
Recombination of IGHJ-IGHD can be divided into two phases. The first phase involves recognition and cleavage of the DNA, and the second phase involves resolution and joining (4,6). In the evolutionary process, Variation from the conserved sequences in the heptamer and nonamer of the RSSs is considered a major factor affecting the relative representation of gene segments in the primary repertoire. The mechanism of RSSs on gene recombination is mainly related to the interaction efficiency of RAG protein (recombinase) (50)(51)(52)(53)(54). Based on the composition of human IGHJ gene families, we found differences in RSSs among 6 IGHJ gene families (Supplementary Table 9 and Figure 5), which suggests that these differences may affect the usage frequency (nonrandom) of IGHJ gene families.
The nonamer of human IGHJ4 and IGHJ6 is the conserved sequence 5'-GGTTTTTGT-3' or initial RAG protein binding (12). A single base mutation of the nonamer resulted in a reduction in overall cleavage levels when the heptamer was retained, but the entire nonamer was substituted with random sequence. Both nicks and hairpins were still found, but overall cleavage was reduced fold. Kowalski D et al found (55,56) that A-rich core sequences of the nonamer may be important to facilitate strand dissociation during the process of recombination.
The presence of three consecutive A residues was necessary for efficient recombination in the nonamer; furthermore, the nucleotides flanking the A-rich core needed to be other than one residue. The mechanism may be that the recombinase must measure the distance between the heptamer and the nonamer to satisfy the 12/23-bp spacer rule (3,12,13,14,15). Akamatsu Y et al found that the A residue at position 5 (nonamer A-rich core) was most crucial in their recombination assay (13). However, Hesse JE et al considered that the "A residue" at position 6 (nonamer) was most crucial in their recombination assay (3). Regarding the effect of nonamer A-rich core mutation and corresponding gene usage, Akamatsu Y et al found that recombination frequency decreased to 27.3% of the control with the mutant 9-4G (position 4 was changed to G) (13). A mutant at position 9-5C gave the lowest recombination frequency (10.4%). With the double mutant at positions 9-3G and 9-4G, the joining rate dropped only to 19.3% (9-6G and 9-7G was 26.0%). According to the results from cell line experiments, human IGH4 and IGHJ6 gene subfamilies appear to have a "complete A-rich core" in the nonamer (conserved), which may play an important role in their high usage selection.
In addition, Akamatsu Y et al found that the nonamer 9-2C was changed to 9-2A, and the recombination frequency was reduced to 2.7% of the control level; 9-2C was changed to 9-2T, and the frequency was reduced to 12.9%; and 9-2C was changed to 9-2G, and the frequency remained at 61.3% (13). When 9-8C/9-9C were changed to 9-8N/9-9A, the recombination frequency dramatically dropped to less than 0.1%, which suggested that the C residue plays an important role when the recombinase measures the distance between the heptamer and the nonamer sequences. In this study, one factor for the low usage frequency of the human IGHJ2 gene may be its 9-9C mutation to 9-9A.
The relationship between the heptamer and the recombination frequency of the corresponding gene family has been confirmed by several laboratories. Both studies found that the mutation of the entire heptamer resulted in low levels of nicking distributed across several sites, the mechanism of heptamer affecting recombination was related to the formation of hairpins, and the nicks and hairpins were reduced 2-fold when the sequence of the last four positions of the heptamer was changed (12,57). In addition, nicking formation depended on the heptamer for the generation of double strand breaks (DSBs) by RAG1 and RAG2, and the nonamer at the correct distance would improve heptamer efficiency in the natural RSSs. The first three nucleotide positions were nearly 100% conserved (CAC/GTG) in the BCR gene. The mutations were in the first three positions, and cleavage was impaired either at the nicking step or the hairpin formation site. No rearrangement was detected with the mutant at position l (7-1G). Mutations at position 2 (7-2T) and position 3 (7-3G) produced detectable levels of recombination, 0.5% and 0.6%, respectively. The G residue at position 5 was changed to C (7-5C), and the recombination frequency dropped to 5.9% of the control level. For the rest of the residues in the heptamer, mutation effects were moderate, ranging from 28.5 to 52.0%. Akamatsu Y et al found that no rearrangement was detected with the mutant at position l (7-1G), and mutations at position 2 (7-2T) and position 3 (7-3G) produced detectable levels of recombination, 0.5% and 0.6%, respectively. The recombination frequency dropped to 5.9% of the control level when the G residue at position 5 was changed to C (7-5C); for the rest of the residues in the heptamer, mutation effects were moderate, ranging from 28.5 to 52.0% (13).
The first three positions of the 6 human IGHJ gene subfamily heptamers are a conserved CAC/GTG sequence. Based on the results of Akamatsu Y et al, position 7-4A of human IGHJ1 mutated to 7-4G, and 7-6T/7-7G of IGHJ2 mutated to 7-6C/7-7C, which may be one important factor causing their low usage frequency. In addition, the 7-5G mutation (IGHJ3, IGHJ4, IGHJ5, and IGHJ6) may have a moderate effect on their usage frequency. The effect of mutations in the human IGHJ heptamer on usage frequency needs to be further explored.

The RSS spacer and combination frequency
The length of the spacer is also a determining factor contributing to the usage frequency of V(D)J rearrangement. Human IGHJ4 and IGHJ3 gene subfamilies have a conserved 23 bp length; however, the IGHJ1, IGHJ2, IGHJ5, and IGH6 gene subfamilies have 21 bp or 22 bp spacer lengths.
Akamatsu Y et al found that the recombination frequency dropped to 7.7% with the 11-bp RSSs when one C residue was added to the 12 bp RSSs (13 bp spacer) (11.0% joining rate); when two C residues were added  Table 9). Base C has the highest ratio in IGHJ4. Is this the reason for the higher usage frequency in the recombination of the IGHJ4 gene subfamily? Whether the base composition of spacer sequences such as the nonamer has the key "A-rich core" structure need to be further explored.

Distance and combination frequency
It has been confirmed that the proximal gene has preferred usage in the initial rearrangement (8-10). Malynn, B.A., et al. believe that the difference in IGHV gene usage in adult spleen B cells is mainly due to the selection of the initial rearrangement rather than the changes in expression frequency after rearrangement (59). The "proximal and distal" studies of BCR recombinant genes are mainly focused on the V gene. "Proximal and distal" differences in the J gene have not been reported. In our results, we did not find the "proximal" phenomenon in the 6 IGHJ gene families with high usage frequencies.

Other factors and combination frequency
Ramsden DA et al found that the sequence of the coding end may be related to the usage frequency of gene combination (12). We found that there are differences in the amino acid length and the coding flank sequences of the human 6 IGHJ families (Supplementary Table 9). The IGHJ4 gene has the shortest 16 amino acid components. The sequence of the coding end and AA length may affect the usage frequency of IGHJ. We analyzed the deletions of the 3'D end, the 5'J end and the insertion between the D-J end and found that there was a difference between the 5'J end of IGHJ4 and other IGHJ genes ( Figure 3 and Supplementary   Table 11). Whether it was a factor for high usage of IGHJ4 needs to be further studied. In addition, IGHD gene families may also affect the nonrandomness of IGHJ genes. VanDyk LF et al suggested that V(D)J recombination was targeted by RSSs, while the RSSs flanking D segments appeared to be equivalent. They were not randomly utilized, suggesting that the D-3' RSSs were not simply superior targets for the D-J recombinase but instead that targeting certain 12/23-bp spacer RSS combinations is more effective (60).
We found that the conserved nonamer of IGHJ4 and IGHJ6 had a higher "double-stranded complementary paired" rate than the 27 IGHD nonamer sequences (Supplementary Table 10 and Supplementary figure 1), although it did not show obvious differences. At present, no evidence to support that RAG has an effect on the "double-stranded complementary paired" of the J-heptamer to D-heptamer and J-nonamer to D-nonamer exists; the mechanism is still unknown. We hypothesize that two genes with high complementarity (7-7/9-9) may be more favorable for binding, cleaving, hairpin formation, and DSB in the recombination process ( Figure   5), which is a very interesting entry point for further research in BCR gene recombination.
In summary, for the possible impacts of RSSs on IGHJ usage frequency, RSSs of human IGHJ4 genes are consistent with conserved RSSs. The length of the IGHJ6 spacer (23 bp) changed slightly, the nonamer and heptamer of IGHJ3 changed, and the length and nonamer of IGHJ5 changed. However, the nonamer, heptamer and spacer of IGHJ1 and IGHJ2 changed significantly. These may be factors that resulted in nonrandom usage of the human IGHJ gene (generally, IGHJ4>IGHJ6>IGHJ3> or ≈ IGHJ5>IGHJ2≈IGHJ1) in the initial rearrangement. In the initial human BCR-H repertoires (before antigen selection), the "background" of the 6 IGHJ genes (the initial usage frequency) is different and rarely influenced by antigen selection. These results suggest that re-evaluation and further investigation are needed when analyzing the significance and mechanism of each IGHJ gene usage in self-tolerance selection and the clonal proliferative response.

Total RNA/DNA extraction and cDNA synthesis
Total RNA was extracted from the PBMCs in three volunteers with immunization with HBV vaccine according to the manufacturer's protocol for the total RNA extraction kit (OmegaBio-Tek). The total RNA was then reverse transcribed into cDNA using Oligo dT18 according to the manufacturer's protocol for the reverse transcription kit (MBI, Fermentas). The genomic DNA from PBMCs in other samples was obtained using the QIAamp DNA Mini Kit (QIAGEN, CA) and was stored in a QIAsafe DNA tube (QIAGEN).

High-throughput sequencing
All the DNA samples were sent to Adaptive Biotechnologies Corp (http://www.adaptivebiotech.com) for multiplex PCR amplification of human BCR-HCDR3 regions. Error from bias in this multiplex PCR assay was controlled using synthetic templates (26), and the HCDR3 sequences were acquired by HTS on the ImmunoSEQ platform (http://www.adaptivebiotech.com) (23). All the PCR products of cDNA samples after PCR amplification were sent to Tongji-SCBIT Biotechnology Corporation for HTS, and detailed experimental procedures have been described in our previous article (25). The HCDR3 regions were identified within the sequencing reads according to the definition established by the International ImMunoGeneTics (IMGT) collaboration. A standard algorithm was used to identify which V(D)J segments contributed to each HCDR3 sequence.

Public data
We used 9,340 unique in-frame BCR-H sequences (non-HTS data in different pathological states) derived from the IMGT/LIGM-DB to analyze the IGHJ gene frequency by IMGT/HighV-QUEST (27). In addition, 50,290 BCR-H sequences of memory B cells and 48,167 HCDR3 sequences of naive B cells from a public database (HTS data from 3 healthy volunteers) were used for this study (28). The unique in-frame BCR-H sequences (n=84,804) and out-of-frame sequences (n=13,653) were compared and analyzed by IMGT/HighV-QUEST software in this study.

Sequence analysis
The raw sequences in FASTA format were analyzed with IMGT/HighV-QUEST online software (version 1.3.1, http://www.imgt.org). Using the IMGT summary document, the sequences not meeting the following criteria were filtered out: (1) no results (sequences for which IMGT/HighV-QUEST did not return any result) and (2) unknown (sequences for which no functionality was detected. This category corresponds to the sequences for which the junction could not be identified (no evidence of rearrangement, no evidence of junction anchors).).
In-frame and out-of-frame unique sequences remaining after filtering were used for IGHJ gene frequency, D-J pairing, and nucleotide insertion and deletion analyses.

RSS composition analysis
According to the accession numbers of these human IGHJ and IGHD genes in IMGT and GenBank, we obtained detailed annotations of complete human genome sequences for RSS composition analysis, including sequence characteristics of nonamers and heptamers, length characteristics of 12 bp and 23 bp spacers, and the IGHJ gene segment (amino acid) composition of code end.

Software and statistics
IMGT/HighV-QUEST (version 1.3.1) was used for identification of sequences (JH and DH), evaluation of functionality and statistical analysis of the sequence data; IMGT/V-QUEST (version 3.3.1) was used for identification of nonamers, heptamers, 12 bp and 23 bp spacers, and IGHJ gene segments of the coding end; Microsoft Office Excel (version 365) was used for storage, filtering and statistical analysis of the sequences.
The resulting sequences were graphed using Prism 8 software (GraphPad). IGHJ gene usages were compared using a χ 2 test. Insertions and deletions of the nucleotides were compared using one-way ANOVA with Bonferroni correction. All statistically significant differences are indicated as *=p<0.05; **=p<0.01, and ***=p<0.001.