Characterizing Features of Human Circulating B Cells Carrying CLL-Like Stereotyped Immunoglobulin Rearrangements

Chronic Lymphocytic Leukemia (CLL) is characterized by the accumulation of monoclonal CD5+ B cells with low surface immunoglobulins (IG). About 40% of CLL clones utilize quasi-identical B cell receptors, defined as stereotyped BCR. CLL-like stereotyped-IG rearrangements are present in normal B cells as a part of the public IG repertoire. In this study, we collected details on the representation and features of CLL-like stereotyped-IG in the IGH repertoire of B-cell subpopulations purified from the peripheral blood of nine healthy donors. The B-cell subpopulations were also fractioned according to the expression of surface CD5 molecules and IG light chain, IGκ and IGλ. IG rearrangements, obtained by high throughput sequencing, were scanned for the presence of CLL-like stereotyped-IG. CLL-like stereotyped-IG did not accumulate preferentially in the CD5+ B cells, nor in specific B-cell subpopulations or the CD5+ cell fraction thereof, and their distribution was not restricted to a single IG light chain type. CLL-like stereotyped-IG shared with the corresponding CLL stereotype rearrangements the IGHV mutational status. Instead, for other features such as IGHV genes and frequency, CLL stereotyped-IGs presented a CLL-like subset specific behavior which could, or could not, be consistent with CLL stereotyped-IGs. Therefore, as opposed to the immuno-phenotype, the features of the CLL stereotyped-IG repertoire suggest a CLL stereotyped subset-specific ontogeny. Overall, these findings suggest that the immune-genotype can provide essential details in tracking and defining the CLL cell of origin.


INTRODUCTION
Chronic Lymphocytic Leukemia (CLL) is characterized by the accumulation of monoclonal B lymphocytes expressing CD5, CD23, and low surface immunoglobulin in blood, bone marrow, and lymphoid tissues (1,2). Analyses of many CLL clones demonstrated that the IG gene rearrangements encoding the CLL B cell receptor (BCR) exhibit a striking skewed use of IGHV genes resulting in an IG repertoire different from that of normal B lymphocytes (3)(4)(5). Moreover, despite the enormous diversity potentially generated by the recombination of IGHV-IGHD-IGHJ genes, up to 40% of the CLL clones (6, 7) exhibit highly similar stereotyped BCR, which has led to the categorization of the CLL clones with stereotyped BCR into subsets based on their similarities. Although several hundred CLL subsets have been identified, those most frequently encountered, defined as "major subsets," are limited in numbers. Stereotyped BCRs are determined based on the VH CDR3 features of at least 50% of amino acids identity and 70% of amino acid motif similarities, identical VH CDR3 length and location of a shared pattern(s) among sequences of the same group (6,8), and the use of IGHV genes belonging to the same phylogenetic clan. IGHV clans are IGHV family genes with structural similarities (9). Conceptually, the stereotyped gene rearrangements should be considered part of the public IG repertoire because different individuals share them (10). In addition, CLL clones of the same stereotyped subsets show IG light chain restrictions (e.g., #1, #2, #4, #6, #8, #64b, and #99) (6,(11)(12)(13) and IGKV-IGKJ/IGLV-IGLJ gene rearrangements presenting stereotypy features similar to those of IGHV rearrangements with limitations in IG light chain gene usage and VL CDR3 composition.
The above observations support the notion of a role for BCR stimulation in CLL ontogeny (14)(15)(16); moreover, the results of therapies with inhibitors of BCR-associated kinases suggest that stimulation via BCR may be critical for the survival/proliferation of CLL cells in full-blown leukemia (17).
Previous studies have identified IGHV-IGHD-IGHJ rearrangements sharing features with that characteristic of CLL subsets in splenic and circulating B cells from normal subjects (18)(19)(20). These rearrangements, from now onward defined as CLL-like stereotyped-IG or CLS-IG, can be observed in different B cell subpopulations, even though they accumulate in the CD5 + B-cells (20,21).
This study used high-throughput sequencing technology on peripheral blood B cell subsets to elucidate CLS-IG's features and cellular distribution. For this purpose, the cells, separated into defined subsets, were also fractionated according to CD5 or IG light chain expression. The data obtained provide a new perspective for interpreting the origin of CLL cell repertoire and possibly for disease ontogeny.

Samples
Peripheral blood cells were obtained from the leukopak of anonymous blood donors (nine donors aged 55 to 64 years old) at the San Martino Hospital Blood Center presentation. Each leukopak is derived from~500ml of blood. B cells were enriched with RosetteSep Human B Cell Enrichment Cocktail (Stemcell Technologies, Vancouver, Canada), obtaining on average 42x10 6 B cells per donor (24x10 6 to 60x10 6 ) (Supplementary Table S1).

Library Preparation and Sequencing
IGH sequencing and analysis were performed as previously reported in detail (22). Briefly, the library was prepared from mRNA with a multiplex approach with IGHV-specific forward primers on the leader sequence and reverse primers on the constant region. The primer set was kindly provided by TIB Molbiol srl (Genoa, Italy). The amplicons obtained included the entire IGHV-IGHD-IGHJ gene and enough constant region to assign the isotype; UMIs (14 to 16 nucleotides) were inserted during ds-cDNA synthesis. The libraries were indexed with Illumina Nextera XT V2 kit (Illumina, San Diego, CA, USA) and sequenced on Illumina MiSeq (MiSeq V3, 2x300 kit, Illumina). DNA sequences were deposited on the NCBI Sequence Read Archive (SRA) portal with BioProject ID: PRJNA807871.

Bioinformatics Analysis
Raw reads were processed with a custom-built workflow using pRESTO (23) as previously described (20); processed sequences were then annotated by IMGT/HighV-QUEST (24). Only productive rearrangements derived from the consensus of two or more raw reads without N nucleotide passed the quality filter. ChangeO (25) was used to define and annotate clonotypes as sequences with identical amino acid VH CDR3 sequence using the same IGHV gene and IGHJ gene.

CLL Subsets Assignment
To identify CLL-like stereotyped sequences (CLS-IG), we first selected the sequences in our database consistent with the core features (IGHV clan, IGHV mutational status, and VH CDR3 length) (26) of each of the 19 major CLL stereotyped subsets and then submitted them to ARResT/AssignSubsets (26) for the assignment. Sequences assigned to CLL stereotyped subsets with a confidence "average" or higher were considered CLS-IG. Sequences not assigned to CLL stereotyped subsets were defined as non-CLS-IG. The entire analysis was then performed at the clonal level, i.e., for each clonotype, a single representative sequence was considered.
To identify VH CDR3 aa sequences consistent with a CLL stereotyped subset but with the reverse (r) mutational status compared to the core feature of CLL stereotypies (rCLS-IG), we re-submitted to aRResT the IGHV-IGHD-IGHJ rearrangements sequences in which the original IGHV gene region was replaced with one with the opposite mutational status, generating an in silico chimeric sequence.

Statistics
Statistical analyses were performed in R. Paired Wilcoxon tests or binomial test with Bonferroni correction applied to assess differences in CLS-IG or VH gene frequency (*p ≤ 0.05, **p ≤ 0.01, ***p ≤ 0.001). The paired Wilcoxon test was calculated only if three or more donors presented data for both points. CLS-IGs' frequency was calculated only when the number of CLS-IG sequences in the group was two or more. Frequencies obtained from only one CLS-IG sequence within a group were not considered informative and therefore not plotted unless specified.  Figure S1 for phenotypes and details. After quality filtering, a total of 2,679,224 productive, unique IGHV-IGHD-IGHJ sequences were obtained from 8,043,500 sorted B cells. Curated sequences were subsequently clustered into 2,184,656 clonotypes (detailed in Supplementary Table S2), of which 1754 (0.08%) were assigned to one of the major CLL stereotyped subsets (CLL-subset) with ARResT/AssignSubsets (26) (as detailed in Methods) and defined as CLL-like stereotypes-IG (CLS-IG) (detailed in Supplementary  Table S3). Clonal families were used as references for the entire analysis.

Correlation of the Higher CLL-Like Stereotyped-IG Representation in CD5 + B Cells With an Asymmetrical Distribution of U and M IGHV Rearrangements in CD5 + and CD5 -B Cells
First, sequences from all CD5 + or CD5subpopulations respectively were pooled and analyzed for the presence of CSL-IG to see whether the CSL-IG were predominant in the CD5 + cell fraction, and we found that the proportion of CLS-IG was significantly higher in CD5 + than in CD5 -B cells ( Figure 2A). However, when the IGHV gene rearrangements were separated into mutated (≥2% IGHV gene mutations, M-IG) and unmutated (<2% mutations, U-IG), there was no difference in the frequency of CLS-IG between CD5 + and CD5 -B cells within a single mutational status group ( Figure 2B). Furthermore, U-IG clonotypes had more CLS-IG than M-IG clonotypes in both CD5 + and the CD5populations. When looking at the average IGHV mutation frequency, the CD5 + clonotypes appear to be enriched in U-IG, contrary to CD5clonotypes enriched in M-IG ( Figure 2C).

Presence of CLL-Like Stereotyped-IG in B-Cell Subpopulations
CLS-IG were found in all B-subsets, although in different proportions ( Figure 3A); i.e., N and TR B cells had a significantly higher CLS-IG representation (0.09%) than MZ (0.05%) and SM (0.03%) B cells. N and TR B cells also had the highest U-IG sequences ( Figure 3C). Further fractionation of each B-subset into CD5 + and CD5cells did not show a significant predominance of CLS-IG in any of the CD5 + cell fractions ( Figure 3B). In most cases, SM, DN, and MO B-subsets presented none or just one CLS-IG per donor. It must be noted that these B-subsets have the lower frequency of CD5 + cells and, therefore, the least IGH sequences (Supplementary Table S2).

Absence of IG Light Chain Restriction in CLL-Like Stereotyped-IGs
Many CLL clones with stereotyped receptors have restricted IG light chain usage (27,28). This is particularly evident in leukemic clones utilizing subsets #1, #2, #4, #6, #8, #64b, and #99 (6,12). We investigated whether CLS-IG from normal B cells showed light chain restriction. As each B-subset was fractionated starting from pre-sorted IGk + or IGl + B-cells and sequenced independently, we could obtain the CLS-IG distribution in both IGk and IGl-bearing cells. Among the CLL subsets for which clear evidence of IG light chain restriction is present in CLL clones, sufficient sequences for an informative analysis were obtained from subsets #1, #2, #64b, and #9 ( Figure 5). There was no statistically significant difference in the IG light chain association for any CLS-IG analyzed. Only CLS-IG of subset #2 showed a trend indicating enrichment in IGk + normal B cells; this contrasts with CLL clones of the same subset that consistently show a pairing with IGl. The entire distribution of the IG light chain among CLS-IG stereotypes is reported in Supplementary Figure S2.

Similar Utilization of Mutated and Unmutated IGHV Genes by CLL Clones and CLL-Like Stereotyped-IGs
Most CLL stereotyped rearrangements are restricted to the utilization of U or M IGHV genes (6,18). This characterizing feature is used by ARResT for CLL stereotyped subset assignment, meaning that a VH CDR3, potentially showing stereotyped features, will be excluded from a given subset if the IGHV gene, concomitantly utilized, fails to share the mutational status expected for that subset. Based on these considerations, we analyzed all the sequences from normal cells identified in this study to see whether there were restrictions to a specific mutational status for CLS-IG. To this end, we investigated whether VH CDR3 aa sequences, consistent with a CLL stereotyped subset, were present independently of the IGHV mutational status by extending the analysis to sequences with a reversed (r) mutational status compared to the core feature of the CLL subset (rCLS-IG) (see Methods). We observed that the "r" rearrangements were significantly less frequently classified as stereotyped than the native ones ( Figure 6A). The same trend was observed for CLS-IG and rCLS-IG from individual CLL subsets when sufficient data were available for analysis (#1, #5, and #14, Figure 6B). The median mutation values of CLS-IG from the major CLL subsets followed a pattern like that of the corresponding CLL stereotyped-IGs ( Figure 6C). For example, #1 was preferentially unmutated, #2 had around 2% mutations, and #14 was mutated. This pattern was similar for both the IGk and IGl B-cells.
Identification of "Typical" and "Non-Typical" CLL-Like Stereotyped-IGs One of the core features defining each CLL stereotyped subset is represented by the VH clan utilized, and, in each CLL subset, the rearranged IGHV genes often present a restriction at this level (6). This restriction ranges from about half the genes within a clan to a single gene (i.e., #4 -IGHV4-34, #8 -IGHV4-39). Again, all the CLS-IG sequences of this study were analyzed and classified as "typical" if expressing an IGHV gene of those found at least once in their respective CLL counterparts (6) and "non-typical" if expressing a different IGHV gene within the VH clan. For each CLL subset, the CLS-IG VH genes usage distribution was compared to control sequences from the same dataset. Control sequences were defined as sharing the core features of the CLL subset in consideration (6) (e.g., #1 -the use of a clan I VH gene, a U-IGHV gene, and a VH CDR3 length of 13 aa). The patterns observed varied among subsets (see Figure 7) and can be summarized in three broad categories as follows: i) CLS-IG falling almost exclusively in the typical category as in subset #1, where even though all rearrangements but one (1/379) were typical, the IGHV gene usage followed a distribution that was not consistent with that observed in either CLL or control non-CLS-IG sequences. ii) CLS-IG distributed in typical and non-typical categories with an IGHV gene distribution like non-CLS-IG control sequences and not presenting the CLL IGHV bias. Two examples of this condition were the CLS-IG sequences from subsets #2 and #3, which utilized several IGHV genes (with a frequency similar to that of non-CLS-IG control sequences) rather than the IGHV3-21 and the IGHV1-69 genes characterizing the #2 and #3 CLL subset, respectively. iii) CLS-IG showing the same bias toward the utilization of a specific IGHV gene marking the CLL stereotyped subset. An example is subset #14, where most CLS-IG expressed the IGHV4-4 gene like the CLL stereotyped-IG and, in this respect, were different from non-CLS-IG control sequences. When separating the sequences by the IG light chain used, some fluctuation in the IGHV gene used could be observed; nevertheless, the general pattern described above remained valid for sequences from both the IGk and IGl group (see Supplementary Figure S3). Figure 7A. In half of the informative CLL subsets (#1, #2, #5, #7H, #12, #14, and #99), we observed a significantly higher representation of typical IGHV genes in CLS-IG compared to controls ( Figure 7B). As for the predominant IGHV gene often noticed in each single CLL subset, we observed that this bias generally was not present in CLS-IG. Indeed nine out of the 12 CLS-IG did not show evidence for this condition ( Figure 7C).

DISCUSSION
The identification of CLL clones' stereotypical IG receptors indicates that different individuals share immunoglobulin rearrangements with these specific features. This aspect reminds the existence of identical CDR3 aa sequences between individuals, which defines the so-called public repertoire (10,29). More recently, the definition of public repertoire has been made less stringent by including rearrangements with a high degree of similarity (30). In this respect, the public repertoire could enclose stereotypic rearrangements recognized in CLL clones. The scope of this study was that of investigating whether, in healthy donors, CLL IG stereotyped sequences were present in the IG repertoire of any of the circulating B-subsets and to which extent. The donors analyzed did not have evidence of lymphocytosis, thus excluding the possibility of introducing biases due to some type of preleukemic stage, which may result in B-cell repertoire alterations (31). Blood samples were collected from donors with a median age of 58.8 (range 56-67), which is in the range of possible identification of clonotypes ascribable to CLL clones that became clinically evident later. Indeed, it has been reported that clonotypes attributable to the leukemic clone can be identified in the PB up to 16 years before CLL diagnosis (32). It is, however, unknown whether phenotypic changes were present in the PB B lymphocytes at this early stage. The age choice appears relevant for comparison with CLL, given that changes in B cell repertoire within single B-cell subsets may occur with advancing age and may influence the cell population in which leukemogenesis occurs. In addition, it has been reported that changes in the representation of certain CLS-IG can be observed in aging individuals (19).
Although CLL stereotyped receptors have been reported in healthy donors (18,19,(33)(34)(35), our approach provides more details on their phenotypic features, considering, for instance, the expression of IG light chains and CD5 expression within Bsubsets. The identification and characterization of CLL stereotyped IG in healthy donors could provide information on the cell(s) of origin of CLL. Attempts to identify a specific Bsubset as a compartment of origin for CLL have been unsatisfying. Here we attempt to lay the ground for a better understanding of the CLL cell of origin.
Consistent with previous reports (20,21), our data show a significantly higher presence of CLS-IG in bulk CD5 + compared to bulk CD5 -B cells. Nevertheless, analysis of U-and M-IGHV gene rearrangements within CD5 + and CD5 -B cells highlighted no enrichment of CLS-IG in CD5 + compared to CD5 -B cells within each mutational category ( Figure 2B). Instead, CLS-IGs were enriched in U-IGHV genes regardless of being expressed by CD5 + or CD5 -B cells.
CLS-IG representation was also investigated in B-subset and their respective CD5 + and CD5fractions. The analysis showed a higher CLS-IG percentage in TR and N than in MZ and SM Bsubset ( Figure 3A), a finding in line with a previous study from our group on splenic CLS-IG carrying the IGHV-1 family genes (20). Since TR and N cells utilize almost exclusively U-IGHV rearrangements ( Figure 3C), the higher representation of CLS-IG in these cell fractions is likely related to the predominance of U-IGHV rearrangements. The analysis of CD5 + and CD5cells from each B-subset separately showed no dominant presence of CLS-IG in any of the cell fractions studied ( Figure 3B). However, MZ shows an evident trend where CD5 + have more CLS-IG than CD5 -. It must be noticed that MZ presents a considerable amount of U-IGs concentrated in the CD5+ fraction ( Figure 3C). It must be mentioned that because CD5 + B cells are scarce in predominantly mutated B cells subsets, a limited number of IG rearrangements were obtained in some CD5 + Bcell fractions (i.e., MO, SM, DN), impairing the possibility of statistical analysis (see Supplementary Table S2 for details). Altogether, our data do not present evidence of a phenotypically distinct B-subset with a significant accumulation of CLS-IG. Their predominance in CD5 + N and TR B cells depends upon the higher presence of U-IG rearrangements in these B-cell fractions.
The relative distribution of CLL subsets in normal B cells did not follow that typically reported for CLL ( Figure 4B). For example, CLL subsets #4 and #8, relatively frequent among the CLL major subsets, were rarely (or never) identified. A preferential representation of individual CLS-IG in the IGk or IGl expressing B-subset was not observed. In contrast, specific CLL subsets (e.g., #1, #2, #4, #6, #8, #64b, and #99) show, in CLL, marked IG-light chain use restrictions (6,12,36). It might be worth mentioning that CLS-IG #2 showed a trend indicating a higher representation in IGk + normal B-cells. This would be compatible with the observation that, in some instances, CLL clones belonging to subset #2 show functional rearrangements of the IGk light chain inactivated with a Kappa Deleting Element (37). Thus IG-receptor editing may operate in this apparently large proportion of CLS-IG subset #2/IGk + B cells. These observations indicate no major structural constraints contributing to certain IGH-IG light chain pairing in the CLS-IG of a normal B-cell repertoire. This is in keeping with what has been reported about the possibility of CLL subset #2 IGHV rearrangement to pair with "non-native" (e.g., other than IGLV3.21 encoded IGl light chain) IG light chains (38).
The IGHV genes used by CLS-IG rearrangements within each CLL subset showed a heterogeneous pattern. Their overall utilization was not always as restricted as CLL stereotyped IG. For instance, the IGHV1-69 gene was virtually absent in CLS-IG from CLL subset #1, even though this gene is one of the IGHV1 genes highly represented in control IGs. CLS-IG in CLL subsets #1 and #3 exemplify the absence of IGHV selection, whereas CLS-IG in CLL subset #14 acts closely to what is observed in CLL with a highly prevalent representation of the IGHV4-4 gene. This indicates that specific VH CDR3 sequences may have a nonrandom association with IGHV genes within a VH clan, possibly related to restrictions during IGHV-IGHD-IGHJ rearrangements and/or positive and negative selection in ontogenesis or the course of early antigenic challenges. The absence of IGk and IGl restriction for each of these CLS-IG subsets points out, at this stage, negligible participation of the IG light chains. Altogether, the leukemogenic process likely involves a further selection of IGHV genes and the IG light chains in a CLL stereotype-specific manner.
The mutational status of CLS-IG deserves particular comment. When the CLS-IG analysis was extended by removing the IGHV mutational status as a prerequisite for classification (see Methods), the representation of CLS-IG mainly followed the mutational status characterizing the original CLL clones. For instance, subset #1 (always unmutated in CLL) was identified predominantly in the U-IG repertoire of normal donors and the majority of subset #14 CLS-IG (mutated in CLL) were recognized in the M-IG repertoire. Likewise, subset #2 CLS-IG were found in the U and M CLS-IG repertoire, as observed in the CLL cohorts. The observation that the mutational status of CLS-IG parallels that of CLL IG stereotypes suggests that specific IG rearrangements could influence clonal function, e.g., by limiting the generation of a post-germinal center progeny (CLL subset #1) or by determining an accumulation of memory B cells (CLL subset #14).
Thus, single CLS-IG in recirculating B cell has only marginally superimposable features compared to those encountered in CLL clones and each one appears to have its characteristics. For instance, subsets #4 and #8 are substantially absent in the CLG-IG repertoire; subset #1 CLS-IG shows the utilization of IGHV genes closer to CLL stereotypes, whereas that of CLS-IG, subset #3 is more random. It can be presumed that the trajectory determining the emergence of CLL clones is very heterogeneous.
The above data demonstrate that CLS-IG detected in peripheral B cells from donors with no evidence of peripheral lymphocytosis have different features than those typically identified in leukemic clones, suggesting the shaping of CLL BCR repertoire and the emergence of the leukemic clones is dictated by numerous selecting factors. A recent study (32) showed that skewing of the B-cell repertoire is observable in some clusters before the clinical presentation of CLL. Thus, subjects in a pre-leukemic phase or predisposed to developing CLL are likely to have a different condition than the donors analyzed here. The observations that CLS-IGs are not enriched in PB derived B cells with a defined phenotype can be interpreted in different ways: 1) CLL cells originate from non-circulating B cells residing in solid lymphoid tissue, 2) CLL cells originate from B cells with a different phenotype than the ones explored in this study, 3) CLL cells originate from B cells without a defined immunophenotype. The above interpretations may not be mutually exclusive. In addition, it is possible that the leukemogenic process can be accompanied by immunophenotypic changes that encompass an elevated expression of CD5 typically observed in CLL and MBL cells and presumably the pre-monoclonal B-cell lymphocytosis described by Kolijn,P. et al. (32) In this context, B-cell immune-genotype appears to be a relevant factor in the quest to identify the CLL clones' cell of origin, adding additional elements useful for understanding CLL emergence routes.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below: https://www.ncbi. nlm.nih.gov/genbank/, PRJNA807871.

AUTHOR CONTRIBUTIONS
DB designed the study, designed the library preparation protocol, analyzed data and wrote the manuscript. MoC performed FACS Sorting, prepared the library, sequenced and analyzed the data and wrote the manuscript. DR designed and performed FACS Sorting. SM and RM processed biological samples and prepared the library. GU and VA collected samples. LA, SV, and AM performed data and statistical analysis. AN, FM, and GC supervised research. MaC and FG prepared the figures. MF and FF designed the study and wrote the paper. All authors read and approved the final manuscript. four-way pre-sort with yield setting to separate enriched viable B cells into IGk + /CD5 + , IGk + /CD5 -, IGl + /CD5 + and IGl + /CD5 -B cells (C) IGk + / CD5 -B cells subset (here shown as representative of all the other above subsets) was further sorted by gating CD24 -CD38 high B cells to exclude plasmablasts and separated into CD24 high CD38 high transitional (TR). Gating CD38 -/+ B cells were then separated based upon IgD and CD27 expression markers and sorted as IgD ++ CD27naive (N), IgD low CD27 + marginal zone-like (MZ), IgD -CD27double negative (DN) B cells. IgD -CD27 + memory B cells were gated to isolate IgM + IgD -CD27 + IgM-only memory (MO) and IgM -IgD -CD27+ switch-memory (SM) B cells. Blue bars indicate non-CLS-IG control sequences (i.e., sharing the same core features as the CLL subset in consideration -IGHV clan, IGHV mutational status, and VH CDR3 length). These rearrangements were used as controls. Black horizontal lines indicate the level of IGHV representation in the reference CLL cohort.