Single-Cell Transcriptomic Analyses Define Distinct Peripheral B Cell Subsets and Discrete Development Pathways

Separation of B cells into different subsets has been useful to understand their different functions in various immune scenarios. In some instances, the subsets defined by phenotypic FACS separation are relatively homogeneous and so establishing the functions associated with them is straightforward. Other subsets, such as the “Double negative” (DN, CD19+CD27-IgD-) population, are more complex with reports of differing functionality which could indicate a heterogeneous population. Recent advances in single-cell techniques enable an alternative route to characterize cells based on their transcriptome. To maximize immunological insight, we need to match prior data from phenotype-based studies with the finer granularity of the single-cell transcriptomic signatures. We also need to be able to define meaningful B cell subsets from single cell analyses performed on PBMCs, where the relative paucity of a B cell signature means that defining B cell subsets within the whole is challenging. Here we provide a reference single-cell dataset based on phenotypically sorted B cells and an unbiased procedure to better classify functional B cell subsets in the peripheral blood, particularly useful in establishing a baseline cellular landscape and in extracting significant changes with respect to this baseline from single-cell datasets. We find 10 different clusters of B cells and applied a novel, geometry-inspired, method to RNA velocity estimates in order to evaluate the dynamic transitions between B cell clusters. This indicated the presence of two main developmental branches of memory B cells. A T-independent branch that involves IgM memory cells and two DN subpopulations, culminating in a population thought to be associated with Age related B cells and the extrafollicular response. The other, T-dependent, branch involves a third DN cluster which appears to be a precursor of classical memory cells. In addition, we identify a novel DN4 population, which is IgE rich and closely linked to the classical/precursor memory branch suggesting an IgE specific T-dependent cell population.


INTRODUCTION
B cells can differentiate into plasma cells and secrete large amounts of antibody. There are, however, many more B cell functions which contribute to an effective immune response. B cells are highly effective antigen presenting cells, capable of presenting both protein and lipid antigens to T and NKT cells. B cell -T cell interactions trigger a variety of activation signals in both directions resulting in enduring affinity matured B cell memory, that may also be class switched, and activated T helper cells. B cells can also be activated via TLRs, producing proinflammatory cytokines such as IL6, TNFa and IFNg in the process and resulting in differentiation into short-lived plasmablasts. The former, T-dependent, response will involve formation of germinal centers over time and, since it is dependent on T cells for maturation which have also been through tolerance checkpoints, it would normally have low risk of producing autoantibodies. The latter, extrafollicular, B cell response has the advantage of being more rapid, but also runs some risk of producing lower specificity antibodies. B cells can also be regulatory, producing IL10 and ensuring that autoreactive responses are not perpetuated.
In studying different functions of human B cells in health and disease most studies rely upon phenotypic differentiation in FACS analyses from peripheral blood using IgD and CD27, or CD24 and CD38, in conjunction with the pan B cell marker CD19. For example, the CD19+CD27+IgD+ IgM memory population (1,2) is reduced in the elderly as a percentage of total B cells (3,4) This has important consequences for older people, since the IgM memory population is thought to provide protection against the bacterial polysaccharide T-independent antigens. Higher dimensional phenotyping shows that the IgM memory population in the blood is heterogeneous and further age-related differences are also seen (5), although the likely functional significance of this age related heterogeneity has yet to be determined.
The CD19+IgD-CD27-'double negative' (DN) cells are of particular interest. Many different roles have been ascribed to this population; 'memory precursors', "exhausted memory cells", 'tissue based memory', 'extrafollicular ASC precursors' and 'atypical memory' (6)(7)(8)(9)(10)(11)(12)(13)(14)(15) or the most recent nomenclature DN1 (memory precursor) and DN2 (extrafollicular ASC precursor B cells) (16). DN cells are increased in older people, and in chronic infections such as HIV (6,7,10,11). DN cells are also expanded in autoimmune disease such as Systemic Lupus Erythematosus (SLE) (14,17) where they are responsive to IFNg and thought to be precursors for pathogenic antibody secreting cells (8,9,18). Repertoire studies to try and clarify the relationship of DN cells to classical CD19+CD27+IgDmemory cells have been carried out and find both evidence for a close relationship with classical cells, with clones shared in both populations, as well as a difference in overall average repertoire character, with less hypermutation and larger complementarity determining regions (19). It is therefore likely that the DN compartment is functionally heterogeneous and only with high resolution techniques, such as single-cell transcriptomics, will it be possible to tease apart the sub-populations.
Single-cell transcriptomics is rapidly becoming a key methodology in biology thanks to its high resolution in terms of individual cells and high dimensional data. It offers the ability to discriminate between subsets of heterogeneous populations to understand individual contributions which may have previously been confounded by the Simpson's paradox of studying averaged data. Unsupervised clustering algorithms offer us the chance to define subsets transcriptionally and interrogate the results to find tractable markers for use in phenotypical distinction of the same. Information about the possible functions of cell clusters can be inferred from the transcriptome relative to other clusters. scRNAseq is particularly useful in B cell immunology, where it has made pairing of heavy and light chain sequences possible (20).
Curated reference databases, such as the Human Cell Atlas (21), and baseline transcriptomic profiles of particular immune sites (22) are important for comparison to experimental datasets as well as providing new insights into already well studied biological systems. A frequently occurring problem in singlecell datasets, however, is the relative lack of B cells, given that most human experiments are run on readily available PBMCs; where B cells only make up between 5-10% of the total. Hence B cells are often under-described in single-cell experiments, and often only divided into naive, memory and plasma cells which may miss important distinctions in disease. This is a general problem but has been particularly evident in recent COVID-19 papers (16,(23)(24)(25)(26)(27)(28)(29). We have therefore produced a dataset looking exclusively at five commonly defined B cell populations by phenotype to improve the resolution and better understand the heterogeneity present in B cells. This phenotyping followed by transcriptomic characterization was performed to give the clearest possible identity of known functionally relevant populations at the transcriptomic level. By producing this dataset, we reconcile known phenotypic populations with those found in the transcriptome and can identify previously unknown populations in high dimensional space, thus providing a basic map of peripheral blood B cells as a reference dataset for the B cell community. We describe 10 major populations of B cells, including a novel population of IgEexpressing cells in the double negative compartment.

Data Preprocessing, Clustering and Differential Expression
Data was processed through CellRanger (10X Genomics, v3.1.0) and aligned to the GRCh38 genome. The raw transcript count matrix was loaded into R (v3.6.3) using the Seurat (v3.0) package (30,31). Cells were selected for further analyses according to the following criteria: (i) express zero CD3E, GNLY, CD14, FCER1A, GCGR3A, LYZ, PPBP and CD8A transcripts, to exclude any non-B cells and; (ii) express at least 200 distinct genes. Additionally, cells with total transcript count in the top 1% percentile were removed, as these cells were manually inspected to express transcripts of multiple V gene families per cell, indicating possible cell clumps tagged with the same barcode. Gene counts were log-normalized and the top 2,000 variably expressed genes were extracted using a variance-stabilizing transformation (vst) as implemented in the FindVariableFeatures function in Seurat. The following genes were removed from the list of variably expressed genes in order to prevent downstream dimensionality reduction and clustering to reflect individual/clonotype specific gene usage: all Ig V, D, J genes (extracted using the regular expression [regex] "IG[HKL][VDJ]"), Ig constant genes (IGHM, IGHD, IGHE, IGHA [1][2], IGHG [1][2][3][4], IGKC, IGLC [1][2][3][4][5][6][7], and AC233755.1 [which encodes IGHV4-38-2]), IGLL genes, T-cell receptor genes (regex "TR[ABGD][CV]"). This pruned the variably expressed gene list to n = 1,840 genes. Principal Component Analysis (PCA) was then performed on this pruned gene list. Surveying the first 50 principal components, the proportion of variance explained plateaued at~1.5% from the 15 th PC onwards. Uniform Manifold Approximation and Projection (UMAP) was performed based on the first 14 PCs, using the implementation in the python umap-learn package with correlation as the distance measure in the PC space. UMAP projections were produced on both two-dimensional and threedimensional spaces. To define cell clusters, a shared nearest neighbor (SNN) graph was constructed using Seurat:: FindNeighbors based on the first 14 PCs, and cell clusters were defined on the SNN graph with Seurat::FindClusters (resolution parameter = 1). Clusters were named according to manual inspection for their composition in terms of the original FACSdefined populations. Differential expression was examined using the Wilcoxon rank-sum test provided in Seurat::FindMarkers. Analysis for the additional samples followed the same procedure, retaining principal components that explained at least 1.5% of the variance. We followed the data integration protocol in Seurat to map all samples to the same UMAP projection. The integrated dataset contains a total of 12,973 cells.

Trajectory and RNA Velocity Analysis
To study the trajectory across the Seurat-defined cell subsets, a spanning tree across the data points was inferred using the monocle3 package (v0.2.0), based on the 3D UMAP embedding produced as detailed above. To estimate RNA velocity, spliced and unspliced transcripts were enumerated using the velocyto package (v0.17) (32). RNA velocity stream was mapped onto the UMAP space according to the pipeline provided in the SeuratWrappers package (v0.1.0) which uses functionalities provided in the velocyto.R package (v0.6) (32).
Based on the inferred velocity stream we also scored the transition between cell clusters using a geometrical approach. We evaluated, for each arrow depicting the velocity stream, its alignment with the projection towards cluster centroids, using the following procedure (illustrated in Supplementary Figure  S2): first, for each arrow A i with starting point X i and ending point Y i , we determined the 'starting cluster' (i.e. the original cell identity from which transition is considered) by a majority vote of the identities of cells within the grid around X i as given when the velocity stream was overlaid. We next considered, for every 'destination' cell cluster, the distance between X i and its cluster centroid Z j . This permitted calculation of the angle q between and X i Z j , using the cosine formula: for each arrow I and each 'destination' cluster j. q should approach zero (and hence cosq = 1) if the arrow is perfectly aligned to the direction pointing from the starting cluster to the destination. This forms the basis to define a 'transition score', taking into effect of the strength of the velocity (represented by the size of the arrow projectedi.e. X i Y i ; it should positively contribute to the score) and the distance X i Z j (which should negatively impact the score). This score, here denoted S, is given by: For every I and j the (1−cosq) functional form ensures S i,j increases as the arrow becomes more geometrically align to point to cluster j. This was performed for every arrow I and every cluster j. To summarize scores from individual i's into a composite transition score (CTS) between every starting cluster j 1 and destination cluster j 2 , all i's are classified by their starting cluster and S i,j are summed together for each j. Finally the CTS, i.e. S i∈j 1 ,j=j 2 S i,j , are adjusted by a weighting factor based on the distance between cluster centroids d 1 (for cluster j 1 ) and d 2 (for j 2 ). We first computed the pairwise distance between d 1 and d 2 . Then, for each d 2 (which corresponds to the destination cluster), we scaled all distances into a weighting factor w of range [0,1]: for all distances z pointing towards d 2 . Again, this functional form implies heavier weight is given for clusters which are close to one another in the dimensionality-reduced space. The final CTS was scaled by this factor: For our datasets RNA velocity streams were considered on two perspectives to view the 3D UMAP space: (1) UMAP1 vs UMAP3 (2) UMAP2 vs UMAP3, as visually these two perspectives separate the clusters the best. All arrows visualized in these two perspectives contributed towards the final CTS. These were plotted in Figure 2F in log 2 scale to dampen extreme signals.

Protein-Protein Interaction Network Inference
The R package ACTIONet (v1.0) (33) was used to infer proteinprotein interaction networks (PPINs) for (i) each FACS-defined population; (ii) each cell cluster defined by Seurat, and; (iii) cells positive for T-bet (TBX21) transcripts. ACTIONet takes as input a reference protein interactome and uses the transcriptomic data to filter and score protein interactions relevant to a given cell subset.
Here, the Unified PPIN (UniPPIN), which we previously compiled and recently updated (April 2020; Marzuoli, Ng and Fraternali, manuscript in preparation), was used as the reference interactome. UniPPIN contains in total of 17,997 proteins and 647,508 interactions, after mapping UniProt identifiers to HGNC gene symbols using the biomaRt (v2.42.1) R package. To reduce dimensionality, genes expressed in less than 10 cells were omitted, leaving a total of n = 12,867 genes for PPIN inference. Raw gene counts were used to calculate a 'gene specificity' score using the compute.annotations.feature.specificity function in ACTIONet, for each cell subset as defined above. This step enhances signals from genes exhibiting expression patterns specific to distinct cell subpopulations and penalizes commonly-expressed genes. These gene specificities were subsequently used to infer a PPIN for each cell subset using ACTIONet::run.SCINET.annotations. Each node (i.e. protein) in the network is scored by a 'specificity' metric, which quantifies the centrality of the protein in the network scaled by its gene expression specificity. Since ribosomal and mitochondrial proteins frequently formed cliques (highly interconnected components) in biological networks, in order to mediate the impact of the inclusion of these proteins in analyzing the topological and functional properties of the inferred PPINs, all ribosomal and mitochondrial proteins were removed from all the inferred networks prior to any statistical analyses and functional annotations.

Functional Annotation
Gene sets from Gene Ontology (GO) biological processes (BP) were downloaded from MSigDB (v7.1). Cluster-specific markers were annotated for GO BPs, by testing for gene set overrepresentation using Fisher exact tests. Gene Set Enrichment Analysis (GSEA) was also carried out (using the fgsea package [v1.4.1]) on the ACTIONet-inferred PPINs, with the proteins ranked using the ACTIONet per-node specificity metric.

Data Visualization
Reads coverage was visualized using the Integrated Genomics Viewer (IGV, v2.8.2). All statistics and data visualization were performed under the R statistical computing environment (v3.6.3). Heatmaps of gene expression were produced using pheatmap (v1.0.12). Visualizations of PPINs were produced using visNetwork (v2.0.9). Volcano plots were produced using EnhancedVolcano (v1.4.0). Visualization of three-dimensional data embeddings were performed using the rgl package (v0.100.54). All other data visualizations were produced using functionalities provided in Seurat and the ggplot2 package (v3.3.0).  Figure 1A). Of the 10 clusters formed in the UMAP analysis in all but 1 case (DN1) over 75% of the cells matched with their original FACS sorted population ( Figure 1B); the disparity between phenotype and transcriptome was possibly due to impure sorting. Differential gene expression analysis highlights these discrete populations ( Figure 1C). Even though these cells were CD19-sorted only 34.4% percent of cells had CD19+ in the transcriptome, CD20 proving to be a much more reliable B cell marker in the transcriptome ( Figure 1D) (30). Similarly, CD27 mRNA was not as abundant as CD27 surface protein, particularly in the Cmem cells. Importantly, we have explicitly filtered any immunoglobulin-related genes prior to clustering to avoid biasing the clustering algorithms by light chain or class of antibody. Even so, we find that IGHE and IGHA2 cells can be differentiated by their transcriptome: DN1 being enriched for IGHA2 and DN4 for IGHE. After finding the unusual DN4 IgE population, further investigation leads us to postulate that it is due to a cat hair allergy, exposure 3 days prior to sampling. M-mem1 cells were distinguished from M-mem2 by higher IgD expression ( Figure 1D). Transitional and Naive cells are best distinguished from other B cells by expression of TCL1A, higher in transitional, present in Naive and absent in all other cells.

Relationships Between B Cell Subsets
We performed dimensionality reduction and pseudotime analyses to decipher the relationships between clusters and RNA velocity measurements were added to superimpose directionality information onto the trajectory (Figure 2). Four distinct branches of B cell clusters were seen in three dimensions ( Figure 2A); 1) a transitional and naïve branch, 2) a classical memory and DN1 branch, 3) an IgM and DN3/2 branch and 4) a separate branch for the IgE-high DN4s (Figures 2A, B). In branch 1 the clear direction of development was transitional to naïve to memory and markers such as CD38 and TCL1A are lost in the progression ( Figure 2C). The relationships between the different memory populations are more complex (Figures 2A, D,  E). We therefore developed a new method, by performing geometry-based calculations on the mapped velocity landscape (Supplementary Figure S2; also see Materials and Methods), to summarize the directionality, strength and position of individual RNA velocity streams between individual cell clusters. This produced a composite transition score (see Materials and Methods) that quantifies the developmental flow of cell clusters ( Figure 2F) which can be mapped conceptually onto the inferred trajectory ( Figure 2G). From this the M-mem2 population almost appears as a separate originating singularity with flow out of the IgM memory cluster, to DN3/2 and C-mem1, but with relatively little flow from Naïve into the M-mem clusters themselves ( Figures 2F, G). The unidirectional flow from DN3 to termination at DN2 is strong. We also see a very strong flow from DN1/4 towards C-mem1/2 terminating in C-mem2.
A UMAP projection considering only the DN population highlights the distinct and robust partitioning of these cells into 4 subpopulations ( Figure 3A). DN3 and DN2 display similar immunoglobulin class distribution, although DN3 expressed IgM which is absent in DN2. DN1 is IgA2-rich, while DN4 is IgE-rich ( Figure 3B).
As a unifying marker both DN3 and DN2 have higher expression of RHOB (Ras Homolog Family Member B, involved in intracellular protein trafficking) compared to DN1 and DN4 ( Figure 3C  related B cells (34). This population is also enriched in the inhibitory receptors FCRL3, FCRL5 and lacks lymphotoxin-B (LTB), CD24 and CXCR5 ( Figure 3D). These are features in common with a previously identified precursor population for extrafollicular antibody secreting cells (ASCs) (16). T-bet has been shown to mediate IFNg-dependent (but not IL4 dependent) development of antibody secreting cells from B cells (35). With this in mind we looked at all cells that were T-bet positive and showed that the large majority were in the DN population in the order DN2>DN3>DN1>DN4 ( Figure 3E). Other transcriptional markers for the T-bet+ cells include ACTB, MPP6, (Figures 3F,  G) and ALOX5AP, GSTP1, LAPTM5 ( Figure 4A) The DN1 population expresses high levels of IGHA2 and JCHAIN (Figure 4). Upregulation of Transgelin-2 (TAGLN2) expression suggests an activated state (36). Inferred proteinprotein interaction network (PPIN) analysis of the DN1 population shows genes enriched in proteins involved in actin filament formation and organization ( Figure 4C). This population falls on a separate differentiation pathway from DN3 and DN2 and more closely resembles classical Memory B cells in differential gene expression ( Figures 1C, 2G).
The IgE-expressing DN4 population has very high levels of FCER2 expression, the low affinity receptor for IgE, as well as IL4R, IL13RA1, and the co-stimulator CD40 (Figures 4D, E); all forming part of the same activation network ( Figure 4F). This IgE population is therefore in an active state and suggests the donor was undergoing an allergic response.
Traditionally human memory B cells were known as CD27+IgDand we have designated these "classical memory". The transcriptomic clustering reveals two different subpopulations in our data: C-mem1 and C-mem2. The main differences between the two are in expression of mitochondrial and ribosomal genes   Figure S5A). This might reflect a low overall gene expression by C-mem1, meaning ribosomal genes are overrepresented (Supplementary Figure S5A). Coupled with their relative positioning in the pseudotime/RNA velocity analysis ( Figure 2G) this could indicate a more quiescent population awaiting activation for C-mem1, while C-mem2 are more active with higher active mitochondrial gene expression. Comparison of Cmem with DN indicates that DN cells have higher levels of CD20 (MS4A1) and HLA genes ( Figure 5A). Analysis of individual genes and inferred PPINs ( Figure 5B; Supplementary Figure S5B) shows genes predominantly involved in protein translation are expressed by the C-mem1, while actin networks involved in movement, division and immune synapse formation, form the top networks for C-mem2. IgM memory cells display a number of genes of interest ( Figure  6A) including CD44 involved in cell to cell interactions and activation (37,38) along with MARCKS involved in actin cross linking (39), TCF4 (encoding E2-2) which recognized the E-box binding site originally identified in immunoglobulin enhancement (40) and SMARCB1 which relieves repressive chromatin structure (41). Such a signature of expression would suggest an active group of cells involved in cell-to-cell interaction. We also highlight very strong expression of CD1C, a marginal zone B cell marker (42), in a small but distinct number of IgM memory cells.
The two IgM memory subpopulations are predominately defined by the quantity of transcript rather than expression of any one set of genes, M-mem2 containing far fewer transcripts. Taken together with the position of these two subsets relative to one another in the pseudotime and velocity analysis, these suggest an activation pathway from a quiescent population M-mem2 to the active M-mem1 population. Most of the gene markers shown in Figure 6A are expressed in both the M-mem1 and M-mem2 clusters (Supplementary Figure S5) except CD1C and AP3B1 which are uniquely expressed in M-mem1 cells ( Figure 6C). Both genes are involved in presentation of lipid antigens (43). Very few other genes were differentially expressed between M-mem1 and M-mem2 ( Figure 6D). The inferred protein interaction networks were more informative, showing translational housekeeping networks dominating the M-mem2 population while M-mem1 networks include both cell adhesion and structural networks suggestive of cell binding ( Figure 6E).
In order to confirm these B cell populations PBMCs from two additional people were also processed. We projected the transcriptomes of cells from all three samples to the same UMAP dimensionality-reduced space, and demonstrated that all the same cell populations exist in all individuals ( Figure 7A); albeit in some cases at different ratios, as is the case with a lower proportion of DN3 in the two additional individuals ( Figure 7B). Additionally, we find a greater overlap between DN1/4 and classical memory highlighting the transcriptomic similarity between these two class switched populations. We also confirmed that cells clustered in the UMAP space show similar expression patterns of marker genes (Supplementary Figure S6) across three individuals.

DISCUSSION
The separation of B cells into functionally different populations is useful in trying to understand immune responses to challenge. Typically, this has been done using surface markers to separate cell types and then ascribing functions to the subsets. However, averaging effects over a heterogeneous population can quite often mask important information. The advent of single-cell technologies enables more precise differentiation between cell types and more understanding of their possible functions based on the genes transcribed. There is a wealth of immunological data based on phenotypic subset separation and maximization of insight to be gained from all available information requires the unification of phenotype and transcriptome techniques. Here we used scRNAseq on the major peripheral B cell populations of an individual, FACS sorted by IgD, CD27 and CD10 (Transitional, Naive, IgM Memory, Classical Memory, Double Negative), to consolidate the phenotype with the transcriptome. Building upon conventional clustering, pseudotime and RNA velocity analyses, we devised a geometry-inspired method to summarize information from individual RNA velocity streams, and derived quantifications for the transitions between cell clusters (Figures 2F, G). RNA velocity analysis has been widely developed and applied since its conception (32,44), providing insights into the transcriptional dynamics of single cells. Individual velocity streams could be difficult to interpret, especially in systems where multi-way transitions amongst several cell clusters are possible. It is worth noting that as a result of phenotypic characterization, in order to ensure the clearest transcriptomes of functionally relevant phenotypes, intermediates between states might have been missed. Here we devised a method to build upon these individual velocity streams and provide a summary score, which allowed us to consider the dynamical relationships between all the discovered cell clusters in an unbiased manner ( Figure 2G). (F) Inferred protein-protein interaction network of IL4R and its direct neighbors. The expression levels of these genes in the DN4 cluster were mapped onto the nodes of the network.
We have found that the DN population (IgD-CD27-) clusters into four different sub-populations. Our DN1/4 populations both have increased levels of switched Ig genes and CD24 expression. Sporadic expression of CD21 and CD11c in the DN1/4 makes them hard to identify, however, CD24 and switched Ig expressions suggests they are 'precursor memory' or 'switched memory' cells (16,45). This is highlighted by the RNA velocity data which displays high flow between our precursor memory (DN1/4) and classical memory populations whose development is closely linked (46). The 'closeness' of this relationship and flow between populations is further highlighted by HB34 and HB78 and the proximity of DN1/4 cells to the classical memory cells; particularly when there is only partial coverage in the transcriptome of the main marker (CD27) for differentiating the two populations.
Our DN2 population differentially expressed more TBX21 and FCRL5 but no CD24, matching the descriptions of Agerelated B cells and extrafollicular ASCs (8,14,16,34). The DN3 population shares no previously described gene expression profile, but the RNA velocity data shows DN3 cells flow predominantly into DN2 cells. It is possible the DN3 population are activated naïve (aNAV) cells which develop into precursor ASCs, but the only marker we find for the aNAV population is high IgM expression with the other major markers (e.g. increased TRL7, CD11C and decreased CXCR5, CD21) lacking sufficient coverage in the transcriptome to make a definitive conclusion. Additionally, our velocity flow from IgM memory to DN3/2 suggests an alternative activation pathway distinct from the activated naïve (aNAV) cell pathway previously described in SLE patients (8).
Together the transcriptome, UMAP analysis and RNA trajectories, all suggest that DN1/4 and DN2/3 are separate DN memory compartments that contribute to two distinctly different branches of development with hardly any crossover between them ( Figures 2F, G). T-independent B cells responses are known to expand the IgM memory cell population and plasma cell lineage commitment (47)(48)(49)(50) which would appear, on the evidence provided here, to be occurring through an IgM memory/DN3/DN2 pathway. We therefore suggest that these two pathways represent the T-independent (DN2/3) and Tdependent (DN1/4) B cell developmental pathways. The end points of both these pathways, DN2 and C-mem2, have RNA velocity arrows continuing to point outwards which may indicate further development, for example into ASCs. IgM memory cells display several genes involved in actin regulation and cell to cell interactions that might suggest formation of B cell synapses. Expression of the gene MARCKS has recently been found to increase lateral motility of the BCR by modulating the cytoskeleton (39), CD44 is a known lymphocyte activator (37,38), TCF4 (E2-2) a developmental transcription factor which regulates the IgG enhancer (24,40), and SMARCB1 a chromatin re-modeler that relives repressive structures (51). The presence of a distinct number of IgM cells strongly expressing CD1C, a marginal zone B cell marker (52), is supportive of the idea that these are circulating marginal zone cells following a T-independent pathway of differentiation (42,53,54); this separation of IgM memory into two clusters is clearly visible in all three patients. In contrast to other clusters, which have limitations as to their interactions with other groups, the RNA velocity data seems to suggest that cells from the Mmem2 cluster could give rise to cells in many different groups in all branches of the model ( Figure 2F).
It is worth noting that the DN1 population is enriched in IgA, particularly genes encoding IgA2 and J chain. This, despite exclusion of immunoglobulin related genes from the clustering, indicates that IgA2-expressing cells are functionally different from the IgG-expressing cells ( Figure 3B). In a similar fashion, we also found an IgE-expressing sub-population of DN cells, DN4. This population is over represented in the data due to our enrichment of the DN subset by FACS sorting and, in reality, represents approximately 0.5% of total B cells in this volunteer; DN making up 1.46% of this patients B cells (Supplementary Figure S1) and 34.21% of cells in the DN clusters are DN4. Even this level is unusually high and is assumed to be the result of exposure to cat hair 3 days prior to sampling although no symptoms were present at blood draw. CD27-IgE+ cells have been seen previously (55), alongside CD27+IgE+ cells, but in these data DN4 is the only cluster to show IgE expression ( Figures 1C, 3B). The study of IgE-expressing cells is usually hampered by their scarcity, the poor polyadenylation of IgE transcripts and the difficulties in distinguishing IgE-expressing cells from those binding to IgE via FcϵRII (56,57). Single cell transcriptomics can circumvent the antibody staining issues and shows here that this population seems to be activated, given the evident RNA velocity stream into and out of DN4 from other clusters ( Figure 2G). The presence of CD24 transcripts and RNA velocity flow into the classical memory population suggests these are memory precursors, although this is difficult to reconcile with the lack of IgE in the classical population. The tendency for some DN4 velocity arrows to point away from the UMAP plot might indicate that these cells are precursor to ASCs, as we have suggested for DN2 and C-mem2. We also briefly note that transitional and naive cell populations are transcriptionally homogenous. The T1, T2 and T3 transitional cells that have previously been distinguished by gradations of surface markers such as CD10, or CD24+CD38+, are not clustered separately in this data; potentially the result of read depth issues inherent in single-cell approaches and therefore requiring surface marker annotation (e.g. CITE-seq). RNA velocity highlights the highly directional development of transitional cells into naive cells, but only indicates small velocities out of the naive pool into memory. This may look different in an immune challenged volunteer where one would expect active differentiation between naïve and memory. However, the naïve to memory differentiation processes may occur in tissue and not be apparent in the blood.
The occasional disparities between mRNA and protein level gene expression means that markers in the transcriptome do not always translate into the proteome and vice versa. Our atlas of peripheral blood B cells can be used as a tool to identify the same cell subsets in other single-cell transcriptome datasets, using referencedbased bioinformatic approaches. It can also be used to identify new markers for tractable methods of B cell sorting (Supplementary Figure S6). It is worth noting that transcriptome data can suffer from lack of sequencing depth, so previously well-known phenotypic markers may not always be reliable at a single cell level. For example, CD27 suffers from incomplete transcript coverage. When clustering data from B cells we strongly suggest removal of variable genes and immunoglobulin constant regions as these tend to skew UMAP results. For example, the kappa lambda light chain distinction between cells is clear and the resulting skew can mask important transcriptomic differences between other, more functional, distinctions. We also note that in this dataset we only find TBX21+ cells in the DN population and not in the naive compartment as with SLE (8,9,58). Additionally, we did not find a population of FCRL4+ DN cells typical of HIV patients (59), highlighting the uniqueness of these populations to particular immune disorders.
In summary, we show branching pathways of B cell development that appear to separate into T dependent and Tindependent pathways. The subset of B cells previously known as "Double negative" contains a variety of different cells belonging to both pathways. DN3 and DN2 being in the T-independent pathway closely related to IgM Memory cells, and DN1 and DN4 being more closely related to classical memory cells. In addition, the serendipitous use of an allergic individual after allergen contact has helped to show an IgE-expressing B cell subset within the DN family (DN4), but clustering on its own separate branch of the RNA velocity pathway. That said, it seems to have strong links to the T-dependent development pathway containing classical memory cells and the DN1 cluster rather than the IgM memory/DN3/DN2 development pathway. The data we have collated can be used to map these B cell subsets in other single cell transcriptomic data.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below: https://www.ebi.ac. uk/arrayexpress/, E-MTAB-9544.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by University of Surrey Ethics Committee, UEC/2017/ 052/FHMS. The patients/participants provided their written informed consent to participate in this study.

AUTHOR CONTRIBUTIONS
AS, FF, and DD-W devised and designed the experiment. AS carried out the experiment and data acquisition, with FACS sorting performed by GW and VT. JN and AS performed bioinformatics analysis. All authors contributed to the article and approved the submitted version.