Identification of B cell subsets based on antigen receptor sequences using deep learning

B cell receptors (BCRs) denote antigen specificity, while corresponding cell subsets indicate B cell functionality. Since each B cell uniquely encodes this combination, physical isolation and subsequent processing of individual B cells become indispensable to identify both attributes. However, this approach accompanies high costs and inevitable information loss, hindering high-throughput investigation of B cell populations. Here, we present BCR-SORT, a deep learning model that predicts cell subsets from their corresponding BCR sequences by leveraging B cell activation and maturation signatures encoded within BCR sequences. Subsequently, BCR-SORT is demonstrated to improve reconstruction of BCR phylogenetic trees, and reproduce results consistent with those verified using physical isolation-based methods or prior knowledge. Notably, when applied to BCR sequences from COVID-19 vaccine recipients, it revealed inter-individual heterogeneity of evolutionary trajectories towards Omicron-binding memory B cells. Overall, BCR-SORT offers great potential to improve our understanding of B cell responses.


Introduction
From infectious diseases to various types of cancers and autoimmune diseases, B cells play a crucial role in establishing antibody-based protective immunity (1)(2)(3)(4)(5)(6)(7).Within the highly heterogeneous B cell populations, individual B cells play distinct roles, each expressing a unique combination of cell subset and antigen receptor (8)(9)(10).The cell subset serves as a primary criterion for the functionality of B cells, on the other hand, the BCR is an antibody molecule uniquely expressed by each B cell to recognize its cognate antigen.Given that the BCR and the corresponding cell subset represent two different modalities of the B cell, simultaneous interrogation of both is essential for comprehensive understanding about B cells.
Before antigenic exposure, the BCRs expressed by naïve B cells establish the foundation for preexisting immunity, preparing the initiation of the B cell response (11,12).When faced with an immunological challenge, B cells undergo a maturation process to optimize immunological role of the B cells by developing their cell subsets and BCR sequences simultaneously.In detail, naïve B cells differentiate into either memory B cells or antibody-secreting cells (ASCs), while at the same time, the BCR sequences also mature through somatic hypermutations (SHMs) and class switching (13).Notably, this bifurcation of maturation trajectories carries distinct implications.BCRs expressed by memory B cells epitomize past antigenic encounters, thus widely utilized in identifying immunological memories engraved in vaccine recipients or convalescent patients.For example, the human antibody response against SARS-CoV-2 was revealed by the discovery of memory B cells expressing neutralizing BCRs among convalescent COVID-19 patients (14)(15)(16)(17)(18)(19).On the other hand, BCRs expressed by ASCs reflect the current antigenic challenges, thus widely utilized in circumstances under explicit exposure to antigens, such as autoimmune disorders or ongoing severe infections.As an instance, potential autoreactive BCRs and their maturations through continuous exposure to autoantigens were analyzed by examining ASCs persistent in autoimmune disease patients (20).Collectively, coupling of antigen receptor and its originating cell subset resolves the role of B cells while unveiling the maturation trajectories.
Various methods have been utilized to investigate B cell subset and antigen receptor.However, approaches combining B cell subset and antigen receptor information of individual B cells while addressing the high diversity of B cell populations are lacking.For example, fluorescence-activated cell sorting (FACS) (21-25) or single B cell RNA sequencing (scRNA-seq) (4-7, 9) was utilized to identify B cell subset and antigen receptor simultaneously.However, both methods require additional complex machinery and experimental procedures for the physical isolation and processing of numerous bulk B cells in a single-cell resolution, which is costly and susceptible to information loss (26).In detail, compartmentalization of multiple cell subsets using FACS necessitates consecutive gating rounds and the preparation of separate libraries for sequencing.Therefore, in most cases, practical use is constrained to specific cell subsets of interest while excluding the other cell subsets (27).This results in a loss of information regarding the transition landscape between cell subsets, which is inherent in most B cell maturation processes.On the other hand, scRNA-seq couldn't cover the vast diversity of the entire B cell population owing to prohibitive expenses, thus resulting in severe undersampling of information.In practice, scRNA-seq covers B cell diversity that is more than one order of magnitude lower compared to bulk BCR high-throughput sequencing (HTS), thereby erasing the maturational relationships between B cells.
Prediction of cell subsets directly from the sequence of corresponding antigen receptors could provide high throughput information of antigen receptors combined with their cell types.In previous studies, SHMs induced in the IGHV region were quantified and B cells exhibiting low SHMs and un-switched isotypes (IgM/IgD) were categorized as naïve B cells, while the rest were classified as antigen-experienced B cells (either memory B cells or ASCs) (28-30).Yet, this heuristic approach was unable to distinguish between memory B cells and ASCs.Considering that the complementarity-determining region 3 of heavy chain (HCDR3) region undergoes the most frequent SHMs, and that B cell activation and maturation are driven by the binding of its BCR to a cognate antigen, there is a chance that HCDR3 harbors pivotal information for predicting cell types.However, no method has yet been proposed to exploit the potential of HCDR3 information.Recently, a machine learning approach was proposed, which harnessed HCDR3 information for cell subset prediction (23).However, fixed vector embeddings that represent physicochemical properties of amino acids were inadequate for HCDR3 sequence representations, resulting in a performance suboptimal to replace FACS or scRNA-seq.Thus, a novel method to utilize HCDR3 as a valuable source of information for cell subset prediction is essential.
To this end, we proposed BCR-SORT, a deep learning model that predicts the originating B cell subset using the HCDR3 sequence of a given BCR.Unlike previous approaches, BCR-SORT established a direct link between the BCR and its originating cell subset by deciphering the inherent B cell maturation features encoded within the HCDR3 sequence.Exploiting these features, BCR-SORT offered a scalable and costeffective method to accurately couple antigen receptors with their cell subsets, especially when used in conjunction with HTS of the BCR repertoire (Figure 1A).Through benchmark tests against FACS and scRNA-seq, we validated the applicability of BCR-SORT on datasets obtained from diverse immunological conditions.In addition, we demonstrated that BCR-SORT enabled cell subset-aware reconstruction of BCR lineage, which rearranged the original evolutionary scenario of the lineage to follow the biological process of B cell differentiation (Figure 1B).Finally, BCR-SORT was applied to various unlabeled datasets from autoimmune diseases and vaccinations.Of note, BCR-SORT revealed treatment-resistant ASC populations from autoimmune disease patients, and at the same time, revealed the maturation trajectory towards Omicron-binding memory B cells induced by the triple vaccinations of wild-type SARS-CoV-2 virus.

Model architecture and training
BCR-SORT was constructed to process HCDR3 sequence, IGHV gene, IGHJ gene, and isotype as inputs and conducted multi-class classification.HCDR3 amino acid sequences were converted to numeric feature vectors by first encoding them using learnable embedding vectors and then processing them using 2layer bidirectional Long Short-Term Memory (LSTM) followed by 3-layers of 1-dimensional Convolutional Neural Network (1D-CNN).Other input attributes were also converted to embedding vectors and concatenated with the sequence feature vector.The concatenated feature vectors were fed into a 3-layer multi-layer perceptron (MLP) to produce output vector.For better generalizability, we randomly masked a single amino acid from each HCDR3 sequence and created an auxiliary task to predict the masked amino acid using the concatenated feature vector.Total loss was calculated by the sum of cross entropy loss for the original classification task and the auxiliary task after scaling the auxiliary loss to 0.05.The model was trained using a learning rate of 10 -4 and a batch size of 1024.The accuracy of the model was assessed by 5fold cross validation.PyTorch (v1.12.0) was used for the implementation of LSTM, CNN, and MLP.

Dataset preparation
We used the BCR-B cell subset coupled data archived in the Observed Antibody Space (31) (OAS) database to train BCR-SORT.We confirmed that each dataset was constructed while satisfying the following criteria: naïve B cell (CD19+CD27-), memory B cell (CD19+CD27+CD38-), and ASC (CD19+CD27+CD38+).Memory B cell data were further obtained from Mitsunaga et al. (25) for the sake of balancing with other labels.To identify HCDR3 sequence, IGHV, IGHJ, and isotype from each BCR sequence, sequence annotation results provided by OAS were utilized as input, whereas the data obtained from Mitsunaga et al. were annotated in-house using BLAST (32) and IgBLAST (33).HCDR3 sequences with length between 8 and 25 were used.For each B cell subset, the isotype proportion was balanced to reflect the actual proportion in the body, following a previous study (24) (Supplementary Table 1).Identical BCR sequences duplicated in multiple B cell subsets were discarded.In total, 2.65 million sequences were collected and split into training set (98%), validation set (1%) and test set (1%).
To evaluate the performance of the state-of-the-art model proposed in previous work (23), additional BCR sequence features (the number of mutations, the physicochemical properties of junction sequences, and the replacement/silent mutation ratios) were calculated in R using shazam (34) (v1.1.2) and alakazam (34) package (v1.2.1).

Integrated gradients
IG is a model interpretation method that calculates the cumulative sum of gradients along a path from a baseline value to the input value to measure the influence score of each input feature.Each input consisted of 25 amino acid sequence features (including padding), 1 isotype feature, 1 IGHV gene usage feature, and 1 IGHJ gene usage feature, and the IG value was calculated with regard to the baseline feature.We used the zero vector as a baseline feature, which was also an embedding of the padding token.The IG value was calculated as follows.
where F denotes the BCR-SORT model, V denotes an input feature, V ij k denotes k-th value of the i-th input and j-th feature, Ṽ denotes a baseline feature, p denotes an interpolation step, and m denotes the number of total interpolation steps.As the IG value was calculated for each vector element, the final IG value for each input feature was computed as the average of the IG vector elements.To focus on features with high influence on the output, input features with top 10% IG values (high-IG) were selected for further analysis.Integrated gradients function in the captum (35) (v0.5.0) module was used to calculate the IG values.

Parapred for paratope prediction
Parapred is a sequence-based paratope prediction model that utilizes a bidirectional LSTM and CNN architecture to leverage local and neighboring residue features (36).By applying the pretrained Parapred model to HCDR3 sequences, the probabilities of each residue corresponding to a paratope were obtained.Each amino acid residue was predicted as a paratope if the obtained probability was higher than 0.488, which was the threshold used in the original paper (36).

In silico saturation mutagenesis
Using high-IG HCDR3 features, we applied in silico saturation mutagenesis by replacing the amino acid of each high-IG residue with 19 different amino acids, while retaining the other input features (isotype, IGHV gene usage, and IGHJ gene usage).To mimic the mechanism of SHM, sequence length was identically controlled and a single substitution of amino acid was mutated, since indels are rare in SHM (37).Subsequently, we predicted cell subsets of the mutated sequences using BCR-SORT and verified whether the applied mutation altered the prediction.

Validation of in silico cell subset alteration on human repertoire
In silico mutations driving the alteration of B cell subset prediction was validated in in vivo circumstances using a human repertoire dataset.To reflect the actual in vivo alteration of B cell subsets, only human repertoires comprising the entire cell subsets (naive, memory, and ASC) in the identical dataset were used.We verified whether the original sequence and the mutated sequence pairs analyzed in in silico experiments were identifiable in the human repertoire as identical cell subsets.Each mutation was grouped based on the types of previous amino acids and the location of the mutation.Subsequently, the number of cell subset alterations for each mutation group was quantified based on the number of BCR sequences that induced a cell subset alteration via a single mutation within that mutation group.Finally, the correlation between mutation groups observed in in vivo alterations and in silico alterations was analyzed.

Benchmarking FACS and scRNA-seq
External validation datasets for benchmarking FACS and scRNA-seq were prepared by manually downloading, and BCR sequences were annotated following the same manner with training set (Supplementary Table 2).In case of benchmarking FACS, BCR sequencing data acquired after sorting the single biological sample into three different B cell subsets using FACS was collected to evaluate the model in real-world settings.B cell subset sorted by the phenotypic criteria identical to the training set was utilized.In case of benchmarking scRNA-seq, each paper independently identified B cell subsets from their gene expression data, while we followed the annotation of B cell subset provided in each paper.In case of benchmarking FACS, the validation datasets were constructed per individual, while those from multiple individuals were pooled when benchmarking scRNA-seq due to data scarcity issue.

Transfer learning
Trained BCR-SORT was fine-tuned into a disease-specific model or a chronological model using external datasets, which were prepared for benchmarking FACS and scRNA-seq.For finetuning, the model was initialized with the pre-trained weights and then further trained on the same task to achieve finely adjusted model.Corresponding to each type of disease, data from a single individual were provided for fine-tuning, and the accuracy on datasets from other individuals was measured to assess the model's performance in unseen situations with data-scarcity.As data from multiple individuals were pooled in case of scRNA-seq, data from a single paper were used for fine-tuning and the accuracy on datasets from other papers with same diseases was measured.For chronological transfer, datasets containing multiple time points from identical individuals were selected.Similarly, data from a single time point were provided for fine-tuning, and the accuracy on datasets from different time points was measured for each individual.All parameters were updated during fine-tuning without freezing specific layer to optimize the model (38), and auxiliary task was not provided during fine-tuning to prevent overfitting on limited dataset.A learning rate of 0.1 times lower than the original learning rate was used to delicately adjust the model's weights, ensuring that the pre-existing knowledge was not abruptly overridden by new data.

Reconstruction of BCR lineage using phylogenetic analysis
Full-length nucleotide sequences in the BCR repertoire were processed for phylogenetic analysis following the workflow of Change-O (34).Briefly, BCR sequences sharing identical IGHV/ IGHJ genes were assigned to a lineage when the similarity between junction sequences exceeded 85%.For each lineage, the virtual germline sequence was inferred after masking the IGHD segment due to its unreliable alignment.The phylogenetic tree of the BCR lineage was constructed using the BuildTrees method in IgPhyML (39,40).To consider lineages undergone active maturations, those containing 10 or more BCR sequences and isotype switching were taken into consideration.Lineages containing more than 200 BCR sequences were down-sampled to prevent excessive computation time required when constructing the phylogenetic tree (41).The phylogenetic tree was analyzed and visualized in R using dowser (42, 43) (v1.1.0),and ggtree (44) (v3.6.2) package.

Rerooting of BCR phylogenetic tree using BCR-SORT
Given a phylogenetic tree, we predicted cell subsets of constituent BCRs using BCR-SORT and examined whether antigen-experienced B cell (memory B cell or ASC) preceded antigen-unexperienced naïve B cell along the tree.If this was the case, we selected a naïve B cell as an alternative root of the tree and rearranged the tree ("reroot") based on the new root.When multiple naïve B cells were identified, we employed the presence of somatic hypermutations (SHMs), even at a low level, as a criterion and selected the least mutated naïve B cell as the new root, considering that naïve B cells can exhibit a small number of SHMs (45)(46)(47).If multiple naïve B cells were identified as the least mutated, we randomly sampled one of them to designate as the new root.IgPhyML was utilized to construct the rerooted phylogenetic tree, after removing the existing virtual germline and setting the new root of the tree as decided.In case of longitudinal studies providing lineages comprised of BCRs across multiple time points, the root was determined among BCRs obtained from the earliest time point in the lineage to reflect chronological development simultaneously.

Blood sample collection
Peripheral blood samples were obtained at three time points from one patient with pemphigus vulgaris (PV)at diagnosis, one month after the second dose of rituximab (RTX), and at relapseand from one patient with myasthenia gravis (MG)before RTX, 1 week after the first dose of RTX, and three months after RTX.The study involving human participant was approved by the Institutional Review Board of Gangnam Severance Hospital (IRB No. 3-2019-0191, PV) and by the Institutional Review Board of Severance Hospital, Yonsei University Health System (IRB No. 4-2023-1059, MG).Peripheral blood mononuclear cells (PBMCs) were obtained from blood samples of the patients using Ficoll-Paque (GE Healthcare).Total RNA was isolated using TRIzol Reagent (Invitrogen) according to the manufacturer's protocols.

Library preparation and nextgeneration sequencing
Genes encoding the variable domain (VH) and part of the first constant domain of the heavy chain (CH1) were amplified, using specific primers, as described previously (11).Briefly, total RNA was used as a template to synthesize complementary DNA (cDNA), using the SuperScript IV First-Strand Synthesis System (Invitrogen), with specific primers targeting the CH1 domain of each isotype (IgM, IgD, IgG, IgA, and IgE), according to the manufacturer's instructions.After cDNA synthesis, SPRI beads (Beckman Coulter, AMPure XP) were used to purify cDNA.The purified cDNA was subjected to second-strand synthesis using V gene-specific primers and KAPA Biosystems kit (Roche, KAPA HiFi HotStart PCR Kit).The PCR conditions for sample collected from the PV patient at one month after the second dose of RTX was as follows: 95°C for 3min; 4 cycles of 98°C for 30s, 60°C for 45s, 72°C for 1 min; and 72°C for 5 min.The PCR conditions for other samples were as follows: 95°C for 3min, 1 cycle of 98°C for 30s, 60°C for 45s, 72°C for 1 min, and 72°C for 5 min.After second-strand synthesis, double-stranded DNA was purified using SPRI beads.VH-CH1 genes were amplified using purified dsDNA using primers containing indexing sequences and a KAPA Biosystems kit.The PCR conditions were as follows: 95°C for 5 min; 25 cycles of 98°C for 30 s, 60°C for 30 s, and 72°C for 1 min; and 72°C for 5 min.PCR products were subjected to electrophoresis on a 1.5% agarose gel and purified using a QIAquick gel extraction kit (QIAGEN) according to the manufacturer's instructions.The gelpurified PCR products were purified again using SPRI beads.SPRIpurified libraries were quantified with a 4200 TapeStation System (Agilent Technologies) using a D1000 ScreenTape assay and subjected to next-generation sequencing on the Illumina NovaSeq platform.

NGS data processing
NGS data were processed following the procedures described previously (11).Briefly, NGS reads were merged by PEAR (48) (v0.9.10), and those with 95% of the bases having Phred scores higher than 20 were left for use.VH-CH1 primers and unique molecular identifier (UMI) sequences were recognized from each read, and the reads were clustered according to the UMI sequences.The clustered reads were aligned using Clustal Omega (49, 50) (v1.2.4), and a consensus sequence was extracted within each UMI cluster based on the most dominant base pairs at each position.Isotype and V(D)J regions were annotated in-house for each consensus read, as described above.

Persistent clone
Within the BCR repertoire from the PV patient, a BCR clone was defined as a group of BCR sequences sharing identical IGHV/ IGHJ gene and HCDR3 amino acid sequence, following the definition of previous work (20).Clones present from pre-RTX to relapse were defined as persistent clones.Those comprised solely of IgM or IgD BCRs were excluded to focus on B cells eliciting antigenspecific responses.

Identification of SARS-CoV-2-specific BCR lineage
BCR repertoire data provided by Park et al. (51) were processed to reconstruct the lineage, as described above.SARS-CoV-2-specific BCR sequences in the repertoires were identified using verified binder sequences archived in CoV-AbDab (52).Omicron-specific BCR sequences and the effective concentration (EC50) of BCRs within the Omicron binder lineages were identified from the data provided in the original paper.BCR sequences in the repertoire were labeled as binders based on the HCDR3 sequence by allowing a single amino acid mismatch.
To examine the relationship between consecutive vaccinations and Omicron immunity, lineages containing the verified Omicron binders after the 3 rd dose of vaccine were selected.To focus on vaccine antigen-driven responses, one naïve B cell lineage was discarded and eight different lineages were identified as a result.Among those eight lineages, memory B cell lineages (n=5) were subjected to phylogenetic analysis, considering the significance of memory B cells on driving Omicron immunity.

Prediction of B cell subsets using BCR-SORT
We assumed that the BCR sequence, especially the HCDR3 sequence, possesses learnable features since activation and maturation of B cells are driven by the binding of their BCR with cognate antigens.Utilizing inputs comprised of HCDR3 amino acid sequence, IGHV/IGHJ gene usage, and isotype information, BCR-SORT predicted the most probable originating B cell subset among naïve B cell, memory B cell, and ASC, as those three cell subsets comprise the majority of the B cell population (53) (Figure 1C).BCR-SORT was trained on 2.6 million BCR sequences obtained from individuals with heterogeneous immunological conditions (Supplementary Table 3).By leveraging the HCDR3 sequence features extracted from LSTM layers followed by 1D-CNN layers, BCR-SORT predicted corresponding cell subsets with high accuracy, outperforming the existing state-of-the-art method based on handcrafted features (23) (Figure 1D).Although those handcrafted features were created using the entire variable regions of BCRs, BCR-SORT exhibited superior performance using only the HCDR3 sequences, indicating that HCDR3 inherently contains plenty of information to represent cell subsets.
In addition to the HCDR3 sequence, other input attributes also contributed to accurate prediction, with the isotype and IGHV/ IGHJ gene following the HCDR3 sequence in terms of importance (Supplementary Table 4).Further, the performance of BCR-SORT was assessed over the entire set of cell subsets, isotypes, and HCDR3 lengths (Figures 1E-G).Interestingly, an increase in input sequence length was observed to enhance accuracy, further corroborating the benefits of HCDR3 information for prediction.Meanwhile, BCR sequences embedded by the BCR-SORT exhibited trajectory-like patterns spanning the entire B cell subsets, resembling the biological course of B cell differentiation as well as the actual cell sorting process (Figure 1H).

Interpretation of BCR-SORT
Learning the relationship between BCRs and their corresponding cell subsets implies understanding the BCR sequence features relevant to B cell activation and differentiation.To ascertain whether BCR-SORT generated outputs by truly utilizing these features, we probed the relationships between input features and model predictions using Integrated Gradients (IG) (54).Through this method, we calculated the influence of each input feature on model output, and to focus on input features with high influence on the output, those with top 10% IG values (high-IG) were selected for further analysis.
Depending on B cell subsets, input features contributed to the prediction results to different extents (Figure 2A).Isotype features exhibited substantial influence when predicting naïve B cells as their isotypes are dominated by IgM and IgD in nature (25).Nevertheless, high-IG features corresponded predominantly to the HCDR3 sequence throughout the entire B cell subsets.Notably, the influence of HCDR3 was more pronounced in antigen-experienced cells (memory B cells and ASCs) than antigen-unexperienced naïve B cells, as BCR-SORT was found to be capable of utilizing B cell activation and maturation signatures encoded in the HCDR3 sequence.
B cell activation and maturation are associated with BCRantigen binding and the accumulation of SHMs.In accordance with this, we discovered that high-IG values were assigned to HCDR3 residues related to BCR-antigen binding and SHM events.By identifying HCDR3 residues contributing to antigen binding (paratope) using Parapred (36), we observed that the proportion of paratope was significantly higher for high-IG residues (Figure 2B).This tendency was notable in antigenexperienced cells, indicating that BCR-SORT considered the antigen binding capability of BCRs during prediction.In addition, high-IG values were assigned to the middle part of the HCDR3, which encoded higher diversity owing to frequent SHMs compared to the amino-terminal and the carboxyl-terminal parts of the HCDR3 (55) (torso of HCDR3), during the prediction of antigenexperienced cells (Figure 2C, Supplementary Figure S1).In sum, the aforementioned evidences indicated that the activation and maturation signatures encoded in BCR sequences were exploited by BCR-SORT.Finally, we assessed the impact of high-IG residues on cell subset prediction.This was achieved by conducting in silico saturation mutagenesis on high-IG residues and identifying alterations in cell subset prediction (56).This experiment efficiently covered all possible mutation scenarios, while mimicking the mechanism of SHM (Figure 2D).Consequently, in silico mutations of a single high-IG residue were found to alter the model's predictions with significantly higher probability than those induced in random residues (Figure 2E).While our primary focus was on in silico experiments, we found reproducible results in human BCR repertoire data.This was evident as identical mutations frequently observed in the in silico experiment were also repeatedly identified in the human BCR repertoire data (Figure 2F, Supplementary Figure S2).
Benchmarking FACS and scRNA-seq BCR-SORT was further validated by comparing the cell subset prediction results with cell subsets identified by FACS and scRNAseq using external datasets.Assessed on multiple unseen datasets comprising various diseases to benchmark both FACS and scRNAseq (4-7, 21-23, 57) (Supplementary Table 2), BCR-SORT In (E), the statistics are calculated using paired t-test.***p< 0.001.
outperformed the current state-of-the-art method (Figures 3A, B).Although the BCR-SORT was trained on datasets originated from various diseases (Supplementary Table 3), it had no experience of datasets such as COVID-19, tetanus toxoid (TT) vaccination, and systemic lupus erythematosus (SLE).However, these unseen diseases were also included to assess the model's general applicability in diverse physiological and pathological circumstances.Overall, BCR-SORT exhibited stable performance on various diseases, except a slight performance degradation on the SLE dataset.
In fact, BCR profiles expressed by each B cell subset are divergent depending on the type of disease owing to different antigen-specific B cell responses, which affects the prediction performance.We hypothesized that the performance of BCR-SORT can be further improved if it is optimized for specific diseases.Based on this hypothesis, we utilized transfer learning to overcome the differences originating from various diseases and leverage the full capacity of the model.Thus, for each type of disease, we fine-tuned the established BCR-SORT into disease-specific models and measured the improvements in accuracy (Figure 3C).Clearly, fine-tuning improved the model performance across all diseases by integrating specific diseaserelated features with the general cell subset features.By contrast, BCR-SORT trained from scratch using only the disease-specific data could not outperform these fine-tuned models (Figure 3D).Given that the type of disease is specified in most analyses and considering the effectiveness of disease-specific fine-tuning even in data-scarce settings, this approach can be viewed as generally applicable in practice.
In addition to disease-specific fine-tuning, BCR-SORT could be fine-tuned on single time point data of an individual to mitigate individual BCR heterogeneity.As the type of disease is likely to be consistent in most patients over multiple time points, we expected that the addition of individual-level BCR properties over diseasespecific properties by chronological transfer would provide further accuracy.Despite the chronological gap between training and evaluation data, the addition of individual-level BCR properties resulted in extra accuracy gain compared with disease-specific properties (Figure 3E), thus demonstrating that BCR-SORT is applicable for chronological analysis of an individual's B cell population with personalized optimization.

Cell subset-aware lineage reconstruction
Beyond characterization of individual B cells, we demonstrated that BCR-SORT can also be applied to investigate maturation hierarchy within families of relevant B cells.As a proxy of maturation hierarchy, SHM hierarchy was inferred starting from SHM-free BCR sequences (germline) by simulating the accumulation of SHMs.To postulate the origin of such hierarchy, previous methods assume a virtual naïve B cell ancestor containing a germline sequence (39,40).However, the germline sequence cannot be specified around HCDR3 owing to unreliable alignment of IGHD germline gene, in other words, SHMs induced within HCDR3 cannot be fully elucidated.Consequently, previous methods are restricted in representing the actual B cell maturation hierarchy, and at times, result in an evolutionary scenario that contradicts the biology of B cell differentiation (e.g.ASC preceding naïve B cell).
This uncertainty in virtual naïve B cell ancestor is mitigated by BCR-SORT as it provides an actual naïve B cell as an alternative ancestor to replace the virtual one.Herein, using the alternative naïve B cell ancestor as a new root of the tree, we demonstrated that BCR-SORT contributes to rearranging the original phylogenetic tree, thereby suggesting alternative evolutionary scenarios better aligned with the biology of B cell differentiation (Figure 4A, Supplementary Figure S3).To this end, we designed a process to rearrange the phylogenetic tree inferred by previous methods ("reroot") and evaluated the outcome of the process (Materials and methods).We compared the sequential order of BCR evolutions between phylogenetic trees reconstructed by IgPhyML and BCR-SORT via rank correlation (Figure 4B).Rerooting using BCR-SORT substantially altered the sequential order of BCR evolution, with the rerooting results well-aligned with those obtained using B cell subset information verified by FACS or scRNA-seq.In detail, rerooting using BCR-SORT relocated naïve B cells to the front of the tree, and ASCs to the rear (Figure 4C).Likewise, IgM BCRs were relocated to the front of the tree, whereas class-switched isotypes, such as IgG or IgA, were relocated to the rear (Figure 4D).
Owing to uncertainty of germline sequence, previous methods had difficulty in clarifying the BCR sequence development along evolution trajectory (affinity maturation process).On the other hand, lineage rerooting using BCR-SORT yielded clear mutation history since the root of the mutation was specified by the designated naïve B cell (Supplementary Figure S4).

Unveiling treatment-resistant B cell subpopulations in autoimmune disorders using BCR-SORT
Before applying BCR-SORT on publicly accessible, unlabeled BCR repertoire datasets, we first applied the model on in-house datasets.While these datasets were also unlabeled, they were anticipated to exhibit discernable patterns in the distribution of B cell subpopulations, thereby serving as a robust foundation for our preliminary assessments.Rituximab (RTX), a monoclonal antibody that targets CD20 molecules on B cells, is a treatment option for various autoimmune diseases mediated by autoantibody-producing B cells.RTX treatment induces profound alterations in B cell subpopulations by depleting naïve B cells and memory B cells, whereas ASCs are resistant to depletion owing to low CD20 expression on cell surface (58).Besides, these RTX-resistant ASCs have been substantiated to serve as a basis of post-RTX relapse (20).Thus, we utilized BCR-SORT to identify RTX-driven alterations and to investigate RTX-resistant ASCs with potential relevance to disease relapse.
We recruited one MG patient and one PV patient to investigate their BCR repertoires.Using BCR-SORT, we identified alterations in B cell subpopulations and observed that ASCs exhibited resistance to RTX treatment in both MG and PV patients (Figure 5A).As post-RTX relapse was identified in PV patient, autoantibody-producing B cells which resist the RTX treatment and survive until the relapse might be the potential source of the relapse.Therefore, we identified persistent clones, groups of similar BCR sequences that were present in all three peripheral blood sampling points (pre-RTX treatment, after it, and after disease relapse), and analyzed using BCR-SORT.Consistent with previous studies (20), ASCs defined by BCR-SORT appeared to be more persistent than memory B cells (Figure 5B).Moreover, phylogenetic analysis revealed that these persistent BCRs continuously acquired SHMs from pre-RTX stage to relapse, implying the continuous exposure to autoantigens (Figure 5C).
Elucidating COVID-19 vaccine-induced maturation of BCRs using BCR-SORT Finally, we applied BCR-SORT to an unlabeled BCR repertoire dataset to interpret the role of B cells by recovering the B cell subset information.Using datasets constructed by Park et al. (51), BCR repertoires obtained from 41 COVID-19 vaccine recipients during three doses of vaccination were analyzed using BCR-SORT after fine-tuning the model with labeled COVID-19 data (Supplementary Figure S5A).Prior to employing BCR-SORT, the intricate B cell responses against SARS-CoV-2 were not clearly understood (Supplementary Figure S5B).However, following its use, these responses showed clear patterns distinct to each B cell subpopulation in accordance with the vaccination schedule (Figure 6A), representing the initial antigenic encounter (Dose 1), boosting (Dose 2), and the persistence of immunological memory over time (Dose 2 + 6wk and Dose 2 + 30wk), followed by rapid recall response (Dose 3).
Previous works have reported that triple vaccinations with the wild-type virus protein (original Wuhan strain) induces a potent antibody response to variants of concern including Omicron (59)(60)(61).Specifically, Omicron-binding memory B cells are known to emerge after the 2 nd dose of vaccine (62)(63)(64) and increase in number by the 3 rd dose of vaccine upon reactivation (59)(60)(61).However, the underlying landscape of SHMs and inter-individual heterogeneity of the recall response has remained unknown.
Consistent with the previously published findings, immunity against Omicron was found to be initiated by memory B cells defined by BCR-SORT after the 2 nd dose of vaccine, followed by their expansion after the 3 rd dose, with a large margin compared to ASCs (Figure 6B, Supplementary Figure S6).Of note, we identified three lineages from two vaccine recipients that contained Omicronbinding memory B cells, which exhibited significant expansion in number and evolutionary relationships between B cells emerging after the 2 nd and 3 rd doses.Using BCR-SORT, we identified inter-individual heterogeneity of memory B cell emergence during the course of vaccinations.Through phylogenetic analysis, we derived direct evidence that memory B cells emerging after the 2 nd dose of vaccines became the source of Omicron-specific recall response after the 3 rd dose via accumulation of SHMs (Figure 6C).However, no evidence of antigen experience was observed until the 3 rd dose of vaccine in a lineage obtained from another vaccine recipient (Figure 6D), although both lineages comprised similar HCDR3 sequences sharing an identical IGHV gene (Supplementary Figure S7).Interestingly, the emergence of Omicron-specific memory B cells we captured was in accordance with the presence of antibodies exhibiting high levels of SHM, further enhancing our observations (Figure 6E, Supplementary Figure S8).Altogether, our observation suggested that the evolution of memory B cells could reveal individual differences in vaccine efficacy against SARS-CoV-2 variants.

Discussion
In this study, we proposed BCR-SORT, a deep learning model designed to predict B cell subsets based on the given BCR sequence.Contrary to conventional B cell subset identification methods such as FACS or scRNA-seq, BCR-SORT enabled the coupling of antigen receptors with B cell subsets using solely the sequencing modality.
Leveraging HCDR3 sequences as a reliable source of B cell subset-specific features, BCR-SORT outperforms the current stateof-the-art method.When full-length BCR sequences were used as inputs instead of HCDR3 sequences, we observed no improvement in performance, further indicating the importance of HCDR3 in cell subset prediction (Supplementary Figure S9).Extensive analysis on the model behavior via IG revealed that BCR-SORT utilized B cell activation and maturation signatures encoded in the HCDR3 sequence during B cell subset prediction.BCR-SORT was validated to be generally applicable to BCR datasets from various types of disease, with disease or individual-specific fine-tuning further enhancing the performance.In addition, BCR-SORT, in conjunction with conventional phylogenetic analysis method, enabled cell subset-aware rearrangement of BCR lineages, which yielded more interpretable results following the biology of B cell differentiation.Finally, BCR-SORT was applied to unlabeled datasets obtained from autoimmune disease patients and COVID-19 vaccine recipients.Notably, BCR-SORT not only reproduced results consistent with previous knowledge, but also elucidated the varying effects of vaccines on constructing immunological memory among different vaccine recipients.Current limitation of BCR-SORT is the absence of a method to analyze individual BCRs while incorporating the context of the BCR repertoire, i.e., similar BCRs originating from the same source are not considered during prediction.In fact, Miho et al. (65) discovered that the similarity relations between BCRs exhibit distinctive features depending on the B cell subset, as the biological mechanism to diversify BCRs differs between cell subsets.Therefore, incorporating this information into predictions is expected to further improve the performance of the model.Additionally, BCR-SORT is unable to distinguish two different B cell subsets encoded by the identical BCR.Unfortunately, indistinguishable cases may emerge when the maturation of the B cell subset precedes BCR sequence maturation by a significant margin.However, results from post-hoc analysis on our dataset implied that these could be extremely rare in practice.We found that only 0.32%, 1.01%, and 0.09% of BCRs were found to be indistinguishable between naïve-memory, memory-ASC, and naïve-ASC compartments within our dataset, respectively.Finally, benchmarking scRNA-seq has been constrained due to limited datasets available compared to FACS, indicating the need for more extensive validation.
In recent developments, the integration of large language models (LLMs) has uncovered unknown representations within biological sequences, paving the way to harness their potential uses (66,67).By embracing the development of LLMs, complicated representations encoded in the HCDR3 sequences are expected to be further deciphered.
Lately, various analytic solutions have been proposed to decipher B cell or T cell responses by combining functional landscape and antigen specificity (68-72).However, these solutions focus on developing a novel method to integrate antigen receptor sequence data and gene expression data, while delegating the data generation to the costly scRNA-seq analysis.Consequently, these solutions are inherently restrained to be widely applied in various contexts and to cover high diversity of immune cells.Instead, BCR-SORT focuses on the generation of missing B cell subset information to combine functional landscape and antigen specificity.High-throughput sequencing has already been established as a standard method to investigate BCR repertoire.Therefore, BCR-SORT is broadly applicable to establish missing links between individual BCR and cell subset.We anticipate that BCR-SORT will contribute to the generalization of the simultaneous analysis of B cell subsets and corresponding antigen receptors to clarify the roles of B cells during various immunological challenges.
University Health System (IRB No. 4-2023-1059).The studies were conducted in accordance with the local legislation and institutional requirements.The participants provided their written informed consent to participate in this study.Written informed consent was obtained from the individual(s) for the publication of any potentially identifiable images or data included in this article.

1
FIGURE 1 BCR sequence-based prediction of B cell subsets using BCR-SORT.(A) Schematic depicting the coupling of B cell subsets with corresponding antigen receptors using BCR-SORT.(B) Schematic example of BCR lineage rearrangement using BCR-SORT.Lineage inferred without cell subset information is reconstructed using BCR-SORT.(C) Detailed illustration of the BCR-SORT architecture.(D) Performance of BCR-SORT with respect to input attributes.The influence of input attributes on accuracy is shown, including that in the case of the existing state-of-the-art method.(E) Confusion matrix representing the average prediction results of BCR-SORT with respect to B cell subsets.(F, G) Bar plots representing the relationship between model accuracy and the isotype of input instances (F), and HCDR3 sequence length (G), respectively.(H) tSNE visualization of penultimate layer activation in BCR-SORT.5,000 instances are randomly selected.Accuracy in (D-G) is measured with 5-fold cross validation.

2
FIGURE 2 Interpretation of B cell subset-specific BCR sequence features using Integrated Gradients.(A) Composition of the top 10% high-IG features with regard to cell subsets.(B) Proportion of HCDR3 residues predicted as paratope using Parapred.Mean predicted paratope proportions and their confidence intervals (error bars) are plotted.The confidence interval is obtained by bootstrapping 100,000 high-IG and non-high-IG sequences 10,000 times.(C) Proportion of high-IG HCDR3 residues with regard to position.Sequence length of 15, which is the most dominant length in the dataset, is shown as an example.(D) Schematic of the in-silico saturation mutagenesis experiment.A single substitution of HCDR3 residue is considered in the region-of-interest, and B cell subset of the mutated sequence is predicted by BCR-SORT.(E) Proportion of in silico mutations altering the prediction of BCR-SORT.To examine if the mutation in high-IG residue was critical on altering the prediction, HCDR3 residues to induce mutations were selected from high-IG residues and randomly selected residues.The proportion is calculated with respect to sequence length.(F) Correlation plot between the cell subset alteration counts measured in in silico saturation mutagenesis and in human repertoire data.Alteration between memory B cell and ASC with the sequence length of 15 is plotted as an example.Spearman's correlation coefficient is shown in the figure.In (E), the statistics are calculated using paired t-test.***p< 0.001.

FIGURE 3
FIGURE 3 External validation and transfer learning for benchmarking FACS and scRNA-seq.(A, B) Performance of BCR-SORT and the current state-of-the-art random forest (RF)-based model on external validation datasets, verified by FACS (A) and scRNA-seq (B).Accuracy is measured using various benchmark datasets and presented according to the source of the dataset.Performance of RF is evaluated on datasets for whom full length of nucleotide BCR sequences are available, and error bars of datasets containing only a single sample are calculated by bootstrapping 150 BCR sequences 1,000 times.(C) Accuracy gains achieved by fine-tuning the established BCR-SORT into disease-specific models with respect to the number of examples utilized in fine-tuning.(D) Performance gap between models trained with the same disease-specific examples, but built either from the established BCR-SORT or from scratch.In cases of using maximum number of training examples are shown.(E) Comparison of accuracy gains achieved using two different data transfer schemes.In case of inter-individual transfer, the established BCR-SORT was fine-tuned using data from other individuals under identical type of disease [in the same manner with (C)], while in case of intra-individual transfer, data from identical individual's chronological data was given.

4
FIGURE 4Comparisons of BCR phylogenetic trees inferred using cell subset-aware lineage reconstruction and conventional method.(A) An example of phylogenetic tree reconstructed using IgPhyML (left) and BCR-SORT (right).Naïve B cell node selected as a new root of the tree is labeled.(B) Distribution of rank correlations representing the sequential alignment of BCR evolutions between two phylogenetic trees.Within the same lineage, sequential order of BCR evolutions is identified along the phylogenetic tree reconstructed by IgPhyML and the tree rerooted by BCR-SORT, and compared with the tree rerooted based on the ground truth naïve B cells.(C, D) Effect of rerooting using BCR-SORT on rearranging BCRs within the phylogenetic tree.The distribution of B cell subset (C) and isotype (D) within the lineage is measured before and after rerooting using BCR-SORT.Each lineage is divided into three parts based on the evolutionary time scale, and positional variation of BCRs within the lineage is measured.In total, 442 lineages are analyzed in (B-D).Statistics are calculated using Mann-Whitney U test with p-value adjustment using Bonferroni correction in (C, D).

5
FIGURE 5 Chronological tracing of treatment-resistant B cells from autoimmune disease patients using BCR-SORT.(A) Proportion of B cell subsets throughout the course of treatment and relapse predicted using BCR-SORT.Data from MG patient (left) and PV patient (right) were shown.(B) BCR fractions in each persistent clone.Persistent clones were divided into those dominated by ASCs (left) and memory B cells (right), and their BCR fractions were measured within each subset.The number of persistent clones analyzed was indicated on top.(C) Representative phylogenetic tree of persistent clones accompanying multiple B cell subsets during the course of treatment and relapse.