Edited by: Ana Cumano, Institut Pasteur, France
Reviewed by: Yuxin Sun, University College London, United Kingdom; Martin Bachmann, University of Bern, Switzerland
*Correspondence: Gur Yaari,
This article was submitted to B Cell Biology, a section of the journal Frontiers in Immunology
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
The adaptive branch of the immune system learns pathogenic patterns and remembers them for future encounters. It does so through dynamic and diverse repertoires of T- and B- cell receptors (TCR and BCRs, respectively). These huge immune repertoires in each individual present investigators with the challenge of extracting meaningful biological information from multi-dimensional data. The ability to embed these DNA and amino acid textual sequences in a vector-space is an important step towards developing effective analysis methods. Here we present Immune2vec, an adaptation of a natural language processing (NLP)-based embedding technique for BCR repertoire sequencing data. We validate Immune2vec on amino acid 3-gram sequences, continuing to longer BCR sequences, and finally to entire repertoires. Our work demonstrates Immune2vec to be a reliable low-dimensional representation that preserves relevant information of immune sequencing data, such as n-gram properties and IGHV gene family classification. Applying Immune2vec along with machine learning approaches to patient data exemplifies how distinct clinical conditions can be effectively stratified, indicating that the embedding space can be used for feature extraction and exploratory data analysis.
Antibodies, the secreted form of BCRs, play a crucial role in the adaptive immune system, by binding specifically to pathogens and neutralizing their activity (
In NLP, the term “embedding” refers to the representation of symbolic information in text at the word-level, phrase-level, and even sentence-level, in terms of real number vectors. “Word embedding” was first introduced by (
Here, the described line of work is continued, with a focus on developing an embedding technique for B/T cell receptor HTS data, and using this embedding, along with machine learning abilities, to answer real-life questions in computational immunology. To do so, we created an initial word2vec model for immune system sequence embedding. This model will be referred to as Immune2Vec. We use a bottom-up approach, starting from validating the proposed model on short 3-gram amino acid sequences, continuing to longer, complementary determining region 3 (CDR3) sequences, and finally to the full-scale problem of entire immune receptor repertoire representation. Such an approach can pave the way to a wide field of computational immunology applications, which can also exploit the countless on-going developments in the field of machine learning and NLP methods, for different types of high-dimension immunological data analyses.
The overall goal of this research is to develop a methodology to embed BCR sequences in a real vector space using methods from NLP. We rely on the evident analogy between immune receptor sequences and natural language, and the applicability of natural language processing methods to immune receptor sequences, to address open questions. The above analogy is illustrated in
The research structure and workflow.
The data sets used in this research are:
The above data sets were used for data analyses, classification processes, and to create the following models:
The workflow of model generation is shown in
•
The first step in creating the model is building an adequate corpus for word2vec training. Selecting the data set is critical to the context analysis, and must be chosen carefully to reflect the type of data we plan to analyze, but not to over-fit it. All our models are generated from amino acid sequences. Building a model based on nucleotide sequences is also possible, but is beyond the scope of this study.
•
•
As suggested by (
•
The term “corpus” refers to the collection of words extracted from all documents. In our case, the corpus is a file containing all the sequences generated by the non-overlapping method.
•
An unsupervised training method composed of a shallow two-layer neural network was used, as described in (
•
Work flows applied to the different levels.
The model contains several hyper parameters that affect the result and must be taken into account. In particular:
The trained model includes the key function called
and
These vectors are summed and normalized by the number of words of all 3 sequences, so:
Logistic regression (sections
k-Nearest neighbors (see the section
Decision Tree (see the section
Random forest (see the section
The first building block of the research is validation of the word-level representation model. This validation includes an analysis of 3-grams representation. The vector representation in 2-dimensional space is compared to known properties of these sequences, and spatial auto-correlation methods are applied to validate that the vector representation manages to capture some biologically meaningful data.
Multiple amino acid biophysical and biochemical properties can be obtained using the “alakazam” package in R (
“Gravy”: grand average of hydrophobicity.
“Bulkiness”: average bulkiness.
“Polarity”: average polarity.
“Aliphatic”: normalized aliphatic index.
“Charge”: normalized net charge.
“Acidic”: acidic side chain content.
“Basic”: basic side chain residue content.
“Aromatic”: aromatic side chain content.
Several clustering methods on the 2-dimensional space with Euclidean distance were implemented in this section. The first is k-means unsupervised clustering. The clustering was done with a depth of 2, using
3-gram embedding analysis using several tools
After clusters are defined, our goal is to understand the correlation between the clusters and the biological properties. From
where
In the second method distances in terms of biophysical or biochemical properties are calculated between all vectors within each cluster, and this distribution is compared to the distribution of the same distances between all the vectors in the data. For each property we show a box plot of the distance of this property between all data points
Spatial auto-correlation is characterized by a correlation in a signal among nearby locations in space. Using this method, we dismiss the clustering effect on the results, by calculation the spatial auto-correlation in the entire 2D embedding space. Moran’s Index, or Moran’s I, (
where
Choosing a list of distances
Constructing the spatial weights matrix
Calculating Moran’s index for each property and each distance.
Analyzing the results, where the expected value under the null hypothesis of no spatial auto-correlation is 0 for large sample sizes as in our case. Positive values indicate positive auto-correlation.
The implementation of this section was done using the python pysal package by (
In this part we focus on one application of sequence-level representation in a classification problem. We chose to classify IGHV families using CDR3 sequences. The models used in the analysis are IVM1 and IVM2. As illustrated in
Data preparation – the required output is one file containing (at least) the trimmed CDR3 translated to amino acids, and the V Call. The 3 most common IGHV families were chosen from the data - IGHV1, IGHV3 and IGHV4. To avoid bias, the same number of samples was chosen from each family. As indicated, the sequences were trimmed from both ends. As can be seen in the sequence logo images in
Generate vectors from the data set by applying the trained Immune2vec model.
Reduce vector dimensions – PCA was used in this section.
Use R to extract IGHV family labels using the “getFamily” function from the “alakazam” package by (
Split the data to test and train data sets. In the train set, each CDR3 sequence was labeled with the corresponding IGHV family of its original VDJ sequence. The size of the train set is 75% of the data, and the size of test set is the remaining 25%.
Use different classifiers to distinguish between families and show the results. In this case, the two classification methods used were decision tree and k-nearest neighbors. The classifiers are optimized to the train set, and tested on the independent test set. The classifiers are applied using the python “Scikit-learn” package (
The general work flow for this part is shown in
For the repertoire-level classification we used 10 CI and 10 SC samples from DS1. For the sequence embedding step we tested the model with the following corpora: data sets DS1, DS2, DS5, and DS6, as described in the section
Raw reads were filtered in several steps to identify and remove low quality sequences, as described in (
At this point, Immune2vec was applied to the CDR3 sequences, producing a d-dimensional vector representation for each sequence, with
The output of applying the model to all sequences, is a matrix of [
Vectors were first grouped according to their V-gene, J-gene, and CDR3 length. For each group, the difference between each pair of vectors was calculated by Euclidean distance. Hierarchical clustering by a complete linkage method was applied and sequences were clustered by a distance threshold. Setting this threshold for the complete linkage presented a new free variable of the model. In (
As described in (
Once the features are selected, the information is processed to a feature table, a basic [
Feature elimination was performed by a random forest model, choosing the most informative 18 features.
Logistic regression with an L2 regularization penalty was applied to these 18 remaining features.
The source code for the implementation can be found in
The first step we took to evaluate Immune2vec’s performance on immunological data was to show that the embedding captures immunological properties of the data, as these properties are not part of the model’s training. In general, the analyses presented throughout the manuscript are divided into three steps:
Pre-processing AIRR-seq data and extracting the core of the amino acids’ CDR3s.
Constructing an embedding model (Immune2Vec), and using it to embed AIRR-seq data.
Analyzing the
The goal of this section is to test whether unsupervised training of the embedding step (Immune2Vec) preserves the biological properties of all possible amino acid n-grams. As can be seen in this section, vectors corresponding to different n-grams cluster in the embedded space according to the bio-physico-chemical properties of their amino acid n-grams.
The first dataset analyzed was DS1, containing Hepatitis C Virus (HCV) infected individuals and healthy controls (
The t-SNE dimension reduction method does not preserve distance or density between points, thus tends to cluster points with high affinity during the process. The 3-grams 2 dimension representations were assigned to clusters by k-means unsupervised clustering, as explained in the section
A comparison between within cluster and between cluster variability. The within cluster variability is calculated using the pooled variance method (see
Spatial auto-correlation. To ensure that the clustering parameters did not affect our results, we quantified the spatial auto-correlation of the entire 2D space using Moran’s Index. As explained in the section
Amino acids properties distribution.
Property name | General variance | K-means (400 clusters) | K-means (180 clusters) | ||
---|---|---|---|---|---|
In-cluster variance | Ratio | In-cluster variance | Ratio | ||
Gravy | 2.83 | 1.08 | 1.37 | 2.1 | 2.3 |
Bulkiness | 6.8 | 3.6 | 1.9 | 4.06 | 1.7 |
Polarity | 2.29 | 0.91 | 2.5 | 1.14 | 2.0 |
Aliphatic | 0.55 | 0.24 | 2.3 | 0.3 | 1.8 |
Charge | 0.6 | 0.25 | 2.5 | 0.33 | 1.8 |
Acidic | 0.03 | 0.01 | 2.2 | 0.02 | 1.8 |
Basic | 0.04 | 0.01 | 3.0 | 0.02 | 1.9 |
Aromatic | 0.05 | 0.02 | 2.3 | 0.03 | 1.8 |
The table compares the general variance of each property for all sequences with the in-cluster variance, and the ratio between them, for three kinds of clustering methods. A high ratio indicates a strong similarity of the property within the cluster.
We examined a sequence-level classification problem, as an intermediate step between word-level and whole immune repertoire analysis. As shown schematically in
First, we extracted CDR3 sequences, trimmed and embedded them in real-space by immune2vec. We assigned a label to each sequence with its IGHV family as inferred from the adjacent 5’ sequence. We then split the data to test and train sets. The training process included optimizing the model parameters on the train set, and the resulting model was assessed on the test set.
IGHV family classification based on CDR3 sequence using Decision Tree (DT), Random Forest (RF) and K-nearest neighbor (KNN).
Model (corpus) | Data set | DT f1-scoree | RF f1-score | KNN f1-score | Number of samples per family |
---|---|---|---|---|---|
IVM1 | DS1 | 0.65 | 0.67 | 0.58 | 150K |
IVM1 | DS2 | 0.50 | 0.51 | 0.46 | 300K |
IVM1 | DS3 | 0.67 | 0.69 | 0.61 | 100K |
IVM2 | DS1 | 0.65 | 0.66 | 0.64 | 150K |
IVM2 | DS2 | 0.50 | 0.52 | 0.46 | 300K |
IVM2 | DS3 | 0.67 | 0.69 | 0.59 | 100K |
An association between the trimmed CDR3 and its adjacent V-family exists, and this association is preserved after Immune2vec embedding. IVM2 shows slightly better results, most likely since it is based on a much larger corpus file. The effect of the number of dimensions used by Immune2vec on the classification process was tested by using IVM2 with DS3. The same classification process was performed on a varying number of dimensions. The results for IGHV family classification are shown in
After gaining confidence in the word and sentence level representations, the next step is repertoire-level representation. The challenge here is to represent accurately an entire immune repertoire, and to use it as an input for machine learning applications such as classification. To investigate whole immune repertoires, we embedded all CDR3 amino acid sequences of a repertoire using Immune2vec. To this end we used IVM3 (see
To evaluate the above approach we used repeated cross validations with 100 repeats, leaving in each instance two different samples, one of each label, as the test set. The results of this binary classification are shown in
Accuracy of the SC-CI BCR and TCR repertoires classification. For validation purposes, the model was trained and applied on randomly labeled data.
To examine the effect of the corpus on Immune2vec, we constructed Immune2vec embedding models using CDR3 sequences from different data sets and of different sizes. The results for classification of patients’ clinical status are shown in
Biology and medicine are data-rich disciplines. The high throughput measurements used in these fields are complex and require profound knowledge to understand. For this reason, deep learning is well suited to solve problems in these fields, as it is capable of combining raw inputs into features. New deep learning algorithms were developed for a variety of biomedical problems such as patient classification, biological process analysis and treatment methods (
This research focuses on developing a methodology to embed BCR cDNA and protein sequences in a real vector space using methods from NLP. The embedding proposed here enables the numerical representation of high-dimensional sequences. As such, it enables the applications of standard tools, such as dimensional reduction and classification to these data. Thus, obtaining meaningful numeric representations of immune receptor sequences, which does not suffer from the curse of dimensionality and can act as input to various machine learning architectures, is necessary. It opens the door to countless applications, developed in the past few years for natural language processing, and widely used in different industries including machine translation, sentiment analysis, auto-completion, and many more, which can be adapted and explored in the medical field for real-life problems.
Several studies specifically applied word embedding to different types of biological data. Medical text analysis was studied by (
In real natural language analysis, one can examine the results of an embedding by simple measures, such as viewing the words that cluster together on a 2-dimensional map, or examining the relations between known words to understand the quality of the model. In biological data, there is more than meets the eye, as there are no semantically meaningful sequences, and deep biological understanding is required to analyze and evaluate the results. In addition, the problem becomes even more complex when trying to represent a wide combination of sequences, such as the entire immune receptor repertoire. The proposed method is first of its kind, involving several steps, multiple hyper-parameters and degrees of freedom.
Bearing in mind these challenges, we built this work in steps, each step designed to raise our confidence in the model, allowing a scale-up in the level of complexity with each result. Following the structure of a natural language, in which words compose sentences and sentences compose whole texts, we examine short sequences of amino acids as words, CDR3 sequences as sentences, and the immune system repertoire as text. This last analogy is not completely accurate, as the order of sentences in books or article is meaningful, while the immune receptor repertoire data is a collection of sequences without biologically relevant order or context. Thus, we did not apply common text analysis tools, such as doc2vec, but preferred to find a way to represent a repertoire by an overall analysis of its set of sequences.
A comprehensive comparison between different AIRR-seq embedding methods that would encompass the many dimensions of AIRR-seq such as the type of chain (IGH, IGK, IGL, TRB, TRA, etc.), type of signal (nucleotide/amino acid enriched motifs, changes in the overall diversity of a repertoire, etc.), DNA library preparation protocol, sequencing technology, etc., is of utmost importance. Such a comparison falls outside the scope of the current manuscript, and is therefore aimed for a future independent study.
Our work demonstrates that sequence embedding using NLP methods, enables low-dimensional representations that preserve relevant information about the sequencing data, such as n-gram properties and sequence gene family classification. We show that this information is meaningful, as we found clinical condition indicators that enable classification of HCV patients. This indicates that the embedding space can be used for feature extraction and exploratory data analysis. At the computational level, first, once our data is correctly transferred to numeric representation, we can exploit the on-going developments from other fields and adapt them to answer biological questions. The field of machine learning and NLP is advancing in a rapid pace, with a plethora of approaches and tools. Second, a corpus file can be generated from non-labeled data, so a large pool of immune receptor repertoire data can be used to build a large corpus for several applications. Furthermore, once a labeled data-set is available, it is easily transformed to numerical representations and processed
While this work provides encouraging results on immune system representation using NLP methods, there is a lot to be done before the suggested methods can be clinically valuable. First, a large amount of the data should be collected and incorporated into a large corpus, which can be used to train a general immune repertoire model independent of the effects of different conditions and diseases that exist when basing a model on a single data set. Second, a major challenge is exhibited in the clustering and feature extraction phase of repertoire data. If we choose to keep a large number of features, the data becomes sparse and hard to cluster in high-dimensional space, calling for tailored approaches for the type of data presented here. Future work should aim at elucidating the effects of different training data sets and obtaining a better understanding of the feature representations, develop further the repertoire presentation approach, and extend the sequence level representation to include information about the specific antigens that are associated with sets of receptors.
To conclude, Immune2vec embeds BCR and TCR sequences in real vector space, using methods from natural language properties. It shows great advantages, and we hope to continue and further unravel this approach to a simple and robust workflow that can be easily applied to new data sets.
Publicly available datasets were analyzed in this study. These data can be found here:
Ethical review and approval was not required for the study on human participants in accordance with the local legislation and institutional requirements. The patients/participants provided their written informed consent to participate in this study.
GY and MO-B conceived the project and developed the ML approach. MO-B and BF implemented the algorithms and analyzed the data. GY supervised the project. GY, PP, and MO-B wrote the paper. All authors edited the manuscript. All authors contributed to the article and approved the submitted version.
ISF [832/16]; European Union’s Horizon 2020 research and innovation program [825821]. The contents of this document are the sole responsibility of the iReceptor Plus Consortium and can under no circumstances be regarded as reflecting the position of the European Union.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
We thank Ayelet Peres for discussions and helpful advice, and Yaacov Weiss for help processing the sequencing raw reads.
The Supplementary Material for this article can be found online at: