Edited by: Nadia Caccamo, University of Palermo, Italy
Reviewed by: Vanessa Venturi, University of New South Wales, Australia; Benny Chain, University College London, United Kingdom
This article was submitted to T Cell Biology, a section of the journal Frontiers in Immunology
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Current sequencing methods allow for detailed samples of T cell receptors (TCR) repertoires. To determine from a repertoire whether its host had been exposed to a target, computational tools that predict TCR-epitope binding are required. Currents tools are based on conserved motifs and are applied to peptides with many known binding TCRs. We employ new Natural Language Processing (NLP) based methods to predict whether any TCR and peptide bind. We combined large-scale TCR-peptide dictionaries with deep learning methods to produce ERGO (pEptide tcR matchinG predictiOn), a highly specific and generic TCR-peptide binding predictor. A set of standard tests are defined for the performance of peptide-TCR binding, including the detection of TCRs binding to a given peptide/antigen, choosing among a set of candidate peptides for a given TCR and determining whether any pair of TCR-peptide bind. ERGO reaches similar results to state of the art methods in these tests even when not trained specifically for each test. The software implementation and data sets are available at
T lymphocytes (T cells) are pivotal in the cellular immune response (
Within the TCRβ chain, the complementarity-determining region 1 (CDR1) and CDR2 loops of the TCR contact the MHC alpha-helices while the hypervariable complementary determining regions (CDR3) interact mainly with the peptide (
Following specific binding of T cell receptors to viral and bacterial-derived peptides bound to MHC (
A direct method for using TCR Rep-Seq as biomarkers has been proposed by Emerson et al. (
In contrast, many TCR responses are characterized by a high level of cross-reactivity with single TCRs binding a large number of MHC-bound peptides, and single peptides binding a large number of TCRs (
Important steps have been made in this direction by Glanville et al. (
All the approaches above are sequence-based and do not model the structure of the interaction. Moreover, it is assumed that the TCR-peptide binding is a binary prediction, instead of explicitly computing the off-rate or on-rate. This is indeed a simplification, based on an arbitrary cutoff of the affinity. Also, all such predictions ignore cross-reactivity; no attempt is made to predict whether a given peptide is the only target of a TCR (or vice versa).
Still, this question is of importance in two experimental setups. The first case is the attempt to predict from deep sequencing whether a given host has been infected by a pathogen [e.g., CMV (
The next required step for using the repertoire to develop specific biomarkers would be to distinguish between TCR binding different peptides. An essential step in the development of high precision predictors is the standardization of the comparison methods.
In contrast with most machine learning tasks, where one attempts to predict the output for a given input (e.g., predicting the content of an image), TCR binding is a pairing problem, where one is given a pair of inputs X and Y (a peptide and a TCR), and the goal is to predict whether they would bind. As such, there are many ways to divide the train and the test, and as a result many possible tests. One could for example assume that either X or Y are fixed and already seen in the training phase, and the other is varied. An alternative division could be that all X and Y are already known during the training and the test is on whether a given pair of known X and Y would bind. Finally, one could imagine a more complex scenario where one would ask on X and Y both absent from the training set whether they bind.
This formally translates into five different tests, each with different outcomes, as the standard method to estimate such predictions (
Single Peptide Binding—SPB. Testing whether an unknown TCR binds a predefined target, using (as training information) TCRs known to bind to this target (
Multi-Peptide Selection—MPS. Given a set of predefined peptides, predict which of those will be bound by a new TCR. In such a case, one trains on a set of different target peptides, and TCRs are again divided into disjoint training and test sets. The outcome of that would be the accuracy of the choice as a function of the number of candidate peptides.
TCR-Peptide Pairing I—TPP-I. Given a large set of peptides and TCRs, test whether a randomly chosen TCR binds a randomly chosen peptide. In this task, all TCR and peptides both belong to training and test sets. However, TCR-peptide pairs are divided into disjoint training and test sets.
TCR-Peptide Pairing II—TPP-II is similar to TPP-I, except that now, TCRs contained in the pairs belonging to the training set cannot belong to the test set.
TCR-Peptide Pairing III—TPP-III is again a similar test on pairs, but here neither TCR nor peptide can be in both training and test set.
We propose these different tests as standard measures for the quality of TCR-peptide binding predictions. The TCR-peptide pairing (TPP) task is often addressed in Natural Language Processing (NLP) using recurrent neural networks (RNN) (
Target peptides and TCRs have different generation mechanisms [TCRs through VDJ recombination and junctional diversity (
For the peptides, we first use an initial random embedding and translated each amino acid (AA) into a 10-dimensional embedding vector. Changing the encoding dimension did not significantly change the obtained accuracy. To merge the encoding vectors of each position into a single vector representing the peptide, each vector was used as input to an LSTM. We used the last output of the LSTM as the encoding of the whole sequence. The embedding matrix values, the weights of the LSTM, and the weights of the FFN were trained simultaneously. For the TCR encoding, we either used a similar approach or an autoencoder (AE) (see Methods and
These models were trained on two large datasets of published TCR binding specific peptides (
ERGO was trained to solve the TPP-I problem (pairing TCR and peptide) on the two datasets above, and then tested on all five mentioned tests. To test the performance of ERGO on the SPB task (detecting whether a previously unseen TCR binds a known peptide), we analyzed the five most frequent peptides in each dataset and tested the possibility of detecting whether a randomly selected TCR binds the peptide. The AUC for the binary classifications ranged between 0.695 and 0.980 (
Comparison between the different versions of the ERGO classifier [AE (Autoencoder) vs. LSTM and McPAS (
LPRRSGAAGA | 0.772 | 0.760 | KLGGALQAK | 0.695 | 0.731 |
GILGFVFTL | 0.843 | 0.832 | GILGFVFTL | 0.820 | 0.817 |
NLVPMVATV | 0.835 | 0.821 | NLVPMVATV | 0.665 | 0.686 |
GLCTLVAML | 0.803 | 0.816 | AVFDRKSDAK | 0.676 | 0.695 |
SSYRRPVGI | 0.969 | 0.980 | RAKFKQLL | 0.828 | 0.825 |
Note that ERGO is never trained on any specific target. Instead, it learns a model for the entire set of peptides through the LSTM. As such, its performance on different peptides varies and is a function of the fit of the trained model to this specific peptide. This is both a strength and a weakness of ERGO. It is a strength in that it applies to a wide range of peptides, but a weakness since for a specific peptide with a large number of known binding TCRs, it can perform worse than existing classifiers.
To compare ERGO to current approaches, we tested its performance on current tools that predict TCR-peptide binding. We first compared it to the work of Jokinen et al. (
Comparison between the different versions of the ERGO classifier [AE vs. LSTM and McPAS (
GLCTLVAML | 0.803 | 0.816 | 0.764 | 0.770 | 0.708 | 0.686 | 0.816 | 0.782 | 0.82 ± 0.02 | |
NLVPMVATV | 0.835 | 0.821 | 0.665 | 0.686 | 0.624 | 0.632 | 0.587 | 0.651 | 0.72 ± 0.01 | |
GILGFVFTL | 0.843 | 0.832 | 0.820 | 0.817 | 0.725 | 0.712 | 0.818 | 0.822 | 0.81 ± 0.01 |
We used two-sample-logos (
The single peptide binding task can be extended to the single antigen protein task, where we predict whether a TCR would bind any peptide from a protein. Instead of testing whether an unseen TCR can bind a specific peptide, we tested whether it can bind any peptide from a target protein. The performance on this task varies drastically between target peptides, with AUC ranging from 0.71 to 0.97 (
Comparison between the different versions of the ERGO classifier [AE vs. LSTM and McPAS (
NP177 | 0.772 | 0.767 | IE1 | 0.703 | 0.738 |
M1 | 0.843 | 0.832 | M | 0.825 | 0.820 |
pp65 | 0.814 | 0.803 | pp65 | 0.702 | 0.716 |
BMLF1 | 0.808 | 0.819 | EBNA4 | 0.711 | 0.717 |
PB1 | 0.958 | 0.970 | Gag | 0.890 | 0.897 |
To use a TCR as a biomarker, one should be able to predict which specific peptide it binds. To test for that, we computed the accuracy (as measured by the sum of the diagonal in the confusion matrix) of predicting the proper target, with a different number of possible targets (
A more important task from a diagnostic point of view would be to distinguish between TCRs binding different peptides for any set of either known or previously unseen TCRs and peptides. To test the specificity of the prediction, we evaluated ERGO's AUC on the three TPP tasks. The easiest task (TPP-I) is predicting unknown TCR-peptide parings (AE AUC value 0.86). A more complex task is the prediction of pairs containing a known peptide with an unknown TCR (TPP-II—AE AUC value 0.81). The hardest pairing task is to predict the binding of a previously unseen peptide and a previously TCR (TPP-III). This task has never been tested and reaches an AUC of 0.669 (
AUC of TPP task with either known peptide and TCR (but unknown pairing TPP-I), known peptide unseen TCR (TPP-II), and unseen peptide and TCR (TPP-III).
TPP-I | 0.859 | 0.840 | 0.842 | 0.776 | 0.761 | 0.805 | 0.813 | |
TPP-II | 0.798 | 0.792 | 0.764 | 0.770 | 0.745 | 0.805 | 0.813 | |
TPP-III | 0.601 | 0.562 | 0.669 | 0.522 | 0.636 | 0.570 | 0.646 |
To test if this performance can be improved by enlarging the training set to better learn the generic properties, we trained ERGO on McPAS (
To further test the effect of the training set size, we subsampled the training set and tested the TPP-I AUC score for different sample sizes. The AUC increased with sample size and did not seem to saturate at the current sample size (
In the future, ERGO may contribute to the development of TCR-based diagnostic tools. However, it can already be used for the detection of TCRs that bind specific tumor antigens. Given a neoantigen extracted from full genome sequencing of tumors (
While TPP-III was never previously tested, TPP-II was recently tested by Jurtz et al. (
To test which position along the CDR3 has the strongest effect on the binding prediction, we trained ERGO ignoring one TCR amino-acid position at a time, by nullifying the position in the autoencoder based model or by skipping that position input in the LSTM based model (
We propose a set of standard tests to evaluate the accuracy of TCR-peptide binding and show that training a model using a combination of deep learning methods and curated datasets on the complex task of pairing random peptides and TCR can lead to high accuracy on all other tests. The main element affecting prediction accuracy is the training size. Enlarging the database improves the prediction accuracy for unseen peptides. Also, when subsampling the existing datasets, the accuracy increases with sample size and does not seem to saturate at the current sample size (
Several other elements can affect the results, such as the V and J gene used and the alpha chain. In general, TCR-sequencing has often been limited to the TCR β chain due to its greater combinatorial and junctional diversity (
ERGO is based on LSTM networks to encode sequential data. Previous models by Jurtz et al. (
ERGO randomly initializes our amino-acid embeddings and trains the embeddings with the model parameters. Using word-embedding algorithms such as Word2Vec (
The prediction method presented here can serve as a first step in identifying neoantigen-reactive T cells for adoptive cell transfer (ACT) of tumor-infiltrating lymphocytes (TILs) targeting neoantigens (
Three TCR-peptide datasets were used in the binding prediction task. McPAS-TCR dataset was downloaded from
The TCR autoencoder was trained on a data which was derived from a prospective clinical study (NCT00809276) by Kanakry et al. (
In each model, training data was loaded as batches of positive and negative examples. For the positive examples, we took the existing TCR-peptide pairs in the database and split it into a train set and a test set. For creating the negative examples for the TPP-I task, we first chose a peptide randomly from the peptides in the training set. Then, we chose five random TCRs from the training set that are not reported to bind this peptide, to create five internal wrong pairs. A similar process was done to create a test set containing positive and negative examples. Thus, the number of negative examples is five times larger than the number of positive examples in both train and test sets.
We used two models for predicting TCR-peptide binding. The models use deep-learning architectures to encode the TCR and the peptide. Then the encodings are fed into a multilayer perceptron (MLP) to predict the binding probability. Two encoding methods are applied—LSTM acceptor encoding and Autoencoder-based encoding. The peptide is always encoded using the LSTM acceptor method, so the two models differ in the TCR encoding method.
First, the amino acids were embedded using an embedding matrix. We set each amino acid an embedding vector, randomly initialized. Next, the TCR or the peptide was fed into a Long Short Term Memory (LSTM) network as a sequence of vectors. The LSTM network outputs a vector for every prefix of the sequence; we used the last output as the encoding of the whole sequence. We used two different embedding matrices and LSTM parameters for the TCRs and the peptides encodings. The embedding dimension of the amino acids was 10. We use two-layered stacked LSTM, with 500 units at each layer. A dropout rate of 0.1 was set between the layers.
The TCR autoencoder was trained before training the Autoencoder-based attachment prediction model. To train the TCR autoencoder, first we added a “stop-codon” at the end of every TCR CDR3 sequence. Each amino acid was represented as a one-hot vector of 21 numbers (20 possible amino acids and an additional stop codon) where all values were set to zeros except one index of the corresponding amino acid which was set to 1. Each of the CDR3 vector representations one-hot vectors (i.e., 20 positions for each amino acid with zeros, except for the position appropriate for this amino acid) were joined and, terminated with a “stop codon” one-hot vector. Zero padding was then added to the CDR3 vectors, completing the vectors to the maximum lengths chosen according to the data lengths distribution. Each zero codon was represented as a fully zeroed one-hot vector.
The concatenated TCR vectors were fed into the Autoencoder network, which was based on a combination of linear layers, creating similar “encoder” and “decoder” networks. In the encoder, the TCRs were first put into a layer of 300 units, then into a layer of 100 units, and then into the encoding layer of 100 units. This layer output was used to encode the TCR in the trained autoencoder model. We used Exponential Linear Unit (ELU) activation between the linear layers and dropout rate of 0.1. The decoding layers were similar to the encoding layers in the reverse order—first, the encoded TCR vectors were fed into a layer with 100 units, then into a layer with 300 units, and then into a layer with the original TCR concatenated one-hot vector length units. We used softmax (a function translating the last layer into a probability function) on the last decoder layer output on every sub-vector matching to an input amino acid one-hot vector position.
We used Mean Squared Error (MSE) loss (when the decoder output should be like the concatenated one-hot input). The autoencoder was trained using Adam optimizer with a learning rate of 1e-4, we used batching with batch size 50. The autoencoder was trained for 300 epochs.
In order to read the TCR from the decoding vector, we first split the long vector into “one-hot” like vectors. We back-translated the one-hot vectors into amino-acids by taking the amino acid matching to the maximal value index in the vector (which should be 1). We dropped all amino acids from the stop codon and forward to get a sequence of amino acid which should be the TCR. The autoencoder was trained with 80% of the data and was evaluated with the rest of it. The autoencoder was only trained on the TCRs and no information on the peptides was ever used to train the autoencoder.
In both models, the TCR encoding was concatenated to the peptide encoding and fed into the MLP. The MLP contains one hidden layer with as units as half of the concatenated vector size and sigmoid is used on the output of the last layer to get a probability value. In both models, the activation in the MLP is Leaky ReLU. Dropout with a rate of 0.1 was set between layers.
As mentioned, we used two models, the LSTM based model and the autoencoder based model. We trained the embeddings, the LSTM parameters and the MLP in the first model, and the TCR autoencoder, peptide LSTM encoder and MLP parameters in the second model. The trained TCR autoencoder parameters were loaded to the autoencoder based model and are trained again within all model parameters.
We used Binary Cross Entropy (BCE) loss. Since we get five times more negative samples than positive samples according to the described sampling method, the loss is weighted, respectively, by a factor of 5/6 for positive samples and by 1/6 for negative samples. The optimizer was Adam with a learning rate of 1e-3 and weight decay 1e-5. We used batching with batch size 50. The model was trained for 100 epochs. The models used 80% of the data for training and 20% for evaluation for all datasets.
All models were implemented with PyTorch library in Python programming language.
The prediction models were evaluated using Area Under the Curve (AUC) score.
Both LSTM based model and the Autoencoder based model hyperparameters were optimized using a grid search in the hyperparameters space. The hyperparameters to optimize were the embedding matrix dimension, the LSTM dimensions, learning rate, weight decay, activation functions, etc. All models were tested with the same grid search. Once the grid search was finished, we chose five new sets of training and test set and reported their results. Note that the results are quite robust to most parameter changes, and that the size of the TCR-peptide pair space is much larger than any of the training sets used during parameter tuning.
At the broad level, the ERGO model was trained and designed to solve the TPP-I task. Since the train and the test set are chosen randomly for each training process (as described above), 5 trained models along with their matching train and test set were analyzed, for each database [McPAS (
For computing single-peptide binding score, samples from each test set were observed. For every peptide, we looked for the pairs in the test set containing that peptide (positive and negative samples). The ROC and AUC scores were computed according to the model prediction of those pairs. SPB scores were computed for the five most frequent peptides in each database (
Single protein scores are extracted similarly, by analyzing all pairs in the test set that contain a peptide of the specific protein.
At first, the number of classes k was set. Trained model prediction scores were extracted for each TCR in the test set, paired with every peptide from the top k frequent peptides in the relevant database. The TCR target was predicted to be the peptide which got the maximal score as the pair complement. Accuracy was computed using the true samples in the test set. This was done for class numbers ranging between 2 and 10, as well as 20 and 30. Mean accuracies are shown in
New test TCRs and new test peptides were deducted from the train and test sets. TPP-I score is the AUC of the model predictions of the original test set. TPP-II is the AUC of the predictions of the new test TCR positive and negative samples. TPP-III is the AUC of the predictions of new test TCR and new test peptide pairs, positive and negative samples.
All models were trained and evaluated using the same train and test partition. Every model train set was a sub-sample of the original train set, while the test set remained the same. Ten thousand new train samples were added at each iteration.
Again, all models were evaluated with the same test set and a train/test partition. In this experiment, the train data was modified by dropping a single amino acid in a specific position at a time. Practically, this was done by deleting this position for all TCRs in the LSTM based model, or by nullifying the relevant position in the one-hot encoding of the TCRs in the AE based model. This experiment was repeated five times, Mean TPP-I scores are shown in
First, TCR records per peptide were counted in the original McPAS (
The united IEDB and MIRA datasets were downloaded from
Publicly available datasets were analyzed in this study. This data can be found here:
IS developed the formalism and implemented it. SD designed the TCR autoencoder formalism. NT-M developed the libraries and wrote the manuscript. YL supervised the work and wrote the manuscript. HB designed the initial formalism. All authors contributed to the article and approved the submitted version.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
We wish to thank Prof. Luning Prak for helpful critiques and suggestions. This manuscript has been released as a pre-print (
The Supplementary Material for this article can be found online at: