Detection of Enriched T Cell Epitope Specificity in Full T Cell Receptor Sequence Repertoires

High-throughput T cell receptor (TCR) sequencing allows the characterization of an individual's TCR repertoire and directly queries their immune state. However, it remains a non-trivial task to couple these sequenced TCRs to their antigenic targets. In this paper, we present a novel strategy to annotate full TCR sequence repertoires with their epitope specificities. The strategy is based on a machine learning algorithm to learn the TCR patterns common to the recognition of a specific epitope. These results are then combined with a statistical analysis to evaluate the occurrence of specific epitope-reactive TCR sequences per epitope in repertoire data. In this manner, we can directly study the capacity of full TCR repertoires to target specific epitopes of the relevant vaccines or pathogens. We demonstrate the usability of this approach on three independent datasets related to vaccine monitoring and infectious disease diagnostics by independently identifying the epitopes that are targeted by the TCR repertoire. The developed method is freely available as a web tool for academic use at tcrex.biodatamining.be.


INTRODUCTION
T cells form an important part of the adaptive immune system as they can recognize potentially pathogen-derived or aberrant peptides (epitopes), which are presented by major histocompatibility complex (MHC) molecules on the cell surface of nucleated host cells or professional antigen-presenting cells, and induce an immune response. The T cell receptor (TCR) molecule is responsible for the recognition of the MHC presented epitope. Each TCR protein is encoded by a genomic region that undergoes non-homologous recombination during T cell maturation, a process termed VDJ recombination. The randomness of the recombination process ensures the production of many different TCR proteins, with each T cell clone expressing a particular TCR, and allows the recognition of different epitopes. Epitope binding by the TCR is a critical step for the activation of targeted immune responses.
As high-throughput sequencing of TCR repertoires becomes more common, the need for more advanced analysis methods is increasing. A major challenge involves the functional identification of activated T cells from TCR repertoire data. To date, considerable efforts have been made to address this problem either by the detection of proliferated T cell clones, through the comparison of the abundance of TCR clonotypes between different time points (1,2), or by the identification of enriched groups of similar T cell clones using VDJ recombination probabilities (3). Along with these identification methods, several strategies have been used to compare the TCR repertoires of pathology-associated tissues with healthy tissues or peripheral blood, for example in cancer (4,5). However, the current identification methods lack the ability to discriminate disease-reactive T cells from other T cells that play a role in concurrent, non-related immune responses, thus requiring additional experiments. Different experimental strategies are currently available to determine epitope-specific TCRs, i.e., TCRs that recognize the epitope of interest, such as epitope-MHC multimer assays and peptide stimulation experiments. However, these techniques are not error-free as they can miss important epitope-specific TCRs (6,7) or falsely report non-binding TCRs to be epitope specific (8). Nevertheless, these methods have been extensively used to generate a large amount of epitope-specific TCR data which has been collected in public databases such as the VDJdb (9) and McPAS-TCR (10).
Here, we present a new strategy to study T cell responses based on the direct identification of epitope-specific TCRs from TCR repertoire data. To determine the epitope specificity of TCRs, machine learning models have been developed to analyze TCR sequences and predict the probability that they recognize and bind specific epitopes. These methods are based on the principle that similar TCR sequences often target the same epitope (11) and that machine learning techniques can be used to learn the molecular underpinnings that are shared by these epitope-specific TCR sequences (12)(13)(14). While these methods have been shown to be performant on small targeted datasets, their application on full repertoire datasets remains challenging.
In this paper, we specifically address two issues that currently limit the use of prediction methods on full TCR repertoires. The first is the uncertainty on where to place the cut-off for a correct prediction. In other words, how confident must the prediction be before we accept it as the truth. If the cut-off is chosen too strictly, we may miss many epitope-specific T cells. Conversely, if it is chosen too low, we increase the number of TCRs that are mistakenly identified as epitope-binding. To address this first issue, the strategy proposed here includes the estimation of a model-specific baseline prediction rate, which can be used to set the prediction cut-off in an informed manner. The second issue is that the presence of epitope-specific TCRs is not restricted to TCR repertoires that have been stimulated by an epitope, since they can also occur in small numbers in naïve, healthy TCR repertoires. Therefore, the identification of single epitope-specific TCR sequences by itself does not provide sufficient information on the capacity of the full TCR repertoire to target specific epitopes. Current prediction strategies do not evaluate whether a TCR repertoire as a whole contains more TCRs against a specific epitope than expected in a healthy TCR repertoire that is not enriched for these TCRs. To address this second issue, we introduce an enrichment analysis strategy to identify those specific epitopes that are being targeted within a given TCR repertoire.
We built upon the epitope-specific TCR classifier described in De Neuter et al. (12) and developed an approach designed for the identification of various epitope-specific TCRs in full repertoire datasets. We have applied this new approach to three independent datasets (1,2,15), which have previously been analyzed with the traditional TCR repertoire data analyses described above. In this manner, we show that by adding this layer of epitope information, we are able to link T cell data to broader immune repertoire targets. The described method is available as a web tool at tcrex.biodatamining.be.

Collection of Training Data and Assembly of the Background TCR Dataset
A positive training dataset was constructed containing human epitope-specific TCR beta chain sequences collected from the manually curated catalog of pathology-associated T cell receptor sequences [McPAS-TCR (10)] (11 339 TCR-pathology combinations) and the VDJ database [VDJdb (9)] (17 792 TCRepitope combinations) on November 18, 2018. To train the prediction models, both the CDR3 beta amino acid sequence and the V/J genes were gathered. The following quality filtering steps were applied for the McPAS-TCR dataset: (1) retaining epitope-specific TCRs determined with peptide-MHC tetramers or peptide stimulation and (2) removal of TCRs with missing information (i.e., CDR3 beta sequence, V/J genes or the specific epitope), reconstructed J genes, V/J genes with special characters that could not be matched to known V/J genes, CDR3 beta sequences with lower case amino acids or non-amino acid characters, TCRs with additional quality remarks and TCRs from studies using mouse strains. The dataset was further extended with 550 TCR-epitope combinations found in the published literature (13,(16)(17)(18). Standard TCRex filtering steps were carried out (Supplemental Material S1). Furthermore, entries with multiple V genes were split into separate entries, each having one of the V genes. For the retained TCR beta sequences, we checked whether their V/J genes were present in the IMGT format as formulated by the IMGT database (19). The latter was necessary due to possible inconsistencies in the way V/J genes were reported across studies. All V/J genes not reported in the IMGT format were corrected using our in-house IMGT parser. Finally, duplicate TCR-epitope combinations across different sources were removed. Only beta chain TCR sequence data for epitopes having at least 30 unique epitope-specific TCRs was retained in order to ensure the presence of an adequate number of TCRs to train and validate the classifiers. The final dataset contained 18 679 unique TCR-epitope combinations.
The control training dataset was designed to contain nonbinding TCRs for each epitope. As one epitope can be presented by more than one MHC-molecule, and one TCR sequence might interact with different MHC molecules (20), we did not take the epitopes' MHC background into account. Instead, we assembled one control dataset for all epitope-specific models by collecting TCR beta chain sequences from healthy TCR repertoires from the study by Dean et al. (21). When considering a certain epitope, the vast majority of TCRs in healthy TCR repertoires will not target it. However, it is not unlikely that these repertoires contain a few TCRs that do target this specific epitope. The presence of these TCRs in our control dataset and their effect on the performance of the prediction models is expected to be negligible due to the low probability of their inclusion. In total, bulk TCR beta chain repertoire data from 587 healthy individuals was available. We randomly selected 1500 TCR beta sequences from the repertoire of each of these individuals. Subsequently standard TCRex filtering steps (Supplemental Material S1) were applied and the V/J genes of the remaining TCR beta chain sequences were transformed to the IMGT format. From this final dataset, 250 000 unique TCR beta chain sequences were randomly selected to make up the negative training dataset and 100 000 unique TCR beta chain sequences were randomly selected to represent the background dataset needed for the calculation of the baseline prediction rate. We ensured that the negative training and background dataset contained TCRs with different CDR3 beta sequences.

Collection of TCR Datasets From Infectious-Disease Studies
To evaluate the use of our epitope-specific models for the functional annotation of TCR repertoires, we collected TCR data from two independent yellow fever virus (YFV) vaccination studies (1, 2) and one cytomegalovirus (CMV) study (15). From the first YFV study (1), pre-(day 0) and post-vaccination (day 14) bulk TCR beta repertoire data (i.e., derived from peripheral blood mononuclear cells or PBMC) was collected from nine volunteers in the immunoSEQ Analyzer format. This study also reported TCRs from activated T cells in the post-vaccination repertoires for each of the volunteers ["CD3 + CD8 + CD14 − CD19 − CD38 + HLA-DR + Ag-experienced, activated effector T cells" (1)]. From the second YFV study (2), pre-(day 0) and post-vaccination (day 15) bulk TCR repertoire data (i.e., derived from PBMC) was collected from three pairs of twins 1 In case biological replicates were present for a volunteer at a given time point, these were merged. As the TCR repertoires from this study were reported in a MiXCR file format that did not follow the standard TCRex input requirements (i.e., two additional columns reporting the best V gene and the best J gene, respectively, for each TCR sequence), we parsed these files and selected the most likely V/J gene for every TCR sequence based on the provided alignment score. For those TCRs with several V/J genes tied for the highest score, only the first one was retained. The second YFV study (2) also reported a list of vaccine-associated TCR beta sequences that were identified in the volunteers using a statistical analysis based on clonal expansion. This list was used to compare the number of identified TCRs between TCRex and the YFV study. From the CMV study (15), we gathered a dataset 1 https://github.com/mptouzel/pogorelyy_et_al_2018 of 164 TCRs that were reported to be CMV-associated following a statistical analysis of a large group (>600) of healthy CMV+ and CMV-volunteers.

Model Training and Performance Evaluation
For each epitope present in the collected training dataset, we trained a random forest model to identify epitope-binding TCRs in a TCR repertoire dataset. These machine learning models have the ability to identify TCRs from multiple epitope-specific clusters (Supplemental Material S2). Training of epitope-specific prediction models was based on the method established in De Neuter et al. (12). This method was shown to perform comparably to other state-of-the-art classifiers in a recent independent study (22). In brief, the amino acid sequences of the CDR3 regions of the TCR beta chains were converted into physicochemical features and the beta chain's V/J genes and families were one-hot encoded. For each epitope in the collected training data, a random forest classifier was trained with 100 individual decision trees using the scikit-learn library (23). We sampled 10 times more negative training data than positive training data from the negative control set of 250 000 sequences, in order to mimic the natural imbalance of epitopespecific T cells in full TCR repertoires. TCR beta chain sequences that occurred in both the positive training and negative training dataset were removed from the negative training dataset to avoid ambiguously labeled TCR beta chain sequences. Given a TCR beta chain sequence, each model returns the confidence value of binding a specific epitope. All models were evaluated using a stratified 5-fold cross-validation strategy. With this validation strategy we obtained the receiver operating characteristic (ROC) curve, precision-recall (PR) curve, the balanced accuracy, the area under the ROC curve (AUC) value and the average precision value for each classifier as implemented in the scikit-learn library (23). Models that had poor AUC (<0.7) or poor average precision (<0.35) values were excluded from the model collection.

Controlling the Baseline Prediction Rate
The class probability score for a TCR beta chain sequence is specific for each epitope model and does not take false positive predictions into account. To control the number of false positives, we provide each TCR sequence with a baseline prediction rate (BPR) value along with its class probability score. This BPR value represents the fraction of epitope-specific TCRs that can be detected in a background dataset (i.e., a control dataset representing a normal, healthy TCR repertoire) at a certain confidence cut-off. Since background datasets are expected to contain only a limited number of epitope-specific TCRs, this BPR can also be used as an estimate for the false positive rate (FPR). For every TCR sequence, its class probability score is compared to the class probability scores of the 100 000 unique background TCR beta chain sequences. A BPR value is then calculated for each TCR sequence as the fraction of randomly selected background sequences with a class probability score greater than or equal to the score of the TCR sequence. Thus, for a desired BPR threshold, a confidence cut-off c can be calculated so that: Where p is the prediction function and B is the background dataset. This allows us to filter the prediction results with a custom threshold across different epitope models, by using a calculated cut-off c that is specific for each model. Furthermore, this allows a direct interpretation of the cut-off with regards to the expected number of false positive hits for a typical TCR sequence dataset.

Epitope Specificity Enrichment Analysis
In addition to the detection of epitope-specific TCR beta chain sequences, we implemented a method to identify those epitopes for which we find a significantly higher number of unique TCR sequences in a target repertoire than expected in a representative control TCR repertoire. This analysis is performed by comparing the abundance of the epitope-specific TCR sequences in the TCR repertoire to a hypothetical dataset of randomly selected TCRs, i.e., the background dataset. TCRex does not use any information on the frequency of the TCR clonotypes in this comparison: only the number of unique predicted epitope-specific TCRs are considered. This enrichment analysis strategy uses a BPR-based enrichment threshold that represents the percentage of unique epitope-specific TCRs identified in the background repertoire at the chosen BPR threshold. For each epitope of interest, a onesided exact binomial test is used to compare the number of TCR sequences with a BPR score smaller or equal to the specified enrichment threshold. More specifically, we test whether we can reject the null hypothesis that the observed number of identified TCRs follows the binomial distribution X ∼ B n, p and thus accept the alternative hypothesis that the abundance of TCRs specific for the epitope in the studied TCR repertoire exceeds the chosen enrichment threshold. Here, X is the number of unique identified TCRs specific for the studied epitope at a chosen enrichment threshold, n the size of the TCR repertoire and p the chosen enrichment threshold. The latter represents the percentage of epitope-specific TCRs that can be identified in a hypothetical background dataset at a specific BPR threshold. Since one false positive hit (i.e., a TCR sequence with a BPR score below the BPR threshold that does not recognize the epitope) is possible at every BPR threshold, we restrict the statistical analysis to those epitopes for which at least two epitope-specific TCR sequences are found.

Web Tool
The trained prediction models were collected and made available as a web application, called TCRex, which is accessible at tcrex.biodatamining.be. This web tool was built using the Django web framework (24) and makes use of several Python libraries including scikit-learn (23), SciPy (25), Pyteomics (26), Altair (27), NumPy (28), Pandas (29), and Dask (30). It provides a userfriendly web interface to predict epitope binding for human TCR beta chain sequences and perform epitope specificity enrichment analyses for various viral and cancer epitopes. It is compatible with several common TCR sequence data formats, namely the immunoSEQ Analyzer formats 2 and MiXCR (31) format, and the TCRex tab-delimited format which contains the V/J genes and CDR3 beta amino acid sequences for each TCR. More information concerning the use of the web tool is provided in Supplemental Material S3. In this paper, the TCRex web tool was used with the default BPR threshold of 0.01% for all analyses. This value is based on the assumption that roughly 1-100 epitope-specific TCRs can be found for every million TCRs in a healthy CD8 + TCR repertoire (32,33).

Overview of the Epitope-Specific Models
The following performance metrics were calculated for each epitope-specific model using a 5-fold stratified crossvalidation approach: the balanced accuracy, the area under the ROC curve (AUC) value and the average precision. These are shown in Supplemental Material S4. Only models with an AUC of at least 0.7 and an average precision of at least 0.35 were included in the final model collection, and thus in the current version (version 0.3.0) of the TCRex web tool. Forty nine out of 54 prediction models passed these inclusion criteria. Out of these 49 models, 44 were trained for viral epitopes and 5 for cancer epitopes (Figure 1). The MHC background for the epitopes is shown in Supplemental Material S5. It is worth to note that the  detection of epitope-specific TCRs in this paper solely relies on these prediction models without any additional experimental validation. Therefore, all identified epitope-specific TCRs should be interpreted as TCRs predicted to bind a specific epitope. However, some of these TCRs might also be present in the training dataset, which is a collection of experimentally identified TCRs.

Evaluation of the Prediction Models Using a Leave-One-Study-Out Validation Approach
To evaluate how our epitope-specific models handle data from new studies, we designed a validation approach wherein all data of one study was left out during the training of the models and served as a holdout validation set. We selected viral epitopes with a large set of epitope-specific training data (at least 1,000 unique TCRs) from various studies. For each of these epitopes, one study was removed from the training dataset (Supplemental Material S6). We ensured that this holdout dataset did not contain any of the training TCRs (i.e., TCRs with the same V/J genes and CDR3 beta sequence as training TCRs). After training the epitope-specific models, predictions on the holdout dataset were made using a strict BPR threshold of 0.01%. For each epitope-specific model, we expected a maximum of one false positive when analyzing these small datasets with a 0.01% BPR threshold, as these datasets varied in size from 66 to 124 TCRs. To verify this assumption, the same models were used on a negative control dataset, i.e., a dataset lacking TCRs specific for the epitope of interest. For this purpose, we randomly selected TCRs from the training dataset of the heteroclitic cancer epitope ELAGIGILTV up to the size of the holdout validation datasets. We also ensured that this negative test set did not contain any training TCRs (i.e., TCRs sharing both their V/J genes and CDR3 beta sequence with a TCR from the training dataset), as this might be the case for putative cross-reactive TCRs. Standard performance metrics including accuracy, sensitivity, specificity and precision were calculated for each of the models (Supplemental Material S7). A summary of the results is shown in Table 1. The majority of the identified TCRs did not match any of the training CDR3 beta sequences, which demonstrates the ability to find TCRs with unseen CDR3 beta sequences. Because of the strict BPR threshold, we achieved a very high specificity and precision, but a limited sensitivity. For both the GILGFVFTL and GLCTLVAML models, the specificity and precision reached the maximum score of 1 which corresponds to zero false positives. For the NLVPMVATV model, two false positives were detected, which does not align with the chosen BPR threshold. One of these TCRs, however, shares its CDR3 beta sequence and J gene with a NLVPMVATV-specific training TCR. This may indicate the presence of cross-reactive TCR recognition patterns for both the cancer and the CMV epitope. If this is the case, the TCR could be considered a true positive. The presence of epitope-specific TCRs in background datasets is a potential issue, as epitope-specificity studies mainly report TCRs binding to the epitope of interest and not the non-binding TCRs. Since there is no public information available on non-binding TCRs, our control datasets might contain a small number of epitope-specific TCRs. This could have led to an overestimation of the number of false positives in this analysis. Overall, the models performed well and did not seem to be subject to any study specific biases.

The Number of Unique Epitope-Specific TCRs Increases Post-vaccination
Previous studies have already demonstrated the feasibility of identifying T cell clones that have proliferated between time points in TCR repertoire data (1,2). However, such methods do not provide insight into the targeted epitopes.
Considering the fact that T cells proliferate upon binding with vaccine-associated epitopes, we hypothesized that postvaccination repertoires contain an elevated level of unique epitope-specific TCRs compared to pre-vaccination repertoires.
To evaluate the difference in the number of unique epitopespecific TCRs (i.e., each TCR clonotype is considered once) captured within the sequenced repertoire before and after vaccination, we reanalyzed published TCR repertoires from two independent YFV vaccination studies. In the first study, nine healthy volunteers with unknown HLA types received the YF-17D vaccine (1). In the second study, the YF-17D vaccine was used to vaccinate three pairs of HLA-A * 02:01 positive twins (2). For each of the participants, TCR beta repertoire data from PBMC samples taken before and 2 weeks after vaccination were available. At the time of analysis, TCRex featured one YFV epitope prediction model: the immunodominant LLWNGPMAV peptide. We used this model to identify unique LLWNGPMAVspecific TCRs in both pre-and post-vaccination repertoires. To control the number of false positives, a strict BPR threshold of 0.01% was applied. As can be seen in Figure 2, the percentage of unique predicted LLWNGPMAV-specific TCR sequences increased post vaccination for the majority of the volunteers [one-sided Wilcoxon signed-rank test, p = 0.03711 for (1) and p = 0.01563 for (2)].

Comparison of the Identification Level With Techniques Relying on Clonal Expansion
The detection of vaccine-associated T cells with our new method is restricted to the recognition of epitope-specific T cells for which trained prediction models are available. Because only one YFV epitope model passed the quality filters at the time of analysis, we expected to detect a considerably smaller number of vaccine-associated TCRs in comparison with a differential analysis between time points. To evaluate the identification level of our new strategy, we compared our results from the post-vaccination repertoires to the outcome of the original   Table 2). These results could indicate that our approach might be able to detect epitope-specific TCRs that might be missed by other techniques. However, it is important to point out that the list of epitope-specific TCRs reported by TCRex may contain a small number of non-specific TCRs due to the presence of false positive predictions. The lower identification rate of TCRex can be explained by its limited sensitivity at strict BPR thresholds, as demonstrated by the leave-one-study-out validation, and its restriction to one YFV epitope. Thus, any T cells reacting with other vaccinerelated epitopes that were detected as expanded TCR clones in Pogorelyy et al. (2), would be missed by TCRex. On the other hand, as clonal expansion techniques do not take into account the epitope specificity of the TCRs, they might also falsely report bystander TCRs or TCRs associated with other non-vaccine related immune responses. Indeed, in the original YFV vaccine study three different experimental assays (i.e., a multimer assay using the LLWNGPMAV epitope, evaluation of the interferon gamma production following vaccine-stimulation and identification of activated CD8 + CD38 + HLA-DR + T cells) were carried out to assess whether the identified expanded T cell clones for one of the volunteers (S1) were vaccineassociated (2). About half of the expanded 773 TCR clones were experimentally validated, of which 68 clonotypes were found to be LLWNGPMAV-specific. This is less than the 170 epitope-specific TCRs identified with TCRex. Although tetramer assays are known to miss epitope-specific TCRs (6, 7), this low identification rate might also be a consequence of the fact that the multimer assay was performed 2 years after vaccination.

The Majority of the Identified LLWNGPMAV-Specific T Cells Are Not Present in Public Databases
Since the detection of epitope-specific T cells relies on a prediction model trained on previously identified TCRs recognizing the epitope, we assessed to what extent our models are able to identify new epitope-specific TCRs. For this, we compared the number of TCRex-identified LLWNGPMAVspecific T cells in the post-vaccination repertoires from the two YFV vaccine studies (1, 2) with the number of public TCRs (i.e., known LLWNGPMAV-specific TCR sequences present in the training data) in these repertoires. For both YFV studies, TCRex achieved a higher identification rate in comparison with the database search technique (Figure 3). The number of TCRs identified by both techniques is rather limited (Supplemental Material S8). TCRex does find some training TCRs in the repertoires, however, we can deduce that the bulk of the TCRs identified by TCRex were not present in our training data and thus not publicly known.

Identification of Enriched Vaccine-Associated TCR Repertoire Targets
In addition to the identification of epitope-specific TCRs in whole TCR repertoires, our newly developed strategy can also be used to detect enriched TCR repertoire targets. In essence, we detect epitopes for which more specific TCRs are present than expected based on the epitope-specific TCR occurrence in a healthy, non-vaccinated TCR repertoire. Here, we demonstrate the feasibility of this enrichment analysis by comparing the levels of LLWNGPMAV-specific TCRs before and after vaccination for the volunteers from the two YFV studies. For each of the volunteers, we evaluated whether their pre-and post-vaccination TCR repertoires contained more LLWNGPMAV-specific TCRs than expected in a background TCR repertoire, i.e., a TCR dataset representing a healthy, non-vaccinated TCR repertoire. As all the LLWNGPMAV-specific prediction analyses were performed   with a 0.01% BPR threshold, we expect that the abundance of LLWNGPMAV-specific TCRs in a background repertoire will not exceed 0.01%. Therefore, we assessed for each of the volunteers' TCR repertoires whether the percentage of epitopespecific TCRs in the TCR repertoires was significantly higher than 0.01% and thus were enriched in LLWNGPMAV-specific TCRs. The number of identified TCRs and the results of the epitope-specificity enrichment analyses for all volunteers from Pogorelyy et al. (2) are shown in Table 3. The majority of the post-vaccination TCR repertoires (4 out of 6) are enriched in LLWNGPMAV-specific TCRs, as opposed to the pre-vaccination TCR repertoires (2 out of 6). The enrichment in pre-vaccination repertoires might be explained by the use of an unrepresentative background dataset, which is a common problem addressed in the next section. A similar trend can be seen for the volunteers from Dewitt et al. (1). Here, the post-vaccination TCR repertoires from 3 of the 9 volunteers demonstrated an elevated level of LLWNGPMAV-specific TCRs while no volunteer showed an enrichment in their pre-vaccination TCR repertoire ( Table 4).
For the YFV study from Dewitt et al. (1) we additionally studied the abundance of LLWNGPMAV-specific TCRs in experimentally enriched TCR repertoires as Dewitt et al. reported the activated subset [i.e., "CD3 + CD8 + CD14 − CD19 − CD38 + HLA-DR + Ag-experienced, activated effector T cells" (1)] of the TCR repertoire for each of the nine volunteers 14 days after vaccination with the YFV-17D vaccine. For each of these activated TCR repertoires, LLWNGPMAV-specific TCRs were identified using a BPR threshold of 0.01% and compared with the abundance of specific TCRs in the unsorted PBMC samples at the same time point. The number of LLWNGPMAVspecific TCRs in the activated TCR dataset increased drastically in comparison to their unsorted PBMC samples for the majority of the volunteers (one-sided Wilcoxon signed-rank test, p = 0.001953) (Figure 4). The same trend can be seen in the results of the enrichment analysis, presented in Table 4, with 4 out of 9 of the activated TCR repertoires demonstrating an enrichment in LLWNGPMAV-specific TCRs whereas this is limited to 3 out of 9 for the unsorted post-vaccination repertoires.

The Importance of the Background Dataset in Enrichment Analyses
In the previous section, we showed how an enrichment in epitope-specificity in TCR repertoires can be detected by comparing the abundances of epitope-specific TCRs to a general background dataset with a predefined enrichment threshold. Here, we demonstrate how the selection of a representative background dataset can improve these enrichment analyses. For each of the volunteers from the two YFV studies (1, 2), we compared the abundance of LLWNGPMAV-specific TCRs in their post-vaccination repertoire to the abundance in their pre-vaccination repertoire. For this, we slightly adapted our general epitope specificity enrichment analysis: (1) the number of successes in the one-sided exact binomial test was set equal to the number of epitope-specific TCRs in the postvaccination repertoire that was identified with a 0.01% BPR threshold and (2) the enrichment threshold was set equal to the fraction of the number of identified TCRs in their prevaccination repertoire. A comparison between the two different enrichment analysis strategies demonstrates the importance of the use of a representative background dataset (Tables 3, 4). When the percentage of epitope-specific TCRs in the prevaccination repertoires is much lower than we generally expect in healthy TCR repertoires, the use of a standard background dataset might not discover the increase of epitope-specific TCRs in post-vaccination repertoires. This can be seen for volunteer P1 and P2 from study (2) ( Table 3). Conversely, when the percentage of epitope-specific TCRs in the pre-vaccination repertoires exceeds our expectations, the use of a background dataset could falsely indicate a significant increase of epitopespecific TCRs in post-vaccination repertoires. This can be seen for volunteer 2 from study (1) ( Table 4). In conclusion, the selection of the background dataset is an important step when performing enrichment analyses and depends on the research question and experimental set-up of the study.

Identification of Enriched Disease-Associated TCR Repertoire Targets
Previously, Emerson et al. introduced a new statistical method to discover disease-associated TCRs from TCR repertoire data (15). Briefly, they sequenced the TCR beta repertoires of a large group (more than 600) healthy CMV + and CMV − volunteers and set up a statistical test to identify public TCRs that were enriched in the healthy CMV + volunteers. In total, 164 TCRs were found to be associated with a positive cytomegalovirus serostatus of which three were matched with a CMV epitope through an additional validation experiment. Here, we use our enrichment analysis strategy to identify the CMV epitopes that are targeted by the TCRs in this dataset. In short, we searched for epitopes for which an elevated level of specific TCRs are present in this dataset, i.e., more than expected in a background TCR repertoire. To see whether we only find an enrichment for CMV-specific TCR sequences, and not for non-CMV epitopes, we performed binding predictions for all epitopes present in TCRex with a strict BPR threshold of 0.01%. In total, 8 out of 164 different TCRs were predicted to bind to one of the CMV epitopes ( Table 5). Two epitopes, both derived from CMV, were found to be significantly enriched in the dataset: NLVPMVATV (adjusted p = 7.1e-7, Benjamini-Hochberg adjusted p-value for two tests) and TPRVTGGGAM (adjusted p = 5.7e-09, Benjamini-Hochberg adjusted p-value for two tests), compared to a TCR background dataset where we would expect to identify 0.01% specific TCRs for each epitope with our selected BPR threshold. Additionally, we also found one YFV-specific TCR in the list of 164 CMV associated TCR sequences. This might be a false positive identification, where the LLWNGPMAV-specific model wrongly classified the TCR as epitope-binding. However, this TCR sequence shared an identical combination of a CDR3 beta sequence and J gene with TCR sequences linked with the YFV epitope in public databases and thus the model training set. It is therefore possible that this TCR is cross-reactive and recognizes both the immunodominant YFV epitope and a CMV epitope. It could also be that the identified TCR is not CMVreactive as the dataset reported by the original study represents TCR sequences that were statistically associated with the CMV seropositive volunteers which does not directly entail CMVreactivity. In summary, this analysis indicates the relevance of using prediction models to identify epitope-specific T cells and search for enriched T cell epitopes in a TCR beta dataset.

Evaluation of the Impact of MHC Bias in Epitope-Specific Prediction Models
Considering each epitope-specific model is based on a positive and negative training dataset, with, respectively, a very constrained and broad MHC background, the performance of the prediction models could be hampered by an inherent MHC bias. More specifically, the models are at risk of being MHC-specific, and thus predict the MHC binding property of a TCR sequence in addition to its epitope specificity. This could obstruct the identification of epitope-specific TCRs that bind MHC proteins other than those included in the training dataset. To assess the MHC-specificity of our prediction models, we studied the influence of the MHC bias on the prediction results. To this end, we collected epitope-specific TCR sequences from epitopes that are presented by more than one MHCI molecule. Out of the 49 epitopes available in TCRex, 8 were shown to be presented by different MHCI proteins (Supplemental Material S5). We selected the epitopes that were associated with two different HLA allele groups, i.e., their HLA allele names differ in the first two digits ( Table 6): TPGPGVRYPL and TPQDLNTML. For each of these epitopes, a model was trained on a dataset of epitopespecific TCRs that recognizes only one of the MHCI proteins.
Hereafter, predictions (0.01% BPR) were made on the remaining epitope-specific TCRs, which were associated with the other MHCI protein. TCRs with a different MHC background than the training dataset were identified for both epitopes ( Table 6). This indicates that these prediction models are MHC-agnostic.
To support this result, we performed an additional analysis where the HLA-B * 42:01 restricted models for the TPGPGVRYPL and TPQDLNTML epitopes were used to predict the specificity of TCRs with the same HLA-B * 42:01 MHC background but different epitope specificity. In total, TCRex contained 4 epitopes associated with HLA-B * 42:01: LPPIVAKEI (62 TCRs), IIKDYGKQM (54 TCRs), HPKVSSEVHI (75 TCRs), and FPRPWLHGL (120 TCRs). Using the HLA-B * 42:01 restricted models (0.01% BPR), none of the TCRs associated with these 4 epitopes were predicted to bind TPGPGVRYPL or TPQDLNTML, which shows that the MHC binding property of a TCR was not learned by these prediction models. Taken together, the results of this study suggest that the MHC-bias in the prediction models does not necessarily have a major impact on the prediction results.

DISCUSSION
Traditional methods to identify epitope-specific T cells require targeted experiments, which are known to fail to identify a considerable amount of reactive T cells (6,7). Approaches to solve this problem computationally have been proposed, but they are hard to translate to full repertoires due to their inherent prediction error. To overcome these limitations, we propose a novel approach built on the identification of epitope-specific T cells. Our method uses epitope-specific prediction models trained on publicly available TCR beta sequence data to assess the probability of binding the epitope of interest and corrects for prediction errors by using a stringent BPR threshold. As such, it allows for the easy identification of epitope-specific T cells in full TCR repertoires and consequently the detection of enriched TCR repertoire targets. In the current implementation, it is based on a random forest classifier trained on common physicochemical properties. The implemented method currently only uses TCR beta chain information, as this is the bulk of the data that is currently available. As linked beta-alpha TCR data becomes more common (34,35), the approach can be easily extended to include alpha chain sequence data.
This novel approach was validated on several independent datasets to address different research questions. Firstly, we examined whether we could detect a difference in the presence of epitope-specific TCRs in pre-and post-vaccination repertoires. Considering that vaccines are expected to activate T cells upon binding with vaccine-associated epitopes, leading to a targeted proliferation of these T cells, we expected to find more unique epitope-specific T cells after vaccination. The analysis of the two YFV studies (1, 2) seems to support this hypothesis as an increase in predicted LLWNGPMAV-specific T cells is found in the post-vaccination repertoires from both studies.
One limitation of our method is the restriction on the number of epitopes that can be studied as the models rely on the presence of sufficient epitope-specific TCR data from targeted experiments. Consequently, it is only possible to inspect the specificity of TCR repertoires for epitopes which have already been studied. However, considering our analysis of the two YFV studies was restricted to one epitope, we achieved a relatively high identification rate in comparison with the method of Pogorelyy et al. (2). This might be a result of the immunodominance of the studied YFV epitope. The majority of the TCRs identified with TCRex, however, did not overlap with the results in Pogorelyy et al. (2). This could indicate that we might be able to find epitope-specific TCRs that are missed by other methods. For example, it is possible that some epitope-specific TCRs expanded less in comparison to others and were therefore not detected using the technique relying on clonal expansion. Though more extensive experimental validation would be required, TCRex has the potential to be used as a complementary technique to existing TCR data analysis strategies by annotating single TCRs with their epitope target and/or by discovering epitope response trends in (activated) TCR repertoires.
Finally, we demonstrated how our approach enables the search for enriched epitope specificity in TCR repertoire datasets using our built-in enrichment analysis test. In short, we used an enrichment test to check whether a TCR repertoire contains significantly more TCRs recognizing a specific epitope than a random background TCR dataset. As a proof of concept, we compared the elevated level of LLWNGPMAV-specific TCRs in pre-and post-vaccination TCR repertoires and analyzed a dataset enriched in CMV-associated TCRs. The former analysis shows the importance of the selection of the right background dataset when performing enrichment analyses, which we discuss in more detail in the next section. For the latter, our enrichment analysis method easily detected a significant increase in epitope-specific TCRs for two CMV epitopes. No enrichment was found for other viral epitopes, which illustrates the robustness of our approach.
While this novel approach allows a full repertoire exploration, several potential sources of bias may still impact the results. A first problem lies in the diversity of epitope-specific repertoires: various groups of T cells can be found in these repertoires, each carrying a specific pattern (11,13,14). Patterns that are present in more T cell clonotypes, will be picked up more efficiently in TCR repertoire studies. This pattern bias will therefore be present in the training dataset. Consequently, T cells carrying common patterns might be detected more easily by the prediction models while T cells with rare patterns are more likely to be missed. Although it is unlikely that all specificity patterns can be discovered with a prediction model, it could still be possible to detect the general trends following a disease or treatment, as these might influence the frequency of both TCRs with common and rare patterns.
Another major concern was the effect of the MHC bias in the training data on the prediction models. This bias could lead to an underestimation of the epitope specificity of TCR repertoires from individuals whose HLA types do not match the MHC background of the prediction models. In addition, it could cause an increase of false positive predictions for individuals with matching HLAs. Our analysis on the impact of the MHC bias demonstrated that the MHC background of epitope-specific TCRs is not likely to hamper the prediction results. This indicates that the epitope-specific TCR patterns might transcend the MHC background of epitopes, which is in line with (11). However, caution needs to be taken when extending this conclusion to all other prediction models as the relative importance between the epitope-specific and MHC-specific TCR patterns could vary between the different models.
Still, even if the epitope-specific models are not MHCagnostic, potential biases in the enrichment results can be avoided by comparing the abundance of epitope-specific TCRs in a TCR repertoire to a background dataset with a matching MHC background. In general, we recommend the use of a representative background dataset (e.g., background data from the same subject using the same TCR sequencing and analysis set up) when available as any other biases between the TCR repertoire and the background dataset might affect the enrichment results. This is demonstrated in the enrichment analyses of the YFV studies, where the replacement of the standard background dataset by the pre-vaccination repertoires gave better results. Nevertheless, if no suitable background dataset is available, the enrichment analysis, as implemented in the TCRex web tool, still provides an easy and fast exploration of a TCR dataset, since it gives an initial overview of the possible TCR targets that may be enriched in this dataset. As bulk sequenced TCR repertoires are commonly not annotated with their epitope specificities and require laborious additional experiments to discover these targets, this enrichment analysis can inform subsequent experimental efforts.
Taken together, we have developed a novel strategy to study the epitope-specificity of full TCR repertoires. Our approach relies on the identification of epitope-specific T cell receptor sequences using epitope-specific prediction models. This enables the prediction of human epitope-specific TCR sequences and the identification of enriched TCR repertoire targets. We have embedded our method into a user-friendly web tool, called TCRex, which is available at tcrex.biodatamining.be. This tool will facilitate the study of the epitope specificity of TCR sequences that underly immune responses in different diseaseassociated studies.

DATA AVAILABILITY STATEMENT
The training dataset of the TCRex web tool will be made available by the authors upon request. The TCR repertoire datasets analyzed in this study were obtained from three different