TCRex: a webtool for the prediction of T-cell receptor sequence epitope specificity

Identification of T-cell receptor (TCR) repertoire epitope targets constitutes an important part of many TCR repertoire studies. To date, we are still relying on time consuming epitope binding experiments for the identification of epitope-specific TCR sequences. Recently, we showed that the prediction of epitope-TCR interaction is possible using a random forest model. We implemented this method in a webtool called TCRex. TCRex is the first tool that enables the prediction of TCR-epitope recognition. It allows users to upload TCR sequences and predict interaction with multiple known cancer or viral epitopes or train new prediction models for new epitopes. TCRex is freely available for academic use at tcrex.biodatamining.be


Introduction
T-cells form an important part of the adaptive immune system as they can recognize potentially pathogenic or aberrant peptides (epitopes), presented either on the cell surface of nucleated host cells or by professional antigen-presenting cells, and induce an immune response. The T-cell receptor (TCR) molecule is responsible for the recognition of the epitope. Each TCR protein is encoded by a genomic region that undergoes non-homologous recombination during T-cell maturation. The randomness of the recombination process induces creation of many different TCR proteins that are for the most part unique per T-cell clone and allows recognition of different epitopes. Epitope binding by the TCR is a critical step for the activation of targeted immune responses. Although multiple immunoinformatics tools exist to predict epitopes and their binding with MHC molecules (e.g. NetCTLpan (Stranzl et al., 2010), CRFMHC (Meysman et al., 2015) and Immune epitope database analysis resource (IEDB-AR) (Zhang et al., 2008)), we still lack useful tools for the prediction of epitope recognition by TCRs.
Here, we present the first tool to analyze TCR sequences and predict the likelihood that they target specific epitopes. This tool is based on the principle that similar TCR sequences often target the same epitope  and that machine learning techniques can be used to learn those commonalities shared by epitope-specific TCR sequences Dash et al., 2017;Glanville et al., 2017). It builds further on our prior work related to the feasibility of predicting TCRepitope recognition using TCR beta sequences . In this study, we showed that a random forest classifier trained to predict TCR-epitope interactions from TCR amino acid physicochemical properties can achieve a high accuracy. We extended this work into a complete framework trained on a large dataset containing different viral and cancer epitopes and made it freely available as a webtool called TCRex.

Data collection
Epitope-specific TCR sequences were downloaded from the manually curated catalogue of pathologyassociated T-cell receptor sequences (McPAS-TCR) (Tickotsky et al., 2017) and the VDJ database (VDJdb) (Shugay et al., 2018) on the 9 th of April 2018. Data collection was currently restricted to TCR beta sequences from human CD8+ T-cells, as this constitutes the bulk of available data. Non-canonical TCR sequences (i.e. not starting with cysteine or not ending with phenylalanine) were removed. Control CD8+ TCR sequences were collected from (Seay et al., 2016) and (Thome et al., 2016). For both the epitope-specific and control TCR sequences, the CDR3 beta amino acid sequence and the corresponding V/J genes and families were retrieved.

Model training and performance evaluation
Training of epitope-specific prediction models occurred in the same manner as presented in . In brief, the amino acid sequences of the CDR3 regions of the TCR beta chains were converted into physiochemical features and the V/J genes and families were one-hot encoded. For each epitope, for which we retrieved at least 30 epitope-specific sequences, a random forest classifier was trained with 100 individual decision trees. Given a TCR beta sequence, these models give the probability of binding a specific epitope. All models were evaluated using a repeated subsampling strategy to obtain the receiver operating characteristic curve (ROC), precision-recall (PR) curve, the accuracy, the mean area-under-the-receiver-operating-characteristic-curve (AUC) value and the mean PR value. Models that had poor AUC (< 0.7) or PR (< 0.35) values were excluded from the final webtool. In total, 43 prediction models with a sufficiently high performance were retained for the initial release. For each of these models, classification thresholds were determined for 1%, 0.1% and 0.01% false positive rates (FPR). These threshold values were calculated by predicting the class probability scores for 100 000 randomly selected control TCR beta sequences. The 1%, 0.1% and 0.01% FPR thresholds were set respectively to the 1000 th , the 100 th and the 10 th highest class probability score.

Webtool
TCRex provides a user-friendly web interface to predict epitope binding for human TCR beta sequences. An overview of the general workflow is presented in figure 1. To start a prediction analysis, users can upload a TCR data file. Different file formats are supported by TCRex, including the immunoSEQ (https://www.adaptivebiotech.com/immunoseq) and MiXCR (Bolotin et al., 2015) output formats. In addition, we propose a very simple tab delimited format that includes the CDR3 amino acid sequences and the V/J genes for all TCR beta sequences following the international ImMunoGeneTics information system (IMGT) notation (Lefranc et al., 2015). In case no V/J gene information is available, users are advised to provide the corresponding V/J families. Fig. 1. Overview of the TCRex workflow. TCR-epitope interaction predictions start after uploading a TCR data file and selecting the epitopes of interest. If the latter are not available in the database, users can train new prediction models by uploading epitopespecific TCR sequences. After choosing the classification threshold, the prediction results can be downloaded as a CSV file together with the performance metrics of the prediction model in case a new classifier was trained.
After uploading the input file, one or more epitopes can be selected from the database. In this first release, prediction models for 43 different epitopes are available, including 38 viral and 5 cancer epitopes. With the development and improvement of new TCR sequencing techniques and the rising interest in epitope-specific TCR repertoire analysis, we expect this number to grow rapidly in the near future. The database will be updated on a half-year schedule with new epitope data made available in the scientific literature. In addition, it is also possible to make predictions for epitopes that are not available in the database. To this end, the user can upload their own dataset containing epitope-specific TCR sequences, which is then used to train new prediction models for their own use. These custom models are hidden to other users and are removed when the task is finished. When all the required information is submitted, the webtool redirects the user to a webpage that gives an overview of all the steps in the prediction process and the current status of the analysis. Once the analysis is finished, a webpage with the prediction results is returned. This interactive results summary allows the user to select the desired classification threshold. By default, a threshold with 0.01% FPR is used, but this can be easily changed to any user-defined class probability threshold. After the analysis, the prediction results can be downloaded as a CSV file. For every TCR sequence provided by the user and every selected epitope, this CSV file contains a score that represents the probability of epitope-TCR recognition. All scores with a value greater than or equal to the chosen classification threshold are considered to bind the epitope of interest. These epitope-TCR pairs are indicated in the output file with an asterisk. In the case where the user trained a new prediction model, the results are presented with a summary of the performance metrics along with the ROC and PR curves and a visualization of the important features. All results are kept available for 48 hours.

Conclusion
TCRex is the first toolbox enabling the prediction of epitope-specific T-cell receptor sequences. Predictions can be made for various cancer and viral epitopes present in the TCRex database or for new epitopes by training new prediction models. TCRex is freely available for academic use as a user-friendly webtool and can be used for the identification of T-cell receptor repertoire targets.