<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" article-type="brief-report" dtd-version="2.3" xml:lang="EN">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Immunol.</journal-id>
<journal-title>Frontiers in Immunology</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Immunol.</abbrev-journal-title>
<issn pub-type="epub">1664-3224</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fimmu.2022.1014256</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Immunology</subject>
<subj-group>
<subject>Perspective</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>On TCR binding predictors failing to generalize to unseen peptides</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Grazioli</surname>
<given-names>Filippo</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="author-notes" rid="fn001">
<sup>*</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/1814479"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>M&#xf6;sch</surname>
<given-names>Anja</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/678865"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Machart</surname>
<given-names>Pierre</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/1950233"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Li</surname>
<given-names>Kai</given-names>
</name>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/1970961"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Alqassem</surname>
<given-names>Israa</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/1998203"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>O&#x2019;Donnell</surname>
<given-names>Timothy J.</given-names>
</name>
<xref ref-type="aff" rid="aff3">
<sup>3</sup>
</xref>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Min</surname>
<given-names>Martin Renqiang</given-names>
</name>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
<xref ref-type="author-notes" rid="fn001">
<sup>*</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/1018436"/>
</contrib>
</contrib-group>
<aff id="aff1">
<sup>1</sup>
<institution>Biomedical AI Group, NEC Laboratories Europe</institution>, <addr-line>Heidelberg</addr-line>, <country>Germany</country>
</aff>
<aff id="aff2">
<sup>2</sup>
<institution>Machine Learning Department, NEC Laboratories America</institution>, <addr-line>Princeton, NJ</addr-line>, <country>United States</country>
</aff>
<aff id="aff3">
<sup>3</sup>
<institution>Division of Hematology and Medical Oncology, Icahn School of Medicine at Mount Sinai</institution>, <addr-line>New York, NY</addr-line>, <country>United States</country>
</aff>
<author-notes>
<fn fn-type="edited-by">
<p>Edited by: Matthew Call, The University of Melbourne, Australia</p>
</fn>
<fn fn-type="edited-by">
<p>Reviewed by: Pieter Meysman, University of Antwerp, Belgium</p>
</fn>
<fn fn-type="corresp" id="fn001">
<p>*Correspondence: Filippo Grazioli, <email xlink:href="mailto:sendtofilippo@gmail.com">sendtofilippo@gmail.com</email>; Martin Renqiang Min, <email xlink:href="mailto:renqiang@nec-labs.com">renqiang@nec-labs.com</email>
</p>
</fn>
<fn fn-type="other" id="fn002">
<p>This article was submitted to T Cell Biology, a section of the journal Frontiers in Immunology</p>
</fn>
</author-notes>
<pub-date pub-type="epub">
<day>21</day>
<month>10</month>
<year>2022</year>
</pub-date>
<pub-date pub-type="collection">
<year>2022</year>
</pub-date>
<volume>13</volume>
<elocation-id>1014256</elocation-id>
<history>
<date date-type="received">
<day>08</day>
<month>08</month>
<year>2022</year>
</date>
<date date-type="accepted">
<day>05</day>
<month>10</month>
<year>2022</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#xa9; 2022 Grazioli, M&#xf6;sch, Machart, Li, Alqassem, O&#x2019;Donnell and Min</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Grazioli, M&#xf6;sch, Machart, Li, Alqassem, O&#x2019;Donnell and Min</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p>
</license>
</permissions>
<abstract>
<p>Several recent studies investigate TCR-peptide/-pMHC binding prediction using machine learning or deep learning approaches. Many of these methods achieve impressive results on test sets, which include peptide sequences that are also included in the training set. In this work, we investigate how state-of-the-art deep learning models for TCR-peptide/-pMHC binding prediction generalize to unseen peptides. We create a dataset including positive samples from IEDB, VDJdb, McPAS-TCR, and the MIRA set, as well as negative samples from both randomization and 10X Genomics assays. We name this collection of samples <italic>TChard</italic>. We propose the <italic>hard split</italic>, a simple heuristic for training/test split, which ensures that test samples exclusively present peptides that do not belong to the training set. We investigate the effect of different training/test splitting techniques on the models&#x2019; test performance, as well as the effect of training and testing the models using mismatched negative samples generated randomly, in addition to the negative samples derived from assays. Our results show that modern deep learning methods fail to generalize to unseen peptides. We provide an explanation why this happens and verify our hypothesis on the <italic>TChard</italic> dataset. We then conclude that robust prediction of TCR recognition is still far for being solved.</p>
</abstract>
<kwd-group>
<kwd>tcr</kwd>
<kwd>peptide</kwd>
<kwd>MHC</kwd>
<kwd>binding prediction</kwd>
<kwd>interaction prediction</kwd>
<kwd>machine learning</kwd>
<kwd>TCR - T cell receptor</kwd>
</kwd-group>
<counts>
<fig-count count="2"/>
<table-count count="0"/>
<equation-count count="0"/>
<ref-count count="39"/>
<page-count count="8"/>
<word-count count="3622"/>
</counts>
</article-meta>
</front>
<body>
<sec id="s1" sec-type="intro">
<label>1</label>
<title>Introduction</title>
<p>Studying T-cell receptors (TCRs) has become an integral part of cancer immunotherapy and human infectious disease research (<xref ref-type="bibr" rid="B1">1</xref>&#x2013;<xref ref-type="bibr" rid="B4">4</xref>). TCRs are able to identify intracellular processed peptides originating from infected or aberrant cells. TCRs are heterodimers consisting of an &#x3b1;- and a &#x3b2;-chain, which bind to peptides presented on the cell surface by either major histocompatibility complex (MHC) class I or class II molecules, depending on the cell type (<xref ref-type="bibr" rid="B5">5</xref>&#x2013;<xref ref-type="bibr" rid="B7">7</xref>). The binding of the TCR to the peptide-MHC (pMHC) complex occurs primarily (but not exclusively) at the complementarity-determining region 3 (CDR3). The CDR3&#x3b1; consists of alleles from the V and J genes; for the CDR3&#x3b2;, the D gene is additionally involved (<xref ref-type="bibr" rid="B8">8</xref>, <xref ref-type="bibr" rid="B9">9</xref>). These alleles can be recombined unboundedly, which results in a high TCR repertoire diversity, essential for a broad T cell-based immune response (<xref ref-type="bibr" rid="B10">10</xref>). When a naive TCR is exposed to an antigen and activated for the first time, a memory T-cell population with this TCR may develop, which enables a long-lasting immune response (<xref ref-type="bibr" rid="B11">11</xref>, <xref ref-type="bibr" rid="B12">12</xref>).</p>
<p>Numerous recent studies investigate TCR-peptide/-pMHC binding prediction by applying different machine or deep learning methods (<xref ref-type="bibr" rid="B13">13</xref>&#x2013;<xref ref-type="bibr" rid="B24">24</xref>). Many of these studies use data from the Immune Epitope Database (IEDB) (<xref ref-type="bibr" rid="B25">25</xref>), VDJdb (<xref ref-type="bibr" rid="B26">26</xref>), and McPAS-TCR (<xref ref-type="bibr" rid="B27">27</xref>), which mainly contain CDR3&#x3b2; data and lack information on CDR3&#x3b1;. Such methods achieve high test performance when evaluated on test sets that belong to the same source as the training set. However, we show that these methods exhibit weak cross-dataset generalization, i.e., the models suffer from severe performance degradation when tested on a different dataset. For example, as shown in <xref ref-type="supplementary-material" rid="SM1">
<bold>Figure S1</bold>
</xref>, several machine learning models trained on McPAS-TCR perform poorly on VDJdb.</p>
<p>In this work, in order to evaluate the relevance of the available data for deep-learning-based TCR-peptide/-pMHC binding prediction, we aggregate binding samples obtained from IEDB, VDJdb, and McPAS-TCR. Non-binding data points are collected from IEDB, as well as from the 10X Genomics samples provided in the NetTCR-2.0 repository (<xref ref-type="bibr" rid="B22">22</xref>). We additionally consider a set of samples from (<xref ref-type="bibr" rid="B28">28</xref>, <xref ref-type="bibr" rid="B29">29</xref>), which are included in the NetTCR-2.0 GitHub repository; we refer to it as the MIRA set. A simple analysis of the class distribution (binding versus non-binding) of the resulting data points reveals that all TCR sequences exclusively appear in either binding or non-binding TCR-peptide/-pMHC pairs; no CDR3 sequence is observed in both positive and negative samples (see <xref ref-type="fig" rid="f1">
<bold>Figure&#xa0;1C</bold>
</xref>). Machine learning models trained naively on data with this class distribution are prone to learning undesirable inductive biases. In fact, our results in <italic>Section 4.1</italic> suggests that they tend to classify samples only as a function of the CDR3 sequences, which could be memorized.</p>
<fig id="f1" position="float">
<label>Figure&#xa0;1</label>
<caption>
<p>Separate class distributions for unique peptides (first row), CDR3<italic>&#x3b2;</italic> (second row), and CDR3<italic>&#x3b1;</italic> (third row) sequences in all (<italic>peptide, CDR3&#x3b2;, CDR3&#x3b1;</italic>) samples. A point on the <bold>
<italic>x</italic>
</bold>-axis represents one unique sequence of amino acids. The <bold>
<italic>y</italic>
</bold>-axis represents how frequently a given peptide, CDR3<italic>&#x3b2;</italic>, or CDR3<italic>&#x3b1;</italic> sequence appears in the considered samples. Sequences are sorted by count. <bold>(A)</bold> Negative samples only include randomized data points (i.e., no negative assays). <bold>(B)</bold> Negative samples include negative assays and randomized negative samples. <bold>(C)</bold> Negative samples only include negative assays.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fimmu-13-1014256-g001.tif"/>
</fig>
<p>For unbiased evaluation, we perform experiments on a dataset derived from the integration of the aforementioned samples. We name the resulting collection of samples <italic>TChard</italic>. To the best of our knowledge, this dataset constitutes the largest set of TCR-peptide/-pMHC samples available at the time this work is being written.</p>
<p>We perform deep learning experiments using two state-of-the-art models for TCR-peptide/-pMHC interaction prediction: ERGO II (<xref ref-type="bibr" rid="B23">23</xref>) and NetTCR-2.0 (<xref ref-type="bibr" rid="B22">22</xref>). ERGO II is a deep learning approach that adopts long short-term memory (LSTM) networks and autoencoders to compute representations of peptides and CDR3s. It can also handle additional input modalities, i.e., V and J genes, MHC, and T-cell type. NetTCR-2.0 employs a simple 1D CNN-based model, integrating peptide and CDR3 sequence information for the prediction of TCR-peptide specificity. Both models input peptide and CDR3s representations in the form of amino acid sequences. The selection of these two models is motivated by the intention to analyze two of the most successful classes of deep learning models: feed-forward convolutional networks (e.g., NetTCR-2.0) and recurrent neural networks (e.g., ERGO II, which includes an LSTM encoder). For this analysis, we do not consider methods that rely on external source of information, e.g., TITAN [24], which performs pre-training on BindingDB (<xref ref-type="bibr" rid="B30">30</xref>).</p>
<p>We perform experiments on <italic>TChard</italic> and investigate the effect of different training/test splitting strategies. In contrast to previous works (<xref ref-type="bibr" rid="B22">22</xref>, <xref ref-type="bibr" rid="B23">23</xref>), we place special emphasis on testing the models on unseen peptides. We propose the <italic>hard split</italic>, a splitting heuristic meant to create test sets that only contain unseen peptides, i.e., not included in the training set. In the context of neoantigen-based cancer vaccines development, neoepitopes exhibit enormous variability in their amino acids sequences; employing TCR binding predictors for this application requires robust generalization to unseen peptides. In accordance with recent findings (<xref ref-type="bibr" rid="B17">17</xref>), we show that evaluating the models&#x2019; performance on unseen peptides leads to poor generalization. In the <xref ref-type="supplementary-material" rid="SM1">
<bold>Supplementary Material</bold>
</xref>, we describe the training/test splitting strategies adopted by Montemurro et&#xa0;al. (<xref ref-type="bibr" rid="B22">22</xref>) and Springer et&#xa0;al. (<xref ref-type="bibr" rid="B23">23</xref>).</p>
</sec>
<sec id="s2">
<label>2</label>
<title>The <italic>TChard</italic> dataset</title>
<p>In this section, we describe the creation of the <italic>TChard</italic> dataset. All samples in <italic>TChard</italic> include a peptide and a CDR3&#x3b2; sequence, associated with a binary binding label. A subset of these samples may additionally have (i) CDR3a sequence information, and/or (ii) allele information of the MHC (class I or II) in complex with peptides. A sample consists therefore of a tuple of molecules (from 2 to 4). When available, the V and J alleles for the &#x3b1;-chain and the V, D, and J alleles for the &#x3b2;-chain are also included. We refer to the binding tuples as <italic>positive</italic> and to the non-binding ones as <italic>negative</italic>.</p>
<sec id="s2_1">
<label>2.1</label>
<title>Dataset creation</title>
<p>First, we collect positive assays from the IEDB, VDJdb<xref ref-type="fn" rid="fn1">
<sup>1</sup>
</xref>, and McPAS-TCR databases. Additionally, we include the binding samples from the MIRA set (<xref ref-type="bibr" rid="B28">28</xref>, <xref ref-type="bibr" rid="B29">29</xref>), which are publicly available in the NetTCR-2.0 repository<sup>
<xref ref-type="fn" rid="fn2">
<sup>2</sup>
</xref>
</sup>.</p>
<p>Second, we include negative assays, i.e., non-binding tuples of molecules extracted from IEDB. Additionally, a set of negative samples extracted from the NetTCR-2.0 repository is considered; this is derived from 10X Genomics assays described by Montemurro et&#xa0;al. (<xref ref-type="bibr" rid="B22">22</xref>). In this work, we refer to the negative tuples derived from negative assays as the NA set.</p>
<p>Third, we operate a filtration over the length of the amino acid sequences, and we only keep samples with peptide sequence length smaller than 16, CDR3&#x3b1; sequence length between 7 and 21, and CDR3&#x3b2; sequence length between 9 and 23. These filtration steps are meant to exclude a small portion of data points that present consistently longer amino acid sequences. Including them in the dataset would imply extending the magnitude of the padding required by NetTCR-2.0 by a large margin, making computation more expensive.</p>
<p>Fourth, we generate negative samples <italic>via</italic> random recombination of the sequences found in the positive tuples. Building from the positive samples, we associate the peptides or pMHC complexes (when MHC allele information is available) with CDR3&#x3b1; and CDR3&#x3b2; sequences randomly sampled from the dataset, as operated in previous studies (<xref ref-type="bibr" rid="B23">23</xref>). We sample twice as many mismatched negative samples as there are positive ones. We discard randomly generated samples that share at least the same (peptide, CDR3&#x3b2;) with any positive sample. In this work, we refer to the randomized negative tuples as the RN set. Additional remarks on invalid residues and CDR3 sequence homogenization are included in the <xref ref-type="supplementary-material" rid="SM1">
<bold>Supplementary Material</bold>
</xref>.</p>
</sec>
<sec id="s2_2">
<label>2.2</label>
<title>Description of the data distributions</title>
<p>The full dataset, i.e., considering negative samples from both NA and RN, presents the following:</p>
<list list-type="simple">
<list-item>
<p>&#x2022; 528,020 unique (<italic>peptide, CDR3&#x3b2;</italic>) tuples, 385,776 of which are negative and 142,244 are positive;</p>
</list-item>
<list-item>
<p>&#x2022; 400,397 unique (<italic>peptide, CDR3&#x3b2;, MHC</italic>) tuples, 300,168 of which are negative and 100,229 are positive;</p>
</list-item>
<list-item>
<p>&#x2022; 111,041 unique (<italic>peptide, CDR3&#x3b2;, CDR3&#x3b1;</italic>) tuples, 82,631 of which are negative and 28,410 are positive; and</p>
</list-item>
<list-item>
<p>&#x2022; 110,266 unique (<italic>peptide, CDR3&#x3b2;, CDR3&#x3b1;, MHC</italic>) tuples, 82,037 of which are negative and 28,229 are positive.</p>
</list-item>
</list>
<p>The dataset statistics considering negative samples derived from either RN or NA are presented in <xref ref-type="supplementary-material" rid="SM1">
<bold>Table S1</bold>
</xref>. <xref ref-type="fig" rid="f1">
<bold>Figure&#xa0;1</bold>
</xref> depicts the class distribution for (<italic>peptide, CDR3&#x3b2;, CDR3&#x3b1;</italic>) samples. Analogously, <xref ref-type="supplementary-material" rid="SM1">
<bold>Figures S2&#x2013;S4</bold>
</xref> depict the class distribution for (<italic>peptide, CDR3&#x3b2;</italic>), (<italic>peptide, CDR3&#x3b2;, MHC</italic>) and (<italic>peptide, CDR3&#x3b2;, CDR3&#x3b1;, MHC</italic>) samples, respectively. <xref ref-type="supplementary-material" rid="SM1">
<bold>Figure S5</bold>
</xref> depicts the length distribution for all sequences.</p>
</sec>
</sec>
<sec id="s3">
<label>3</label>
<title>Predicting TCR recognition with deep learning</title>
<p>We perform experiments on the <italic>TChard</italic> dataset with two publicly available state-of-the-art deep learning methods for TCR-peptide/-pMHC interaction prediction: ERGO II and NetTCR-2.0 <sup>
<xref ref-type="fn" rid="fn3">
<sup>3</sup>
</xref>
</sup>.</p>
<p>We operate TCR-peptide interaction prediction considering peptide and CDR3&#x3b2;, as well as TCR-pMHC interaction prediction considering peptide, CDR3&#x3b2;, CDR3&#x3b1;, and MHC. NetTCR-2.0 is not explicitly designed to account for MHC information; we circumvent this shortcoming by concatenating the MHC pseudo-sequence <xref ref-type="fn" rid="fn4">
<sup>4</sup>
</xref> to the other input amino acid sequences and perform BLOSUM50 encoding (<xref ref-type="bibr" rid="B32">32</xref>). We do not make distinctions between class I and II MHCs and train a single model for both types.</p>
<sec id="s3_1">
<label>3.1</label>
<title>Random and hard training/test splits</title>
<p>For performance evaluation, we investigate two different strategies for training/test splits.</p>
<p>
<bold>Random split (RS).</bold> Given a training/test ratio (80/20 in this work), this procedure consists in sampling test samples uniformly from the dataset without replacement until the desired budget is filled. The remaining samples constitute the training set. In this work, we refer to RS(RN), when the negative tuples only belong to the RN set, to RS(NA), when the negative tuples only belong to the NA set, and to RS(RN+NA), when all negative samples are considered.</p>
<p>The nature of TCR recognition is combinatorial. In our dataset, although a given tuple of molecules is only observed once, a given peptide can appear multiple times, paired with different CDR3&#x3b2;, CDR3&#x3b1;, or MHC. Using a random training/test split ensures that test tuples are not observed at training time. However, this can lead to testing the model on peptides, MHCs, or CDR3&#x3b2; and CDR3&#x3b1; sequences that were already observed at training time in combination with different sequences. Our results show that this can lead to overoptimistic estimates of machine learning models&#x2019; real-world performance. To enable neoantigen-based cancer vaccines and T-cell herapy, it is fundamental to test the model on sequences that were never observed at training time. Neoantigens display in fact enormous variability in their amino acids sequence; to identify the most immunogenic vaccine elements, we need models that generalize to unseen sequences.</p>
<p>
<bold>Hard split (HS).</bold> We propose a simple heuristic, which we refer to as <italic>hard split</italic>. Considering the whole dataset consisting in a set of tuples, we first select a <italic>minimum</italic> training/test ratio (85/15 in this work). Let <italic>P<sub>l,u</sub>
</italic> be the set of all peptides that are observed in at least <italic>l</italic> tuples but no more than <italic>u</italic> tuples in our dataset. We randomly sample a peptide from <italic>P<sub>l,u</sub>
</italic> without replacement. All tuples that include that peptide are assigned to the test set. If the current number of test samples is smaller than the budget defined by the training/test ratio, the sampling from <italic>P<sub>l,u</sub>
</italic> is repeated.</p>
<p>This heuristic ensures that the peptides that belong to the test set are not observed by the model at training time. For the (<italic>peptide, CDR3&#x3b2;</italic>) tuples, which present 1,360 different peptides, we set <italic>l</italic> and <italic>u</italic> to 500 and 10,000, respectively. This selects a set of 104 possible test peptides. For the (<italic>peptide, CDR3&#x3b2;, CDR3&#x3b1;, MHC</italic>) tuples, which present 870 different peptides, we set <italic>l</italic> and <italic>u</italic> to 100 and 5,000, respectively. This results in a set of 42 possible test peptides. The <italic>l</italic> parameter is a lower bound and ensures that the selected test peptides are paired with a sufficiently broad variety of CDR3 sequences. The <italic>u</italic> parameter is an upper bound and allows excluding test peptides that can too quickly saturate the test budget, hence reducing the variety of test peptides. We create five different hard splits using five different random seeds for the sampling of the test peptides. For the creation of the hard training/test splits, we consider all positive samples, as well as the negative samples from the RN set, i.e., excluding the negative samples from the negative assays. We refer to this type of split as HS(RN).</p>
<p>
<xref ref-type="supplementary-material" rid="SM1">
<bold>Tables S2</bold>
</xref> and <xref ref-type="supplementary-material" rid="SM1">
<bold>S3</bold>
</xref> describe the different HS(RN) hard splits for the (<italic>peptide, CDR3&#x3b2;</italic>) and (<italic>peptide, CDR3&#x3b2;, CDR3&#x3b1;, MHC</italic>) samples, respectively. They present the lists of test peptides and the number of positive and negative samples associated with each of them. Each displayed test is paired with different TCRs. The test TCRs can be observed at training time, as the HS only ensures that test peptides are unseen.</p>
<p>Since a subset of the available samples is included in more than one source database, we drop duplicate data points for the two considered settings, i.e., (<italic>peptide, CDR3&#x3b2;, label</italic>) and (<italic>peptide, CDR3&#x3b2;, CDR3&#x3b1;, MHC, label</italic>).</p>
</sec>
<sec id="s3_2">
<label>3.2</label>
<title>Validation approach and performance evaluation</title>
<p>For robust performance evaluation, we repeat the experiments for each different training/test split (i.e., five times). The area under the receiver operator characteristic (AUROC) curve (<xref ref-type="bibr" rid="B33">33</xref>, <xref ref-type="bibr" rid="B34">34</xref>), the area under the precision&#x2013;recall (AUPR) curve (<xref ref-type="bibr" rid="B35">35</xref>, <xref ref-type="bibr" rid="B36">36</xref>), the F1 score (F1) (<xref ref-type="bibr" rid="B37">37</xref>), and precision, recall, and classification accuracy are computed on the test sets and averaged.</p>
<p>We adopt the default configuration for both ERGO II and NetTCR-2.0, as proposed in their original implementations. For ERGO II, we adopt the LSTM amino acid sequences encoder. The training is performed for a maximum of 1,000 epochs and, in order to avoid over-fitting, the best model is selected by saving the weights corresponding to the epoch where the AUROC is maximum on the validation set. The validation set is obtained <italic>via</italic> 80/20 stratified random split of the training set.</p>
</sec>
</sec>
<sec id="s4" sec-type="results">
<label>4</label>
<title>Results</title>
<p>
<xref ref-type="fig" rid="f2">
<bold>Figure&#xa0;2</bold>
</xref> shows test results for ERGO II and NetTCR-2.0, for the RS and HS splitting strategies, in both the peptide+CDR3&#x3b2; and the peptide+CDR3&#x3b2;+CDR3&#x3b1;+MHC settings. We perform experiments considering negative samples from the NA set only, from the RN set only, and jointly from both the NA and RN sets. Additionally, in the <xref ref-type="supplementary-material" rid="SM1">
<bold>Supplementary Material</bold>
</xref>, we report results of experiments performed exclusively on VDJdb samples with quality score &#x2265; 1.</p>
<fig id="f2" position="float">
<label>Figure&#xa0;2</label>
<caption>
<p>Test results for ERGO II and NetTCR-2.0 for TCR-peptide/-pMHC interaction prediction trained and tested on <italic>TChard</italic>. AUPR: area under the precision&#x2013;recall curve. AUROC: area under the receiver operator characteristic curve. NA: negative samples from negative assays. RN: negative samples from random mismatching. RS(&#xb7;): random split. HS(&#xb7;): hard split. Confidence intervals are standard deviation over 5 experiments with independent training/test splits. <bold>(A&#x2013;D)</bold> ERGO II and NetTCR-2.0 results on (<italic>peptide, CDR3&#x3b2;</italic>) and (<italic>peptide, CDR3&#x3b2;, CDR3&#x3b1;, MHC</italic>) samples. Legend: <italic>Source of training negatives | Training/test split</italic>. <bold>(E)</bold> Peptide-specific AUROC computed on the (<italic>peptide, CDR3&#x3b2;</italic>) test set obtained with hard split 0 (see <xref ref-type="supplementary-material" rid="SM1">
<bold>Table S2</bold>
</xref>). <bold>(F)</bold> Peptide-specific AUROC computed on the (<italic>peptide, CDR3&#x3b2;, CDR3&#x3b1;, MHC</italic>) test set obtained with hard split 0 (see Table S3).</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fimmu-13-1014256-g002.tif"/>
</fig>
<sec id="s4_1">
<label>4.1</label>
<title>Overoptimistic classification performance due to sequence memorization</title>
<p>As depicted in <xref ref-type="fig" rid="f2">
<bold>Figure&#xa0;2</bold>
</xref>, almost perfect classification is achieved when training with negative samples only from the NA set and testing using the RS(NA) split. As shown in <xref ref-type="supplementary-material" rid="SM1">
<bold>Figures S2C</bold>
</xref> and <xref ref-type="supplementary-material" rid="SM1">
<bold>S4C</bold>
</xref>, when considering negative samples from the NA set only, the binding and non-binding class histograms of the CDR3 sequences are disjoint. Hence, models can learn to correctly map a large portion of test tuples to the correct label simply by memorizing the CDR3 sequences, ignoring the peptide. We believe that these results are overoptimistic and should not be considered as the approximation of these models&#x2019; real-world performance.</p>
</sec>
<sec id="s4_2">
<label>4.2</label>
<title>The hard split allows for realistic evaluation</title>
<p>Using the HS heuristic appears to make prediction on the test set consistently harder, if not impossible. This tendency is observed in the peptide+CDR3&#x3b2; setting (<xref ref-type="fig" rid="f2">
<bold>Figures&#xa0;2A, B</bold>
</xref>) and in the peptide+CDR3&#x3b2;+CDR3&#x3b1;+MHC setting (<xref ref-type="fig" rid="f2">
<bold>Figures&#xa0;2C, D</bold>
</xref>). In the peptide+CDR3&#x3b2; setting, when testing the models using the HS(RN) split, the predictions on the test set barely exceed random-level performance, i.e., almost no generalization to unseen peptides is occurring (AUROC &#x2248; 0.55). This phenomenon is observed when the models are trained using negative samples from the RN set only, as well as when using negative samples from both the RN and NA sets.</p>
<p>The effect of including negative samples from NA at training time does not significantly influence test performance when the HS is adopted. Conversely, when RS is performed, significant differences are caused by the utilization of the negative samples from NA. This reinforces our claims regarding sequence memorization. ERGO II, in the peptide+CDR3&#x3b2; setting (<xref ref-type="fig" rid="f2">
<bold>Figure&#xa0;2A</bold>
</xref>), achieves overoptimistic performance when the negative samples come from both NA and RN and testing is operated using RS(RN+NA). The same phenomenon is observed in <xref ref-type="fig" rid="f2">
<bold>Figure&#xa0;2B</bold>
</xref> for ERGO II in the peptide+CDR3&#x3b2;+CDR3&#x3b1;+MHC setting and in <xref ref-type="fig" rid="f2">
<bold>Figure 2D</bold>
</xref> for NetTCR-2.0 in the peptide+CDR3&#x3b2;+CDR3&#x3b1;+MHC setting.</p>
<p>
<xref ref-type="supplementary-material" rid="SM1">
<bold>Figure S6</bold>
</xref> depicts NetTCR-2.0 results on the (<italic>peptide, CDR3&#x3b2;, CDR3&#x3b1;, MHC</italic>) samples, but ignoring the MHC; we report these results for fairness, as NetTCR-2.0 is not originally designed to handle MHC pseudo-sequences.</p>
</sec>
</sec>
<sec id="s5" sec-type="discussion">
<label>5</label>
<title>Discussion</title>
<p>In this work, we aim to test the reliability of state-of-the-art deep learning methods on TCR-peptide/-pMHC binding prediction for unseen peptides. To this purpose, we integrate TCR-peptide/-pMHC samples from different databases. We name this collection of samples <italic>TChard</italic>.</p>
<p>We perform experiments with two state-of-the-art deep learning models for TCR-peptide/-pMHC interaction prediction, ERGO II and NetTCR-2.0. We study the peptide+CDR3&#x3b2; and the peptide+CDR3&#x3b2;+CDR3&#x3b1;+MHC settings. We compare the effect of different training/test splitting strategies, RS and HS. RS is a naive random split, while HS allows testing the models on unseen peptides. We investigate the effect of training and testing the models using mismatched negative samples generated randomly (RN), in addition to the negative samples derived from assays (NA).</p>
<p>As shown in our experiments, when the HS is performed, the two models do not generalize to unseen peptides; this appears to be in contrast to the TPP-III results presented by Springer et&#xa0;al. (<xref ref-type="bibr" rid="B23">23</xref>). Conversely, when a simple RS is employed and negative samples only belong to NA, almost perfect classification is achieved. We believe that this phenomenon is due to the class distribution of the CDR3 sequences and the related sequence memorization. As shown in <xref ref-type="fig" rid="f1">
<bold>Figure&#xa0;1C</bold>
</xref>, when considering negative samples from NA only, the positive and the negative samples are completely disjoint. Hence, a given CDR3 sequence is only presented in either binding or non-binding samples. This leads to learning an inductive bias, which classifies tuples as binding or non-binding exclusively based on the CDR3 sequence, without considering which peptide they are paired with; this appears to be confirmed also by the findings of Weber et&#xa0;al. (<xref ref-type="bibr" rid="B24">24</xref>).</p>
<p>In order to make progress towards robust TCR-peptide/-pMHC interaction prediction, machine learning models should achieve satisfactory test performance on the hard training/test split (HS), which we propose in this work. Only then will such models be applicable for real-world applications, e.g., personalized cancer immunotherapy and T-cell engineering. Possible strategies to achieve this goal might require exploring different feature representations, e.g., SMILES (<xref ref-type="bibr" rid="B38">38</xref>) encodings as proposed in TITAN (<xref ref-type="bibr" rid="B24">24</xref>). Further possible methods might rely on physics-based simulations for the generation of large-scale datasets. Additionally, transfer learning techniques (<xref ref-type="bibr" rid="B39">39</xref>) might allow to leverage knowledge from large databases of protein-ligand binding affinity, e.g., BindingDB (<xref ref-type="bibr" rid="B30">30</xref>), which includes more than 1 million labeled samples.</p>
</sec>
<sec id="s6" sec-type="data-availability">
<title>Data availability statement</title>
<p>The dataset adopted for this study can be found in the following repository: <uri xlink:href="https://doi.org/10.5281/zenodo.6962043">https://doi.org/10.5281/zenodo.6962043</uri>. The code used to create the dataset and to run the machine learning experiments can be found in <uri xlink:href="https://github.com/nec-research/tc-hard">https://github.com/nec-research/tc-hard</uri>.</p>
</sec>
<sec id="s7" sec-type="author-contributions">
<title>Author contributions</title>
<p>FG pre-processed the data, created the dataset, performed the machine learning experiments, and drafted the manuscript. All other authors contributed to the conceptualization of the work and revised the manuscript. In particular, AM supported the data pre-processing and provided immuno-oncological guidance. All authors contributed to the article and approved the submitted version.</p>
</sec>
</body>
<back>
<sec id="s8" sec-type="COI-statement">
<title>Conflict of interest</title>
<p>Authors FG, AM, PM and IA are employed by NEC Laboratories Europe. KL and MM are employed by NEC Laboratories America.</p>
<p>The remaining author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec id="s9" sec-type="disclaimer">
<title>Publisher&#x2019;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<sec id="s10" sec-type="supplementary-material">
<title>Supplementary material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/fimmu.2022.1014256/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/fimmu.2022.1014256/full#supplementary-material</ext-link>
</p>
<supplementary-material xlink:href="DataSheet_1.pdf" id="SM1" mimetype="application/pdf"/>
</sec>
<fn-group>
<fn id="fn1">
<label>1</label>
<p>As we aim at creating the largest possible collection of samples, we do not perform any filtration on the quality score of the VDJdb samples, at the cost of introducing noise.</p>
</fn>
<fn id="fn2">
<label>2</label>
<p>
<uri xlink:href="https://github.com/mnielLab/NetTCR-2.0/tree/main/data">https://github.com/mnielLab/NetTCR-2.0/tree/main/data</uri>.</p>
</fn>
<fn id="fn3">
<label>3</label>
<p>ERGO II: <uri xlink:href="https://github.com/IdoSpringer/ERGO-II">https://github.com/IdoSpringer/ERGO-II</uri>; NetTCR-2.0: <uri xlink:href="https://github.com/mnielLab/NetTCR-2.0">https://github.com/mnielLab/NetTCR-2.0</uri>.</p>
</fn>
<fn id="fn4">
<label>4</label>
<p>Taken from the PUFFIN [<xref ref-type="bibr" rid="B31">31</xref>] repository: <uri xlink:href="https://github.com/gifford-lab/PUFFIN/blob/master/data/">https://github.com/gifford-lab/PUFFIN/blob/master/data/</uri>.</p>
</fn>
</fn-group>
<ref-list>
<title>References</title>
<ref id="B1">
<label>1</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kalos</surname> <given-names>M</given-names>
</name>
<name>
<surname>June</surname> <given-names>C</given-names>
</name>
</person-group>. <article-title>Adoptive T cell transfer for cancer immunotherapy in the era of synthetic biology</article-title>. <source>Immunity</source> (<year>2013</year>) <volume>39</volume>:<fpage>49</fpage>&#x2013;<lpage>60</lpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.1016/j.immuni.2013.07.002</pub-id>
</citation>
</ref>
<ref id="B2">
<label>2</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Woodsworth</surname> <given-names>DJ</given-names>
</name>
<name>
<surname>Castellarin</surname> <given-names>M</given-names>
</name>
<name>
<surname>Holt</surname> <given-names>RA</given-names>
</name>
</person-group>. <article-title>Sequence analysis of T-cell repertoires in health and disease</article-title>. <source>Genome Med</source> (<year>2013</year>) <volume>5</volume>:<fpage>98</fpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.1186/gm502</pub-id>
</citation>
</ref>
<ref id="B3">
<label>3</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Maus</surname> <given-names>MV</given-names>
</name>
<name>
<surname>Fraietta</surname> <given-names>JA</given-names>
</name>
<name>
<surname>Levine</surname> <given-names>BL</given-names>
</name>
<name>
<surname>Kalos</surname> <given-names>M</given-names>
</name>
<name>
<surname>Zhao</surname> <given-names>Y</given-names>
</name>
<name>
<surname>June</surname> <given-names>CH</given-names>
</name>
</person-group>. <article-title>Adoptive immunotherapy for cancer or viruses</article-title>. <source>Annu Rev Immunol</source> (<year>2014</year>) <volume>32</volume>:<fpage>189</fpage>&#x2013;<lpage>225</lpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.1146/annurev-immunol-032713-120136</pub-id>
</citation>
</ref>
<ref id="B4">
<label>4</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kunert</surname> <given-names>A</given-names>
</name>
<name>
<surname>van Brakel</surname> <given-names>M</given-names>
</name>
<name>
<surname>van Steenbergen-Langeveld</surname> <given-names>S</given-names>
</name>
<name>
<surname>da Silva</surname> <given-names>M</given-names>
</name>
<name>
<surname>Coulie</surname> <given-names>PG</given-names>
</name>
<name>
<surname>Lamers</surname> <given-names>C</given-names>
</name>
<etal/>
</person-group>. <article-title>MAGE-C2&#x2013;specific TCRs combined with epigenetic drug-enhanced antigenicity yield robust and tumor-selective T cell responses</article-title>. <source>J Immunol</source> (<year>2016</year>) <volume>197</volume>:<page-range>2541&#x2013;52</page-range>. doi:&#xa0;<pub-id pub-id-type="doi">10.4049/jimmunol.1502024</pub-id>
</citation>
</ref>
<ref id="B5">
<label>5</label>
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Alberts</surname> <given-names>B</given-names>
</name>
<name>
<surname>Johnson</surname> <given-names>A</given-names>
</name>
<name>
<surname>Lewis</surname> <given-names>J</given-names>
</name>
<name>
<surname>Morgan</surname> <given-names>D</given-names>
</name>
<name>
<surname>Raff</surname> <given-names>M</given-names>
</name>
<name>
<surname>Roberts</surname> <given-names>K</given-names>
</name>
<etal/>
</person-group>. <source>Molecular biology of the cell</source>. <publisher-loc>New York, USA</publisher-loc>: <publisher-name>WW Norton &amp; Company</publisher-name> (<year>2017</year>).</citation>
</ref>
<ref id="B6">
<label>6</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rowen</surname> <given-names>L</given-names>
</name>
<name>
<surname>Koop</surname> <given-names>BF</given-names>
</name>
<name>
<surname>Hood</surname> <given-names>L</given-names>
</name>
</person-group>. <article-title>The complete 685-kilobase dna sequence of the human <italic>&#x3b2;</italic> T cell receptor locus</article-title>. <source>Science</source> (<year>1996</year>) <volume>272</volume>:<page-range>1755&#x2013;62</page-range>. doi: <pub-id pub-id-type="doi">10.1126/science.272.5269.1755</pub-id>
</citation>
</ref>
<ref id="B7">
<label>7</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Glanville</surname> <given-names>J</given-names>
</name>
<name>
<surname>Huang</surname> <given-names>H</given-names>
</name>
<name>
<surname>Nau</surname> <given-names>A</given-names>
</name>
<name>
<surname>Hatton</surname> <given-names>O</given-names>
</name>
<name>
<surname>Wagar</surname> <given-names>LE</given-names>
</name>
<name>
<surname>Rubelt</surname> <given-names>F</given-names>
</name>
<etal/>
</person-group>. <article-title>Identifying specificity groups in the t cell receptor repertoire</article-title>. <source>Nature</source> (<year>2017</year>) <volume>547</volume>:<page-range>94&#x2013;8</page-range>. doi: <pub-id pub-id-type="doi">10.1038/nature22976</pub-id>
</citation>
</ref>
<ref id="B8">
<label>8</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Feng</surname> <given-names>D</given-names>
</name>
<name>
<surname>Bond</surname> <given-names>CJ</given-names>
</name>
<name>
<surname>Ely</surname> <given-names>LK</given-names>
</name>
<name>
<surname>Maynard</surname> <given-names>J</given-names>
</name>
<name>
<surname>Garcia</surname> <given-names>KC</given-names>
</name>
</person-group>. <article-title>Structural evidence for a germline-encoded t cell receptor&#x2013;major histocompatibility complex interaction&#x2019;codon&#x2019;</article-title>. <source>Nat Immunol</source> (<year>2007</year>) <volume>8</volume>:<page-range>975&#x2013;83</page-range>. doi: <pub-id pub-id-type="doi">10.1038/ni1502</pub-id>
</citation>
</ref>
<ref id="B9">
<label>9</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rossjohn</surname> <given-names>J</given-names>
</name>
<name>
<surname>Gras</surname> <given-names>S</given-names>
</name>
<name>
<surname>Miles</surname> <given-names>JJ</given-names>
</name>
<name>
<surname>Turner</surname> <given-names>SJ</given-names>
</name>
<name>
<surname>Godfrey</surname> <given-names>DI</given-names>
</name>
<name>
<surname>McCluskey</surname> <given-names>J</given-names>
</name>
</person-group>. <article-title>T Cell antigen receptor recognition of antigen-presenting molecules</article-title>. <source>Annu Rev Immunol</source> (<year>2015</year>) <volume>33</volume>:<fpage>169</fpage>&#x2013;<lpage>200</lpage>. doi: <pub-id pub-id-type="doi">10.1146/annurev-immunol-032414-112334</pub-id>
</citation>
</ref>
<ref id="B10">
<label>10</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Qi</surname> <given-names>Q</given-names>
</name>
<name>
<surname>Liu</surname> <given-names>Y</given-names>
</name>
<name>
<surname>Cheng</surname> <given-names>Y</given-names>
</name>
<name>
<surname>Glanville</surname> <given-names>J</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>D</given-names>
</name>
<name>
<surname>Lee</surname> <given-names>JY</given-names>
</name>
<etal/>
</person-group>. <article-title>Diversity and clonal selection in the human T-cell repertoire</article-title>. <source>Proc Natl Acad Sci</source> (<year>2014</year>) <volume>111</volume>:<page-range>13139&#x2013;44</page-range>. doi:&#xa0;<pub-id pub-id-type="doi">10.1073/pnas.1409155111</pub-id>
</citation>
</ref>
<ref id="B11">
<label>11</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jameson</surname> <given-names>SC</given-names>
</name>
<name>
<surname>Masopust</surname> <given-names>D</given-names>
</name>
</person-group>. <article-title>Understanding subset diversity in T cell memory</article-title>. <source>Immunity</source> (<year>2018</year>) <volume>48</volume>:<page-range>214&#x2013;26</page-range>. doi:&#xa0;<pub-id pub-id-type="doi">10.1016/j.immuni.2018.02.010</pub-id>
</citation>
</ref>
<ref id="B12">
<label>12</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Omilusik</surname> <given-names>KD</given-names>
</name>
<name>
<surname>Goldrath</surname> <given-names>AW</given-names>
</name>
</person-group>. <article-title>Remembering to remember: T cell memory maintenance and plasticity</article-title>. <source>Curr Opin Immunol</source> (<year>2019</year>) <volume>58</volume>:<fpage>89</fpage>&#x2013;<lpage>97</lpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.1016/j.coi.2019.04.009</pub-id>
</citation>
</ref>
<ref id="B13">
<label>13</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jurtz</surname> <given-names>VI</given-names>
</name>
<name>
<surname>Jessen</surname> <given-names>LE</given-names>
</name>
<name>
<surname>Bentzen</surname> <given-names>AK</given-names>
</name>
<name>
<surname>Jespersen</surname> <given-names>MC</given-names>
</name>
<name>
<surname>Mahajan</surname> <given-names>S</given-names>
</name>
<name>
<surname>Vita</surname> <given-names>R</given-names>
</name>
<etal/>
</person-group>. <article-title>Nettcr: sequence-based prediction of tcr binding to peptide-mhc complexes using convolutional neural networks</article-title>. <source>BioRxiv</source> (<year>2018</year>), <fpage>433706</fpage>. doi: <pub-id pub-id-type="doi">10.1101/433706</pub-id>
</citation>
</ref>
<ref id="B14">
<label>14</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>De Neuter</surname> <given-names>N</given-names>
</name>
<name>
<surname>Bittremieux</surname> <given-names>W</given-names>
</name>
<name>
<surname>Beirnaert</surname> <given-names>C</given-names>
</name>
<name>
<surname>Cuypers</surname> <given-names>B</given-names>
</name>
<name>
<surname>Mrzic</surname> <given-names>A</given-names>
</name>
<name>
<surname>Moris</surname> <given-names>P</given-names>
</name>
<etal/>
</person-group>. <article-title>On the feasibility of mining cd8+ t cell receptor patterns underlying immunogenic peptide recognition</article-title>. <source>Immunogenetics</source> (<year>2018</year>) <volume>70</volume>:<page-range>159&#x2013;68</page-range>. doi: <pub-id pub-id-type="doi">10.1007/s00251-017-1023-5</pub-id>
</citation>
</ref>
<ref id="B15">
<label>15</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jokinen</surname> <given-names>E</given-names>
</name>
<name>
<surname>Huuhtanen</surname> <given-names>J</given-names>
</name>
<name>
<surname>Mustjoki</surname> <given-names>S</given-names>
</name>
<name>
<surname>Heinonen</surname> <given-names>M</given-names>
</name>
<name>
<surname>L&#xe4;hdesm&#xe4;ki</surname> <given-names>H</given-names>
</name>
</person-group>. <article-title>Predicting recognition between t cell receptors and epitopes with tcrgp</article-title>. <source>PloS Comput Biol</source> (<year>2021</year>) <volume>17</volume>:<elocation-id>e1008814</elocation-id>. doi: <pub-id pub-id-type="doi">10.1371/journal.pcbi.1008814</pub-id>
</citation>
</ref>
<ref id="B16">
<label>16</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wong</surname> <given-names>EB</given-names>
</name>
<name>
<surname>Gold</surname> <given-names>MC</given-names>
</name>
<name>
<surname>Meermeier</surname> <given-names>EW</given-names>
</name>
<name>
<surname>Xulu</surname> <given-names>BZ</given-names>
</name>
<name>
<surname>Khuzwayo</surname> <given-names>S</given-names>
</name>
<name>
<surname>Sullivan</surname> <given-names>ZA</given-names>
</name>
<etal/>
</person-group>. <article-title>Trav1-2+ cd8+ t-cells including oligoconal expansions of mait cells are enriched in the airways in human tuberculosis</article-title>. <source>Commun Biol</source> (<year>2019</year>) <volume>2</volume>:<fpage>1</fpage>&#x2013;<lpage>13</lpage>. doi: <pub-id pub-id-type="doi">10.1038/s42003-019-0442-2</pub-id>
</citation>
</ref>
<ref id="B17">
<label>17</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Moris</surname> <given-names>P</given-names>
</name>
<name>
<surname>De Pauw</surname> <given-names>J</given-names>
</name>
<name>
<surname>Postovskaya</surname> <given-names>A</given-names>
</name>
<name>
<surname>Gielis</surname> <given-names>S</given-names>
</name>
<name>
<surname>De Neuter</surname> <given-names>N</given-names>
</name>
<name>
<surname>Bittremieux</surname> <given-names>W</given-names>
</name>
<etal/>
</person-group>. <article-title>Current challenges for unseen-epitope tcr interaction prediction and a new perspective derived from image classification</article-title>. <source>Briefings Bioinf</source> (<year>2021</year>) <volume>22</volume>:<fpage>bbaa318</fpage>. doi: <pub-id pub-id-type="doi">10.1093/bib/bbaa318</pub-id>
</citation>
</ref>
<ref id="B18">
<label>18</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gielis</surname> <given-names>S</given-names>
</name>
<name>
<surname>Moris</surname> <given-names>P</given-names>
</name>
<name>
<surname>Bittremieux</surname> <given-names>W</given-names>
</name>
<name>
<surname>De Neuter</surname> <given-names>N</given-names>
</name>
<name>
<surname>Ogunjimi</surname> <given-names>B</given-names>
</name>
<name>
<surname>Laukens</surname> <given-names>K</given-names>
</name>
<etal/>
</person-group>. <article-title>Detection of enriched t cell epitope specificity in full t cell receptor sequence repertoires</article-title>. <source>Front Immunol</source> (<year>2019</year>) <volume>10</volume>:<elocation-id>2820</elocation-id>. doi: <pub-id pub-id-type="doi">10.3389/fimmu.2019.02820</pub-id>
</citation>
</ref>
<ref id="B19">
<label>19</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tong</surname> <given-names>Y</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>J</given-names>
</name>
<name>
<surname>Zheng</surname> <given-names>T</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>X</given-names>
</name>
<name>
<surname>Xiao</surname> <given-names>X</given-names>
</name>
<name>
<surname>Zhu</surname> <given-names>X</given-names>
</name>
<etal/>
</person-group>. <article-title>Sete: Sequence-based ensemble learning approach for tcr epitope binding prediction</article-title>. <source>Comput Biol Chem</source> (<year>2020</year>) <volume>87</volume>:<fpage>107281</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.compbiolchem.2020.107281</pub-id>
</citation>
</ref>
<ref id="B20">
<label>20</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Springer</surname> <given-names>I</given-names>
</name>
<name>
<surname>Besser</surname> <given-names>H</given-names>
</name>
<name>
<surname>Tickotsky-Moskovitz</surname> <given-names>N</given-names>
</name>
<name>
<surname>Dvorkin</surname> <given-names>S</given-names>
</name>
<name>
<surname>Louzoun</surname> <given-names>Y</given-names>
</name>
</person-group>. <article-title>Prediction of specific tcr-peptide binding from large dictionaries of tcr-peptide pairs</article-title>. <source>Front Immunol</source> (<year>2020</year>) <volume>11</volume>:<elocation-id>1803</elocation-id>. doi: <pub-id pub-id-type="doi">10.3389/fimmu.2020.01803</pub-id>
</citation>
</ref>
<ref id="B21">
<label>21</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Fischer</surname> <given-names>DS</given-names>
</name>
<name>
<surname>Wu</surname> <given-names>Y</given-names>
</name>
<name>
<surname>Schubert</surname> <given-names>B</given-names>
</name>
<name>
<surname>Theis</surname> <given-names>FJ</given-names>
</name>
</person-group>. <article-title>Predicting antigen specificity of single t cells based on tcr cdr 3 regions</article-title>. <source>Mol Syst Biol</source> (<year>2020</year>) <volume>16</volume>:<elocation-id>e9416</elocation-id>. doi: <pub-id pub-id-type="doi">10.15252/msb.20199416</pub-id>
</citation>
</ref>
<ref id="B22">
<label>22</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Montemurro</surname> <given-names>A</given-names>
</name>
<name>
<surname>Schuster</surname> <given-names>V</given-names>
</name>
<name>
<surname>Povlsen</surname> <given-names>HR</given-names>
</name>
<name>
<surname>Bentzen</surname> <given-names>AK</given-names>
</name>
<name>
<surname>Jurtz</surname> <given-names>V</given-names>
</name>
<name>
<surname>Chronister</surname> <given-names>WD</given-names>
</name>
<etal/>
</person-group>. <article-title>Nettcr-2.0 enables accurate prediction of tcr-peptide binding by using paired tcr <italic>&#x3b1;</italic> And <italic>&#x3b2;</italic> Sequence data</article-title>. <source>Commun Biol</source> (<year>2021</year>) <volume>4</volume>:<fpage>1</fpage>&#x2013;<lpage>13</lpage>. doi: <pub-id pub-id-type="doi">10.1038/s42003-021-02610-3</pub-id>
</citation>
</ref>
<ref id="B23">
<label>23</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Springer</surname> <given-names>I</given-names>
</name>
<name>
<surname>Tickotsky</surname> <given-names>N</given-names>
</name>
<name>
<surname>Louzoun</surname> <given-names>Y</given-names>
</name>
</person-group>. <article-title>Contribution of t cell receptor alpha and beta cdr3, mhc typing, v and j genes to peptide binding prediction</article-title>. <source>Front Immunol</source> (<year>2021</year>) <volume>12</volume>. doi: <pub-id pub-id-type="doi">10.3389/fimmu.2021.664514</pub-id>
</citation>
</ref>
<ref id="B24">
<label>24</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Weber</surname> <given-names>A</given-names>
</name>
<name>
<surname>Born</surname> <given-names>J</given-names>
</name>
<name>
<surname>Rodriguez Martinez</surname> <given-names>M</given-names>
</name>
</person-group>. <article-title>TITAN: T-cell receptor specificity prediction with bimodal attention networks</article-title>. <source>Bioinformatics</source> (<year>2021</year>) <volume>37</volume>:<page-range>i237&#x2013;44</page-range>. doi:&#xa0;<pub-id pub-id-type="doi">10.1093/bioinformatics/btab294</pub-id>
</citation>
</ref>
<ref id="B25">
<label>25</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Vita</surname> <given-names>R</given-names>
</name>
<name>
<surname>Mahajan</surname> <given-names>S</given-names>
</name>
<name>
<surname>Overton</surname> <given-names>JA</given-names>
</name>
<name>
<surname>Dhanda</surname> <given-names>SK</given-names>
</name>
<name>
<surname>Martini</surname> <given-names>S</given-names>
</name>
<name>
<surname>Cantrell</surname> <given-names>JR</given-names>
</name>
<etal/>
</person-group>. <article-title>The immune epitope database (iedb): 2018 update</article-title>. <source>Nucleic Acids Res</source> (<year>2019</year>) <volume>47</volume>:<page-range>D339&#x2013;43</page-range>. doi: <pub-id pub-id-type="doi">10.1093/nar/gky1006</pub-id>
</citation>
</ref>
<ref id="B26">
<label>26</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bagaev</surname> <given-names>DV</given-names>
</name>
<name>
<surname>Vroomans</surname> <given-names>RM</given-names>
</name>
<name>
<surname>Samir</surname> <given-names>J</given-names>
</name>
<name>
<surname>Stervbo</surname> <given-names>U</given-names>
</name>
<name>
<surname>Rius</surname> <given-names>C</given-names>
</name>
<name>
<surname>Dolton</surname> <given-names>G</given-names>
</name>
<etal/>
</person-group>. <article-title>Vdjdb in 2019: database extension, new analysis infrastructure and a t-cell receptor motif compendium</article-title>. <source>Nucleic Acids Res</source> (<year>2020</year>) <volume>48</volume>:<page-range>D1057&#x2013;62</page-range>. doi: <pub-id pub-id-type="doi">10.1093/nar/gkz874</pub-id>
</citation>
</ref>
<ref id="B27">
<label>27</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tickotsky</surname> <given-names>N</given-names>
</name>
<name>
<surname>Sagiv</surname> <given-names>T</given-names>
</name>
<name>
<surname>Prilusky</surname> <given-names>J</given-names>
</name>
<name>
<surname>Shifrut</surname> <given-names>E</given-names>
</name>
<name>
<surname>Friedman</surname> <given-names>N</given-names>
</name>
</person-group>. <article-title>Mcpas-tcr: a manually curated catalogue of pathology-associated t cell receptor sequences</article-title>. <source>Bioinformatics</source> (<year>2017</year>) <volume>33</volume>:<page-range>2924&#x2013;9</page-range>. doi: <pub-id pub-id-type="doi">10.1093/bioinformatics/btx286</pub-id>
</citation>
</ref>
<ref id="B28">
<label>28</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Klinger</surname> <given-names>M</given-names>
</name>
<name>
<surname>Pepin</surname> <given-names>F</given-names>
</name>
<name>
<surname>Wilkins</surname> <given-names>J</given-names>
</name>
<name>
<surname>Asbury</surname> <given-names>T</given-names>
</name>
<name>
<surname>Wittkop</surname> <given-names>T</given-names>
</name>
<name>
<surname>Zheng</surname> <given-names>J</given-names>
</name>
<etal/>
</person-group>. <article-title>Multiplex identification of antigen-specific t cell receptors using a combination of immune assays and immune receptor sequencing</article-title>. <source>PloS One</source> (<year>2015</year>) <volume>10</volume>:<elocation-id>e0141561</elocation-id>. doi: <pub-id pub-id-type="doi">10.1371/journal.pone.0141561</pub-id>
</citation>
</ref>
<ref id="B29">
<label>29</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Nolan</surname> <given-names>S</given-names>
</name>
<name>
<surname>Vignali</surname> <given-names>M</given-names>
</name>
<name>
<surname>Klinger</surname> <given-names>M</given-names>
</name>
<name>
<surname>Dines</surname> <given-names>JN</given-names>
</name>
<name>
<surname>Kaplan</surname> <given-names>IM</given-names>
</name>
<name>
<surname>Svejnoha</surname> <given-names>E</given-names>
</name>
<etal/>
</person-group>. <article-title>A large-scale database of t-cell receptor beta (tcr<italic>&#x3b2;</italic>) sequences and binding associations from natural and synthetic exposure to sars-cov-2</article-title>. <source>Res square</source> (<year>2020</year>). doi: <pub-id pub-id-type="doi">10.21203/rs.3.rs-51964/v1</pub-id>
</citation>
</ref>
<ref id="B30">
<label>30</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gilson</surname> <given-names>MK</given-names>
</name>
<name>
<surname>Liu</surname> <given-names>T</given-names>
</name>
<name>
<surname>Baitaluk</surname> <given-names>M</given-names>
</name>
<name>
<surname>Nicola</surname> <given-names>G</given-names>
</name>
<name>
<surname>Hwang</surname> <given-names>L</given-names>
</name>
<name>
<surname>Chong</surname> <given-names>J</given-names>
</name>
</person-group>. <article-title>Bindingdb in 2015: a public database for medicinal chemistry, computational chemistry and systems pharmacology</article-title>. <source>Nucleic Acids Res</source> (<year>2016</year>) <volume>44</volume>:<page-range>D1045&#x2013;53</page-range>. doi: <pub-id pub-id-type="doi">10.1093/nar/gkv1072</pub-id>
</citation>
</ref>
<ref id="B31">
<label>31</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zeng</surname> <given-names>H</given-names>
</name>
<name>
<surname>Gifford</surname> <given-names>DK</given-names>
</name>
</person-group>. <article-title>Quantification of uncertainty in peptide-mhc binding prediction improves high-affinity peptide selection for therapeutic design</article-title>. <source>Cell Syst</source> (<year>2019</year>) <volume>9</volume>:<page-range>159&#x2013;66</page-range>. doi: <pub-id pub-id-type="doi">10.1016/j.cels.2019.05.004</pub-id>
</citation>
</ref>
<ref id="B32">
<label>32</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Henikoff</surname> <given-names>S</given-names>
</name>
<name>
<surname>Henikoff</surname> <given-names>JG</given-names>
</name>
</person-group>. <article-title>Amino acid substitution matrices from protein blocks</article-title>. <source>Proc Natl Acad Sci</source> (<year>1992</year>) <volume>89</volume>:<page-range>10915&#x2013;9</page-range>. doi: <pub-id pub-id-type="doi">10.1073/pnas.89.22.10915</pub-id>
</citation>
</ref>
<ref id="B33">
<label>33</label>
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Davis</surname> <given-names>J</given-names>
</name>
<name>
<surname>Goadrich</surname> <given-names>M</given-names>
</name>
</person-group>. <article-title>The relationship between precision-recall and roc curves</article-title>, in: <conf-name>Proceedings of the 23rd international conference on Machine learning</conf-name>. <publisher-loc>New York, NY, United States</publisher-loc>: <publisher-name>Association for Computing Machinery</publisher-name> (<year>2006</year>), <page-range>233&#x2013;40</page-range>.</citation>
</ref>
<ref id="B34">
<label>34</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Fawcett</surname> <given-names>T</given-names>
</name>
</person-group>. <article-title>An introduction to roc analysis</article-title>. <source>Pattern recognition Lett</source> (<year>2006</year>) <volume>27</volume>:<page-range>861&#x2013;74</page-range>. doi: <pub-id pub-id-type="doi">10.1016/j.patrec.2005.10.010</pub-id>
</citation>
</ref>
<ref id="B35">
<label>35</label>
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Manning</surname> <given-names>C</given-names>
</name>
<name>
<surname>Schutze</surname> <given-names>H</given-names>
</name>
</person-group>. <source>Foundations of statistical natural language processing</source>. <publisher-loc>Cambridge, USA</publisher-loc>: <publisher-name>MIT press</publisher-name> (<year>1999</year>).</citation>
</ref>
<ref id="B36">
<label>36</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Saito</surname> <given-names>T</given-names>
</name>
<name>
<surname>Rehmsmeier</surname> <given-names>M</given-names>
</name>
</person-group>. <article-title>The precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets</article-title>. <source>PloS One</source> (<year>2015</year>) <volume>10</volume>:<elocation-id>e0118432</elocation-id>. doi: <pub-id pub-id-type="doi">10.1371/journal.pone.0118432</pub-id>
</citation>
</ref>
<ref id="B37">
<label>37</label>
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Goutte</surname> <given-names>C</given-names>
</name>
<name>
<surname>Gaussier</surname> <given-names>E</given-names>
</name>
</person-group>. <article-title>A probabilistic interpretation of precision, recall and f-score, with implication for evaluation</article-title>. In: <source>European Conference on information retrieval</source>. <publisher-loc>Heidelberg, Germany</publisher-loc>: <publisher-name>Springer</publisher-name> (<year>2005</year>). p. <page-range>345&#x2013;59</page-range>.</citation>
</ref>
<ref id="B38">
<label>38</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Weininger</surname> <given-names>D</given-names>
</name>
<name>
<surname>Weininger</surname> <given-names>A</given-names>
</name>
<name>
<surname>Weininger</surname> <given-names>JL</given-names>
</name>
</person-group>. <article-title>Smiles. 2. algorithm for generation of unique smiles notation</article-title>. <source>J Chem Inf Comput Sci</source> (<year>1989</year>) <volume>29</volume>:<fpage>97</fpage>&#x2013;<lpage>101</lpage>. doi: <pub-id pub-id-type="doi">10.1021/ci00062a008</pub-id>
</citation>
</ref>
<ref id="B39">
<label>39</label>
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Weiss</surname> <given-names>K</given-names>
</name>
<name>
<surname>Khoshgoftaar</surname> <given-names>TM</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>D</given-names>
</name>
</person-group>. <article-title>A survey of transfer learning</article-title>. <source>J Big Data</source> (<year>2016</year>) <volume>3</volume>:<fpage>1</fpage>&#x2013;<lpage>40</lpage>. doi: <pub-id pub-id-type="doi">10.1186/s40537-016-0043-6</pub-id>
</citation>
</ref>
</ref-list>
</back>
</article>