TCR-peptide contact profile determines immunogenicity in 1 pathogen/tumor-derived MHC-I epitopes. 2

One Sentence Summary: Accurate epitope prediction was achieved via machine 12 learning by incorporating TCR-peptide contact profiles. Abstract Computational methodologies predict cytotoxic T lymphocytes 19 (CTLs) will galvanize vaccine research and pave the way toward targeted 20 immunotherapy of infections and cancer. However, the classification of immunogenic 21 epitopes and non-immunogenic major histocompatibility complex (MHC) class I ligands 22 in silico has yet to attain sufficient accuracy. Here, we demonstrated highly accurate 23 epitope prediction by a machine learning-based classifier incorporating T cell receptor 24 (TCR)-peptide contact profiles, with an accuracy of 0.77 and an area under the curve of 25 0.84 in hold-out validation. Predictive accuracy was retained for five major human 26 leucocyte antigen supertypes. Successful prediction using independent datasets of viral 27 epitopes and tumor neoepitopes was demonstrated. Collectively, this is the first study 28 demonstrating accurate and generalizable CTL immunogenicity prediction from the 29 TCR-peptide axis. The R package Repitope was implemented to maximize code 30 reusability. Prospective validation in vaccination and/or cancer immunotherapy cohorts is 31 warranted.


Introduction 34
The adaptive immune system is driven by antigen recognition. The capability of 35 triggering immune responses is termed 'immunogenicity'. Antigens are processed into 36 fragments of peptides by proteasomes, and coupled to major histocompatibility complex 37 [MHC; also called the human leucocyte antigen (HLA) in humans] molecules on the 38 surface of antigen-presenting cells (APCs). Antigenic peptides presented by 39

MHC-bearing cells are called MHC ligands. Naïve T cells interact with the MHC ligands 40
(MHCLs) via their receptor (T cell receptor, TCR), and successful recognition activates 41 them to initiate subsequent immunological orchestration(1). Immunogenic MHCLs are 42 termed 'epitopes'. Conversely, being MHCLs does not ensure immunogenicity(2). 43 Acquired immunity plays an indispensable role in rejecting both pathogens and 44 tumors. Accumulating evidence is shedding light on mutation-derived epitopes, or 45 neoepitopes, as the targets of anticancer T cell immunity. First, the efficacy of immune 46 checkpoint inhibitors correlates with tumor mutational burden(3-6). Second, 47 mismatch-repair deficiency, which increases the overall genomic instability and tumor 48 mutational burden, has been shown to predict a better outcome in patients receiving 49 checkpoint blockade therapy(7), which eventually led to the FDA approval of the first 50 pan-cancer efficacy biomarker(8). Third, the presence of neoepitope-specific T cells in 51 Furthermore, patient-derived T cells transformed with the appropriate donor TCR 87 . CC-BY-NC 4.0 International license not certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was this version posted June 25, 2017. . https://doi.org/10.1101/155317 doi: bioRxiv preprint 6 successfully invigorated anti-neoepitope immunity. Their results suggest that even 88 MHCLs non-immunogenic to autologous TCRs can serve as epitopes if recognized by 89 appropriate TCRs. 90 We started the whole project aiming at unveiling the enigma of the 91 immunogenicity on MHC-I-loaded peptides on the basis of the following hypothesis: 92 are peptides stably interacting with the host TCR repertoire more likely to be 93 immunogenic? If this is the case, prediction of peptide immunogenicity may be 94 significantly improved by incorporating the TCR-peptide axis. Given that human TCR 95 repertoires are evolutionarily optimized so as to effectively combat pathogens and 96 cancers, we utilized a pooled human TCR repertoire sequenced from the commercial 97 RNA of peripheral blood CD8 + T cells for reference. We defined repertoire-wide 98 TCR-peptide contact profiles (rTPCP) using amino acid pairwise contact potential 99 (AACP) scales to quantitatively parametrize TCR-peptide interactions to classify 100 epitopes and MHCLs through a machine learning (ML) approach. Our initial model 101 achieved unprecedented accuracy in hold-out validation. When the rTPCP definition 102 was modified to incorporate position-specific effects (mrTPCP), comparable accuracy 103 was achieved with just one AACP scale. Prediction was not biased for at least five HLA 104 supertypes. Permutation of peptide sequences, but not TCR sequences, undermined 105 . CC-BY-NC 4.0 International license not certified by peer review) is the author/funder. It is made available under a because of its high demand for computational power; our goal is to construct a "portable" 133 prediction framework that can be run on ordinary desktop computers. To simplify the 134 framework, we adopted a sequence-based prediction strategy using AACPs listed in the 135 AAIndex database(25) (http://www.genome.jp/aaindex/AAindex/list_of_potentials) as 136 the measurement of energetic stability, or the decrease in free energy, of TCR-peptide 137 interaction. We hereby propose the concept of rTPCP, where a given peptide contacts 138 with all TCRs in a given repertoire with varying contact potentials (  As an initial attempt, we focused on 450 epitopes and 450 ligands restricted on human 145 leucocyte antigen A2 (HLA-A2). We retrieved 35 AAIndex AACP scales (table S1)   Collectively, these observations suggest that the mrTPCP framework effectively 203 mimics the biological mechanisms of CTL immunogenicity, thereby providing a 204 . CC-BY-NC 4.0 International license not certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was this version posted June 25, 2017. . https://doi.org/10.1101/155317 doi: bioRxiv preprint 13 promising methodology for accurate epitope prediction. 205

Immunogenicity prediction using independent datasets 206
Any pattern learned from one dataset is not always extendable to other datasets 207 constructed in different contexts. Therefore, we tested the performance of our 208 immunogenicity prediction model by utilizing independent datasets adopted from 209 previous publications(4, 10, 24, 29-32), after removing peptides overlapping with those 210 in the Chowell dataset. As expected, randomly selected 10,000 MHCLs retrieved from 211 the Immune Epitope Database (IEDB) were predicted as either immunogenic or 212 non-immunogenic in an approximately 1:1 ratio, with a uniform distribution of 213 predicted probabilities ( Fig. 5A and Table 1). In contrast, epitope datasets of viral and 214 tumor origin were significantly enriched with peptides predicted as epitopes (p < 0.01 by 215 Wilcoxon's rank sum test in comparison with randomly selected MHCLs from IEDB). It by the methods implemented in the cocor package(33) (Fig. 6A). The PFSs of three 230 patients, namely, CA9903, CU9061, and SA9755, were better predicted (Fig. 6A). Next, 231 we analyzed clinical and mutational data from melanoma patients treated with 232 ipilimumab (n = 110)(5). Clinical benefit (CB) was defined as originally reported (5). 233 There were significant differences in both mutational burden and predicted neoepitope 234 burden between patients with and without CB (Fig. 6B). Overall, our results showed 235 that neoepitope burden predicted through the mrTPCP framework retains at least 236 comparable usefulness as a biomarker as compared with conventional mutational 237 burden, with greatly reduced number of neoepitope candidates, enabling more focused 238 approach in view of precision immunotherapy. were the three most MHCL-enriched, and neoepitope-enriched types of cancer (Fig. 7A). 247 There was a significant gene-by-gene variation of the ratio of neoepitope burden to the 248 MHCL burden (Fig. 7B). Mitochondrial enzymes (MT-CO1 and MT-ND4) and 249 olfactory receptors (OR2T2, OR4A5, OR4C16, OR4K2, OR5J2, and OR7D4) were the 250 genes that were particularly high-yield in terms of neoepitopes. 251

R package implementation of immunogenicity prediction framework 252
We implemented the R package Repitope to maximize code reusability. Repitope 253 contains datasets used in this study, functions to calculate rTPCP and mrTPCP variables 254 for user-provided peptide datasets and reference TCR repertoire data, and the mrTPCP 255 SVM classifier developed in this study. Source codes are deposited for public use at 256 GitHub (https://github.com/masato-ogishi/Repitope/).  (Figs. 2 and 4). Our mrTPCP framework has two notable features: independence from HLA 284 specificity, and dependence on a reference TCR repertoire. First, pan-specific 285 immunogenicity prediction may be feasible, as it does not depend on HLA information. 286 We showed that our model worked with minimal performance reduction for at least five 287 major HLA supertypes (HLA-A1, A2, B15, B44, and B57), for which sufficient amount 288 of peptide data was available (Fig. 4B). This point could further be explored with more 289 immunogenicity data for various HLA alleles in the future. Second, the framework 290 requires reference TCR repertoire. The model discussed in this study relies on the pooled 291 TCR repertoire of German origin, which could be a source of bias. However, 292 immunogenicity could still be predicted with a minimal decrease in AUC, even when 293 using completely random sequences instead of TCR repertoire (Fig. 4C). Conversely, 294 . CC-BY-NC 4.0 International license not certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was this version posted June 25, 2017. . https://doi.org/10.1101/155317 doi: bioRxiv preprint manipulation of input peptide sequences resulted in a significant decrease in predictive 295 accuracy (Fig. 4C). These observations suggested that the mrTPCP framework is 296 primarily dependent on the inherent features of epitope sequence but not the reference 297 repertoire. Interestingly, peptide sequence permutation and randomization with relative That being said, however, eliminating candidates least likely to be immunogenic in silico 317 should greatly expedite research in targeted immunotherapy, and the findings in our 318 present study are indeed encouraging; epitopes of viral and tumor origin not included in 319 the training/testing dataset were successfully predicted with high sensitivity, whereas 320 predicted probabilities of MHCLs randomly retrieved from IEDB distributed almost 321 uniformly from 0 to 1 (Fig. 5 and Table 1). It is reasonable to assume a distribution of 322 levels of immunogenicity in the dataset of randomly selected MHCLs without T cell 323 assay-based annotation. Furthermore, we showed that the usefulness of neoepitope 324 burden as a biomarker for clinical outcome was not affected, or even slightly improved, 325 when candidate mutations were filtered using our prediction model (Fig. 6). One caveat to 326 be mentioned is its relatively low sensitivity in predicting HIV epitopes. In addition to the 327 "general" rules learned from the Chowell dataset which contains epitopes from various 328 sources, some additional rules may be critical for HIV-specific CTL immunity and could 329 be machine-learned with more data obtained specifically in the context of chronic HIV 330

Study design 371
Research objectives. The purpose of this study was to construct a sequence-based epitope 372 prediction model by incorporating TCR-peptide contact profiles. 373 Design. This is a retrospective, observational study. The entire analysis is exploratory; no 374 predetermined experimental protocol was applied a priori.  The dataset primarily utilized in this study is originated from the research led by 406 Chowell et al(23). Any peptide derived from a mouse experiment was removed to create 407 a human-specific immunogenicity dataset. No additional data filtering was performed to 408 avoid deliberate peptide selection. 409

Machine learning for immunogenicity prediction 410
Machine learning (ML) procedures were streamlined using the caret package in 411 R(49). The hold-out validation strategy was adopted; the input dataset was randomly split 412 into training and testing subdatasets in a ratio of 2:1. The training subdataset was 413 preprocessed (i.e., centered and scaled) using the preProcess function in caret. Ten-fold 414 cross-validations (CVs) were repeated ten times to train classifiers. Testing subdataset 415 was preprocessed using the preprocessing model generated from the training subdataset, 416 and immunogenicity was predicted. Unless otherwise noted, the performance metric in 417 each testing subdataset was reported. As any ML algorithm is designed to self-optimize 418 through CVs, the performance metric obtained in the process of CVs is an optimized 419 value for the input dataset. Our true interest is the performance of the trained classifier 420 when applied to an external dataset not involved in either model training or optimization. 421 Preliminary assessment suggested that the support vector machine (SVM) was the best 422 algorithm. SVM has a long history of providing state-of-the-art, well-generalizable 423

Epitope/ligand datasets for external validation 431
The hold-out validation strategy is by itself not sufficient for evaluating the 432 generalizability of the ML classifier for external datasets. The ML algorithm, after all, 433 mines hidden patterns applicable across the training dataset. When the hold-out 434 validation strategy is adopted, training and testing subdatasets derived from a single 435 data source lie in a single context, and consequently, patterns learned from the training 436 subdataset is highly likely applicable to the testing subdataset. Therefore, the trained 437 classifier should be tested and validated with other external datasets constructed in 438 different contexts. In this study, the trained classifier may be biased, since 439 autoimmunity-and cancer-associated immunogenic peptides were excluded from the 440 epitope data, and pathogen-derived MHCLs were excluded from the MHCL data in the 441  Table S1. AAIndex AACP scales used in this study. 493    Table 1. Prediction results on datasets independent from training/validation data. 737 Immunogenicity was predicted using the mrTPCP-based SVM classifier (Fig. 4A) 743 . CC-BY-NC 4.0 International license not certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was this version posted June 25, 2017. . https://doi.org/10.1101/155317 doi: bioRxiv preprint