Epitope immunogenicity prediction through repertoire-wide TCR-peptide contact profiles

Computational methodologies to predict epitopes for cytotoxic T lymphocytes (CTLs) will galvanize vaccine research and pave the way toward targeted immunotherapy of infections and cancer. However, the classification of immunogenic epitopes and non-immunogenic major histocompatibility complex (MHC) class I ligands in silico remains difficult. Here, we defined a novel framework quantifying the interactions between a given peptide and T cell receptor (TCR) repertoire. Using 4738 peptide sequences and a pooled TCR repertoire, an epitope classifier with unprecedented accuracy in hold-out validation was constructed. The classifier was applicable to multiple human leucocyte antigen supertypes. The classifier was further validated independently using pathogen epitope datasets and tumor neoepitope datasets. A panel of neoepitope-rich genes were identified using The Cancer Genome Atlas datasets. The R package Repitope was implemented to maximize code reusability. This is the first study demonstrating in silico CTL epitope prediction with clinically meaningful robustness, thus prospective validation is warranted.

peptide sequences and a pooled TCR repertoire, an epitope classifier with unprecedented 23 accuracy in hold-out validation was constructed. The classifier was applicable to multiple 24 human leucocyte antigen supertypes. The classifier was further validated independently 25 using pathogen epitope datasets and tumor neoepitope datasets. A panel of 26 neoepitope-rich genes were identified using The Cancer Genome Atlas (TCGA) datasets. 27 The R package Repitope was implemented to maximize code reusability. This is the first 28 study demonstrating in silico CTL epitope prediction with clinically meaningful 29 Introduction 33 The adaptive immune system is driven by antigen recognition. The capability of 34 triggering immune responses is termed 'immunogenicity'. Antigens are processed into 35 fragments of peptides by proteasomes, and coupled to major histocompatibility complex 36 [MHC; also called the human leucocyte antigen (HLA) in humans] molecules on the 37 surface of antigen-presenting cells (APCs). Antigenic peptides presented by 38

MHC-bearing cells are called MHC ligands. Naïve T cells interact with the MHC ligands 39
(MHCLs) via their receptor (T cell receptor, TCR), and successful recognition activates 40 them to initiate subsequent immunological orchestration(1). Immunogenic MHCLs are 41 termed 'epitopes'. Conversely, being MHCLs does not ensure immunogenicity(2). 42 Acquired immunity plays an indispensable role in rejecting both pathogens and 43 tumors. Accumulating evidence is shedding light on mutation-derived epitopes, or 44 neoepitopes, as the targets of anticancer T cell immunity. First, the efficacy of immune 45 checkpoint inhibitors correlates with tumor mutational burden(3-6). Second, 46 mismatch-repair deficiency, which increases the overall genomic instability and tumor 47 mutational burden, has been shown to predict a better outcome in patients receiving 48 checkpoint blockade therapy(7), which eventually led to the FDA approval of the first 49 pan-cancer efficacy biomarker(8). Third, the presence of neoepitope-specific T cells in 50 not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was . http://dx.doi.org/10.1101/155317 doi: bioRxiv preprint first posted online Jun. 25, 2017; 6 successfully invigorated anti-neoepitope immunity. Their results suggest that even 87 MHCLs non-immunogenic to autologous TCRs can serve as epitopes if recognized by 88 appropriate TCRs. 89 We started the whole project aiming at unveiling the enigma of the 90 immunogenicity on MHC-I-loaded peptides on the basis of the following hypothesis: 91 are peptides stably interacting with the host TCR repertoire more likely to be 92 immunogenic? If this is the case, prediction of peptide immunogenicity may be 93 significantly improved by incorporating the TCR-peptide axis. Given that human TCR 94 repertoires are evolutionarily optimized so as to effectively combat pathogens and 95 cancers, we utilized a pooled human TCR repertoire sequenced from the commercial 96 RNA of peripheral blood CD8 + T cells for reference. We defined repertoire-wide 97 TCR-peptide contact profiles (rTPCP) using amino acid pairwise contact potential 98 (AACP) scales to quantitatively parametrize TCR-peptide interactions to classify 99 epitopes and MHCLs through a machine learning (ML) approach. Our initial model 100 achieved unprecedented accuracy in hold-out validation. When the rTPCP definition 101 was modified to incorporate position-specific effects (mrTPCP), comparable accuracy 102 was achieved with just one AACP scale. Prediction was not biased for at least five HLA 103 supertypes. Permutation of peptide sequences, but not TCR sequences, undermined 104 not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.

115
Preparation of pooled human TCR repertoire dataset 116 First, we screened public databases such as Sequence Read Archive, but failed to 117 find a suitable human TCR sequence dataset. Therefore, we generated an in-house TCR 118 repertoire data by sequencing the variable regions of TCR β chains (TCR-Vβ) from 119 commercially available pooled human peripheral CD8 + T cells. Among the three 120 complementarity-determining regions (CDRs), we focused on CDR3, because it has the 121 largest diversity among CDR regions, and CDR1 and CDR2 are primarily involved in the 122 recognition of MHC, not the ligand presented(1). Rarefaction analysis estimated the total 123 Immunogenicity prediction from repertoire-wide TCR-peptide contact profiles 127 parametrized using amino acid contact potentials 128 Immunogenicity prediction model necessitates quantitative parametrization of the 129 likelihood that a given peptide stably interacts with a given set of TCRs. Although 130 molecular dynamics simulation would be the most accurate method, it is not appropriate 131 not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was . http://dx.doi.org/10.1101/155317 doi: bioRxiv preprint first posted online Jun. 25, 2017; because of its high demand for computational power; our goal is to construct a "portable" 132 prediction framework that can be run on ordinary desktop computers. To simplify the 133 framework, we adopted a sequence-based prediction strategy using AACPs listed in the 134 AAIndex database(25) (http://www.genome.jp/aaindex/AAindex/list_of_potentials) as 135 the measurement of energetic stability, or the decrease in free energy, of TCR-peptide 136 interaction. We hereby propose the concept of rTPCP, where a given peptide contacts 137 with all TCRs in a given repertoire with varying contact potentials ( Fig. 1; see 138 Supplementary Materials and Methods for details). Using the rTPCP variables, we 139 attempted ML-based classification of MHCL peptides into immunogenic (functional 140 epitope) and non-immunogenic subsets. We utilized the peptide dataset compiled by 141 Chowell et al., which contains 7582 distinct human peptides (23). Preliminary analyses 142 suggested support vector machine (SVM) as the most accurate and balanced algorithm. 143 As an initial attempt, we focused on 450 epitopes and 450 ligands restricted on human 144 leucocyte antigen A2 (HLA-A2). We retrieved 35 AAIndex AACP scales (table S1)   MHC-loaded peptides interact with TCRs at specific positions(1). The effects of 166 position-specific interactions may counterbalance each other in TPCP. To test this 167 not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was . http://dx.doi.org/10.1101/155317 doi: bioRxiv preprint first posted online Jun. 25, 2017; hypothesis, we modified the rTPCP definition to incorporate position-specific 168 interactions (mrTPCP; schematically depicted in Figure 3) (28), respectively. 181 The amino acid compositions of MHCLs are restricted by the HLA to which they 182 are coupled. Since our mrTPCP framework is not dependent on HLA information, it 183 might be useful for pan-specific immunogenicity prediction. To test this hypothesis, 184 4738 unique nonapeptides in the Chowell dataset were stratified based on their 185 not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was . http://dx.doi.org/10.1101/155317 doi: bioRxiv preprint first posted online Jun. 25, 2017; corresponding HLA supertypes, and ROC analysis was conducted (Fig. 4B). The trained 186 classifier worked with no significant decrease in accuracy for at least five major HLA 187 supertypes (HLA-A1, A2, B15, B44, and B57) for which a sufficient amount of peptide 188 data was available. 189 Previous studies suggest that position-specific amino acid usage biases in 190 MHC-coupled peptides affect their immunogenicity(21, 23). In our model, windows 1 191 and 2 seemed to be of higher importance, but no exceptionally important window was 192 identified ( fig. S4). To further evaluate these position-dependent characteristics, we next 193 conducted sequence manipulation analysis; mrTPCP variables were calculated for 194 manipulated peptide sequences or using manipulated reference TCR repertoire 195 sequences. The classifier trained from authentic TCRs and peptides was then applied to 196 perform ROC analysis. Manipulation of TCRs led to a minimal decrease in AUC,197 whereas manipulation of peptides led to a significant decrease in AUC (Fig. 4C). 198 Difference in amino acid compositions between epitopes and MHCLs was only of 199 partial predictive significance, indicating that position-specific or sequence-specific 200 features are the major determinants of immunogenicity. 201 Collectively, these observations suggest that the mrTPCP framework effectively 202 mimics the biological mechanisms of CTL immunogenicity, thereby providing a 203 not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.

Immunogenicity prediction using independent datasets 205
Any pattern learned from one dataset is not always extendable to other datasets 206 constructed in different contexts. Therefore, we tested the performance of our 207 immunogenicity prediction model by utilizing independent datasets adopted from 208 previous publications (4,10,24,(29)(30)(31)(32), after removing peptides overlapping with those 209 in the Chowell dataset. As expected, randomly selected 10,000 MHCLs retrieved from 210 the Immune Epitope Database (IEDB) were predicted as either immunogenic or 211 non-immunogenic in an approximately 1:1 ratio, with a uniform distribution of 212 predicted probabilities ( Fig. 5A and Table 1). In contrast, epitope datasets of viral and 213 tumor origin were significantly enriched with peptides predicted as epitopes (p < 0.01 by 214 Wilcoxon's rank sum test in comparison with randomly selected MHCLs from IEDB). It  Encouraged by these observations, we next explored the possibility that our 221 not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was . http://dx.doi.org/10.1101/155317 doi: bioRxiv preprint first posted online Jun. 25, 2017; immunogenicity prediction model improves the correlation between neoepitope burden 222 and clinical outcomes in checkpoint inhibitor trials. First, we adopted clinical and 223 mutational data from non-small cell lung carcinoma (NSCLC) patients treated with 224 pembrolizumab (n = 23)(3). We observed a slightly improved correlation between 225 neoepitope burden and progression-free survival (PFS) (R = 0.55, p = 0.007), compared 226 with the correlation between originally reported mutated peptide burden and PFS (R = 227 0.61, p = 0.002), although the improvement is not statistically significant as determined 228 by the methods implemented in the cocor package(33) (Fig. 6A). The PFSs of three 229 patients, namely, CA9903, CU9061, and SA9755, were better predicted (Fig. 6A). Next, 230 we analyzed clinical and mutational data from melanoma patients treated with 231 ipilimumab (n = 110)(5). Clinical benefit (CB) was defined as originally reported (5). 232 There were significant differences in both mutational burden and predicted neoepitope 233 burden between patients with and without CB (Fig. 6B). Overall, our results showed 234 that neoepitope burden predicted through the mrTPCP framework retains at least 235 comparable usefulness as a biomarker as compared with conventional mutational 236 burden, with greatly reduced number of neoepitope candidates, enabling more focused 237 approach in view of precision immunotherapy. 238 Finally, we compared estimated neoepitope burden across 21 tumor types in The 239 not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.

R package implementation of immunogenicity prediction framework 251
We implemented the R package Repitope to maximize code reusability. Repitope 252 contains datasets used in this study, functions to calculate rTPCP and mrTPCP variables 253 for user-provided peptide datasets and reference TCR repertoire data, and the mrTPCP 254 SVM classifier developed in this study. Source codes are deposited for public use at 255 GitHub (https://github.com/masato-ogishi/Repitope/). 256 257 not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.

258
In this work, the accurate classification of epitopes and non-immunogenic MHC-I 259 ligands was achieved by introducing the concept of repertoire-wide TCR-peptide contact 260 profiles. Considering that current concepts of CTL epitope prediction are mostly focused 261 on the peptide-MHC axis, it is of interest that our immunogenicity prediction model 262 incorporating the TCR-peptide axis showed improved predictive capability over previous 263

models. 264
We decided to use the dataset previously compiled by Chowell et al. for the 265 following reasons. First, we eschewed compiling peptide datasets from scrach to avoid 266 potential selection bias. Second, the dataset contains a sufficiently large number of human 267 peptide data from IEDB, the largest and least biased data source available. Finally, the 268 mutual exclusiveness of epitopes and MHCLs included in the dataset is ideal for model 269 training and evaluation; the immunogenic CTL epitopes included were defined by T cell  Moreover, our model also improved upon previous ones in that it employs smaller 275 not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was . http://dx.doi.org/10.1101/155317 doi: bioRxiv preprint first posted online Jun. 25, 2017; number of variables(19, 22) ( fig. S3). Generally, ML classifiers using smaller numbers of 276 variables are preferable, since over-parametrization frequently causes ML algorithms to 277 "cheat", or to find variables distributed unevenly between the two classes in question just 278 because of stochastic fluctuation (with no generalizability for external data). Our 279 mrTPCP model employs only one AACP scale for parametrization, which resulted in 187 280 mrTPCP variables. This is a fairly small size when considering the number of input 281

peptides. 282
Our mrTPCP framework has two notable features: independence from HLA 283 specificity, and dependence on a reference TCR repertoire. First, pan-specific 284 immunogenicity prediction may be feasible, as it does not depend on HLA information. 285 We showed that our model worked with minimal performance reduction for at least five 286 major HLA supertypes (HLA-A1, A2, B15, B44, and B57), for which sufficient amount 287 of peptide data was available (Fig. 4B). This point could further be explored with more 288 immunogenicity data for various HLA alleles in the future. Second, the framework 289 requires reference TCR repertoire. The model discussed in this study relies on the pooled 290 TCR repertoire of German origin, which could be a source of bias. However, 291 immunogenicity could still be predicted with a minimal decrease in AUC, even when 292 using completely random sequences instead of TCR repertoire (Fig. 4C). Conversely, 293 not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was . http://dx.doi.org/10.1101/155317 doi: bioRxiv preprint first posted online Jun. 25, 2017; manipulation of input peptide sequences resulted in a significant decrease in predictive 294 accuracy (Fig. 4C). These observations suggested that the mrTPCP framework is 295 primarily dependent on the inherent features of epitope sequence but not the reference 296 repertoire. Interestingly, peptide sequence permutation and randomization with relative 297 amino acid compositions retained led to moderately decreased AUC (0.68 and 0.64, 298 respectively), whereas completely random peptide sequences could not be classified 299 (AUC = 0.50). This is consistent with the previous research of Calis et al., in which an 300 AUC of 0.65 was obtained when residue-specific properties but not sequence-specific 301 properties were taken into consideration(21). Collectively, both amino acid composition 302 and sequence-specific features recapitulated by the mrTPCP framework are important in 303 determining peptide immunogenicity. 304 The regulatory mechanisms of CTL activation are asymmetric, and it is this 305 asymmetry that makes the construction of immunogenicity prediction models 306 complicated. The activation part is relatively simple; stable and strong interactions in the 307 TCR-peptide-MHC complex are the main driving force of T cell activation(1). In contrast, 308 there are several immunomodulatory systems outside the TCR-peptide-MHC axis 309 affecting the T cell activation process in vivo, including regulatory T cells (Tregs)(35), 310 CTL exhaustion mediated by chronic immune checkpoint signals, and the 311 not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was . http://dx.doi.org/10.1101/155317 doi: bioRxiv preprint first posted online Jun. 25, 2017; immunosuppressive microenvironment engendered by solid tumors (36,37). Considering 312 this asymmetry, immunogenicity prediction models based solely on peptide sequence 313 may in principle yield some false positives. Therefore, our results should be recognized as 314 preliminary, warranting prospective validation to evaluate their clinical applicability. 315 That being said, however, eliminating candidates least likely to be immunogenic in silico 316 should greatly expedite research in targeted immunotherapy, and the findings in our 317 present study are indeed encouraging; epitopes of viral and tumor origin not included in 318 the training/testing dataset were successfully predicted with high sensitivity, whereas 319 predicted probabilities of MHCLs randomly retrieved from IEDB distributed almost 320 uniformly from 0 to 1 (Fig. 5 and Table 1). It is reasonable to assume a distribution of 321 levels of immunogenicity in the dataset of randomly selected MHCLs without T cell 322 assay-based annotation. Furthermore, we showed that the usefulness of neoepitope 323 burden as a biomarker for clinical outcome was not affected, or even slightly improved, 324 when candidate mutations were filtered using our prediction model (Fig. 6). One caveat to 325 be mentioned is its relatively low sensitivity in predicting HIV epitopes. In addition to the 326 "general" rules learned from the Chowell dataset which contains epitopes from various 327 sources, some additional rules may be critical for HIV-specific CTL immunity and could 328 be machine-learned with more data obtained specifically in the context of chronic HIV 329 not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.  Similarly to previous studies on immunogenicity prediction, this study has several 341 limitations. First, this is a retrospective observational study; no prospective identification 342 of novel epitopes is demonstrated. Thus, prospective validation is indispensable before 343 this model can be clinically applied. Second, the process of quantitative parametrization 344 of TCR-peptide interactions could further be optimized, as our window-based pairwise 345 interaction model might oversimplify the biophysicochemical nature of the 346 TCR-peptide-MHC interactions. In particular, the hypothesis that either a 4-mer or 347 not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was . http://dx.doi.org/10.1101/155317 doi: bioRxiv preprint first posted online Jun. 25, 2017; 5-mer window size is sufficient for recapitulating TCR-peptide interactions is not 348 experimentally verified, necessitating further exploration. Moreover, we limited our 349 modeling to TCR-Vβ, omitting TCR-Vα; this point could further be explored. Despite 350 these caveats, however, both the proposed framework of mrTPCP recapitulating the 351 biology of TCR-peptide interactions and the demonstrated robustness of 352 immunogenicity prediction represent noticeable progress toward fully unveiling the 353 mechanisms underlying CTL immunity, paving the way toward precision 354 immunotherapy against pathogens and cancer. 355 In conclusion, accurate epitope prediction was achieved through a machine 356 learning approach by incorporating TCR-peptide interactions parametrized using an 357 optimal amino acid pairwise contact potential scale. Unbiased prediction was 358 demonstrated for peptides coupled to multiple major HLA supertypes. The framework 359 was primarily reliant on the sequence-dependent features of the peptides, and only 360 minimally affected by the perturbation of the reference TCR repertoire. The resultant 361 classifier worked well for independent viral epitopes and tumor neoepitopes. These 362 findings not only provide valuable insights into the mechanisms underlying CTL 363 immunity, but could also bolster the ongoing precision immunotherapy initiatives. Code 364 reusability was maximized by publicly distributing the R package Repitope, in which 365 not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was . http://dx.doi.org/10.1101/155317 doi: bioRxiv preprint first posted online Jun. 25, 2017; datasets and key scripts are bundled. Disease-specific, prospective cohort studies could 366 be conducted to evaluate clinical usefulness in the future. 367 368 not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.

Study design 370
Research objectives. The purpose of this study was to construct a sequence-based epitope 371 prediction model by incorporating TCR-peptide contact profiles. 372 Design. This is a retrospective, observational study. The entire analysis is exploratory; no 373 predetermined experimental protocol was applied a priori. 374

Data collection. Peptide sequences accompanied by annotations on immunogenicity and 375
other clinical profiles (if applicable) were manually retrieved from public database and 376 previously published articles by the authors. 377 Data size. The optimal sizes of the epitope and MHCL datasets are unknown, since we 378 hereby proposed a novel framework. Therefore, no statistical estimation was performed 379 to predetermine sample size. 380

Computational analysis 381
All in-house computational analyses were conducted using R ver. 3.4.0 382 (https://www.r-project.org/) (42). The latest versions of R packages were consistently 383 used. Key datasets and scripts were bundled as the R package Repitope, and publicly 384 distributed in GitHub (https://github.com/masato-ogishi/Repitope/). Other scripts are 385 available upon request. 386 not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.

Preparation of pooled human TCR repertoire dataset 387
TCR repertoire sequencing was carried out as previously described (43)  Epitope/ligand dataset for training/testing immunogenicity prediction model 404 not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was . http://dx.doi.org/10.1101/155317 doi: bioRxiv preprint first posted online Jun. 25, 2017; The dataset primarily utilized in this study is originated from the research led by 405 Chowell et al(23). Any peptide derived from a mouse experiment was removed to create 406 a human-specific immunogenicity dataset. No additional data filtering was performed to 407 avoid deliberate peptide selection. 408

Machine learning for immunogenicity prediction 409
Machine learning (ML) procedures were streamlined using the caret package in 410 R(49). The hold-out validation strategy was adopted; the input dataset was randomly split 411 into training and testing subdatasets in a ratio of 2:1. The training subdataset was 412 preprocessed (i.e., centered and scaled) using the preProcess function in caret. Ten-fold 413 cross-validations (CVs) were repeated ten times to train classifiers. Testing subdataset 414 was preprocessed using the preprocessing model generated from the training subdataset, 415 and immunogenicity was predicted. Unless otherwise noted, the performance metric in 416 each testing subdataset was reported. As any ML algorithm is designed to self-optimize 417 through CVs, the performance metric obtained in the process of CVs is an optimized 418 value for the input dataset. Our true interest is the performance of the trained classifier 419 when applied to an external dataset not involved in either model training or optimization. 420 Preliminary assessment suggested that the support vector machine (SVM) was the best 421 algorithm. SVM has a long history of providing state-of-the-art, well-generalizable 422 not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was . http://dx.doi.org/10.1101/155317 doi: bioRxiv preprint first posted online Jun. 25, 2017; predictions in various biological contexts(50). Four SVM methods, namely, svmLinear, 423 svmPoly, svmRadial, and svmRadialSigma, were tested. We chose svmPoly as the best 424 algorithm on the basis of various factors including accuracy, AUC, balance between 425 sensitivity and specificity, and the smoothness of the calibration curve. Accuracy was 426 calculated using the confusionMatrix function implemented in caret, and AUC was 427 calculated using either the classifierplots function in the classifierplots package, or the 428 roc and auc functions implemented in the pROC package (27). 429

Epitope/ligand datasets for external validation 430
The hold-out validation strategy is by itself not sufficient for evaluating the 431 generalizability of the ML classifier for external datasets. The ML algorithm, after all, 432 mines hidden patterns applicable across the training dataset. When the hold-out 433 validation strategy is adopted, training and testing subdatasets derived from a single 434 data source lie in a single context, and consequently, patterns learned from the training 435 subdataset is highly likely applicable to the testing subdataset. Therefore, the trained 436 classifier should be tested and validated with other external datasets constructed in 437 different contexts. In this study, the trained classifier may be biased, since 438 autoimmunity-and cancer-associated immunogenic peptides were excluded from the 439 epitope data, and pathogen-derived MHCLs were excluded from the MHCL data in the 440 not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.

Chowell dataset. 441
To independently assess the generalizability of the trained immunogenicity 442 prediction model, epitope datasets were collected from the following sources: (i) Note that the MHCL data lacks T cell assay annotation, and thus the true ratio of 'epitopes' 452 to 'MHCLs' in the definitions discussed in this study is unknown. Moreover, we retrieved 453 the epitope/MHCL dataset originally reported by Calis et al (21). This dataset is suitable 454 for assessing the specificity of our immunogenicity prediction model, because it contains 455 experimentally validated non-immunogenic MHCLs, mostly originated from dengue 456 virus. Peptide sequences containing alphabetical characters other than those representing 457 20 authentic amino acid residues were removed. Any peptide contained in the Chowell 458 not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.

Correlation with clinical outcomes in checkpoint inhibitor trials 461
Correlation between mutational landscapes and clinical outcome has been shown 462 in various tumor types in checkpoint inhibitor trials(3-5) To test the predictive 463 usefulness of neoepitope burden predicted through the proposed framework, we 464 re-analyzed mutational datasets from two studies(3, 5). Datasets are available as 465 supplementary data files (Data files S7 and S8). 466

Neoepitope burden across TCGA tumor types 467
To assess the difference in neoepitope burden across tumor types, we analyzed binding stability of all nonapeptides were estimated using the EpitopePrediction 475 not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.

Statistical analysis 477
No variable distribution was assumed a priori, and data were presented as median 478 and interquartile range, unless otherwise stated. P values were reported unadjusted unless 479 otherwise stated. No accounting for missing data values is applicable. All statistical 480 analysis is exploratory; no predetermined experimental protocols were applied before 481 initiating the entire project. All statistical analyses were conducted in R. 482 483 not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.   Table S1. AAIndex AACP scales used in this study. 492  419-66 (2006).  in non-small cell lung cancer. Science (80-. ). 348, 124-128 (2015). 514
The copyright holder for this preprint (which was . http://dx.doi.org/10.1101/155317 doi: bioRxiv preprint first posted online Jun. 25, 2017 TPCPs were calculated against multiple TCR CDR3 sequences, and rTPCP is expressed 663 as a set of representative statistics of TPCPs. 664 665 not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was . http://dx.doi.org/10.1101/155317 doi: bioRxiv preprint first posted online Jun. 25, 2017;  HLA-stratified ROC analysis. The entire Chowell dataset was sorted according to their 686 HLA restriction, and six most data-rich HLA supertypes were selected for visualization. 687 (C) Sequence manipulation analysis. Either the input peptide sequences or the reference 688 TCR repertoire sequences were manipulated, and mrTPCP variables were calculated. The 689 authentically trained SVM classifier was applied. Inv, inversion of the sequence; Perm, 690 permutation of the sequence; Syns, randomly synthesized sequences with relative amino 691 not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was . http://dx.doi.org/10.1101/155317 doi: bioRxiv preprint first posted online Jun. 25, 2017 acid frequencies retained. For peptides, amino acid frequencies of immunogenic and 692 non-immunogenic peptides were separately considered; Random, completely random 693 sequences. (B-C) AUC was calculated and graphics were generated using pROC and 694 plotROC packages in R, respectively. 695 696 not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was . http://dx.doi.org/10.1101/155317 doi: bioRxiv preprint first posted online Jun. 25, 2017; burden/predicted neoepitope burden in non-small cell lung carcinoma (NSCLC) patients 713 treated with pembrolizumab(3). The three patients were labeled, for which scaled fitting 714 residuals decreased by more than 1 when the predicted neoepitope burden was used as a 715 correlate. Adjusted correlation coefficient ( 2 ) was calculated using the stat_poly_eq 716 package.(B) Clinical benefit (CB) was associated with heavier mutational 717 burden/predicted neoepitope burden in melanoma patients treated with ipilimumab(5). 718 CB was defined as in the original paper(5). NCB, no clinical benefit. 719 720 not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was . http://dx.doi.org/10.1101/155317 doi: bioRxiv preprint first posted online Jun. 25, 2017; (https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/tcga-study-abbreviations). 728 (B) The correlation between predicted MHCL burden and neoepitope burden per gene. 729 HUGO symbols were depicted for genes enriched with neoepitopes. Enrichment was 730 defined as fitting residuals being larger than 10 for the purpose of tidy visualization. 731 Adjusted correlation coefficient was calculated using the stat_poly_eq package. 732 733 not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was . http://dx.doi.org/10.1101/155317 doi: bioRxiv preprint first posted online Jun. 25, 2017; Table 1. Prediction results on datasets independent from training/validation data. 736 Immunogenicity was predicted using the mrTPCP-based SVM classifier (Fig. 4A)  836 not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.

SIMK990104
Distance-dependent statistical potential (contacts within 10-12 Angstrooms) not peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.