Immunogenicity landscape of cytotoxic T lymphocyte epitopes in sequence space

Cytotoxic T lymphocytes (CTLs) recognize peptides known as epitopes presented onto MHC-I molecules to trigger an immune response. However, factors demarcating epitopes from non-immunogenic MHC binders remain poorly understood. Here, using 21,146 human MHC-I-loaded peptide sequences, we demonstrated an accurate immunogenicity prediction through a sequence-based approximation of the contact potential profiles (CPPs) of peptides against pooled TCR repertoires in conjunction with MHC binding information, achieving the area under the curve (AUC) of 79%. Predictive features provided insights into molecular scanning by TCR repertoire. Our framework also worked for primate and mouse peptide datasets. The quantitative immunogenicity scores delineated the landscapes of viral escaping and epitope formation through epitope mapping in sequence space and predicted overall survivals of cancer patients in checkpoint blockade cohorts. Overall, our analysis provides insights into the high-dimensional architectures of CTL activation and expedites translational efforts toward precision immunotherapy in infectious diseases and cancer.


Introduction
the host TCR repertoire would serve as predictive features of immunogenicity. Because 48 . CC-BY 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted May 2, 2018. . https://doi.org/10.1101/155317 doi: bioRxiv preprint of the extreme diversity of TCR repertoire, neither in vitro assay nor structural modeling 49 is a feasible approach. Instead, we hereby established an alternative, sequence-based 50 framework which we termed Repitope, approximating the CPPs using amino acid 51 pairwise contact potential (AACP) scales. We made three assumptions. First, only a 52 small fraction of TCRs in the host TCR repertoire recognize a given MHC-loaded 53 peptide. Second, CDR3 loops, which exhibit the highest degree of genetic variability, 54 are sufficient in modeling the TCR-peptide interactions. Finally, only a small part of a 55 CDR3 loop interacts with a small part of a given peptide. With those assumptions in 56 mind, molecular scanning by the host TCR repertoire can be viewed as a parallelized 57 pairwise sequence alignment problem between peptide sequences and CDR3 fragment 58 sequences, with alignment scores calculated from AACP-derived custom substitution 59 matrix as surrogate measurements of inter-fragment contact potentials. 60 With the Repitope framework, we demonstrated an accurate and quantitative 61 prediction of immunogenicity of cytotoxic T lymphocyte (CTL) epitopes presented on 62 human HLA class I molecules. Incorporation of binding to representative HLA 63 supertypes improved the overall predictive accuracy, indicating the complementary 64 roles of HLA restriction and TCR recognition. The most predictive features provided 65 mechanistic insights into CTL immunity. Mapping adjacent peptides in sequence space 66 . CC-BY 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted May 2, 2018. . https://doi.org/10.1101/155317 doi: bioRxiv preprint enabled prediction of bidirectional immune transitions including escaping and epitope 67 formation. In silico mutagenesis followed by neighbor network clustering analysis 68 enabled identification of transition-prone peptides after single amino acid substitutions. 69 Finally, we demonstrated stratification of long-surviving patients treated with 70 checkpoint inhibitors in melanoma and lung carcinoma cohorts from the accumulation 71 of changes in immunogenicity between wild-type and mutated peptides. Collectively, 72 our findings theoretically bolster our understanding of CTL immunity and galvanize the 73 rational design of precision immunotherapy. We distributed datasets and essential 74 functions as the R package Repitope (https://github.com/masato-ogishi/Repitope/) for 75 ensuring reproducibility and warranting prospective validations in independent cohorts. 76 77 . CC-BY 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted May 2, 2018. . https://doi.org/10.1101/155317 doi: bioRxiv preprint 6

78
Immunogenicity prediction from TCR-peptide contact potential profiling 79 We first collected a set of 21,146 peptides of lengths 8 to 11 amino acids with 80 known HLA restrictions and T cell functional assay results from the Immune Epitope 81 Database (IEDB) 15 and other literature 8,9,[16][17][18][19][20][21] , yielding 6,942 immunogenic epitopes and 82 14,204 non-immunogenic MHC binders. Experimentally verified HLA genotype and 83 serotype information was integrated as HLA restriction (Figure 1a and Supplementary 84 Table 1 and Methods), against which prediction by NetMHC 4.0 6 showed stronger 85 binding than non-restricting HLAs (Figure 1b). Meanwhile, epitopes showed 86 significantly stronger binding than mere MHC binders (Figure 1c). Tumor neoepitopes 87 showed at least comparable binding compared to virus-derived epitopes (Figure 1d). 88 . CC-BY 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted May 2, 2018. . https://doi.org/10.1101/155317 doi: bioRxiv preprint immunodeficiency virus type 1. Neoepitope and viral data were obtained from TANTIGEN and IEDB database, respectively. In the right panel, tumor neoepitopes were depicted the same as the left panel for comparative purposes. Grey dashed line indicates the threshold of strong binders (0.5%). P values were calculated by Kruskal-Wallis test followed by Wilcoxon rank sum test with false discovery rate (FDR)-adjustment. **, p < 0.01; ***, p < 0.001; ****, p < 0.0001; ns, not significant.
As expected, previously reported tools for immunogenicity prediction showed 89 only marginal predictive capabilities (Supplementary Figure 1). To achieve accurate 90 immunogenicity prediction, we hereby propose the concept of TCR-peptide contact 91 potential profiling (CPP) (Figure 2a). We extracted and pooled human TCR β CDR3 92 sequences from datasets deposited in the Sequence Read Archive using MiXCR 93 . CC-BY 4.0 International license certified by peer review) is the author/funder. It is made available under a  The copyright holder for this preprint (which was not this version posted May 2, 2018. . https://doi.org/10.1101/155317 doi: bioRxiv preprint length 3 played vital roles (Figure 2c). Also notably, the existence of residues K and Y 103 was indicated as essential features (Supplementary Figure 3). indeed, both residues 104 were more enriched to MHC binders (Figure 2d). In contrast, RYQ and other trimer 105 sequence motifs were enriched toward epitopes (Figure 2e). Given that nonamer 106 peptides bound to MHC class II are required to share only five residues to bind the same 107 TCR 23 , and that MHC-I presents peptides shorter than MHC-II, it is likely that trimer 108 motifs are critical for at least the initial phase of molecular scanning by TCRs. By integrating physicochemical peptide descriptors, CPP features, and HLA 114 restrictions, we established a framework for immunogenicity prediction, which we 115 termed Repitope. Incorporation of either annotation-based HLA restrictions or binding 116 prediction results against twelve HLA representatives by NetMHC 4.0 6 improved 117 predictive performance (Supplementary Figure 4). Although annotation-based 118 predictions slightly outperformed those utilizing predicted binding categories, we 119 adopted the latter as "immunogenicity scores" for subsequent analyses (Figure 3a, b). 120 . CC-BY 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted May 2, 2018. . https://doi.org/10.1101 This makes the entire framework dependent solely on input peptide sequences. The 121 classification of epitopes and MHC binders via immunogenicity score yields sensitivity 122 and specificity of 70% and 73%, respectively, with the best score threshold being 0.35. 123 The overall accuracy, kappa, and area under the curve (AUC) were 72%, 41%, and 79%, 124 respectively ( Table 1). The high predictive performance was retained across peptide 125 However, sequence-level peptide clustering led to slightly lower AUCs 128 Figure 7). Furthermore, we found that either our framework or the 129 prediction models constructed on the human peptide dataset successfully predicted 130 immunogenicity for the peptide datasets of other species (Supplementary Figure 8). 131

(Supplementary
Particularly, immunogenicity scores calculated on the primate peptides using human 132 models without HLA information yielded AUC of 70%. This finding implies the 133 existence of shared regulatory mechanisms of CTL immunity between human and other 134 primates. Predictive performances of various algorithms and conditions tested were 135 summarized in Table 1. 136 In contrast to HLA binding, tumor neoepitopes exhibited significantly lower 137 immunogenicity scores compared to epitopes of various viral origins (Figure 1c and 138 . CC-BY 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted May 2, 2018. . https://doi.org/10.1101/155317 doi: bioRxiv preprint Figure 3c), with some exceptions such as gp100280-288 (score = 0.85) and NY-ESO-1157-139 165 (score = 0.70). This is consistent with previous observations that neoepitopes 140 dissimilar to self and similar to pathogen-derived epitopes are more likely to be 141 Collectively, the aforementioned observations strongly suggest that the Repitope 143 framework recapitulates the high-dimensional architectures of CTL immunity to enable 144 in silico prediction of epitope immunogenicity. 145 . CC-BY 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted May 2, 2018. . https://doi.org/10.1101/155317 doi: bioRxiv preprint Kruskal-Wallis test followed by Wilcoxon rank sum test with FDR adjustment; *, p < 0.05; ***, p < 0.001; ****, p < 0.0001; ns, not significant.

146
. CC-BY 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted May 2, 2018. . https://doi.org/10.1101/155317 doi: bioRxiv preprint Table 1. Summary of immunogenicity prediction results. Bolded was the prediction adopted as immunogenicity score. HLA, whether annotated HLA binding information or predicted binding by NetMHC 4.0 was used for machine learning. N_Feat, The number of CPP features (excluding HLA-related features) used. N_Pept, the number of peptides for machine learning. N_Epi, the number of epitopes in the dataset. ROC-AUC, the area under the receiver-operating curve.
PRC-AUC, the area under the precision-recall curve. Thr, the best threshold determined from ROC.

Immunogenicity score ratios of adjacent peptide pairs predict immune transitions 147
Single amino acid substitutions can drastically affect the immunogenicity of the 148 MHC-bound peptide. Delineating the landscape of "immune transitions," i.e., escaping 149 and epitope formation, is the critical step toward understanding the evolutionary 150 trajectories of microbes and cancer. To this end, we first characterized the 6,175 151 adjacent peptide pairs identified from our epitope metadataset, of which 940 were 152 transitional pairs based on their annotations. As expected, the ratios of their 153 immunogenicity scores predicted immune transitions, with AUC and the best threshold 154 of 74% and 1.26, respectively (Figure 4a, b). Substitutions at anchor positions 2 and 9 155 . CC-BY 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted May 2, 2018. . https://doi.org/10.1101/155317 doi: bioRxiv preprint in nonamers significantly affected the immunogenicity of the mutated counterparts 156 (Figure 4c). Moreover, mutating residues K and Y, residues enriched in 157 non-immunogenic MHC binders, significantly augmented immunogenicity (Figure 2d  158 and Figure 4d). Next, we characterized transitional peptide pairs of the same viral 159 origins (Figure 4e, f). Ratios of immunogenicity scores of transitional peptide pairs 160 were significantly higher than those of non-transitional pairs in all of the five viruses 161 analyzed (Figure 4e). The combined analysis yielded AUC of 79% (Figure 4f). These 162 findings strongly suggested that the viral adaptation trajectory can be predicted from the 163 dynamic changes in immunogenicity scores.

Immune transitional landscapes delineated by epitope mapping in sequence space 165
Strong immunogenicity does not necessarily mean robust immunogenicity 166 because of potential immune evasion. The discrepancy of these two aspects was 167 exemplified by the neighbor network clustering analysis on an epitope RRYQKSTEL 168 and 132 neighbors (Figure 5a). This epitope, whose immunogenicity score was 0.994, 169 was chosen because it has the largest number of neighbors in our metadataset. Indeed, 170 there were 32 non-immunogenic neighbors, forming two clusters. The Cluster 3 171 consisted of P9 mutants, in which both annotations and predictions fluctuated. In 172 . CC-BY 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted May 2, 2018. . https://doi.org/10.1101/155317 doi: bioRxiv preprint contrast, the Clusters 2, a cluster of P8 mutants, consisted of non-immunogenic peptides 173 with only one exception. These observations indicated that, while P8 mutation deprives 174 the epitope of immunogenicity, P9 mutation undermines the robustness of its 175 immunogenicity, possibly by affecting its binding to restricting HLAs. 176 To gain more comprehensive insights into the immunogenicity landscapes in 177 epitope sequence space, we next conducted in silico mutagenesis analysis; we 178 computationally generated all possible neighbors only single amino acid distant from 179 original viral peptides (Figure 4e, f). While mean scores were higher in epitopes, 180 coefficients of variance were higher in MHC binders (Figure 5b, c). When focusing on 181 the central cluster in neighbor networks, similar but more enhanced trends were 182 observed (Figure 5d, e). As for nonamers, P1-and P8-mutating clusters had higher 183 average scores, whereas P2-and P9-mutated clusters had higher score variances, 184 potentially because of disrupted interactions with restricting HLAs (Figure 5f, g). We 185 next classified peptides based on their annotated immunogenicity and existence of their 186 adjacent but transitional counterparts. The differences in immunogenicity scores, 187 per-cluster mean scores, and intracluster score variances between transitional and 188 non-transitional peptides were statistically significant (Figure 5h-j). Most notably, 189 non-immunogenic but next to immunogenic peptides exhibited higher scores and lower 190 . CC-BY 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted May 2, 2018. . https://doi.org/10.1101/155317 doi: bioRxiv preprint intracluster score variances than robustly non-immunogenic ones. These observations 191 indicate that increased intracluster variance, as well as decreased immunogenicity score, 192 represents the hallmark of escaping mutations. Our findings may help to choose less 193 escape-prone epitopes for the development of robustly effective vaccines. 194 network. An epitope RRYQKSTEL was chosen as representative. Edges represent single amino acid substitution, weighted by inversed score ratios. Colors indicate predicted immunogenicity scores, and shapes indicate annotated immunogenicity (circle, immunogenic; square, non-immunogenic). A walktrap algorithm was adopted for clustering. Labels indicate consensus sequences for each cluster. Our analysis demonstrates highly accurate immunogenicity predictions in silico. 225 Our framework is novel in that it utilizes TCR fragment library as a peptide homology 226 evaluation function from the viewpoint of TCR repertoire, in contrast to previous studies 227 that define homology from sequence-level pairwise alignment scores 11,12 . TCR-peptide 228 contact potential profiling recapitulates the context-dependent effects of single amino 229 acid substitutions on immunogenicity. In contrast to high-throughput in vitro T cell 230 assays that only provide estimates of immunogenicity of the peptides tested, our 231 . CC-BY 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted May 2, 2018. . https://doi.org/10.1101/155317 doi: bioRxiv preprint framework can extrapolate the high-dimensional rules of immunogenicity learned from 232 more than 20,000 peptide sequences to delineate the immunogenic landscapes in 233 sequence space. Finally, our framework is advantageous in that it does not depend on the 234 a priori knowledge of HLA restrictions, thus enabling in silico screening solely from 235 genomic data such as tumor mutation datasets. 236 Escaping is one of the most prominent issues in developing effective vaccines 34,35 , 237 and neoepitope formation is a promising target towards tumor-specific therapeutic 238 vaccines 36,37 . These phenomena are two sides of the same coin in sequence space. 239 However, the bidirectional effects of single amino acid substitutions in terms of 240 immunogenicity are likely highly context-dependent and remain barely understood. 241 Even a single mutation could affect various aspects of CTL immunity, including 242 proteolytic processing, transportation to endoplasmic reticulum, stability and structural 243 orientation in MHC presentation, and propensity for TCR recognition 1,9,38 . We found 244 that dynamic changes in immunogenicity scores are predictive of viral escaping and 245 response to checkpoint blockade therapy in cancer patients. Moreover, we characterized 246 the peculiarity of peptides adjacent to transitional boundaries defined from neighbor 247 network architectures in sequence space. To our knowledge, this work is the first 248 attempt to systematically explain the effects of single amino acid substitutions in the 249 . CC-BY 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted May 2, 2018. MHCBN 20 , and TANTIGEN 21 ] and previous publications 8,9,19 . For the peptides from 274 IEDB database, only those with the evidence listed in Supplementary Table 2 were 275 considered valid. Peptides presented on non-human MHC molecules were discarded, 276 whereas those presented on HLA class I molecules expressed in non-human hosts (e.g., 277 transgenic mice) were included. Either peptide sequences whose lengths were other than 278 8, 9, 10, and 11, or those containing non-standard letters were discarded. In this manner, 279 a total of 21,146 unique peptides were identified. A total of 1,872 (8.9%) peptides that 280 had conflicting immunogenicity annotations were identified. In this study, peptides that 281 were annotated as immunogenic in at least one dataset were considered immunogenic. 282 Consequently, 6,942 (32.8%) unique epitopes and 14,204 (67.2%) MHC binders were 283 identified. 284 Other species. CTL epitopes and MHC binders loaded on species-specific MHC 285 . CC-BY 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted May 2, 2018. . https://doi.org/10.1101/155317 doi: bioRxiv preprint molecules were retrieved from the IEDB database (as of Mar 30, 2018). Species studied 286 were mice and primates (bonobo, chimpanzee, gorilla, marmoset, and rhesus macaque). 287

Contact potential profiling (CPP) 288
Concept. Immunogenic peptides presented on the MHC class I molecules must be 289 recognized by the TCRs of CD8 + CTLs with strong affinity to trigger subsequent 290 immunological cascades 14 . The entire TCR repertoires do not necessarily recognize 291 those peptides; instead, a quite limited set of epitope-specific TCRs would be sufficient. 292 Moreover, given the flexible conformational changes upon binding between TCR and 293 peptide-MHC complex, it is plausible that only a restricted set of residue pairs between 294 the peptide and the CDR3 loop initially act as a "seed" of molecular scanning that may 295 endow the complex the time for conformational changes to further enhance the affinity 296 and trigger CTL activation 14,39,40 . In this view, the interacting surfaces could be 297 approximated as a pair of linear sequences, and by deconvoluting the TCR repertoire 298 into a set of fragments of a defined amino acid length, the process of molecular 299 scanning could be interpreted as a pairwise sequence alignment problem. One caution is 300 that, although an ordinary sequence alignment algorithm involves a similarity matrix 301 which gives higher values to biochemically similar residue pairs, higher scores must be 302 given to more strongly interacting residue pairs rather than more biochemically similar 303 . CC-BY 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted May 2, 2018. . https://doi.org/10.1101/155317 doi: bioRxiv preprint 26 residue pairs in order to interpret the alignment as the reflection of intermolecular 304 interaction. For this purpose, amino acid pairwise contact potentials (AACPs) were 305 adopted from the AAIndex database 41 306 (http://www.genome.jp/aaindex/AAindex/list_of_potentials) to generate custom 307 substitution matrices. The Smith-Waterman local alignment algorithm maximizes the 308 alignment score of a given TCR-derived sequence fragment against the whole sequence 309 of a given peptide, and its alignment score can be interpreted as a correlate of the 310 intermolecular contact potential. Consequently, the interactions between a given peptide 311 and a given TCR repertoire can be profiled by summarizing the distribution of 312 alignment scores. 313 Fragment library. We hypothesized that human public TCR clonotypes are biased in 314 favor of "publicly" immunogenic peptides rather than non-immunogenic ones, and 315 therefore could be utilized as a probe for immunogenicity. We focused on the CDR3 316 loops because they are primarily responsible for the interactions with MHC-loaded 317 peptides, whereas MHC α-helices are typically contacted with more conserved CDR1 318 and CDR2 loops 1,42,43 . We restricted our analysis to human TCRβ (TRB) repertoire. Due 319 to the frequent lack of annotations of T cell sorting (e.g., CD8 + T cells), we included 320 TCR repertoires from all kinds of T cells. TCR repertoire datasets were collected from 321 . CC-BY 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted May 2, 2018. . https://doi.org/10.1101/155317 doi: bioRxiv preprint NCBI Sequence Read Archive (SRA) and a previous study led by Britanova et al. 44 . 322 SRA was searched with the following terms '(T Cell Receptor β) AND human.' The 323 following BioProjects were included: PRJNA389805, PRJNA329041, PRJNA298417, 324 PRJNA273698, PRJNA258001, PRJNA229070, PRJNA79707, and PRJNA79435. 325 Projects in which healthy donors were not involved (e.g., cancer, autoimmune diseases) 326 were omitted. Datasets derived from HIV-infected subjects in the BioProject 327 PRJNA258001 were discarded. Meanwhile, PRJNA316572, involving the unique 328 molecular identifier strategy, were not included in order to maintain consistency in the 329 way of analysis. Fastq files were obtained using fastq-dump script with the following 330 options: --gzip --skip-technical --readids --read-filter pass --dumpbase --split-files --clip 331 --accession [SRA run number]. By using MiXCR software 22 , a total of 23,006,555 TRB 332 CDR3 clonotypes were identified. Subsequently, a total of 191,326 clonotypes observed 333 in at least 22 out of 220 (=10%) different datasets were retained as public clonotypes. 334 Finally, the pooled CDR3 sequences were fragmented by a sliding window strategy to 335 generate a fragment library. Sliding windows of amino acid length 3 to 8 were tested. 336 Moreover, since CDR3 loops could interact with peptides in either a forward parallel or 337 antiparallel fashion, reversed CDR3 sequences were simultaneously fragmented and 338 pooled. Resultant fragment pools were compressed to a defined library depth (e.g., 339 . CC-BY 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted May 2, 2018. . https://doi.org/10.1101/155317 doi: bioRxiv preprint 10,000) by random down-sampling while preserving the original proportion of each 340

fragment. 341
Pairwise sequence alignment. The pairwiseAlignment function implemented in the 342 Biostrings package in Bioconductor 45 was utilized. This function seeks an optimal 343 alignment which maximizes the overall alignment score defined as a sum of pairwise 344 scores. Alignment type was set "global-local" to obtain an optimal alignment of a CDR3 345 fragment with consecutive subsequences of the peptide. Gaps were not allowed. A set of 346 substitution matrices were generated from a set of thirty-five AACP scales retrieved 347 from the AAIndex database 41 . The scales utilized in this study are summarized in 348 Supplementary Table 3. AACP matrices were rescaled so that numerical indicators 349 resided within the range between 0 and 1 for comparative purposes. For analytical 350 purposes, we also prepared inversed AACP matrices (rescaled as well) to mimic the 351 weakest interactions as well as the strongest interactions. A distribution of alignment 352 scores was summarized by calculating representative statistics. Following statistics were 353 utilized: mean, standard deviation, median, 10% trimmed mean, median absolute 354 deviation, skew, kurtosis, standard error, interquartile range, 10% quantile, and, 90% 355 quantile. Predictive features were generated by combining the fragment length, the 356 AACP scale, and the type of statistics. In this manner, a total of 4,620 CPP features 357 . CC-BY 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted May 2, 2018. . https://doi.org/10.1101/155317 doi: bioRxiv preprint 29 were generated for each of the peptides. 358

Peptide descriptors 359
Apart from CPP features, sequence-based physicochemical features were also 360 calculated. Each peptide sequence was converted into a set of consecutive fragments of 361 a defined amino acid length, and peptide descriptors were calculated against each of the 362 fragments using functions in the Peptide package. Following functions were utilized: 363 aIndex, blosumIndices,boman,charge,crucianProperties,fasgaiVectors,hmoment,364 hydrophobicity,instaIndex,kideraFactors,mswhimScores,pI,protFP,vhseScales,and,365 zScales. The distributions of the values were summarized as described in the paragraph 366 of CPP. Moreover, a set of categorical features indicating whether the peptide of interest 367 is free from a specific amino acid residue (e.g., tyrosine) were included. Finally, the 368 length of the peptide was included as a feature as well. In this manner, combined with 369 CPP features, a total of 6,081 features were generated for each of the peptides. 370

HLA restriction 371
Annotation. Experimental annotations on binding to HLA class I molecules were 372 retrieved from IEDB and other databases. HLA genotypes were converted to supertypes 373 according to the rules from Sidney et al. 46 Serotypes were treated per se. In this work, 374 supertypes and serotypes were not strictly distinguished and treated as "HLA restriction" 375 . CC-BY 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted May 2, 2018. . https://doi.org/10.1101/155317 doi: bioRxiv preprint