A Highly Effective System for Predicting MHC-II Epitopes With Immunogenicity

In the past decade, the substantial achievements of therapeutic cancer vaccines have shed a new light on cancer immunotherapy. The major challenge for designing potent therapeutic cancer vaccines is to identify neoantigens capable of inducing sufficient immune responses, especially involving major histocompatibility complex (MHC)-II epitopes. However, most previous studies on T-cell epitopes were focused on either ligand binding or antigen presentation by MHC rather than the immunogenicity of T-cell epitopes. In order to better facilitate a therapeutic vaccine design, in this study, we propose a revolutionary new tool: a convolutional neural network model named FIONA (Flexible Immunogenicity Optimization Neural-network Architecture) trained on IEDB datasets. FIONA could accurately predict the epitopes presented by the given specific MHC-II subtypes, as well as their immunogenicity. By leveraging the human leukocyte antigen allele hierarchical encoding model together with peptide dense embedding fusion encoding, FIONA (with AUC = 0.94) outperforms several other tools in predicting epitopes presented by MHC-II subtypes in head-to-head comparison; moreover, FIONA has unprecedentedly incorporated the capacity to predict the immunogenicity of epitopes with MHC-II subtype specificity. Therefore, we developed a reliable pipeline to effectively predict CD4+ T-cell immune responses against cancer and infectious diseases.

In the past decade, therapeutic cancer vaccines have achieved excellent clinical study results (17)(18)(19)(20)(21)(22)(23)(24). For instance, in patients with anti-PD1-refractory/relapsed unresectable Stage III or IV melanoma, BioNTech's therapeutic cancer vaccine candidate BNT111 combined with cemiplimab elicited durable objective responses (23), which received Food and Drug Administration (FDA) Fast Track Designation in 2021. In another study, treatment with dendritic cell vaccine primed with WT1 mRNA could prevent or delay relapse in 43% of patients with AML in remission after chemotherapy in a Phase II trial (25). These two clinical studies utilized therapeutic vaccines based on TAA.
Therapeutic vaccines based on personalized neoantigens also made remarkable progress. In a clinical trial on melanoma patients conducted by Otto et al., a synthetic long peptide vaccine consisting of multiple epitopes established tumorspecific T-cell responses and demonstrated effectiveness over five years (20,22,26). In end-stage colorectal cancer (CRC) patients (3rd line or more advanced), Gritstone's GRANITE personalized immunotherapy showed a 44% molecular response rate (4/9) by circulating tumor DNA analysis that can be considered as a surrogate endpoint [NCT03639714].
Theoretically, neoantigens are superior to TAA as targets for therapeutic cancer vaccine: although TAAs have relatively higher expression levels in tumor cells, they may still be present in particular types of normal cells at low levels; Her2 and survivin would be good examples (27)(28)(29)(30). In contrast, neoantigens originate from mutations and aberrant translations of tumor RNA transcriptome. Consequently, they are "absolutely" specific to tumor cells as normal cells do not have such mutations and aberrant translations. Such "absolute" specificity means that Tcell responses against neoantigens are unlikely to elicit an offtarget effect on normal cells. Thus, the safety concern of neoantigen-based personalized vaccines would be minimal.
Despite its theoretical superiority on safety, neoantigen-based personalized vaccines still need to face a technical bottleneck: how to identify T-cell epitopes with sufficient immunogenicity from neoantigens for vaccine design. Especially, MHC-II epitopes are believed to be more necessary than MHC-I epitopes for preventing the immune escape of tumor cells (31,32,34). The insufficient capability of predicting MHC-II epitopes has obviously limited the development of neoantigen-based personalized vaccines, resulting in scarcely reported clinical studies involving MHC-II epitopes. Therefore, our research aims to break this bottleneck and provide a powerful tool for developing a neoantigen-based personalized vaccine.
Generally speaking, the conversion of aberrant peptides generated by genomic variations in tumor cells into epitopes eliciting in vivo T-cell immune responses is a complex process involving multiple hierarchical levels. Therefore, the prediction and identification of T-cell epitopes should preferably involve multiple levels to reflect complex biological processes. Such a "funnel-like" procedure (35-37) that would eliminate most Tcell epitope candidates would necessarily involve several major steps: 1. Mutation identification 2. Peptide-MHC binding prediction 3. Peptide-MHC presentation prediction 4. Peptide-MHC immunogenicity prediction Plenty of previous work has been accomplished by various research groups in the relevant field and thereafter generated several well-known software implements: • The latest version of NetMHCIIpan uses binding and elution datasets deconvoluted by NNalign_MA (38) to predict peptide ligands that can be presented by MHC-I and MHC-II on the cell surface (39,40). • MHCflurry improves the pan-allele prediction of MHC-Ipresented peptide ligands by incorporating antigen processing and MHC ligandome elution (41). • ForestMHC applied the deconvolution of polyallelic datasets trained by MixMHCpred based on position weight matrices (PWMs) and MHC-I-presented peptide ligands (42). • MARIA adopts a multimodal recurrent neural network that summarizes in vitro binding measurements, mRNA abundance, and protease cleavage signatures to predict MHC-II-presented peptide ligands (43).
However, the well-known tools listed above never touched the 4th step of the funnel: immunogenicity. Considering the negative selection of T cells during thymus development (44,45), the vast majority of self-derived peptides will not trigger a downstream immune response even if presented by APC such as DC (46,47), and such peptides account for 90% of all presented peptides. Obviously, the current antigen presentation prediction tools are NOT the ultimate solutions for the design of neoantigen-based personalized vaccines because even the peptide ligands presented by MHC-I or MHC-II may not be immunogenic at all.
Recently, several emerging studies have taken MHC-I immunogenicity prediction into consideration. For example, deepHLApan incorporated both peptide-MHC complex binding affinity and immunogenicity to predict the T-cell epitope (48). DeepNetBim extracted the attributes of the network as new features from peptide-MHC binding and immunogenic models as a pan-specific MHC-I epitope prediction tool (49).
Nevertheless, there remains an unfilled gap in identifying MHC-II epitopes with sufficient immunogenicity, as neoantigen-driven B-cell and CD4+ T-helper cell collaboration promotes anti-tumor CD8 T-cell responses (50). In this work, we developed an overarching framework to predict MHC-II epitopes: our convolutional neural network (CNN) model predicts the probability of a peptide to be presented to the cell surface by a designated MHC subtype, as well as its immunogenicity to activate immune T cells. The overall research consists of the following parts: (1) The datasets of peptide presentation and immunogenicity are obtained from an open database (IEDB) (51) and then processed with rigorous organization and cleaning. (2) We constructed a semiotic-based human leukocyte antigen (HLA)-encoding method with three levels to associate the information of the HLA allele nomenclature, which better represents the characteristics of different MHC subtypes that are not entirely independent or discrete.
(3) The encoded MHC subtypes and peptides are integrated into the deep learning model based on a specially designed CNN. (4) Independent validation datasets are used to evaluate the model's prediction performance.

Eluted Ligandome and Immunogenicity Data
The Eluted Ligandome date corresponding to various MHC-II subtypes is downloaded from the IEDB database; T-cell assay data reflecting the immunogenicity of peptides are extracted from the IEDB database. Python scripts are used to resolve raw XML data filtered with the following criteria: (1) MHC-II alleles include HLA-DP, DQ, and DR b chains, whereas a chains are reasonably omitted as they contribute little to ligand specificity. (2) Only MHC-II subtypes with explicit 2 fields in the HLA nomenclature such as HLA-DPB1*01:03 are retained. (3) The peptide length is in the range of 9~25 amino acids, representing 98% of total peptides (4) Peptide-MHC pairs with controversial assay results are excluded. (5) MHC-II subtypes with fewer than 10 corresponding peptides are excluded as the data size is too small to train our model, which leads to 65 available MHC-II subtypes. (6) T-cell assay data are based on wet-lab assays rather than predictions in original dataset's column named Assay Type.

Negative Elution Training Data Generation
We generate the negative datasets corresponding to elution data treated as positive data from the global maximum dissimilarity scoring matrix based on sequence dissimilarity with an additional NetMHCIIpan binding filter: (1) Full protein length F is extracted according to its accession ID (GenBank ID) given an eluted sequence P. (2) We use a window with the same length of P to slide on the full-length sequence F to get a list of candidates from which 10 negative sequences with the lowest sequence similarity compared to the entire positive dataset and the lowest possibility to be eluted sequences calculated by NetMHCIIpan 4.0 as a filter.
In total, we obtained 273,102 non-redundant eluted ligands (as positive data) and corresponding to 61 MHC-II subtypes (Supplementary Table 1 Table 2).

MHC-II Subtype Encoding Based on Hierarchical Relationship
Antigen presentation and immunogenicity are both closely associated with MHC-II subtypes because peptide ligands are finally presented on the cell surface by MHC-II to T-cell receptors. In order to develop useful tools to predict MHC-II epitopes, we need to "teach" computer programs how to distinguish various MHC-II subtypes. Therefore, setting a reasonable coding method for MHC-II subtypes is an inevitable question. In quite a number of earlier studies, MHC-II subtypes are converted into orthogonal vectors using one-hot encoding. Although a one-hot coding approach is feasible and straightforward, it apparently does not fully reflect biological mechanisms. One-hot coding treats each MHC-II subtype as a unique dimension: for example, in the perspective of one-hot coding, HLA-DRB1*01:01 and HLA-DRB1*01:09 are assumed to have no relation at all, neither are their corresponding ligandomes. However, such an assumption conflicts with realworld biological mechanisms: the evolution of various MHC subtypes can be reflected in phylogenetic trees, and some MHC-II supertypes consisting of multiple subtypes have been characterized by a partially shared ligandome in previous studies.
As an imperfect approach, one-hot coding for MHC-II subtypes may waste lots of training data as it does not recognize the overlapping ligands of closely related MHC-II subtypes. Moreover, one-hot coding would cause the MHC-II subtypes without abundant training data (e.g., fewer than 10 corresponding peptides) to be neglected, as the segregated data amount may not be sufficient for training the model. In order to develop more powerful tools for predicting MHC-II epitopes, we propose a novel coding system that could quantitatively reflect the relation among various MHC-II subtypes. Our goal is to use training data in a more scientific way with maximal utilization and also enable epitope prediction for the MHC-II subtypes without many available data.
The nomenclature rationale of each HLA allele is like a leaf node based on a tree, which enriches the hierarchical information and truly reflects the categories and associations of different HLA alleles. We creatively propose a new HLA coding method named hierarchical relationship-based HLA encoding, as shown in Figure 1. In this model, we regard the HLA gene (HLA-DRB1, DPB1 and DQB1) as layer 0, the first field number (e.g., '01' of DRB1*01) as layer 1, and the second field number (e.g., '02' of DRB1*01:02) as layer 2. We encode each single layer according to an [99×128] embedding table to get an E ∈ R 1×128 vector that represents each layer so that on the single layer, the same symbols have the same biological means while different symbols are discrete and orthogonal to each other mathematically. Afterwards, a transition matrix is adopted to transform the concatenated three-layer encoding matrix [3×128] into a one-dimensional vector [1×128] for later model training.
. Convolution of feature extraction, W i is the transfer matrix representing the weight of each layer. E i is the information coding of each layer; e is the integrated HLA embedding value.

Normalization of HLA Embedding Value
The obtained HLA embedding value needs to be normalized before feeding to the deep learning model. Batch normalization (BN), a commonly used method, is used to normalize the whole batch of the dataset to a standard Gaussian distribution (52) so that differences in distinct data distribution from different samples can be normalized according to Equation 3: . Batch normalization. v and s 2 are the perdimension mean and variance, respectively. Arbitrarily, the constant u is added in the denominator for numerical stability.
On the contrary, layer normalization (LN) normalizes all features of each sample in the sample scale (53) according to Equation4: . Layer normalization v and s 2 are the perdimension mean and variance, respectively. Arbitrarily, constant u is added in the denominator for numerical stability for each single layer l.
Both batch normalization and layer normalization could be used to avoid gradient disappearance or gradient explosion caused by excessive fluctuation of the input value, so as to simplify subsequent model training. However, they still have substantial differences: batch normalization depends more on the statistical parameters between different samples; thus, feature extraction and normalization calculation within a single sample are insufficient, whereas layer normalization eliminates the characteristic relationship between different samples in a batch and only normalizes different eigenvalues in the same sample. Because of the reasons described above, both methods are not very suitable for current HLA embedding normalization. Because we need to consider not only the characteristics of the same layer but also the impact of differences at different layers, we developed a new method of HLA normalization (HLAN): . HLA normalization equation µ is the mean value based on different levels, s is the level variance, x is the input value C is the layer according to the HLA-named system, and H is the length of the input value.
After normalization, the features are integrated through a convolution layer, and the final output results that are used as the input of the subsequent deep learning model are as follows: . Convolution layer to integrate an HLAN result.

HLA-Encoding Fusion Layer
We tested two different coding fusion layer schemas to fuse hierarchical representations from different layers representing an MHC-II subtype nomenclature in Figures 2A, B Figure 2C; shared parameters are calculated as the same embedding table called HLA_Norm.
For each input peptide sequence, we completed its length to 25 with the letter "J" and then converted each letter into a [1×128] vector according to its position in vocab to tokenize the whole sequence and finally get a [25×128] matrix presenting the input peptide.

HLA Subtype and Peptide Sequence Fusion Encoding
The MHC-II subtype and peptide sequence are paired, concatenated (Figure 3), and sent to our model for further training and testing.
To characterize an HLA-peptide sequence after pairing, we need a unified model that can extract the paired information. Compared with the recurrent neural network, a full CNN can (C) shows that the sequence information is integrated by concatenation after the weight is processed by using the shared index embedding table.
FIGURE 3 | Schema of HLA-peptide fusion encoding. The orange HLA_head is the result generated in Figure 2C. HLA_head is loaded on top of peptide sequence in concatenation mode to form a new constituent sequence containing HLA_head and peptide.
better model the information of adjacent positions. A onedimensional CNN can be expressed as follows: . f is the convolution kernel, * is the convolution operator, and X is the input value. f c is a one-dimensional convolution kernel of c dimension, and x s is the input value decomposed according to its own dimension.

10-Fold Cross-Validation
Ten-fold cross-validation is applied to evaluate model robustness. Before training, the dataset is randomly partitioned into 10 non-overlapping subsets. The cross-validation process is repeated 10 times, with each subset used as a validation set while the remaining subsets are utilized as the training set. The results of the cross-validation sets are averaged to obtain the final result. One hundred epochs are executed, and the model is saved if the validation accuracy is better than previous epochs.

The Architecture of FIONA
We used the matrix p(x) obtained by matrix transformation in Equation 6 that converts a one-dimensional vector sequence of the MHC-II subtype and peptide into a [26×128] matrix as input for the model to predict whether a peptide will be presented to the cell surface (FIONA-P) or trigger immunogenicity (FIONA-I) given a specific MHC subtype. In order to implement the above 2 predictive functions, we constructed two models with different training datasets (presentation and immunogenicity) explained in Section 2.1 and Section 2.2 with the same architecture shown in Figure 4. FIONA includes a CNN layer for prediction, which focuses on integrating and extracting overall features from MHC-II subtype-peptide pairs. In this process, HLA embedding and peptide embedding are integrated to play a synergistic role in improving the prediction performance. Additionally, in order to improve the prediction ability of our model, we added multiple pooling layers in the convolution layer to extract and integrate features.
Our model accepts fused HLA_Peptide embedding as input. Referring to Resnet's design pattern shown in the left of Figure 4, we created several aggregation modules in the form of blocks for stacked layers, connected layers, and convolution layers successively named BlockConv. The final convolution layer aggregates the internal characteristics of each embedding into a vector.

Ablation Experiment
We conducted ablation experiments to validate our HLAencoding schema and its impact on the overall results by eliminating the HLA_Norm layer or replacing the normalization layer with batch normalization and layer normalization individually. We divided the comparison into two parts: the first part is based on HLA_ Embedding using different encoding and normalization methods, and the second FIGURE 4 | Architecture of FIONA. The dataset is downloaded from the IEDB database according to Section 2.1. The first dotted box contains an HLA_head and peptide, which are encoded separately and then integrated for feature transformation. The middle part contains an HLA_head, and peptide_Embedding represents sequence feature information. The right dotted box is our CNN model for training and prediction. part partially modifies the architecture of our model to find out the impacts of these modifications on the performance of our model.
As shown in Table 1, the ablation test shows that our MHC subtype hierarchical relationship-encoding method greatly outperforms the traditional one-hot method regardless of subsequent normalization methods on both presentation data and immunogenicity data. In addition, the HLA_Norm method has the best performance on both presentation and immunogenicity datasets compared to the Batch Norm and Layer Norm. Meanwhile, the final architecture consisting of con_ANA and BlockConv has the best performance among all tests including eliminating the HLA coding content, which leads to a dramatic decrease of ROC and PR values.

FIONA-P Favors Balanced Positive and Negative MHC-II Peptide Presentation Data
In a natural environment, the proportion of presented antigens compared to non-presented peptides degraded by protease is relatively low; therefore, the unbalanced data amount should theoretically and more faithfully reflect the actual situation. However, the unbalanced data amount of positive and negative samples is a great challenge to the construction and optimization of the deep learning model. Here, we selected a specific number of samples from multiple negative samples generated by the method mentioned in Section 2.2 to build 2 datasets with relatively balanced and unbalanced positive and negative ratios (positive data to negative data = 1:1 and 1:5, respectively) to compare the influence to our model FIONA-P. As shown in Figure 5, FIONA-P has a better performance for balanced datasets, especially in terms of the performance of PR, which has a pronounced degradation if unbalanced data are used.
Similarly, we also compared the performance of natively uneven immunogenicity data from the IEDB and artificial synthetic datasets after randomly reducing a portion of the positive data (positive data to negative data = 1.62:1 and 1:1, respectively) to test the performance of the FIONA-I model under such circumstances. The test results show only minor changes in terms of ROC and PR.

FIONA-P Achieves Comparable Performance
The IEDB benchmark dataset is often used to compare the performance of different binding prediction tools. However, these datasets are usually intracellular binding data rather than elution data. To test the ability of the presentation prediction of several existing MHC-II epitope tools [Maria, NetMHCIIpan4.0, BERTMHC (54), and MixMHC2pred (55)], we used an independent dataset from the University of Tübingen (56) that contains 142,625 naturally eluted ligands from 29 tissues across 42 MHC-II subtypes (33 MHC-II subtypes in total after omitting the a chains of MHC, Supplementary Table 3). The independent dataset is deduplicated by sequence and the corresponding MHC-II subtypes compared with the training dataset. All the supported MHC-II subtypes that overlap the MHC-II subtypes of the independent dataset are tested. For all tools, our FIONA-P model achieved the best performance for 25 out of the 33 MHC-II subtypes, especially in subtypes with higher corresponding eluted peptides as shown in Figure 6. Our model has shown a bit of advancement compared with MixMHC2 and great improvement compared with other tools. However, MixMHC2 only supports 38 MHC-II subtypes; thus, 3 of unsupported MHC-II subtypes have no available results in this comparison. Our model not only supports 65 MHC-II subtypes by direct training but is also able to predict the peptide presentation of corresponding untrained MHC-II subtypes by our new breakthrough HLA hierarchical encoding method. Since the number of supported MHC-II subtypes is also very important in epitope prediction, our model has greatly broadened the scope of available MHC-II subtypes.

FIONA-I Improves Positive Prediction Value of True Neoantigen Through Validation of Curated Neoantigen Dataset
As previously discussed, only a small proportion of peptides presented by APC can trigger the downstream immunogenicity of T cells, resulting in a fairly low false-positive rate (FPR), which is presumably one of the main reasons that cancer vaccines do not have enough clinical benefits since these vaccines cannot load sufficient epitopes to inhibit the immune escape of cancer cells given such high FPR.   We used a fully manually annotated neoantigen database, NEPdb (57), newly published in 2021 to demonstrate that our FIONA-I model substantially improves the positive predictive value (PPV) of neoantigen prediction compared to the antigen presentation model. All MHC-II neoantigen data entries containing DP, DQ, and DR alleles were retrieved from the NEPdb, which contains 182 positive and 3,508 negative epitopes across 31 different MHC-II subtypes (Supplementary Table 4). FIONA-I, FIONA-P, and other MHC-II epitope tools (Maria, NetMHCIIpan4.0, BERTMHC, and MixMHC2pred) are used to calculate the PPV with a default parameter setting. Maria/ BERTMHC directly returns '0' for negative and '1' for positive, NetMHCIIpan4.0 and MixMHC2pred take top 10% peptides as positive; all the MHC-II subtypes that are not supported by these tools are neglected. As shown in Table 2, FIONA-I raises the PPV from 22.51% (mean PPV of FIONA-P, Maria, NetMHCIIpan4.0, BERTMHC, and MixMHC2pred) to 40.27%, obtaining a near doubling of the increasement. The results showed that FIONA-I could improve the PPV significantly and retain the sensitivity at 0.89, indicating that the immunogenicity model could greatly contribute to highconfidence neoantigen identification.

Web Service
We developed a user-friendly web interface (http://therarna.cn/ fiona.html), allowing visitors to quickly query whether peptides would be presented or able to trigger an immune reaction given the specific MHC-II subtypes.

DISCUSSION
Since 2018, a couple of useful tools for antigen presentation prediction have been reported. For example, Gritstone has published its proprietary software for predicting the MHC-I epitopes presented on the cell surface; similarly, MARIA is capable of predicting the MHC-II epitopes presented on the cell surface. Both tools use the same underlying hypothesis: antigen processing by proteases, antigen abundance, and peptide-MHC interaction are 3 important factors that participate in antigen presentation. Therefore, both tools involved these 3 factors into the algorithm training and by far outperformed early versions of NetMHCIIpan and MHCFlurry.
The 3-factor theory of antigen presentation actually makes good scientific sense: short peptides need to be cleaved from long peptides by proteases to enable their binding to MHC-I or MHC-II; antigen abundance measured by the mRNA level determines the amount of MHC ligands displayed on the cell surface to be recognized by a T-cell receptor, whereas peptide-MHC interaction would tell which ligands are more favorable to be displayed by MHC. However, there might be better ways to integrate these 3 factors into antigen presentation prediction. For instance, in the algorithm structure of MARIA, antigen presentation is simplified to a cleavage score, peptide-MHC interaction is simplified to an HLA-DR binding score, and antigen abundance is standardized as mRNA TPM (transcript per million); afterwards, the 3 types of data from different dimensions were put into the algorithm training. We are not saying "that approach is not right," but we seriously want to discuss what a better model should be. Biologically, long peptide cleavage by proteases occurs before short peptides interact with MHC. Therefore, it is more reasonable to develop a tool to enumerate all short peptides generated from a long peptide by protease cleavage, and the pool of short peptides would be the input of the next-step antigen presentation prediction. Moreover, in the antigen presentation process, antigen abundance would no longer be a limiting factor once it exceeds a reasonable level, which has been proven in the neoantigen meta-analysis of TESLA (58). Therefore, we may use TPM>35 proposed by TESLA as a cut-off point of antigen abundance. In other words, the antigens whose expression levels are above the cutoff point should be regarded as "abundant" to be presented. Furthermore, in a natural infection caused by an exogenous virus or bacteria, all pathogen-related antigens should be regarded as "abundant," even though their expression levels could hardly be standardized as TPM.
Based on the mechanistic analysis above, antigen processing had better been analyzed with a separate upstream tool, whereas antigen abundance could be reasonably simplified to a criterion of TPM >35. Therefore, our antigen presentation prediction tool focuses more on peptide-MHC interaction. We would not recommend MARIA's approach of oversimplifying the peptide-MHC interaction to a binding score because the amino sequences of peptide ligands as well as MHC complex may reveal important information relevant to the antigen presentation process. For example, previous studies confirmed that peptide-MHC binding affinity reflected as IC50 (nM) does not accurately reflect the stability of the peptide-MHC complex. Thus, the sequences of peptide ligands would provide information in more than one dimension. Taking all the foresaid into account, our antigen presentation prediction tool involves the sequences of peptide ligands and MHC into deep learning and therefore avoids the issue of oversimplification. This could be a possible explanation that our model outperforms the well-known tools. Compared to antigen presentation, predicting the immunogenicity of MHC ligands is more challenging due to the lack of powerful theories. As previously discussed, there is a scientifically sound 3-factor theory that explains the mechanism of antigen presentation, and this theory effectively guided the development of multiple prediction tools. In contrast, the root cause of immunogenicity is more difficult to interpret.
Immunogenicity is shaped by a T-cell-negative selection; thus, the real challenge of immunogenicity prediction is the limited understanding of the mechanism of a T-cell-negative selection. A T-cell-negative selection process in thymus removes T cells reactive to self-antigens from the T-cell repertoire and therefore provides protection against unwanted T-cell responses. A T-cellnegative selection determines which MHC ligands will NOT elicit an immune response, whereas other MHC ligands may still encounter the corresponding TCR in the T-cell repertoire.
So far, the T-cell-negative selection process is still a "blackbox," and there is no powerful theory that clearly interprets its delicate mechanism. Especially, no theory could define what factors constitute the "sufficient condition" to trigger a T-cellnegative selection. At least, self-antigen alone does not constitute the "sufficient condition." A T-cell-negative selection does NOT remove all T cells that recognize the MHC ligands derived from self-antigens, and such complexity is endorsed by 2 facts in clinical studies: 1. Self-reactive T cells are present in patients with autoimmune diseases (59). 2. A peptide vaccine could elicit T-cell responses against TAA in cancer patients (2).
The lack of a robust theory to interpret a T-cell-negative selection makes it challenging to predict immunogenicity. All software tools for predicting immunogenicity, including ours, are based on an empirical approach: the tools are trained with T-cell assay data that distinguish immunogenic peptides from nonimmunogenic ones, matched with MHC subtypes. Of course, even an empirical approach could solve many problems. For example, our trained software could achieve PPV at 40.27% on an independent dataset. Nevertheless, the limitation of the empirical approach should not be forgotten: such methodology requires tremendous T-cell assay data to train a functional model. For those MHC subtypes that do not have many corresponding T-cell assay results, the empirical approach cannot be used. Based on our discussion above, a more accurate immunogenicity prediction tool would rely on the emergence of a more robust theory that interprets the mechanism of the T-cell-negative selection mechanism. By then, it might be possible to deduce the immunogenicity of peptide ligands based on the host's MHC genotype and proteome information.
Our study proposed a systematic workflow that could identify MHC-II restricted epitopes that can be presented on the cell surface and elicit immune responses. This tool could be of great usefulness for identifying potential epitopes from cancer neoantigens and paving the way of designing effective cancer therapeutic vaccines.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding author.