Ranking-Based Convolutional Neural Network Models for Peptide-MHC Class I Binding Prediction

T-cell receptors can recognize foreign peptides bound to major histocompatibility complex (MHC) class-I proteins, and thus trigger the adaptive immune response. Therefore, identifying peptides that can bind to MHC class-I molecules plays a vital role in the design of peptide vaccines. Many computational methods, for example, the state-of-the-art allele-specific method MHCflurry, have been developed to predict the binding affinities between peptides and MHC molecules. In this manuscript, we develop two allele-specific Convolutional Neural Network-based methods named ConvM and SpConvM to tackle the binding prediction problem. Specifically, we formulate the problem as to optimize the rankings of peptide-MHC bindings via ranking-based learning objectives. Such optimization is more robust and tolerant to the measurement inaccuracy of binding affinities, and therefore enables more accurate prioritization of binding peptides. In addition, we develop a new position encoding method in ConvM and SpConvM to better identify the most important amino acids for the binding events. We conduct a comprehensive set of experiments using the latest Immune Epitope Database (IEDB) datasets. Our experimental results demonstrate that our models significantly outperform the state-of-the-art methods including MHCflurry with an average percentage improvement of 6.70% on AUC and 17.10% on ROC5 across 128 alleles.


INTRODUCTION
Immunotherapy, an important treatment of cancers, treats the disease by boosting patients' immune systems to kill cancer cells Waldman et al. (2020); Esfahani et al. (2020); Couzin-Frankel (2013); Mellman et al. (2011).To trigger patients' adaptive immune responses, Cytotoxic T cells, also known as CD8+ T-cells, have to recognize peptides presented on the cancer cell surface Blum et al. (2013); Valitutti et al. (1995).These peptides are fragments derived from self-proteins or pathogens by proteasomal proteolysis within the cell.To have the peptides presented on the cell surface to be recognized by CB8 receptors, they need to be brought from inside the cells to the cell surface, typically through binding with and transported by major histocompatibility complex (MHC) class-I molecules.To mimic natural occurring proteins from pathogens, synthetic peptide vaccines are developed for therapeutic purposes Purcell et al. (2007).Therefore, to design successful peptide vaccines, it is critical to identify and study peptides that can bind with MHC class-I molecules.
Chen et al.

RCNN4PMHC
Many computational methods have been developed to predict the binding affinities between peptides and MHC class-I molecules O'Donnell et al. (2018); Han and Kim (2017).These existing computational methods can be categorized into two types: allele-specific methods and pan methods.Allele-specific methods train one model for one allele such that the model can capture binding patterns specific to the allele, and thus it is better customized to that allele Lundegaard et al. (2008); O'Donnell et al. (2018).Pan methods train one model for all the alleles at a same time, and thus the information across different alleles can be shared and integrated into a general model Jurtz et al. (2017); Hu et al. (2018).These existing methods can achieve significant performance on the prediction of binding affinities.However, most existing methods formulate the prediction problem as to predict the exact binding affinity values (e.g., IC 50 values) via regression.Such formulations may suffer from two potential issues.First of all, they tend to be sensitive to the measurement errors when the measured IC 50 values are not accurate.In addition, many of these methods use ranking-based measurement such as Kendall's Tau correlations to measure the performance of regression-based methods Bhattacharya et al. (2017); O'Donnell et al. (2020).This could lead to sub-optimal solution as small regression errors do not necessarily correlate to large Kendall's Tau.Therefore, these methods are limited in their capability of prioritizing the most possible peptide-MHC pairs of high binding affinities.
In this study, we formulate the problem as to prioritize the most possible peptide-MHC binding pairs via ranking based learning.We propose three ranking-based learning objectives such that through optimizing these objectives, we impose peptide-MHC pairs of high binding affinities ranked higher than those of low binding affinities.Coupled with these objectives, we develop two allele-specific Convolutional Neural Network (CNN)-based methods with attention mechanism, denoted as ConvM and SpConvM.ConvM extracts local features of peptide sequences using 1D convolutional layers, and learns the importance of different positions in peptides using self-attention mechanism.In addition to the local features used in ConvM, SpConvM represents the peptide sequences at different granularity levels by leveraging both global and local features of peptide sequences.We also develop a new position encoding method together with self-attention mechanism so as to differentiate amino acids at different positions.We compare the various combinations of model architectures and objective functions of our methods with the state-of-the-art baseline MHCflurry O' Donnell et al. (2018) on IEDB datasets Vita et al. (2018).Our experimental results demonstrate that our models significantly outperform the state-of-the-art methods with an average percentage improvement of 6.70% on AUC and 17.10% on ROC5 across 128 alleles.
We summarize our contributions below: • We formulate the problem as to optimize the rankings of peptide-MHC pairs instead of predicting the exact binding affinity values.Our experimental results demonstrate that our ranking-based learning is able to significantly improve the performance of identifying the most possible peptide-MHC binding pairs.
• We develop two allele-specific methods ConvM and SpConvM with position encoding and self attention, which enable a better learning of the importance of amino acids at different positions in determining peptide-MHC binding.
• We incorporate both global and local features in SpConvM to better capture and learn from different granularities of peptide sequence information.
Chen et al.

LITERATURE REVIEW
The existing computational methods for peptide-MHC binding prediction can be generally classified into two categories: linear regression-based methods and deep learning (DL)-based methods.Below, we present a literature review for each of the categories, including the key ideas and the representative work.

Peptide Binding Prediction via Linear Regression
Many early developed methods on peptide-MHC binding prediction are based on linear regression.For example, Peters et al.Peters and Sette (2005) proposed a method named Stabilized Matrix Method (SMM), which applied linear regression to predict the binding affinities from one-hot encoded vector representation of peptide sequences.Kim et al. Kim et al. (2009) derived a novel amino acid similarity matrix named Peptide:MHC Binding Energy Covariance (PMBEC) matrix and incorporated it into the SMM approach to improve the performance of SMM.In PMBEC, each amino acid is represented by its covariance of relative binding energy contributions with all other amino acids.Some recent work Bonsack et al. (2019); Zhao and Sher (2018) demonstrates these linear regression-based methods are inferior to DL-based methods, and therefore, in our work, we focus on DL-based methods.

Peptide Binding Prediction via Deep Learning
The DL-based models can be categorized into allele-specific methods and pan methods.Allele-specific methods train a model for each allele and learn the binding patterns of each allele separately.Instead, pan methods train a model for all alleles to learn all the binding patterns together within one model.Both the methods use similar encoding methods such Onehot encoding, BLOSUM encoding and Word2Vec Goldberg and Levy (2014).

Allele-Specific Deep Learning Methods
Among these allele-specific methods, Lundegaard et al.Lundegaard et al. (2008) proposed NetMHC3.0 that takes the embeddings of peptide sequences as input, and they applied neural networks with one hidden layer to predict peptide-MHC binding for peptides of fixed length.In NetMHC3.0, the hidden layer is a fullyconnected (FC) layer, and learns the global features of peptide sequences such as the position and types of specific amino acids.Andreatta et al. Andreatta and Nielsen (2015) extended NetMHC3.0 to NetMHC4.0 by padding so that the model can handle peptides of variable length.Kuksa et al. Kuksa et al. (2015) developed two nonlinear high-order methods including high-order neural networks (HONN) pre-trained with high-order semi-restricted Boltzmann machine (RBM), and high-order kernel support vector machines (hkSVM).Both the high-order RBMs and the high-order kernel are designed to capture the direct strong high-order interactions between features.Bhattacharya et al. Bhattacharya et al. (2017)

Pan Deep Learning Methods
Nielsen et al. Nielsen and Andreatta (2016) developed a DL-based pan method named NetMHCpan3.0.This method takes the embedding of pseudo MHC sequences and peptide sequences as input, and then applies an ensemble of neural networks to predict the binding affinities of peptide-MHC pairs.Jurtz et al.Jurtz et al. (2017) extended NetMHCpan3.0 to NetMHCpan4.0 by training the model on both binding affinity data and eluted ligand data.Their model shares a hidden layer among two kinds of data and applies Chen et al.

RCNN4PMHC
two different output layers to predict binding affinities and eluted ligands, respectively, for peptide-MHC pairs.Phloyphisut et al.Phloyphisut et al. (2019) developed a deep learning model, which uses GRUs to learn the embeddings of peptides, and a FC layer to learn the embeddings of alleles.The two types of embeddings are then concatenated to predict peptide-MHC binding probabilities.Han et al. Han and Kim (2017) encoded peptide-MHC pairs into image-like array (ILA) data and applied deep 2D convolutional neural networks to extract the possible peptide-MHC interactions from the ILA data.Hu et al. Hu et al. (2018) combined a deep convolutional neural network with an attention module.They applied multiple convolutional layers to extract features of different levels.The extracted features are integrated with the features learned from attention mechanism and fed into the output layer to predict binding affinities of peptide-MHC pairs.O' Donnel et al. O'Donnell et al. (2020) developed a pan-allele binding affinity predictor MHCflurry2.0 BP and an allele-independent antigen presentation predictor MHCflurry2.0 AP to calculate the presentation scores of peptide-MHC pairs.Their binding affinity predictor includes upstream and downstream residues of peptides from their source proteins to improve the performance of models.Note that MHCflurry2.0 is a pan method and requires source proteins of peptides.Therefore, we do not compare our methods with MHCflurry2.0.

Peptide-MHC Binding Data
The dataset is collected from the Immune Epitope Database (IEDB) Vita et al. (2018).Each peptide-MHC entry m in the dataset measures the binding affinity between a peptide and an allele.These binding affinity entries could be of either quantitative values (e.g., IC 50 ) or qualitative levels indicating levels of binding strength.The mapping between quantitative values and qualitative levels is shown in Table 1 We combined the widely used IEDB benchmark dataset curated by Kim et al. Kim et al. (2014) and the latest data added to IEDB (downloaded from the IEDB website on Jun. 24, 2019).The benchmark dataset contains two datasets BD2009 and BD2013 compiled in 2009 and 2013, respectively.BD2009 consists of 137,654 entries, and BD2013 consists of 179,692 entries.The latest dataset consists of 189,063 peptide-MHC entries.Specifically, we excluded those entries with non-specific, mutant or unparseable allele names such as HLA-A2.We then combined the datasets by processing the duplicated entries and entries with conflicting affinities as follows.We first mapped the quantitative values of all these duplicated or conflicting entries into qualitative levels based on Table 1, and used majority voting to identify the major binding level of the peptide-MHC pairs.If such binding levels cannot be identified, we simply removed all the conflicting entires; otherwise, we assigned the average quantitative values in the identified major binding level to the peptide-MHC pairs.The combined dataset consists of 202,510 entries across 128 alleles and 53,253 peptides as in Table 2.We further normalized the binding affinity values ranging from 0 Chen et al.

RCNN4PMHC
to where x is the measured binding affinity value, and clamp(1 − log 50,000 (x), 0, 1) represents that 1 − log 50,000 (x) is clamped into range [0, 1].By using the above clamp function, smaller/larger binding affinity values corresponding to higher/lower binding affinities will be converted to higher/lower normalized values.

DEFINITIONS AND NOTATIONS
All the key definitions and notations are listed in Table 3.

METHODS
We developed two new models: ConvM and SpConvM (will be discussed in Section 5.1 and Section 5.2), and compare them with MHCflurry O' Donnell et al. (2018), where MHCflurry is the state-of-the-art and used as the baseline.In terms of the embeddings of amino acids, we compare the performance of SpConvM with three embedding methods for amino acids and their combinations.In terms of the loss functions, we

Convolutional Neural Networks with Attention Layers (ConvM)
In this section, we introduce our new model ConvM, a convolutional neural network with attention layers.Figure 1 presents the architecture of ConvM.

Peptide Representation in ConvM
In ConvM, we first represent each amino acid, denoted as a j , in a peptide sequence, denoted as p = [a 1 , ..., a j , ..., a n ] (started from C-end), using two types of information.The first information encodes the type of amino acids using BLOSUM62 matrix or Onehot encoding.The details about the encoding of amino acids are described in Section 6.1.1.The second information encodes the position of each amino acid in peptide sequences, as it has been demonstrated O 'Donnell et al. (2018) that different positions of peptides contribute differently to their binding to alleles.In particular, each position of the peptide sequences, regardless of which amino acid is at the position, will have two position vectors: for the j-th position from the C-end, we use o j and o −j ∈ R d o ×1 to represent the position information with respect to the C-end and the N-end, respectively.The two position vectors will together accommodate the variation of peptide lengths.Thus, each amino acid a j is represented as a feature vector , where e j ∈ R d e ×1 is a j 's embedding with an encoding method in Section 6.1.1,and d o is the dimension of position embedding vectors; a peptide of n amino acids is represented as a feature matrix The position vectors will be learned in order to optimize the peptide representations.

Model Architecture of ConvM
The ConvM model consists of a 1D convolutional layer, a self-attention layer and a fully-connected layer as demonstrated in Figure 1.The 1D convolutional layer takes the peptide feature matrix F ∈ R (d e +2d o )×n as input, and extracts local feature patterns from the peptide sequences via 1D convolution using d r kernels of size (d e + 2d o ) × k.The output of the 1D convolutional layer is an embedding matrix R = [r 1 , ..., r (n−k+1) ] ∈ R d r ×(n−k+1) , in which each column r i represents the embedding of the i-th k-mer out of the d r kernels.Batch normalization is applied to fix the mean and variance of the embedding Chen et al.

RCNN4PMHC
matrix R in each batch.After batch normalization, rectified linear unit (ReLU) activation is applied as the non-linear activation function on the embedding matrix.Then, we apply the self-attention mechanism Chorowski et al. (2015) to convert the embedding matrix into an embedding vector c for the input peptide as follows.First, the weight w i for i-th k-mer is calculated as follows, where and v ∈ R 1×d a are the parameters of self-attention layer, and d a is the number of hidden units in the self-attention layer.With the weight w i on each k-mer, the embedding of the whole sequence is calculated as the weighted sum of all k-mer embeddings, that is, The embedding vector c ∈ R d r ×1 of the input peptide is then fed into the fully-connected layer to predict peptide binding at the output layer.We will discuss the loss function used at the output layer later in Section 5.3.

Convolutional Neural Networks with Global Kernels and Attention Layers (SpConvM)
We further develop ConvM into a new model with global kernels, denoted as SpConvM as in Figure 1.The use of global kernels is inspired by Bhattacharya et al. Bhattacharya et al. (2017), which demonstrates that global kernels within CNN models can significantly improve the performance of peptide binding prediction.As ConvM primarily extracts and utilizes local features, the additional global kernels extract global features from the entire peptide sequences that could be useful for prediction but cannot be captured by local convolution.In order to use global kernels, we pad the peptide sequences of various lengths to length 15, with padding 0 vectors in the middle of the peptide representations in a same way as in MHCflurry.More details about padding are available later in Section 6.1.2.The padded peptide sequences will be encoded into a feature matrix F G in the same way as in ConvM, except that the position embeddings are not included because the global kernels will overwrite the local information after the convolution.
Given the input F G , the convolution using d g global kernels will generate a vector g ∈ R dg×1 .We concatenate g and c as in ConvM (i.e., the embedding vector calculated from local kernels) to construct a local-global embedding vector c = [c; g] for the input peptide sequence and feed c into the fully-connected layer to predict peptide prediction as in Figure 1.

Loss Functions
We propose three pair-wise hinge loss functions, denoted as H v , H l and H i , respectively.We will compare these loss functions with the widely used mean-square loss function O'Donnell et al. (2018), denoted as MS, in learning peptide bindings.

Hinge Loss Functions for Peptide Binding Ranking
We first evaluate the hinge loss as the loss function in conjunction with various model architectures.The use of hinge loss is inspired by the following observation.We noticed that in literature, peptide-MHC binding prediction is often formulated into either a regression problem, in which the binding affinities between peptides and alleles are predicted, or a classification problem, in which whether the peptide will bind to the allele is the target to predict.However, in practice, it is also important to first prioritize the Chen et al.

RCNN4PMHC
most promising peptides with acceptable binding affinities for further assessment, whereas regression and classification are not optimal for prioritization.Besides, recent work has already employed several evaluation metrics on top ranked peptides, for example, Zeng and Gifford (2019) evaluated the performance through the true positive rate at 5% false positive rate, which suggests the importance of top-ranked peptides in addition to accurate affinity prediction.All of these inspire us to consider ranking based formulation for peptide prioritization.
Given two normalized binding affinity values bi and bj of any two peptides p i and p j with respect to an allele, the allele-specific pair-wise ranking problem can be considered as to learn a scoring function S (•), such that Please note that S ( p i ) is a score for peptide p i , which is not necessarily close to the binding affinity bi , as long as it reconstructs the ranking structures among all peptides.This allows the ranking based formulation more flexibility to identify the most promising peptides without accurately estimating their binding affinities.To learn such scoring functions, hinge loss is widely used, and thus we develop three hinge loss functions to emphasize different aspects during peptide ranking.

Value-based Hinge Loss Function
The first hinge loss function, denoted as H v , aims to well rank peptides with significantly different binding affinities.Given two peptides p i and p j , this hinge loss function is defined as follows: where li denotes the binding level of peptide p i according to the Table 1; li > lj denotes that the binding level of peptide p i is higher than the peptide p j ; bi and bj are the ground-truth normalized binding affinities of p i and p j , respectively; c > 0 is a pre-specified constant to increase the difference between two predicted scores.H v learns from two peptides of different binding levels and defines a margin value between two peptides as the difference of their ground-truth binding affinities bi − bj plus a constant c.If two peptides p i and p j are on different binding levels li > lj , and the difference of their predicted scores is smaller than the margin c + ( bi − bj ), this pair of peptides will contribute to the overall loss; otherwise, the loss of this pair will be 0. Note that H v is only defined on peptides of different binding levels.For the peptides with the same or similar binding affinities, H v allows incorrect ranking among them.

Level-based Hinge Loss Function
Instead of ranking with respect to the margin as in H v , we relax the ranking criterion and use a margin according to the difference of binding levels (Table 1).Thus, the second hinge loss, denoted as H l , is defined as follows: where r > 0 is a constant.Given a pair of peptides in two different binding levels, similar to H v , H l requires that if the difference of their predicted scores is smaller than a margin, this pair of peptides will contribute to the overall loss; otherwise, the loss of these two peptides will be 0.However, unlike H v , the margin defined in H l depends on the difference of binding levels between two peptides (i.e., r × ( li − lj )).Therefore, in H l , the margin values of all the peptides ( p 1 , p 2 , • • • , p n ) on the level li to any other peptides on the level lj will be the same (i.e., r × ( li − lj )).Note that H l is defined on peptides of different binding levels, and thus also allows incorrect ranking among peptides of same binding levels as in H v ; the difference with H v is on how the margin is calculated.
Chen et al.

Constrained Level-based Hinge Loss Function
The third hinge loss function H i extends H l by adding a constraint that two peptides of a same binding level can have similar predicted scores.This hinge loss is defined as follows: Given a pair of peptides on a same binding level, the added constraint (the case if li = lj ) requires that if the absolute difference | S ( p i ) − S ( p j )| is smaller than the pre-specified margin r, the loss will be zero; otherwise, this pair will have non-zero loss.The constraint on the absolute difference allows incorrect ranking among peptides on a same binding level as long as their predicted scores are similar.

Mean-Squares Loss
We also compare a mean-squares loss function, denoted as MS, proposed in O'Donnell et al. ( 2018); Paul et al. (2019), to fit the entries without exact binding affinity values as below: if m i is qualitative and l i = 1 (i.e., negative binding), where "m i is quantitative" denotes that the peptide-MHC entry m i is associated with an exact binding affinity value x i .In this case, the MS loss is calculated as the squared difference between the predicted score S ( p i ) and bi ( bi is normalized from x i as in Equation 1).In Equation 8, "m i is qualitative" denotes that m i is associated with a binding level li instead of a binding affinity value (Table 1).In this case, bi is normalized from the binding level thresholds (i.e., x i ∈ {100, 500, 1000, 5000} in calculating bi in Equation 1).When qualitative m i has li = 1, that is, the peptide does not bind to the allele and the binding affinity is low (i.e., large binding affinity value), the predicted score S ( p i ) should be small enough compared to bi in order not to increase the loss.When quantitative m i has li > 1, that is, the peptide binds to the allele with reasonably high binding affinity (i.e., small binding affinity value), the predicted score S ( p i ) should be large enough compared to bi in order not to increase the loss.
Note that in MS, the predicted score S ( p ) needs to be normalized into range [0, 1].This is because b is in range [0, 1] (Equation 1) so that S ( p ) needs to be in the same range and thus neither S ( p ) nor b will dominate the squared errors due to substantially large or small values.However, in the three hinge loss functions (Equation 5, 6 and 7), the potential different range between S ( p ) and b or l could be accommodated by the constant c (Equation 5) or r (Equation 6 and 6), respectively.In MS, we use sigmoid function to normalize S ( p ).

Encoding Methods
Encoding methods represent each amino acid with a vector.2017) and Word2Vec embedding method Vang and Xie (2017).BLOSUM encoding utilizes the BLOSUM62 matrix Henikoff and Henikoff (1992), which measures the evolutionary divergence information among amino acids.We use the i-th row of the BLOSUM matrix as the feature of i-th amino acid.Onehot encoding represents the i-th natural amino acid with an one-hot vector, in which all elements are '0' except the i-th position as '1'.Word2Vec learns the embeddings of amino acids from their contexts in protein sequences or peptide sequences.This embedding method requires learning on a large corpus of amino acid sequences, and is much more complicated than Onehot.However, it is demonstrated Phloyphisut et al. (2019) that Word2Vec embedding method is comparable to Onehot encoding method, and therefore, we use BLOSUM encoding and Onehot encoding, but not Word2Vec.Besides the above encoding methods, we also evaluate another deep encoding method, denoted as Deep, in which the encoding of each amino acid is learned during the training process.Deep encoding is not deterministic and is learned during the training process; the dimension of embedding vector needs to be specified as a predefined hyper-parameter.We also combine different representations of amino acid generated by the above three encoding methods.These combinations include BLOSUM+Onehot, BLOSUM+Deep, Onehot+Deep and BLOSUM+Onehot+Deep, where "+" represents concatenation of the embeddings of amino acid from different encoding methods.

Baseline Method -Local Connected Neural Networks MHCflurry
MHCflurry O'Donnell et al. ( 2018) is a state-of-the-art deep model with locally-connected layers for peptide binding prediction.In MHCflurry, all peptides of length 8 to 15 are padded into length 15 by keeping the first and last four residues and inserting the padding elements in the middle (e.g., "GGFVPNMLSV" is padded to "GGFVXXPNXXXMLSV").The padded sequences are encoded into a feature matrix E ∈ R 15×20 using BLOSUM encoding method.MHCflurry employs locally-connected layers to extract local feature patterns for each k-mers in peptide sequences.Unlike CNN using common filters across all k-mer residues in peptides, locally-connected layers apply local filters for each k-mer to encode the position-specific features.The encoded feature embeddings for all k-mers are then concatenated into a vector, and fed into the fully-connected layer for binding prediction.To the best of our knowledge, MHCflurry is one of the best neural network model for allele-specific peptide binding prediction problem.

Batch Generation
For models with MS as the loss function, we randomly sample a batch of peptides as the training batch.For models with the proposed pair-wise hinge loss functions (H v , H l , H i ), to reduce computational costs, we construct pairs of peptides for each training batch from a sampled batch of peptides.Specifically, for H v and H l , each pair consists of two peptides from different binding levels; and for H i , the constructed pairs can consist of two peptides from the same or different binding levels.

Model Training
We use 5-fold cross validation to tune the hyper-parameters of all methods through a grid search approach.For each allele, we run the grid search algorithm to find the optimal hyper-parameters for the allele-specific model.We apply stochastic gradient descent (SGD) to optimize the loss functions.We use 10% of the training set as a validation set.This validation set is applied to adjust the learning rate dynamically and determine the early stopping of training process.We set the dimension of Deep encoding method as 20, which is equal to the dimension of BLOSUM and Deep encoding method.We also set both the constant c in H v and the constant r in H l and H i as 0.2.

Evaluation Metrics
We use 4 types of evaluation metrics, including average rank (AR), hit rate (HR), area under the roc curve (AUC), and ROC, to evaluate the performance of the various model architectures and loss functions.Both AR and HR metrics are employed to measure the effectiveness of our model on the prioritization of promising peptides.Specifically, AR metric measures the average overall rankings of promising peptides; HR metric measures the ratio of promising peptides ranked at top; AUC metric measures the possibility that positive Chen et al.

RCNN4PMHC
peptides are ranked higher than negative peptides; ROC metric measures the ratio of positive peptides that are prioritized higher than top-n false positive peptides.We denote s i as the rank of peptide p i based on their predicted scores, P h as the set of peptides with binding affinities smaller than h (e.g., h = 500nM).Then AR h (e.g., AR 500 ) is defined as follows, where | P h | is the size of P h .Smaller values of AR h indicate that promising peptides are ranked higher in general, and thus better model performance.
The hit rate HR h (e.g., HR 500 ) is defined as follows, where Pt denotes the set of peptides with predicted scores ranked at top t.Larger values of HR h indicate that more promising peptides are prioritized to top-t by the model, and thus better performance.
We use h = 500nM as the threshold to distinguish positive peptides and negative peptides, and apply two metrics for classification to evaluate the model performance.The first classification metric AUC is calculated as below, where P is the set of all peptides, and | P | is the number of peptides in the dataset; P500 is the set of all positive peptides, and | P500 | is the number of positive peptides; 1(•) is an indicator function (1(x) = 1 if x is true, otherwise 0).Larger values of AUC indicate that positive peptides are more likely to be ranked higher than negative peptides.ROC t (e.g., ROC 5 ) score is the area under the roc curve up to t false positives.ROC t is calculated as below.
Larger values of ROC t indicate that the model can prioritize more positive peptides up to first t false positive peptides.We use 7 metrics constructed from the above 4 types of metrics to evaluate the model performance.These 7 metrics include AR 100 , HR 100 , AR 500 , HR 500 , AUC, ROC 5 and ROC 10 .
In order to compare the models with respect to one single metric in a holistic way, we define a hybrid metric H by combining all the evaluation metrics .Given a model trained with a set of hyper-parameters Y, we denote its performance on metric "mtrc" (mtrc =AR h , HR h , AUC h , ROC h ) as mtrc(Y), and the best metric value as best Y (mtrc) = max Y (mtrc(Y)).Then, the hybrid metric H for a model with hyper-parameters Y is defined as below, Chen et al.

RCNN4PMHC
where I(↓ mtrc) is an identity function: I(↓ mtrc) = +1 if smaller values on metric mtrc indicate better performance; I(↓ mtrc) = −1 otherwise.For metrics AR 100 and AR 500 , a smaller value represents a better model performance.

EXPERIMENTAL RESULTS
We present the experimental results in this section.All the parameters used in the experiments are reported in the Appendix.

Model Architecture Comparison
We evaluate all the 12 possible combinations of the 3 model architectures (ConvM, SpConvM, MHCflurry) and the 4 loss functions (H v , H l , H i , MS) with all the encoding methods through 5-fold cross validation.Table 4 presents the overall performance comparison with BLOSUM+Onehot+Deep encoding method (encoding method comparison will be presented later in Section 7.3).We apply the grid search to determine the optimal hyperparameters of each method on each allele with respect to the hybrid metric H (Appendix Section A1.1), and report the best performance in Table 4.We use MHCflurry with MS loss in O 'Donnell et al. (2018) as the baseline, and calculate the percentage improvement of our methods over the baseline across 128 alleles.In Table 4, the best model for each allele is selected with respect to H ; the model performance is further evaluated using the 7 evaluation metrics.The values in the table are percentage improvement compared with the baseline MHCflurry with MS.Models are trained using BLOSUM+Onehot+Deep encoding methods, and selected with respect to H and evaluated using the 7 evaluation metrics.The best improvement with respect to each metric is bold.
Table 4 shows that as for the model architectures, on average, SpConvM achieves the best performance overall among all three model architectures (e.g., SpConvM with MS has 8.66% improvement in AR 100 and 8.14% in HR 100 over MHCflurry with MS).Please note that when we calculate the improvement, we exclude alleles on which our models achieve more than 150% improvement (typically no more than 15 such alleles under different metrics).This is to remove potential bias due to a few alleles on which the improvement is extremely substantial.SpConvM performs better than ConvM on average.SpConvM extends ConvM with global kernels to extract global features from the entire peptide sequences.The better performance of SpConvM than that of ConvM indicates that global features could capture useful information from entire peptide sequences, which are typically short, for binding prediction.In addition, MHCflurry outperforms ConvM on average.The difference between MHCflurry and ConvM is that MHCflurry learns Chen et al.

RCNN4PMHC
position-specific features via position-specific kernels, and ConvM learns local features via kernels that are common to all the locations.As demonstrated in other studies O'Donnell et al. ( 2018) that certain positions of peptides are more critical for their binding to alleles, the better performance of MHCflurry over ConvM could be attributed to its position-specific feature learning capablity.Moreover, since the peptide sequences are usually short (8-15 amino-acid long), it is very likely that these short sequences do not have strong local patterns, and thus ConvM could not capture a lot of useful local information.In comparison with MHCflurry, SpConvM integrates both local features via its ConvM component and global features via global kernels.Such integration could enable SpConvM to capture global information as compensation to local features, and thus to improve model performance.
In addition to using the hybrid metric H to determine the optimal hyperparameters, we also apply another four metrics AR 100 , HR 100 , AUC and ROC 5 to select the hyperparameters.The results of the best models in terms of these four metrics are presented in Table A1 A2 A3 and A4 in the Appendix, respectively.The results show the same trend as that in Table 4, that is, on average, SpConvM outperforms ConvM and MHCflurry on all 7 metrics and H v loss function is the best among all loss functions.4. Figure 2 shows that in ConvM, about half of the alleles have performance improvement compared to that in MHCflurry with MS.Overall, there is an average 4.71% improvement among all the alleles.Figure 3 shows that more alleles have performance improvement in SpConvM compared to that in ConvM and in MHCflurry with MS, and more alleles have significant improvement.This indicates the strong performance of SpConvM. Figure 4 shows that in comparison with MS as the loss function, MHCflurry has more improvement using H v as the loss function (average improvement 8.93%).
Chen et al.

Loss Function Comparison
Table 4 also demonstrates that H v is the most effective loss function in combination with each of the learning architectures, and all three hinge loss functions H v , H l and H i can outperform the MS loss function.For example, for SpConvM, the performance improvement of H v , H l , H i and MS in terms of AR 100 follows the order H v (11.58%) > H i (10.01%) > H l (8.97%) > MS(8.66%), compared with the baseline MHCflurry with MS.This trend is also consistent for ConvM and MHCflurry.The better performance of H v may be due to the use of a margin in the loss function that is determined by the binding affinity values (Equation 5).This value-based margin could enforce granular ranking among peptides even when they are from a same binding level.In H l (Equation 6) and H i (Equation 7), the margins are determined based on the levels of the binding affinities.While H l and H i can still produce ranking structures of peptides according to their binding levels, they may fall short to differentiate peptides of a same binding level.
All the three hinge loss functions H v , H l and H i outperform MS across all the model architectures.This might be due to two reasons.First, the pairwise hinge loss functions are less sensitive to the imbalance of different amounts of peptides, either strongly binding or weakly/non-binding, by sampling and constructing pairs from respective peptides.Thus, the learning is not biased by one type of peptides, and the models can better learn the difference among different types of peptides, and accordingly produce better ranking orders of peptides.Second, the pairwise hinge loss functions can tolerate insignificant measurement errors to some extent.All the three hinge loss functions do not consider pairs of peptides with similar binding affinities.This enables our models to be more robust and tolerant to noisy data due to the measurement inaccuracy of binding affinities.

Encoding Method Comparison
We evaluate three encoding methods (BLOSUM, Onehot, Deep) and their combinations (BLOSUM+Deep, Onehot+Deep, BLOSUM+Onehot, BLOSUM+Onehot+Deep) over SpConvM with H v loss (the best loss function overall) using BLOSUM through 5-fold cross validation.We report the results of best models across all 128 alleles in the same way as in Section 7.1 (i.e., model selection with respect to H , evaluated using the 7 metrics).Table 5 presents the average percentage improvement of the 7 encoding methods over the baseline on the 7 metrics.The reported results in Table 5 are from the models with the optimal hyperparameters that are selected according to the hybrid metric H . Table 5 shows that BLOSUM+Onehot+Deep encoding method achieves the best performance in general.BLOSUM+Onehot+Deep encodes the amino acids using their inherent evolutionary information via BLOSUM and identity of different amino acids via Onehot, both of which are deterministic and not specific to the learning problem, and also the allele-specific information via Deep, which is learned in the model and thus specific to the learning problem.The combination of deterministic, amino acid identities and learned features enables very rich information content in the embeddings, and could be the reason why it outperforms others.With a similar rationale, Chen et al.

RCNN4PMHC
BLOSUM+Deep achieves the second best performance in general.BLOSUM on its own outperforms Onehot and Deep, respectively, indicating BLOSUM is rich in representing amino acid information.Combing BLOSUM with Onehot and Deep, respectively, introduces notable improvement over BLOSUM alone, indicating that BLOSUM+Onehot and BLOSUM+Deep are able to represent complementary information rather than that in BLOSUM.Onehot on itself alone performs the worst primarily due to its very limited information content.Combing Onehot with Deep improves from Onehot but does not perform well compared to Deep alone.This may be due to that Onehot (i.e., amino acid identity) information still plays a substantial role in Onehot+Deep so Deep information does not supply sufficient additional information.The values in the table are percentage improvement compared with SpConvM with Hv using BLOSUM.Models are selected with respect to H and evaluated using the 7 evaluation metrics.The best improvement with respect to each metric is bold.
We also select the optimal set of hyperparameters with respect to AR 100 , HR 100 , AUC and ROC 5 , and report the corresponding results in Table A7, A6, A8 and A5 in Appendix, respectively.With different model selection metrics, the encoding methods have different performance.However, in general, BLOSUM+Onehot+Deep achieves better performance than other encoding methods over all the metrics.

Attention Weights
Figure 5 shows the attention weight of HLA-A-0201 data learned by the attention layer of ConvM (with BLOSUM+Onehot+Deep, H v ).In Figure 5, each column represents the weight of 1-mer embedding, that is, the embedding over one amino acid, because the best kernel size for HLA-A-0201 data in ConvM is 1; each row represents an attention weight learned for a specific peptide by ConvM. Figure 5 shows the amino acids located at the second position and the last position contribute most to the binding events.This is consistent with the conserved motif calculated by SMM matrix Peters and Sette (2005).This indicates that ConvM with the attention layer is able to accurately learn the importance of different positions in peptides in predicting peptide activities.We do not present the attention weights learned by SpConvM, as in SpConvM, the attention weights do not show binding patterns as clear as those in ConvM.This is due to that SpConvM incorporates both local features and global features, and the global features might significantly contribute to the prediction and therefore the contribution from local features is reduced.

CONCLUSIONS
Our methods contribute to the study of peptide-MHC binding prediction problem in two ways.First, instead of predicting the exact binding affinities values as in the existing methods, we formulate the problem as to prioritize most possible peptide-MHC binding pairs via a ranking-based learning.We developed three pairwise ranking-based learning objectives for such prioritization, and the corresponding learning methods that impose the peptide-MHC pairs of higher binding affinities ranked above those with lower binding affinities with a certain margin.Our experimental results in comparison with the state-of-the-art developed a deep recurrent neural network based on gated recurrent units (GRUs) to capture the sequential features from peptides of various length.Vang et al.Vang and Xie (2017) applied two layers of 1D convolution on the embeddings of peptide sequences so as to learn local binding patterns existing in each k-mer amino acids.O'Donnell et al.O'Donnell et al. (2018) designed a deep model named MHCflurry with locally-connected layers.This locally-connected layer is used to learn the position-specific local features from the peptide sequences.MHCflurry has been demonstrated to achieve better or similar performance compared with most of The other prediction methods Boehm et al. (2019); Zhao and Sher (2018).

Chen et al. RCNN4PMHC Figure 1 .
Figure 1.Architectures of ConvM and SpConvM developed three pair-wise hinge loss functions, and compare them with the conventional mean-square loss function used in MHCflurry.

Figure 5 .
Figure 5. Attention Weights for HLA-A-0201 Learned from ConvM . Note that higher IC 50 values indicate lower binding affinities.

Table 3 .
Notations original/normalized binding affinity for a peptide-MHC pair l binding level for a peptide-MHC pair embeddings e encoding vector of amino acid type r embedding vector of each amino acid o position embedding of each k-mer amino acids R feature matrix for a peptide sequence F G feature matrix for a padded peptide sequence (i.e., input of global kernel in SpConvM)