Trans-Allelic Model for Prediction of Peptide:MHC-II Interactions

Major histocompatibility complex class two (MHC-II) molecules are trans-membrane proteins and key components of the cellular immune system. Upon recognition of foreign peptides expressed on the MHC-II binding groove, CD4+ T cells mount an immune response against invading pathogens. Therefore, mechanistic identification and knowledge of physicochemical features that govern interactions between peptides and MHC-II molecules is useful for the design of effective epitope-based vaccines, as well as for understanding of immune responses. In this article, we present a comprehensive trans-allelic prediction model, a generalized version of our previous biophysical model, that can predict peptide interactions for all three human MHC-II loci (HLA-DR, HLA-DP, and HLA-DQ), using both peptide sequence data and structural information of MHC-II molecules. The advantage of this approach over other machine learning models is that it offers a simple and plausible physical explanation for peptide–MHC-II interactions. We train the model using a benchmark experimental dataset and measure its predictive performance using novel data. Despite its relative simplicity, we find that the model has comparable performance to the state-of-the-art method, the NetMHCIIpan method. Focusing on the physical basis of peptide–MHC binding, we find support for previous theoretical predictions about the contributions of certain binding pockets to the binding energy. In addition, we find that binding pocket P5 of HLA-DP, which was not previously considered as a primary anchor, does make strong contribution to the binding energy. Together, the results indicate that our model can serve as a useful complement to alternative approaches to predicting peptide–MHC interactions.


INTRODUCTION
Major histocompatibility complex class two (MHC-II) molecules are surface proteins that exist on the membrane of antigen presenting cells (APCs) such as macrophages, dendritic cells, and B cells. They bind short peptide fragments derived from exogenous proteins and present them to CD4 + helper-T cells. Upon the recognition of foreign peptides presented by MHC-II molecules, the helper-T cells (precisely speaking, CD4 + effector T cells) will initiate proper adaptive immune responses, including enabling sufficient maturation of B cells and cytotoxic CD8 + T cells (1). Therefore, the binding of peptide to MHC-II molecules is considered to be a fundamental and pre-requisite step in the initiation of adaptive immunity (2,3). As such, mechanistic identification of the basic determinants of peptide-MHC-II interactions presents potential for understanding the immune system's mechanisms and improving the process of designing peptide-and protein-based vaccines.
MHC genes for humans, referred to as human leukocyte antigen (HLA), are among the most polymorphic genetic elements found within a long continuous stretch of DNA on chromosome 6 (4). Such high polymorphism reflects the immense contribution of MHC molecules to the adaptive immune system and underpins their capacity to recognize a wide range of pathogens. Nonetheless, some viruses, such as hepatitis C, avian/swine influenza, and human immunodeficiency virus (HIV), undergo extensive mutations that allow them to partially escape recognition by the MHC molecules (5). MHC genes can be divided into HLA class I, II, and III. Loci corresponding to HLA class I are A, B, and C; HLA class II loci are DP, DQ, and DR; HLA class III genes encode for several other immune-related proteins and provide support for the former two classes (1,4).
MHC-II molecules account for the likelihood of success of organ transplantation, and there are well-established associations between many disorders and particular classes of MHC-II molecules. These include the contribution of HLA-DQ genes to insulin-dependent diabetes (6); HLA-DR genes to multiple sclerosis; and narcolepsy (7) along with other autoimmune diseases resulting from degeneracy and misregulation in the process of peptide presentation (8). Moreover, genetic and epidemiological data have implicated MHC-II molecules in susceptibility to many infectious diseases such as HIV/AIDS, malaria (9), and cancer (10).
Experimental assays for prediction of peptide-MHC-II interactions are often faced with important obstacles, including substantial resources needed for laboratory work, high time, and labor demands. This is the case in particular, for experimental work aimed at finding out which promiscuous epitopes bind to specific MHC molecules, a necessary step in the design of peptide-based vaccines which protect against a broad range of pathogen variants. Computational methods, which are more efficient and less costly than biological assays, have been employed to complement these assays. Due to advances in sequencing technologies, immunological data have grown at an unprecedented pace and continue to accrue. This has been exploited in systematic computational analyses of genomes of multiple pathogens to determine which subunits might induce a potent immune response. The results have been the design and development of new vaccine candidates against HIV, influenza, and other hyper-variable viruses (11). Use of computational methods has significantly reduced experimental effort and costs by up to 85% (12).
Many immunoinformatics methods for prediction of peptide-MHC interactions, for both class I and II, have been developed based on machine learning approaches such as simple pattern motif (13), support vector machine (SVM) (14), hidden Markov model (HMM) (15), neural network (NN) models (16)(17)(18), quantitative structure-activity relationship (QSAR) analysis (19), structure-based methods, and biophysical methods (2,20,21;Degoot et al.,unpublished). These methods can be divided into two categories, namely, intra-allele (allele-specific) and trans-allele (pan-specific) methods. Intra-allelic methods are trained for a specific MHC molecule on a limited set of experimental peptide-binding data and applied for prediction of peptides binding to that molecule. Because of the extreme polymorphism of MHC molecules, the existence of thousands of allele variants, combined with the lack of sufficient experimental binding data, it is impossible to build a prediction model for each allele. Thus, trans-allele and general purpose (22) methods such as MULTIRTA (2), NetMHCIIpan (18), and TEPITOPEpan (23) have been developed using richer peptide-binding data expanding over many alleles or across species (18). Similar methods for MHC-I are also available such as NetMHCpan (24,25) and KISS (26).
The trans-allelic models are often designed to extrapolate either structural similarities or shared physicochemical binding determinants among HLA genes, to predict affinities for alleles that are not part of the training dataset. These models generally have better predictive performance for new alleles and a wide range of potential applications compared with the intra-allelic models.
Most of the existing trans-allelic models for MHC-II are extended versions of their earlier intra-allelic counterparts: TEPI-TOPEpan (23) was extended from TEPITOPE (27); MULTIRTA (2) evolved from RTA (20); and the series of NetMHCIIpans (1.0, 2.0, 3.0, and 3.1) (17,18,28,29) were generalized from the NN align (30) method. In the same vein, in this article, we present a trans-allele method, an extension of our previous method (Degoot et al., unpublished), for prediction of peptide-HLA class II interactions based on biophysical ideas.
The remarkable strength of the method presented here over other existing advanced data-driven approaches is its physical basis. We formulate the process of binding affinity between peptide and MHC-II molecule as an inverse problem of statistical physics. From the observable macroscopic (bound and unbound) states of experimental data, we compute the microscopic parameters (Hamiltonians for amino acid residues involved in the interaction) that govern the system. In fact, many problems in computational biology can be solved in such a way (31,32), taking advantage of the availability of vast amount of genomic data and high resolution structural information. Solutions obtained using this approach are more plausible and physically interpretable than those obtained using mere sequence-based methods (2; Degoot et al., unpublished). In addition, because sparsity is a hallmark feature of biological processes, we adjust the model's parameters via incorporating an L 1 regularization term into the model. The L 1 constraint, commonly named Lasso, encourages sparsity and improves the predictive performance of the model on novel data.
The rest of this article is organized as follows: in Section 2.1, we describe the idea of MHC-II polymorphic residue groups, which is employed to capture structure similarity among MHC-II alleles. In Section 2.2, we define our methodology and formulate the learning function. After that we briefly describe the benchmark dataset used to test the predictive performance of the model in Section 2.3 and present the results in Section 3. Finally, in Section 3.3, we summarize and discuss our results and compare our method with the state-of-the art method.
domains equally contribute to the binding affinity. For HLA-I molecules, the β domain is largely conserved, and variation occurs mostly in the α domain. On the other hand, polymorphism occurs in both domains of HLA-II molecules; except for HLA-DR alleles, where the variation takes place in the β domain. In addition, the peptide-binding groove of the HLA-II is open at both ends, which allows binding peptides of variable lengths (33), ranging from 9 to 30 amino acid residues, or even an entire protein (29,34). This is in contrast to the peptide-binding groove of the HLA-I alleles, which accommodate only short peptides of lengths ranging from 8 to 11 amino acids. This flexible constraint on peptide lengths together with its immense polymorphism, contribute to a lower predictive performance of computational methods for peptide-MHC-II interactions compared with MHC-I methods (2,22).
The notion of MHC polymorphic residue groups, introduced by Bordner and Mittelmann (2), is based on a simple observation of an intrinsic (independent of peptide) feature of the MHC-II binding groove. Although a peptide could bind to an MHC-II molecule in various registers, due to the open-ended nature of the MHC-II binding groove, the strength of the binding affinity is primarily determined by 9 residues occupying the binding groove pockets. Interestingly, most of polymorphism in MHC-II genes occurs at these binding pockets (see the discussion in Section 3.3).
From the limited available crystallographic structural data of peptide-MHC-II complexes for a few MHC-II molecules from the Protein Data Bank (PDB) (35) (summarized in Table S1 in Supplementary Material), sets of important positions for the polymorphic residues in the binding groove that contact one or more peptide-binding cores and are within a distance of not more than 4 Å (2, 18, 36) in one or more of the MHC-II complex structures can be extracted. Then, by extrapolating the similarities among MHC molecules, their corresponding residues in different genes are determined using multiple sequence analysis (MSA) (37). Exploiting the fact that HLA-DR alleles are polymorphic only in the β domain and have the same α domain, the polymorphic residue groups for HLA-DR are extracted from its β domain sequences. Similarly, assuming sufficiency of the β domains for predicting MHC-peptide binding preferences (2) and for the sake of simplicity of the model, residue groups for HLA-DP and HLA-DQ were also extracted from the β domain.
Next, the set of polymorphic residues that always co-occur at the specified positions are clustered into the same group. The rationale of clustering polymorphic residue groups, rather than individual residues, is to avoid over-parametrization of the model. Table S2 in Supplementary Material shows such polymorphic residue groups for HLA-DRB, HLA-DP, and HLA-DQ alleles, assembled by the procedures described earlier.

Trans-Allele Model
In our previous intra-allele model (Degoot et al., unpublished) the probability of peptide P (k) to bind an MHC molecule M (T(k)) was computed as follows: where δE (k) is the change in binding energy in terms of the sum of the differences of first-and second-order Hamiltonians between the bound and unbound states. Specifically, δE (k) is given by the following equation: in which |P (k) | is the length of peptide k, R is the number of all possible configurations (registers) in which the peptide binds to the particular MHC molecule, and δS is the difference in entropy between the bound and unbound states.
For the trans-allele model, two changes were introduced into the second term of equation (2). First, instead of residue-residue interaction, δH (2) on the peptide sequence and b j on the MHC binding pocket, we rather focus on residuepolymorphic group interaction, δH (2) where g jn is residue group number n of position j as defined in Section 2.1. Next, we introduce a binary operator T(k, j, n) that equals 1 if the MHC molecule type, M (T(k)) , corresponding to peptide P (k) contains polymorphic residue group n at the set of pre-determined positions of pocket j, and equals 0 otherwise. Hence, δE (k) is given by the following equation: where G(j) is the number of polymorphic residue groups for binding pocket j. Column two of Table S2 in Supplementary Material shows G(j), j = 1, 2, . . ., 9, for HLA-DR, HLA-DP, and HLA-DQ alleles.
Let ∆ denote the model's parameters. Using equations (1) and (3), we formulate, through the maximum likelihood approach, the following cost function: where G k (∆) is the empirical loss function given by the following equation: and y k ∈ {0, 1} is the experimental value; y = 1 for binding peptides and y = 0 for non-binding ones. λP(∆) is a regularization term with the following form: where λ > 0 is a hyper-parameter and d is the dimension of parameter vector ∆, which varies depending on the type of MHC-II molecule. The L 1 constraint penalty term P(∆), also known as Lasso (38), has an important role in the model. As the model is defined on a large number of parameters (d = 2,321, 561, and 401 for HLA-DR, HLA-DP, and DQ molecules, respectively) a few parameters are expected to contribute to the binding affinity while the rest are expected to be noisy. Lasso has the capability to filter out the noisy parameters by inducing sparsity in the model, as it shrinks most of the parameter values to 0, and avoids data overfitting. The hyper-parameter λ controls the degree of sparsity of the model; the larger the value of λ the more sparse the model. Equation (4) is a non-linear and non-smooth function; due to the L 1 constraint. But it is a convex function and we solved it, after quadratic approximation, by means of an iterative, cyclic coordinate descent approach using a soft-thresholding operator. This learning function takes the form of a generalized linear model and the algorithm we used to solve it is both fast and efficient. Details of this optimization method are found in Friedman et al. (39) and are summarized in the supplementary material.

Binding Affinity Dataset
The model has been developed by using both quantitative peptidebinding data and MHC-II molecule sequences. We obtained a total of 51,023 peptide-binding data for 24 HLA-DR, 5 HLA-DP, and 6 HLA-DQ from the IEDB database (40). This is a wellcurated dataset and was used to develop NetMHCIIpan (18), the state-of-the-art method. The binding affinities data were given in the form of log-transformed measurements of the IC 50 (half maximum inhibition concentration) according to the formula 1 − log(IC 50 )/log(50,000) (16). We dichotomized these data using a moderate threshold of IC 50 500 nM (≡0.426 of log-transformed data). Peptides with IC 50 less than or equal 500 nM (≥0.426 of logtransformed value) were considered as binders, and non-binders otherwise. This moderate threshold, which has been used in other previous methods including the state-of-the art method (20,29,30,41), allows us to make direct comparisons. Amino acid sequences for the MHC-II alleles used in this study were obtained from the EMBL-EBI online-database (42). Table 1 gives a summary of the peptide-binding dataset used to develop the method.

RESULTS
This section presents prediction results of the model obtained from the dataset of three MHC-II allotypes as described in Section 2.3. We applied a fivefold cross validation analysis to the model and compared it against its intra-allelic version (Table S3 in Supplementary Material). We also examine its predictive performance on data which were previously unseen by the model.

Performance of the Trans-Allele Model
We tested the predictive performance of the model by using fivefold cross validation. The partitioning of the data used in fivefold cross validation was previously done by Andreatta et al. (29), by clustering together peptides in a way that minimizes over-estimation of predictive performance, using the technique described by Nielsen et al. (30). Figure 1 shows results of the test done using alleles belonging to the three MHC-II loci considered in this study. The performance was measured in terms of area under the curve (AUC) (43) values, which range between 0 and 1. The higher the AUC value the better the predictive performance of model. Values below 0.5 reflect a worse performance than a random test. The model has an excellent performance for HLA-DP molecules (average AUC value = 0.930), and a good predictive power for both HLA-DQ and HLA-DR molecules (average AUC values = 0.830 and 0.802, respectively). The surprisingly excellent performance for HLA-DP could be the result of both a higher structural similarity (see Section 3.3) and a higher number of peptides per allele for HLA-DP. Indeed, for all HLA-DP alleles, the number of available peptides exceeds the empirically required number of peptide-binding measurements (≈200 peptides (22)), but this is not the case for all HLA-DR alleles. HLA-DQ alleles have sufficient number of peptide measurements but these have a lower structural similarity compared with the corresponding peptides for HLA-DP alleles (see Section 3.3). Table S3 in Supplementary Material shows AUC values obtained with the intra-allele and trans-allele versions of the model. For the intra-alleles version, the model was evaluated on peptide-binding data corresponding to an individual allele only. On average, the performance of the trans-allele model is comparable to that of the intra-allele model for HLA-DP (0.930 vs 0.928), it is worse for HLA-DQ (0.830 vs 0.857) and it is better for HLA-DR (0.780 vs 0.771) (Figure 2).

Comparing the Intra-Allele vs Trans-Allele Methods
These results demonstrate two important observations. First, there is a common binding preference among MHC-II loci, which is the basis of all trans-allelic models, and that has been successfully captured by the definition of MHC-II polymorphic groups for HLA-DP loci, and to a lesser extent for HLA-DQ and HLA-DR. Second, the trans-allelic model is able to extrapolate similarities among the MHC-II allotypes and achieve good predictive performance. As a result, the overall performance of the trans-allelic model is comparable to that of intra-allele model, even though the former model is applied on a much diverse set of MHC-II sequences.
A decreased performance of the trans-allelic model when compared with the intra-allelic method for HLA-DQ molecules is consistent with results reported in NetMHCIIpan (18). Here we suggest that this is probably because of the limited structural information available for HLA-DQ alleles. In fact, because of this limited structural information there are only 17 polymorphic residue groups for all the 9 binding pockets defined for HLA-DQ alleles. By contrast, there are 25 and 115 polymorphic residue groups defined for HLA-DP and HLA-DR molecules, respectively. The first column gives the names of the 34 genes used to develop the method, distributed as 24, 5, and 6 for HLA-DR, HLA-DP, and HLA-DQ genes, respectively. The second column represents the index for each allele in the EMBL-EBI database (42). The third and fourth columns give the total number of peptide and the number of binder peptides, receptively, per allele. The last column shows the percentage of binder peptides. Binder peptides were identified using an IC 50 binding cutoff of 500 nM, as in previous studies (2,17,18,30). The last row presents the overall statistics for the last three columns.
Another reason for the reduction of the trans-allelic model's performance for HLA-DQ alleles is that there is a large sequence diversity of MHC-II molecules belonging to this locus. We will examine the empirical support for this assertion in Section 3.3.

Prediction on a Novel Dataset
We examined the predictive power of the model on a blind dataset-i.e., a dataset which was not used in the training phase. More precisely, to make peptide-binding predictions for a particular allele, we train the model on an entirely different allele. The allele used for training was chosen based on its similarity to the focal allele as quantified using three different metrics: nearest-neighbor, Hamming distance, and Leave-One-Out (LOO) approach.
In the nearest-neighbor approach the distance between two MHC molecules is defined (17) as follows: in which S (A, B) is the score of the BLOSUM50 (44) metric between amino acid sequences of A and B. The BLOSUM50 metric measures genetic distance between two sequences by quantifying the likelihood that one amino acid will be substituted by another amino acid on evolutionary time scales. Hamming distance simply counts the different occurrences of corresponding amino acid residues between two sequences. In both nearest-neighbor and Hamming metrics, we train the model on peptide data belonging to the corresponding nearest allele to parameterize the model, and then we assess its accuracy in terms of AUC values calculated based on peptide data belonging to the focal allele using those parameters. However, unlike the TEPITOPE and the series of NetMHCI-Ipan methods which defined nearest neighbor at pocket level, we derive both the nearest-neighbor metric and the Hamming distance at residue level. Our choice is based on the fact that accounting for the entire MHC-II sequence provides a broader allele coverage (2) and hence extend the model's applicability. Computing sequence similarity at residue level is an intuitive and natural approach to perform comparative analysis of sequences rather than other artificial ways that may be more computationally efficient. We found that 71% (for HLA-DR), 60% (HLA-DP), and 67% (HLA-DQ) of alleles used for training were consistent between the residue-level and pocket-level approaches. These statistics indicate that, as mentioned before, most of MHC-II polymorphisms occur at the binding pockets.
The LOO approach involved partitioning data into two parts; the peptide-binding data not belonging to the allele under consideration are used to learn the model's parameters and the remaining data, the peptide-binding data belonging to the focal allele, are used as test data. Figure 3 shows a comparison of results from these three approaches (details are in Table S4 in Supplementary Material). The results show that, regardless of the metric we used, the trans-allele method has a high predictive power for HLA-DP allele and a moderate predictive power for the other alleles.
The much higher predictive power for HLA-DP compared with the other alleles is likely due to the comparatively lower sequence diversity of HLA-DP alleles. To make this assertion more precise we carried out a regression analysis by defining the AUC values from LOO approach as functions of both NN and Hamming metric distances. Figure 4 gives results of our analysis. As seen in Figure 4, all HLA-DQ alleles fall below the least squares lines for both metrics (blue points). We also found that model performance for HLA-DP allele (red points) increases as the distance between alleles decreases. The authors of NetMHCIIpan also arrived at the same conclusion (18), but only for the NN metric.

Analysis of the Model's Parameters
To determine the key factors that contribute to the binding affinities for the three MHC-II alleles considered in this study, we calculated the Hamiltonians corresponding to each amino acid residue and the 9 binding pockets of the MHC-II binding groove. These values were then averaged over only the polymorphic residue groups defined for each pocket containing the particular amino acid. FIGURE 2 | Comparing results between the intra-alleles (gray bars) and the trans-alleles (red bars) methods in terms of AUC values. These bars show that there is a significant increase in performance of the trans-allele method for HLA-DR molecules and decrease for HLA-DQ molecules compared with the intra-allele method. The difference in the HLA-DP loci is limited. Analysis of HLA-DR parameters revealed that pocket P1 has moderate attractive interactions with peptide (negative energies indicated by blue color in Figure 5), via hydrophobic (I, L, W, Y) side chains and, to lesser extent, via the aromatic (F, W) amino acids and a single hydrophilic residue (K). Remarkably, previous studies (2,46) arrived at a similar conclusion of a large tendency of position P1 toward interactions involving the hydrophobic side chains. The repulsive interactions (positive energies indicated by red color in Figure 5) of pocket P1 mostly occur with the hydrophilic side chains (D, E, N, S, T) and the aliphatic residue (A). Generally, most of the primary anchor pockets (P1, P4, P6, P7, P9) confer attractive interactions, but the pocket P1 makes the largest contribution. This is consistent with results obtained using the MULTIRTA method (2). Among the secondary anchors,  we found that pocket P2 has attractive interactions with aromatic (F, Y) and the hydrophobic (I, M, Y) side chains. The most repulsive interactions come from the pocket P8, which has a strong unfavorable interactions involving the side chains of residues C, D, E, F, G, I, L, W, and Y (see Figure 5A).
For HLA-DP, we found that pocket P9 has significantly attractive interactions involving the hydrophobic residue (L). This is consistent with the previous results of Ref. (47) (see Figure 5B). Also, we found that pockets P4 and P5 have important attractive interactions with peptide via hydrophobic (Y) and aromatic (F) side chains, respectively. The contribution of the pocket P4 is concordant with other studies such as (41), but the contribution of the pocket P5 was not reported in the study of Andreatta and Nielsen (47), which was specifically dedicated to HLA-DQ and HLA-DP alleles. Furthermore, we found that the other two pockets P1 and P6, which were reported as primary anchors in that study, have a moderate contribution to calculated bind energies (see Figure 5B).
The pattern of energetic contributions for HLA-DQ alleles is less ordered. There is no common pattern except the observation of significant attractive interaction of pocket P1 via the hydrophobic residue (W) and the repulsive interaction of pocket P4 via the side chains C, E, and D (see Figure 5C). This finding is in line with the observations of Morten et al. (47).

Discussion
Interactions between peptides and MHC-II molecules are central to the adaptive immune system. Precise prediction and knowledge of the physicochemical determinants that govern such interaction is useful in designing effective and affordable epitope-based vaccines, and in providing insights about the immune system's mechanism as well as in understanding the pathogenesis of diseases. In this study, we have developed a trans-allelic model that can predict peptide interactions to the three human MHC-II loci. It can be readily applied to MHC-II molecules of other species provided that relative structural information are available. This method is based on biophysical ideas, an alternative to the dominant machine learning approaches.
The model presented here is, in addition to NetMHCIIpan, only the second trans-allelic method that allows comprehensive prediction analysis of peptide binding to all three human MHC-II loci. Most trans-allelic models for MHC-II peptides are restricted to HLA-DR and HLA-DP alleles. The TEPITOPEpan method (23), which is popular among immunologists and is the successor of a pioneer method in this field, is limited to HLA-DR alleles.
In this work we employed the definition of MHC polymorphic residue groups of the MULTIRTA method (2), which is more intuitive and inclusive than the MHC pseudo sequences of NetMHCIIpan (18), in developing our trans-allelic model. Utilizing new structural data for MHC-II complexes, which were not present when MULTIRTA was being developed, we extended that idea to cover all three human MHC-II loci. There exist similar exercises for capturing structural similarity among MHC molecules. The earlier works of Murthy and Stern (48) FIGURE 6 | Performance comparison between our model and NetMHCIIpan. Each model was used to predict the probability of peptide binding to query alleles belonging to each of three HLA loci (i.e., HLA-DP, HLA-DQ, and HLA-DR) after training it using peptide-binding data for a different allele. The allele that was most similar to the query allele was used for training. As in previous work (18), similarity between HLA alleles was defined based on two metrics: nearest neighbor (NN) and leave-one-out (LOO). See the text for definitions of these metrics. For each query allele, we measured each model's predictive performance (accounting for both sensitivity and specificity) by calculating an AUC value. The higher the AUC value the better the predictive performance. The plot shows the average difference between the AUC values for alleles belonging to the same locus obtained using our model vs. the corresponding values obtained using NetMHCIIpan, when similarity is defined based on either (A) the NN or (B) the LOO metric. Error bars denote SDs. Strikingly, our model performs better than NetMHCIIpan when predicting peptide binding to HLA-DQ using the NN metric (p-value = 0.015). For all other cases, both models have equivalent performance. and Sinigaglia and Hammer (49) were mostly limited to HLA-DR molecules. But in a previous study (2), the "polymorphic residue groups" were shown to be useful for inferring the interaction energy. This physical way of capturing structural similarity among MHC molecules works well in our biophysical approach.
We compared how well our model predicts the MHC-II allele binding preferences of a novel peptide dataset vs. how well the state-of-the-art NetMHCIIpan method performs the same task. In this comparison we applied both our model and NetMHCIIpan to predict binding preferences for peptides known to either bind or not bind a reference allele after training both models using peptide-binding data for a second allele. For a given MHC-II locus, the second allele was the one that was most similar to the reference allele. Similarity was quantified based on either a leave-one-out approach or a nearest-neighbor approach (see Section 3.3). When using the nearest-neighbor approach, we found that our model performs significantly better than NetMHCIIpan in predicting peptide-binding preferences for HLA-DQ alleles (P-value = 0.015; Figure 6A). Furthermore, at the 95% confidence level, for all other cases, we found no significant difference between the performances of the two models ( Figure 6).
These results are reassuring and indicate that our inversephysics approach constitutes a promising complement to the widely used pattern-based approach to peptide-MHC-II binding predictions. The outstanding predictive accuracy of the NetMHCIIpan is not the result of its theoretical basis. Rather it derives from the use of sophisticated ensembles of neural networks, which are very powerful. However, our method has a distinguishing advantage over all the advanced machine learning models in that it is more physically meaningful. It is worth noting that our prediction results of peptide-MHC-II interaction were based on in silico analysis of real data. Additional, in vivo and in vitro investigations are needed to further validate the reported predictive performance.