ORIGINAL RESEARCH article

Front. Genet., 25 April 2022

Sec. Statistical Genetics and Methodology

Volume 13 - 2022 | https://doi.org/10.3389/fgene.2022.875112

PredMHC: An Effective Predictor of Major Histocompatibility Complex Using Mixed Features

  • College of Electrical and Information Engineering, Quzhou University, Quzhou, China

Abstract

The major histocompatibility complex (MHC) is a large locus on vertebrate DNA that contains a tightly linked set of polymorphic genes encoding cell surface proteins essential for the adaptive immune system. The groups of proteins encoded in the MHC play an important role in the adaptive immune system. Therefore, the accurate identification of the MHC is necessary to understand its role in the adaptive immune system. An effective predictor called PredMHC is established in this study to identify the MHC from protein sequences. Firstly, PredMHC encoded a protein sequence with mixed features including 188D, APAAC, KSCTriad, CKSAAGP, and PAAC. Secondly, three classifiers including SGD, SMO, and random forest were trained on the mixed features of the protein sequence. Finally, the prediction result was obtained by the voting of the three classifiers. The experimental results of the 10-fold cross-validation test in the training dataset showed that PredMHC can obtain 91.69% accuracy. Experimental results on comparison with other features, classifiers, and existing methods showed the effectiveness of PredMHC in predicting the MHC.

Introduction

As a large locus on vertebrate DNA, the major histocompatibility complex (MHC) contains a tightly linked set of polymorphic genes encoding cell surface proteins that are essential for immune surveillance. These cell surface proteins are called MHC molecules (Kubiniok et al., 2022). MHC molecules are classified into MHC class I, MHC class II, and MHC class III according to variation in molecular structure, function, and distribution (Marcoux et al., 2021). MHC class I molecules are expressed in all nucleated cells and platelets—essentially all cells except red blood cells, which display antigens to signal cytotoxic T lymphocytes, including clusters of differentiation (CD8+) (McShan et al., 2021). MHC class II molecules are expressed in antigen-presenting cells, such as B cells, dendritic cells, and macrophages, where they normally bind to CD4+ receptors on helper T cells to clear foreign antigens. MHC class III genes are interleaved with class I and class II genes on the short arm of chromosome 6, but their proteins play different physiological roles.

MHC molecules are cell surface glycoproteins with a three-dimensional structure and are of vital importance to infection, autoimmunity, transplantation, and tumor immunotherapy. MHC-binding prediction plays an important role in identifying potential novel therapeutic strategies. Mahoney et al. (2021) pointed out that MHC phosphopeptides can be considered potential immunotherapeutic targets for cancer and other chronic diseases. Therefore, many scholars carried out a lot of research work on MHC-binding prediction. The first computational method (Altuvia et al., 1995) to uncover the MHC-binding peptide was developed by Altuvia et al., which is based on protein structure and is further improved to distinguish candidate peptides that bind to hydrophobic binding pockets of the MHC molecules (Altuvia et al., 1997). The SVRMHC (Liu et al., 2006) is an MHC-binding peptide model which encoded peptides with physicochemical properties and trained support vector machines to construct a prediction model on mice. NetMHC-3.0 (Lundegaard et al., 2008) is a web server with high performance for predicting peptide binders based on artificial neural networks. Boehm et al. proposed a method named ForestMHC (Boehm et al., 2019) to identify immunogenic peptides. ForestMHC encoded a peptide sequence with physicochemical properties and trained a random forest classifier to construct an identification model. Saxena et al. (2020) predicted the binding potential of peptides to the MHC, which is critical for designing peptide-based therapeutics, using a deep learning model named OnionMHC. In consideration of the importance of structural information, the OnionMHC represents peptides with its sequence and structure-based features for peptide-HLA-A*02:01 binding predictions. (Lv et al., 2020) Jiang et al. (2021) gave a comprehensive review of the state-of-the-art literature on MHC-binding peptide prediction and an in-depth evaluation of feature representation methods, prediction models, and model training strategies on benchmark datasets. Based on the limitation of only handling peptide sequences with fixed length, Jiang et al. proposed a novel variable-length MHC-binding prediction model named BVLSTM-MHC. Experimental results on an independent validation dataset showed that BVLSTM-MHC has better performance than the ten mainstream prediction tools.

Scientists are devoted to discover MHC molecules in various vertebrate genomes. Hopkins et al. (1986) described a rat monoclonal antibody which can recognize MHC class II antigens in sheep and seems to recognize determinants which are nonpolymorphic. Moreover, based on the antibody, the distribution of sheep class II molecules is investigated, and the class II- expression variations by cells in efferent lymph and peripheral is also investigated. Westbrook et al. (2015) combined the SMRT sequencing technology and CCS and introduced and validated the technology of SMRT-CCS on identifying class I transcripts in Mauritian-origin cynomolgus macaques. Furthermore, SMRT-CCS was applied to characterize 60 new full-length class I transcriptional sequences expressed in the Chinese cynomolgus monkey population. By using pyrosequencing with high-resolution and Sanger sequencing technology, Shiina et al. (2015) genotyped 127 unrelated animals and identified 112 different alleles. Moreover, the International Society for Animal Genetics (ISAG) standardized the nomenclature and established the IPD-MHC database which is used to scientifically manage the MHC allele sequences and genes from nonhuman organisms (Giuseppe et al., 2017; Maccari et al., 2018; Ali et al., 2021; Burton et al., 2021; Karcioglu and Bulut, 2021; Roy et al., 2021; Safaei et al., 2021; Wang et al., 2021).

At early stages, the research studies related to the MHC are developed based on mice experiments. With the availability of a large amount of data and development of machine learning, developing a machine learning–based model to research the MHC was feasible. Li et al. (2019) proposed an identification method of the MHC based on an extreme learning machine algorithm. Although high accuracy has been achieved, there are still many aspects worthy of further investigation (Lv et al., 2019; Lv et al., 2021a; Lv et al., 2021b). In this study, we aim to propose a new MHC predictor, PredMHC, to further improve prediction performance.

Materials and Methods

Framework of PredMHC

In this study, we introduced a novel MHC predictor named PredMHC, the framework of which is shown in Figure 1. First, PredMHC encoded a protein sequence with mixed features including 188D, APAAC, KSCTriad, CKSAAGP, and PAAC. Second, three classifiers including SGD, SMO, and random forest were trained on the mixed features of protein sequence. Finally, the prediction result was obtained by the voting of the three classifiers. We will introduce the datasets, feature extraction, and classifiers in detail in the following section.

FIGURE 1

Dataset

The dataset constructed by Li et al. (2019) is used in this study. A web server called ELM-MHC was developed by Li et al., from which the dataset can be downloaded. The reason that we used the same dataset as ELM-MHC is as follows. First, the dataset is constructed by searching for MHC sequences on the Uniprot database, and it is reliable. Second, the dataset is used cd-hit to de-duplication processing. The protein sequences are clustered based on the parameter setting, and the sequence with the maximum length in every cluster is used as a representative sequence. The redundant and homology-biased sequences are removed in this dataset. Finally, the most important inference was that we can fairly compare with the existing method by using the same dataset. The final dataset contained 13,488 protein sequences, which consists of 6,712 MHC protein sequences (positive examples) and 6,776 nonMHC protein sequences (negative examples). All protein sequences were divided into two groups: 10,790 sequences as a set of 10-fold cross-validation and 2,698 sequences as a set of independent validation. The training dataset (Train-10790) comprised 5,370 MHC protein sequences and 5,420 nonMHC protein sequences, all randomly selected from the set of positive and negative examples, respectively. They were then further randomly divided into five sets for the input of 10-fold cross-validation. The independent testing dataset (Test-2698) contained 1,342 positive and 1,356 negative examples.

Feature Extraction

To classify a protein sequence into different categories using the machine learning method, the first step is to encode the protein sequence with features. A feature that can effectively discriminate positive examples from negative examples can greatly improve the prediction performance of the model. In this study, we try to encode protein sequences with mixed features including 188D, APAAC, KSCTriad, CKSAAGP, and PAAC. The mixed features can represent a protein sequence from different prospectives; thus, it can better distinguish different protein sequences.

SVMProt-188D

SVMProt-188D is a feature extraction method based on the amino acid composition and physicochemical properties (Dubchak et al., 1995; Saxena et al., 2021). It encodes each protein sequence as a 188-dimensional feature vector. The first 20 features are the frequencies of the 20 amino acids (A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, and Y in alphabetical order) occurring in the sequence. The formula is defined aswhere Ni denotes the number of the ith amino acid in the protein sequence and L denotes the length of a sequence. Obviously, .

The latter dimensions are correlated with eight physicochemical properties, namely, hydrophobicity, normalized Van der Waals volume, polarity, polarizability, charge, surface tension, secondary structure, and solvent accessibility. Each physicochemical property consists of 21 numbers. In detail, each property consists of three descriptors, composition (C), transition (T), and distribution (D). C indicates the proportion of amino acids with specific physicochemical properties to all amino acids, and the dimension of C is 3; T represents the percentage frequency of amino acids with a specific property behind amino acids with another property, and its dimension is 3; and D represents the proportions of the chain length of 0, 25, 50, 75, and 100% amino acids with a specific property, and its dimension is 8. Therefore, after analyzing the composition and eight physicochemical properties of amino acids, we can obtain a total of 20+(3 + 5+8)×8 = 188 features.

Amphiphilic Pseudo Amino Acid Composition

The concept of amphiphilic pseudo amino acid composition (APAAC), originally proposed by Chou (Chou, 2005; Lv et al., 2021a; Awais et al., 2021; Naseer et al., 2021; Yan et al., 2021), is an effective protein descriptor and has been applied for diverse protein sequence analysis. APAAC is different from traditional AAC. It can incorporate a partial sequence-order effect by using the hydrophobicity and hydrophilicity of the constituent amino acids in a protein. For the convenience of the readers, we will briefly introduce the concept of APAAC. Let R1R2R3...RL be a protein sequence with length L, where R1 denotes the residue at position 1, R2 denotes the residue at positon 2, and so forth. According to the definition of APAAC, a protein can be denoted as a vector P with dimension (20+2λ). Vector P is defined as follows.where P1, P2, … , P20 in Eq. 1 represent the classic AAC and the next 2λ discrete numbers describe the sequence correlation factor.

K-Spaced Conjoint Triad

The k-spaced conjoint triad (KSCTriad) (Chao et al., 2018; Zhen et al., 2020) is an effective protein descriptor and has been comprehensively applied for diverse biological sequence analyses. Different from the conjoint triad descriptor, KSCTriad not only calculates the number of three continuous amino acid units but also incorporates the continuous amino acid units that are separated by any k-residues.

Composition of K-Spaced Amino Acid Group Pairs

The composition of k-spaced amino acid pairs (CKSAAP) (Chen et al., 2010; Ahmad et al., 2021; Akbar et al., 2021; Al-Qazzaz et al., 2021; Alar and Fernandez, 2021; Alim et al., 2021; Buriro et al., 2021) method describes the order-related information of the protein sequence, which takes the occurrence frequency of two amino acids separated by k-residues in the sequence as a feature element. The protein contains 20 amino acids; thus, a 400-dimensional feature vector can be obtained for each interval. The composition of k-spaced amino acid group pairs (CKSAAGP) is a variation of the CKSAAP method. The 20 amino acids can be classified into five groups based on the chemical properties of their side chains: the aliphatic group, aromatic group, positive charged group, negative charged group, and uncharged group. The CKSAAGP method is based on the frequency of the two groups separated by a k-spaced amino acid.

Pseudo-Amino Acid Composition

The conventional amino acid composition is defined in a 20-D space, and each dimension represents the frequency of the occurrence of one of the 20 native amino acids. Different from the conventional amino acid protein composition, the pseudo-amino acid composition (Chou, 2001; Awais et al., 2021), which is a vector with 20+λ discrete components, will contain much more sequence-order and sequence-length information. According to the concept of pseudo-amino acid composition, the feature is given bywhere the first 20 components are the occurrence frequencies of the 20 amino acids in the protein which is the same as in the conventional amino acid composition, while the additional components p20+1 … p20+λ are the sequence-order correlation factors of the different ranks.

Classifier

To obtain better classification results, we adopted the voting of three base classifiers as the final classification result. The three classifiers were, respectively, random forest, SMO, and SGD. The three classifiers are popular and have been successfully used in bioinformatics many times.

Random forest is an ensemble classifier based on the decision tree algorithm proposed by Breiman in 2001 (Breiman, 2001). To solve regression or classification tasks, random forests construct many decision trees by extracting subsets from all the samples through the bootstrap technique and obtain the prediction result by voting on these decision trees. Random forests are widely used in bioinformatics because of their low computational overhead and ability of handling unbalanced data.

The support vector machine (SVM) (Hearst et al., 1998) is a well-known machine learning algorithm that completes various classification tasks by constructing a separating hyperplane in the high-dimensional space. However, the training speed of support vector machines is heavily influenced by data size. To solve this problem, the sequential minimum optimization (SMO) (Platt, 1999) algorithm was proposed, which decomposes large quadratic programming problems (OPs) of an original SVM into a series of the smallest possible QP problems. Moreover, the solution process of SMO needs no additional matrix storage, thus saving both time and space costs.

The goal of the stochastic gradient descent (SGD) algorithm is to find a path that leads to optimal result. When using this algorithm, the parameter values are first initialized, and then these values are continuously changed until the target function converges. The SGD algorithm is widely used to process large-scale sparse data, such as text classification tasks.

Measurement

To evaluate the performance of the proposed method, we introduced four indicators commonly used in bioinformatics: sensitivity (SE), specificity (SP), accuracy (ACC), and Matthew’s correlation coefficient (MCC). The formulae of these indicators are as follows (Zhang et al., 2021a; Lv et al., 2021b; Zhang et al., 2021b; Zhang et al., 2021c; Zhang et al., 2021d; Zhang et al., 2021e; Zhao et al., 2021; Zhu et al., 2021; Zou et al., 2021; Zhao et al., 2022).where TP is an abbreviation for true positives, representing the number of MHC proteins predicted in positive examples; FP is an abbreviation for false positives, representing the number of MHC proteins predicted in negative examples; TN is an abbreviation for true negatives, representing nonMHC proteins predicted in negative examples; and FN is an abbreviation for false negatives and indicates the number of predicted nonMHC proteins in positive examples. SE and SP represent the predictive accuracy of the model in positive and negative samples, respectively. Both ACC and MCC represent the overall performance of the model. For all the aforementioned metrics , the higher the score they get the better the performance of the model.

Result and Discussion

Cross-Validation Results of Train-10790

In many experiments, we tried a variety of methods to extract highly recognizable features from protein sequences in the training set and used several algorithms to train the model to achieve optimal accuracy. The experimental comparison results of different features are explained in Performance of Different Features on Cross-Validation, and the experimental comparison results of different classifiers are explained in Performance of Different Classifiers on Cross-Validation.

Performance of Different Features on Cross-Validation

Using the voting of random forest, SMO, and SGD as the classification model, we first tried 188D, APAAC, KSCTriad, CKSAAGP, PAAC, and their combinations. Table 1 shows the performance of the five single features and several combinations of features with good performance in the 10-fold cross-validation. As shown in Table 1, according to the indexes MCC and ACC, the mixed features proposed in this study have the highest score; thus, our method has better overall performance. According to the indicator of SE, the feature of APAAC has the highest score, whereas its value of ACC, MCC, and SP is lower; it verifies that the feature of APAAC was bias to classify a protein into the MHC protein. Similar to APAAC, PAAC also has higher value on the indicator SE and lower value on other indicators. Therefore, from the overall perspective, our method obviously performs better than all other methods.

TABLE1

FeauresACCMCCSESP
(1)-188D0.89530.79270.85960.9310
(2)-APAAC0.83290.68240.94940.7108
(3)-KSCTriad0.87640.75800.81770.9350
(4)-CKSAAGP0.86820.74690.78260.9529
(5)-PAAC0.82830.67390.94850.7018
188D + APAAC0.90030.80190.87350.9276
APAAC + KSCTriad0.88720.77820.83860.9360
KSCTriad + CKSAAGP0.89930.80390.84040.9576
CKSAAGP + PAAC0.88480.77280.83760.9316
188D + APAAC + KSCTriad0.91210.82680.87340.9511
APAAC + KSCTriad + CKSAAGP0.90540.81550.85180.9589
KSCTriad + CKSAAGP + PAAC0.90410.81270.85160.9565
188D + APAAC + KSCTriad + CKSAAGP0.91570.83510.87010.9618
APAAC + KSCTriad + CKSAAGP + PAAC0.90650.81780.85220.9608
Our mixed feature0.91690.83700.87610.9587

Result of different features on Train-10790.

Performance of Different Classifiers on Cross-Validation

To verify the performance of our used classifier, we compared the classifier used in this study with other classifiers. Table 2 shows the experimental results. As shown in Table 2, the voting of SGD, SMO, and random forest used in our identification system has better performance than other single classifiers. As shown in Table 2, our classification model has 0.9169% accuracy and 0.8370 MCC, which are higher than those of other classifiers. It verified that our classification model has better overall performance. According to the number of winning incidences, our classification wins on three indicators and has the highest number of wins. It is shown in Table 2 that the SE of our classification model was slightly lower than that of random forest. However, the values of ACC, MCC, and SP of our classification model are obviously higher than those of random forest. Therefore, from the overall perspective, our classification model obviously performs better than all other classifiers.

TABLE 2

ClassifiersACCMCCSESP
SGD0.87940.76000.85040.9081
SMO0.90380.81060.85940.9478
Random forest0.88500.76990.88300.8869
Our classification model0.91690.83700.87610.9587

Result of different classifiers on Train-10790.

Independent-Validation Results of Test-2698

To evaluate the generalization performance of the proposed model, we tested its performance on the Test-2698 dataset. In detail, we trained the model proposed in this study on the Train-10790 dataset and then computed its performance on the test-2698 dataset. The experimental results are shown in Tables 3, 4. As shown in Tables 3, 4, the feature extraction method and classifier used in this study have better performance than the other feature extraction methods and classifiers, respectively.

TABLE 3

FeaturesACCMCCSESP
188D0.89260.78690.85930.9259
APAAC0.83570.68920.95330.7139
KSCTriad0.87410.75040.83550.9127
CKSAAGP0.87740.76140.80980.9442
PAAC0.83260.68260.95270.7056
188D + APAAC0.90100.80610.84820.9530
APAAC + KSCTriad0.89400.78880.86970.9182
KSCTriad + CKSAAGP0.90550.81550.85400.9573
CKSAAGP + PAAC0.89010.78180.85710.9230
188D + APAAC + KSCTriad0.91720.83550.89380.9412
APAAC + KSCTriad + CKSAAGP0.91300.82870.87290.9532
KSCTriad + CKSAAGP + PAAC0.91550.83370.87690.9544
188D + APAAC + KSCTriad + CKSAAGP0.91980.84160.88410.9550
APAAC + KSCTriad + CKSAAGP + PAAC0.91340.83000.86930.9574
Our mixed feature0.92460.85020.90340.9466

Result of different features on Test-2698.

TABLE 4

ClassifierACCMCCSESP
SGD0.89590.79180.89350.8982
SMO0.90630.81470.86820.9440
Random forest0.89480.78960.89130.8982
Our classification model0.92460.85020.90340.9466

Result of different classifiers on Test-2698.

Comparison With Other Predictors

To evaluate the performance of the classifier PredMHC, we compared it with ELM-MHC on the same dataset including Train-10790 and Test-2698. The comparison results on the 10-fold cross-validation are shown in Table 5. As we can see from Table 5, PredMHC has higher score than ELM-MHC on the indicators ACC, MCC, and SP. According to the number of winning incidence, PredMHC has better performance than ELM-MHC. According to ACC and MCC, PredMHC has better overall performance than ELM-MHC. Therefore, PredMHC is superior to the existing methods in the prediction of MHC protein.

TABLE 5

MethodACCMCCSESP
ELM-MHC0.91660.8220.8930.908
Our method0.91850.84030.87410.9627

Comparison of 10-fold cross-validation with the existing method on all data.

Conclusion

In this study, we proposed an efficient, reliable, and simple experimental model for predicting the MHC protein based on mixed features. After a large number of comparative experiments, we selected the mixed features of 188D, APAAC, KSCTriad, CKSAAGP, and PAAC, which showed global performance on the 10-fold cross-validation training dataset and independent test dataset. We then used the voting of SGD, SMO, and random forest to build a prediction model which also achieved the best performance on both training and test datasets. In terms of important indicators, our model obtained an MCC of 0.8370 and ACC of 0.9169 in the 10-fold cross-validation based on the Train-10790 dataset and MCC of 0.8502 and ACC of 0.9246 in the independent validation based on the Test-2698 dataset. In conclusion, we believe that our novel model provides an efficient and reliable method to screen MHCs from a large number of protein sequences. In the future, we will pay more attention to deep learning classifiers and evolution strategies (Tahoces et al., 2021; Tandel et al., 2021; Tavolara et al., 2021; Togacar, 2021; Tsiknakis et al., 2021; Turki and Taguchi, 2021; Usman et al., 2021; Vafaeezadeh et al., 2021; Wang et al., 2021; Watanabe et al., 2021; Yap et al., 2021; Yildirim et al., 2021).

Statements

Data availability statement

The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.

Author contributions

Conceptualization, YL; data curation, DC; formal analysis, DC; project administration, DC; writing—original draft, YL; and writing—review and editing, DC.

Funding

This work was supported by the Research Start-up Funding Project of Quzhou University (BSYJ202112 and BSYJ202109), the National Natural Science Foundation of China (61901103 and 61671189), and the Natural Science Foundation of Heilongjiang Province (LH 2019F002).

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors, and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

  • 1

    AhmadF.FarooqA.KhanM. U. G. (2021). Deep Learning Model for Pathogen Classification Using Feature Fusion and Data Augmentation. Cbio16 (3), 466–483. 10.2174/1574893615999200707143535

  • 2

    AkbarS.AhmadA.HayatM.RehmanA. U.KhanS.AliF.et al (2021). iAtbP-Hyb-EnC: Prediction of Antitubercular Peptides via Heterogeneous Feature Representation and Genetic Algorithm Based Ensemble Learning Model. Comput. Biol. Med.137, 104778. 10.1016/j.compbiomed.2021.104778

  • 3

    Al-QazzazN. K.AlyasseriZ. A. A.AbdulkareemK. H.AliN. S.Al-MhiqaniM. N.GugerC. (2021). EEG Feature Fusion for Motor Imagery: A New Robust Framework towards Stroke Patients Rehabilitation. Comput. Biol. Med.137, 104799. 10.1016/j.compbiomed.2021.104799

  • 4

    AlarH. S.FernandezP. L. (2021). Accurate and Efficient Mosquito Genus Classification Algorithm Using Candidate-Elimination and Nearest Centroid on Extracted Features of Wingbeat Acoustic Properties. Comput. Biol. Med.139, 104973. 10.1016/j.compbiomed.2021.104973

  • 5

    AliF.AkbarS.GhulamA.MaherZ. A.UnarA.TalpurD. B.et al (2021). AFP-CMBPred: Computational Identification of Antifreeze Proteins by Extending Consensus Sequences into Multi-Blocks Evolutionary Information. Comput. Biol. Med.139, 105006. 10.1016/j.compbiomed.2021.105006

  • 6

    AlimA.RafayA.NaseemI. (2021). PoGB-pred: Prediction of Antifreeze Proteins Sequences Using Amino Acid Composition with Feature Selection Followed by a Sequential-Based Ensemble Approach. Cbio16 (3), 446–456. 10.2174/1574893615999200707141926

  • 7

    AltuviaY.SchuelerO.MargalitH. (1995). Ranking Potential Binding Peptides to MHC Molecules by a Computational Threading Approach. J. Mol. Biol.249 (2), 244–250. 10.1006/jmbi.1995.0293

  • 8

    AltuviaY.SetteA.SidneyJ.SouthwoodS.MargalitH. (1997). A Structure-Based Algorithm to Predict Potential Binding Peptides to MHC Molecules with Hydrophobic Binding Pockets. Hum. Immunol.58 (1), 1–11. 10.1016/s0198-8859(97)00210-3

  • 9

    AwaisM.HussainW.RasoolN.KhanY. D. (2021). iTSP-PseAAC: Identifying Tumor Suppressor Proteins by Using Fully Connected Neural Network and PseAAC. Cbio16 (5), 700–709. 10.2174/1574893615666210108094431

  • 10

    BoehmK. M.BhinderB.RajaV. J.DephoureN.ElementoO. (2019). Predicting Peptide Presentation by Major Histocompatibility Complex Class I: an Improved Machine Learning Approach to the Immunopeptidome. BMC Bioinformatics20 (1), 7. 10.1186/s12859-018-2561-z

  • 11

    BreimanL. (2001). Random Forests. Mach Learn.45 (1), 5–32. 10.1023/a:1010933404324

  • 12

    BuriroA. B.AhmedB.BalochG.AhmedJ.ShoorangizR.WeddellS. J.et al (2021). Classification of Alcoholic EEG Signals Using Wavelet Scattering Transform-Based Features. Comput. Biol. Med.139, 104969. 10.1016/j.compbiomed.2021.104969

  • 13

    BurtonW. S.MyersC. A.JensenA.HamiltonL.ShelburneK. B.BanksS. A.et al (2021). Automatic Tracking of Healthy Joint Kinematics from Stereo-Radiography Sequences. Comput. Biol. Med.139, 104945. 10.1016/j.compbiomed.2021.104945

  • 14

    ChaoZ.WangC.LiuH.ZhouQ.QianL.GuoY.et al (2018). Identification and Analysis of Adenine N6-Methylation Sites in the rice Genome. Nat. Plants4 (8), 554–563. 10.1038/s41477-018-0214-x

  • 15

    ChenK.JiangY.DuL.KurganL. (2010). Prediction of Integral Membrane Protein Type by Collocated Hydrophobic Amino Acid Pairs. J. Comput. Chem.30 (1), 163–172. 10.1002/jcc.21053

  • 16

    ChouK-C. (2001). Prediction of Protein Cellular Attributes Using Pseudo-amino Acid Composition. Proteins Struct. Funct. Bioinformatics43 (3), 246–255. 10.1002/prot.1035

  • 17

    ChouK.-C. (2005). Using Amphiphilic Pseudo Amino Acid Composition to Predict Enzyme Subfamily Classes. Bioinformatics21 (1), 10–19. 10.1093/bioinformatics/bth466

  • 18

    DubchakI.MuchnikI.HolbrookS. R.KimS. H. (1995). Prediction of Protein Folding Class Using Global Description of Amino Acid Sequence. Proc. Natl. Acad. Sci.92 (19), 8700–8704. 10.1073/pnas.92.19.8700

  • 19

    GiuseppeM.JamesR.KeithB.GuethleinL. A.UnniG.JimK.et al (2017). IPD-MHC 2.0: an Improved Inter-species Database for the Study of the Major Histocompatibility Complex. Nucleic Acids Res.45 (D1), D860. 10.1093/nar/gkw1050

  • 20

    HearstM. A.DumaisS. T.OsunaE. (1998). Support Vector Machines: Training and Applications. IEEE Intel. Syst. App.13 (4), 18–28.

  • 21

    HopkinsJ.DutiaB. M.McconnellI. (1986). Monoclonal Antibodies to Sheep Lymphocytes. I. Identification of MHC Class II Molecules on Lymphoid Tissue and Changes in the Level of Class II Expression on Lymph-Borne Cells Following Antigen Stimulation In Vivo. Immunology59 (3), 433

  • 22

    JiangL.YuH.LiJ.TangJ.GuoY.GuoF. (2021). Predicting MHC Class I Binder: Existing Approaches and a Novel Recurrent Neural Network Solution. Brief. Bioinform.22 (6), bbab216. 10.1093/bib/bbab216

  • 23

    KarciogluA. A.BulutH. (2021). The WM-Q Multiple Exact String Matching Algorithm for DNA Sequences. Comput. Biol. Med.136, 104656. 10.1016/j.compbiomed.2021.104656

  • 24

    KubiniokP.MarcuA.BichmannL.KuchenbeckerL.SchusterH.HamelinD. J.et al (2022). Understanding the Constitutive Presentation of MHC Class I Immunopeptidomes in Primary Tissues. Iscience25 (2), 103768. 10.1016/j.isci.2022.103768

  • 25

    LiY.NiuM.ZouQ. (2019). An Improved MHC Identification Method with Extreme Learning Machine Algorithm. J. proteome Res.18 (3), 1392–1401. 10.1021/acs.jproteome.9b00012

  • 26

    LiuW.MengX.XuQ.FlowerD. R.LiT. (2006). Quantitative Prediction of Mouse Class I MHC Peptide Binding Affinity Using Support Vector Machine Regression (SVR) Models. BMC Bioinformatics7 (1), 182. 10.1186/1471-2105-7-182

  • 27

    LundegaardC.LamberthK.HarndahlM.BuusS.LundO.NielsenM. (2008). NetMHC-3.0: Accurate Web Accessible Predictions of Human, Mouse and Monkey MHC Class I Affinities for Peptides of Length 8-11. Nucleic Acids Res.36, W509–W512. 10.1093/nar/gkn202

  • 28

    LvZ.AoC.ZouQ. (2019). Protein Function Prediction: From Traditional Classifier to Deep Learning. Proteomics19 (14), e1900119. 10.1002/pmic.201900119

  • 29

    LvZ.CuiF.ZouQ.ZhangL.XuL. (2021). Anticancer Peptides Prediction with Deep Representation Learning Features. Brief Bioinform22 (5), bbab008. 10.1093/bib/bbab008

  • 30

    LvZ.DingH.WangL.ZouQ. (2021). A Convolutional Neural Network Using Dinucleotide One-Hot Encoder for Identifying DNA N6-Methyladenine Sites in the Rice Genome. Neurocomputing422, 214–221. 10.1016/j.neucom.2020.09.056

  • 31

    LvZ.WangP.ZouQ.JiangQ. (2020). Identification of Sub-golgi Protein Localization by Use of Deep Representation Learning Features. Bioinformatics36 (24), 5600–5609. 10.1093/bioinformatics/btaa1074

  • 32

    MaccariG.RobinsonJ.BontropR. E.OttingN.de GrootN. G.HoC. S.et al (2018). IPD-MHC: Nomenclature Requirements for the Non-human Major Histocompatibility Complex in the Next-Generation Sequencing Era. Immunogenetics70 (10), 619–623. 10.1007/s00251-018-1072-4

  • 33

    MahoneyK. E.ShabanowitzJ.HuntD. F. (2021). MHC Phosphopeptides: Promising Targets for Immunotherapy of Cancer and Other Chronic Diseases. Mol. Cell Proteomics20 (640), 100112. 10.1016/j.mcpro.2021.100112

  • 34

    MarcouxG.LarocheA.HasseS.BellioM.MbarikM.TamagneM.et al (2021). Platelet EVs Contain an Active Proteasome Involved in Protein Processing for Antigen Presentation via MHC-I Molecules. Blood J. Am. Soc. Hematol.138 (25), 2607–2620. 10.1182/blood.2020009957

  • 35

    McShanA. C.DevlinC. A.MorozovG. I.OverallS. A.MoschidiD.AkellaN.et al (2021). TAPBPR Promotes Antigen Loading on MHC-I Molecules Using a Peptide Trap. Nat. Commun.12 (1), 3174–3218. 10.1038/s41467-021-23225-6

  • 36

    NaseerS.HussainW.KhanY. D.RasoolN. (2021). NPalmitoylDeep-Pseaac: A Predictor of N-Palmitoylation Sites in Proteins Using Deep Representations of Proteins and PseAAC via Modified 5-Steps Rule. Cbio16 (2), 294–305. 10.2174/1574893615999200605142828

  • 37

    PlattJ. C. (1999).Fast Training of Support Vector Machines Using Sequential Minimal Optimization, Advances in Kernel Methods. Support Vector Learning

  • 38

    RoyS.SharmaB.MazidM. I.AkhandR. N.DasM.MarufatuzzahanM.et al (2021). Identification and Host Response Interaction Study of SARS-CoV-2 Encoded miRNA-like Sequences: an In Silico Approach. Comput. Biol. Med.134, 104451. 10.1016/j.compbiomed.2021.104451

  • 39

    SafaeiM.SundararajanE. A.DrissM.BoulilaW.Shapi'iA. (2021). A Systematic Literature Review on Obesity: Understanding the Causes & Consequences of Obesity and Reviewing Various Machine Learning Approaches Used to Predict Obesity. Comput. Biol. Med.136, 104754. 10.1016/j.compbiomed.2021.104754

  • 40

    SaxenaD.SharmaA.SiddiquiM. H.KumarR. (2021). Development of Machine Learning Based Blood-Brain Barrier Permeability Prediction Models Using Physicochemical Properties, MACCS and Substructure Fingerprints. Cbio16 (6), 855–864. 10.2174/1574893616666210203104013

  • 41

    SaxenaS.AnimeshS.FullwoodM.MuY. (2020). OnionMHC: A Deep Learning Model for Peptide - HLA-A*02:01 Binding Predictions Using Both Structure and Sequence Feature SetsJ. Micromech. Mol. Phys.5 (03), 2050009.

  • 42

    ShiinaT.YamadaY.AarninkA.SuzukiS.MasuyaA.ItoS.et al (2015). Discovery of Novel MHC-Class I Alleles and Haplotypes in Filipino Cynomolgus Macaques (Macaca fascicularis) by Pyrosequencing and Sanger Sequencing. Immunogenetics67 (10), 563–578. 10.1007/s00251-015-0867-9

  • 43

    TahocesP. G.VarelaR.CarreiraJ. M. (2021). Deep Learning Method for Aortic Root Detection. Comput. Biol. Med.135, 104533. 10.1016/j.compbiomed.2021.104533

  • 44

    TandelG. S.TiwariA.KakdeO. G. (2021). Performance Optimisation of Deep Learning Models Using Majority Voting Algorithm for Brain Tumour Classification. Comput. Biol. Med.135, 104564. 10.1016/j.compbiomed.2021.104564

  • 45

    TavolaraT. E.GurcanM. N.SegalS.NiaziM. K. K. (2021). Identification of Difficult to Intubate Patients from Frontal Face Images Using an Ensemble of Deep Learning Models. Comput. Biol. Med.136, 104737. 10.1016/j.compbiomed.2021.104737

  • 46

    TogacarM. (2021). Detection of Segmented Uterine Cancer Images by Hotspot Detection Method Using Deep Learning Models, Pigeon-Inspired Optimization, Types-Based Dominant Activation Selection Approaches. Comput. Biol. Med.136, 104659. 10.1016/j.compbiomed.2021.104659

  • 47

    TsiknakisN.TheodoropoulosD.ManikisG.KtistakisE.BoutsoraO.BertoA.et al (2021). Deep Learning for Diabetic Retinopathy Detection and Classification Based on Fundus Images: A Review. Comput. Biol. Med.135, 104599. 10.1016/j.compbiomed.2021.104599

  • 48

    TurkiT.TaguchiY. h. (2021). Discriminating the Single-Cell Gene Regulatory Networks of Human Pancreatic Islets: A Novel Deep Learning Application. Comput. Biol. Med.132, 132. 10.1016/j.compbiomed.2021.104257

  • 49

    UsmanS. M.KhalidS.BashirS. (2021). A Deep Learning Based Ensemble Learning Method for Epileptic Seizure Prediction. Comput. Biol. Med.136. 10.1016/j.compbiomed.2021.104710

  • 50

    VafaeezadehM.BehnamH.HosseinsabetA.GifaniP. (2021). A Deep Learning Approach for the Automatic Recognition of Prosthetic Mitral Valve in Echocardiographic Images. Comput. Biol. Med.133, 104388. 10.1016/j.compbiomed.2021.104388

  • 51

    WangX.WangS.FuH.RuanX.TangX.DeepFusion-Rbp (2021). DeepFusion-RBP: Using Deep Learning to Fuse Multiple Features to Identify RNA-Binding Protein Sequences. Cbio16 (8), 1089–1100. 10.2174/1574893616666210618145121

  • 52

    WatanabeS.SakaguchiK.MurataD.IshiiK. (2021). Deep Learning-Based Hounsfield Unit Value Measurement Method for Bolus Tracking Images in Cerebral Computed Tomography Angiography. Comput. Biol. Med.137, 104824. 10.1016/j.compbiomed.2021.104824

  • 53

    WestbrookC. J.KarlJ. A.WisemanR. W.MateS.KorolevaG.GarciaK.et al (2015). No Assembly Required: Full-Length MHC Class I Allele Discovery by PacBio Circular Consensus Sequencing. Hum. Immunol.76 (12), 891–896. 10.1016/j.humimm.2015.03.022

  • 54

    YanN.LvZ.HongW.XuX. (2021). Editorial: Feature Representation and Learning Methods with Applications in Protein Secondary Structure. Front. Bioeng. Biotechnol.20219 (822). 10.3389/fbioe.2021.748722

  • 55

    YapM. H.HachiumaR.AlaviA.BrüngelR.CassidyB.GoyalM.et al (2021). Deep Learning in Diabetic Foot Ulcers Detection: A Comprehensive Evaluation. Comput. Biol. Med.135, 104596. 10.1016/j.compbiomed.2021.104596

  • 56

    YildirimK.BozdagP. G.TaloM.YildirimO.KarabatakM.AcharyaU. R. (2021). Deep Learning Model for Automated Kidney Stone Detection Using Coronal CT Images. Comput. Biol. Med.135, 104569. 10.1016/j.compbiomed.2021.104569

  • 57

    ZhangJ.SunQ.LiangC. (2021). Prediction of lncRNA-Disease Associations Based on Robust Multi-Label Learning. Cbio16 (9), 1179–1189. 10.2174/1574893616666210712091221

  • 58

    ZhangQ.ZhouJ.ZhangB. (2021). Computational Traditional Chinese Medicine Diagnosis: A Literature Survey. Comput. Biol. Med.133, 104358. 10.1016/j.compbiomed.2021.104358

  • 59

    ZhangS.YuanZ.WangY.BaiY.ChenB.WangH. (2021). REUR: A Unified Deep Framework for Signet Ring Cell Detection in Low-Resolution Pathological Images. Comput. Biol. Med.136, 104711. 10.1016/j.compbiomed.2021.104711

  • 60

    ZhangY.DuanG.YanC.YiH.WuF.-X.WangJ. (2021). MDAPlatform: A Component-Based Platform for Constructing and Assessing miRNA-Disease Association Prediction Methods. Cbio16 (5), 710–721. 10.2174/1574893616999210120181506

  • 61

    ZhangZ.YuS.QinW.LiangX.XieY.CaoG. (2021). Self-supervised CT Super-resolution with Hybrid Model. Comput. Biol. Med.138, 104775. 10.1016/j.compbiomed.2021.104775

  • 62

    ZhaoS.JuY.YeX.ZhangJ.HanS. (2021). Bioluminescent Proteins Prediction with Voting Strategy. Cbio16 (2), 240–251. 10.2174/1574893615999200601122328

  • 63

    ZhaoX.DuY.ZhangR. (2022). A CNN-Based Multi-Target Fast Classification Method for AR-SSVEP. Comput. Biol. Med.141, 105042. 10.1016/j.compbiomed.2021.105042

  • 64

    ZhenC.PeiZ.FuyiL.Marquez-LagoT. T.AndréL.JericoR.et al (2020). iLearn: an Integrated Platform and Meta-Learner for Feature Engineering, Machine Learning Analysis and Modeling of DNA, RNA and Protein Sequence Data. Brief. Bioinform.21 (3), 1047–1057. 10.1093/bib/bbz041

  • 65

    ZhuQ.FanY.PanX. (2021). Fusing Multiple Biological Networks to Effectively Predict miRNA-Disease Associations. Cbio16 (3), 371–384. 10.2174/1574893615999200715165335

  • 66

    ZouY.WuH.GuoX.PengL.DingY.TangJ.et al (2021). MK-FSVM-SVDD: A Multiple Kernel-Based Fuzzy SVM Model for Predicting DNA-Binding Proteins via Support Vector Data Description. Cbio16 (2), 274–283. 10.2174/1574893615999200607173829

Summary

Keywords

protein classification, major histocompatibility complex, machine learning, feature extraction, identification

Citation

Chen D and Li Y (2022) PredMHC: An Effective Predictor of Major Histocompatibility Complex Using Mixed Features. Front. Genet. 13:875112. doi: 10.3389/fgene.2022.875112

Received

13 February 2022

Accepted

07 March 2022

Published

25 April 2022

Volume

13 - 2022

Edited by

Quan Zou, University of Electronic Science and Technology of China, China

Reviewed by

Chunyu Wang, Harbin Institute of Technology, China

Haiying Zhang, Xiamen University, China

Updates

Copyright

*Correspondence: Yanjuan Li,

This article was submitted to Statistical Genetics and Methodology, a section of the journal Frontiers in Genetics

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Outline

Figures

Cite article

Copy to clipboard


Export citation file


Share article

Article metrics