Edited by: Noton Kumar Dutta, Johns Hopkins University, United States
Reviewed by: Tikam Chand Dakal, Hospital Maisonneuve-Rosemont and University of Montreal, Canada; William Farias Porto, Universidade Católica Dom Bosco, Brazil
This article was submitted to Antimicrobials, Resistance and Chemotherapy, a section of the journal Frontiers in Microbiology
†These authors have contributed equally to this work.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Designing drug delivery vehicles using cell-penetrating peptides is a hot area of research in the field of medicine. In the past, number of
Since the existence of human race, therapeutic molecules have been used to cure human illness and to extend lives (Tosato et al.,
A universal mechanism of CPP internalization is always proved to be an exploring question, as the involved pathways are not fully clarified yet. The difficulty arises due to differing size, physicochemical properties, as well as concentration of diverse CPP and CPP-conjugates (Guidotti et al.,
Despite, numerous properties and potential applications of CPPs, still there use in real life is limited. The primary limitation associated with CPP is endosomal compartment entrapment which reduces the bioavailability of the drug several times. In literature, it has been shown that bioavailability of CPPs can be increased several times by introducing a chemical modification in a CPP (Postlethwaite et al.,
In the last few years, several computational methods have been developed for the prediction of CPPs. These methods have been developed on various features like amino acid composition (Sanders et al.,
Cell-penetrating peptides were extracted from CPPsite2.0 database (Agrawal et al.,
The dataset was divided into two datasets namely training (main) and validation dataset (Bhalla et al.,
Atom composition is computed from CPPs and non-CPPs by converting peptide structures in SMILES format using openbabel (O'Boyle et al.,
Where atom (
We computed diatom composition of amino acids just like the atomic composition for CPPs and non-CPPs. The diatomic composition provides the composition of the pair of atoms in each residue (e.g., C-C, C-O, etc.) of the peptide, and used to convert the variable length of modified peptides to fixed length feature vectors. The diatomic composition provided the fixed length of 64 (8 × 8) vectors.
Where diatom (
A biological property of any chemical molecule is determined by its chemical descriptors, which have been used in the past to develop QSAR based molecules (Kumar et al.,
We substitute the symbol of the modified residue with its original natural amino acid, for calculating amino acid composition for the positive and negative dataset. This left us with the sequence having 20 natural amino acids which generated the vector of 20.
Here,
We also calculated dipeptide composition of the peptides since it provides global information of the peptide. The dipeptide composition was calculated using the formula 4, and it generated the vector of 400 (20 × 20).
Where dipeptide (
We also calculated N and C terminus amino acid composition as well as dipeptide composition for developing prediction models. The composition of 5, 10, and 15 residues from N-terminus as well as C-terminus was taken into account. Also, we joined the terminal residues like N5C5, N10C10, and N15C15 and for developing models.
In order to observe the residue preference at a particular position in the peptide, web-logos were prepared for first 15 N and 15 C-terminals along with their modifications using online WebLogo software (Crooks et al.,
To check whether is there any significance difference between modified CPPs and non-CPPs, we performed Welch
Different parameters were used to check the performance of various models developed in this study. These parameters are divided into two groups.
This category includes Sensitivity (Sen), Specificity (Spc), Accuracy (Acc), and Matthews's correlation coefficient (MCC), where Sensitivity is true positive rate, Specificity is true negative rate, accuracy is ability to differentiate true positive and true negative and MCC is a correlation coefficient between observed and predicted. These can be calculated using the following equations.
Where
In this study, we also used threshold independent measure to evaluate the performance of models. In case of threshold independent measures, Receiver Operating Characteristics (ROC) curve is drawn between false positive and false negative rates. In order to measure performance, Area Under Curve ROC curve is computed called AUROC.
We compute percent average composition of atoms in CPPs and non-CPPs to understand the preference of certain types of atoms present in the CPPs and non-CPPs. Overall, the profile is more or less same in both CPPs and non-CPPs.CPPs are slightly rich in H and N atoms whereas non-CPPs are slightly rich in C, O, and S (Figure
Percentage amino acid composition of CPPs and non-CPPs.
In addition to compositional preference, we also computed preference of different types of residues in CPPs. It was revealed that some specific type of residues was preferred in the positive dataset contain CPPs as compared to the negative dataset contain non-CPPs. Residues like Rand K are highly preferred at various positions CPPs particularly at N terminal (Figure
Weblogo illustrating residue preference of first 15 N terminal residues of modified
Weblogo illustrating residue preference of first 15 C terminal residues of modified
We used various machine-learning approaches like SVM, Random Forest, Naive Bayes, J48 and SMO for developing the prediction model. These models utilize different features or descriptors to discriminate or classify CPPs and non-CPPs. The results are explained in details in the following sections.
Tertiary structure of a peptide can present all type of modifications. Thus structure of peptide is used to predict cell penetration ability of modified peptide. In this study, we got structure of peptides from databases CPPsite 2.0 and SATPdb. The models were developed using various features of peptide structures. First, we developed model using atomic composition of peptides. In order to obtain atomic composition of peptides from its structure, we convert structure from sdf format to SMILES. The atomic composition of peptides was calculated from SMILES of peptide. Prediction models were developed using different classifiers like SVM, RF, Naive Bayes, SMO and J48 using atomic composition as an input feature. Random Forest based classification model provided the highest accuracy of 84.02%, MCC of 0.68 and AUROC of 0.91 on the training dataset. On validation dataset, we achieved maximum accuracy of 78.33%, MCC of 0.57 and AUROC of 0.88. Performance of different classifiers given in Table
Performance of different machine learning methods on atom composition.
SVM | 81.10 | 80.58 | 80.84 | 0.62 | 0.84 | 79.33 | 75.33 | 77.33 | 0.55 | 0.81 | |
Random Forest | Ntree = 30 | 83.33 | 84.71 | 84.02 | 0.68 | 0.91 | 79.33 | 77.33 | 78.33 | 0.57 | 0.88 |
SMO | 77.66 | 83.51 | 80.58 | 0.61 | 0.80 | 75.33 | 82.67 | 79.00 | 0.58 | 0.79 | |
J48 | 75.43 | 80.58 | 78.01 | 0.56 | 0.82 | 80.00 | 76.00 | 78.00 | 0.56 | 0.79 | |
Naive Bayes | Default | 74.57 | 65.46 | 70.02 | 0.40 | 0.80 | 80.00 | 69.33 | 74.67 | 0.50 | 0.82 |
Performance of different machine learning methods on diatom composition.
SVM | 90.38 | 86.43 | 88.40 | 0.77 | 0.93 | 85.33 | 96.67 | 91.00 | 0.83 | 0.97 | |
Random Forest | Ntree = 30 | 88.49 | 88.49 | 88.49 | 0.77 | 0.94 | 85.33 | 82.00 | 83.67 | 0.67 | 0.93 |
SMO | 86.25 | 89.00 | 87.63 | 0.75 | 0.87 | 86.67 | 84.00 | 85.33 | 0.71 | 0.85 | |
J48 | 82.47 | 81.10 | 81.79 | 0.64 | 0.81 | 85.33 | 81.33 | 83.33 | 0.67 | 0.82 | |
Naive Bayes | Default | 71.65 | 70.45 | 71.05 | 0.42 | 0.78 | 72.67 | 66.67 | 69.67 | 0.39 | 0.77 |
We developed models individually for 2D descriptors, 3D descriptors, and Fingerprints as well as the single model by combining 2D, 3D descriptors, and Fingerprints. The descriptors were computed using PaDEL software from tertiary structure of peptides (sdf format). The models were developed on the features, selected after performing feature selection, by attribute evaluator named, “CfsSubsetEval” with search method of “BestFirst” at default parameters in the forward direction (amount of backtracking,
Performance of different machine learning methods on 2D descriptors.
SVM | 89.00 | 84.48 | 86.75 | 0.74 | 0.92 | 86.00 | 82.67 | 84.33 | 0.69 | 0.92 | |
Random Forest | Ntree = 60 | 92.78 | 91.90 | 92.34 | 0.85 | 0.97 | 94.67 | 88.67 | 91.67 | 0.83 | 0.97 |
SMO | 83.16 | 86.38 | 84.77 | 0.70 | 0.84 | 81.33 | 87.33 | 84.33 | 0.69 | 0.84 | |
J48 | 89.52 | 88.79 | 89.16 | 0.78 | 0.89 | 90.00 | 87.33 | 88.67 | 0.77 | 0.89 | |
Naive Bayes | Default | 75.09 | 78.79 | 76.94 | 0.54 | 0.85 | 74.67 | 77.33 | 76.00 | 0.52 | 0.84 |
In case of 3D descriptors, total 47 features were calculated and was reduced to 6 after applying feature selection (Table
Performance of different machine learning methods on 3D descriptors.
SVM | 76.29 | 74.40 | 75.34 | 0.51 | 0.80 | 71.14 | 73.15 | 72.15 | 0.44 | 0.80 | |
Random Forest | Ntree = 700 | 80.93 | 72.16 | 76.55 | 0.53 | 0.85 | 79.87 | 67.11 | 73.49 | 0.47 | 0.83 |
SMO | 69.42 | 72.85 | 71.13 | 0.42 | 0.71 | 63.09 | 76.51 | 69.80 | 040 | 0.69 | |
J48 | 74.74 | 76.12 | 75.43 | 0.51 | 0.78 | 72.48 | 74.50 | 73.49 | 0.47 | 0.78 | |
Naive Bayes | Default | 69.24 | 74.40 | 71.82 | 0.44 | 0.78 | 69.80 | 75.84 | 72.82 | 0.46 | 0.79 |
Performance of different machine learning methods on fingerprints.
SVM | 90.19 | 88.12 | 89.16 | 0.78 | 0.95 | 93.33 | 89.33 | 91.33 | 0.83 | 0.96 | |
Random Forest | Ntree = 600 | 94.32 | 90.19 | 92.25 | 0.85 | 0.98 | 96.67 | 88.00 | 92.33 | 0.85 | 0.98 |
SMO | 85.54 | 85.03 | 85.28 | 0.71 | 0.85 | 88.67 | 85.33 | 87.00 | 0.74 | 0.87 | |
J48 | 90.02 | 89.33 | 89.67 | 0.79 | 0.89 | 88.67 | 88.67 | 88.67 | 0.77 | 0.90 | |
Naive Bayes | Default | 86.40 | 84.34 | 85.37 | 0.71 | 0.90 | 82.67 | 85.33 | 84.00 | 0.68 | 0.90 |
Finally, we calculated all the 2D, 3D descriptors and fingerprints at the same time, which generated 15,204 features. Feature selection reduced it down to 48 important features on which different machine learning classifiers were evaluated. Here we observe the maximum accuracy of 95.10%, MCC of 0.90 and AUROC of 0.99 on main dataset and 92.33% accuracy, 0.85 MCC and 0.98 AUROC on validation dataset by Random Forest model (Table
Performance of different machine learning methods on 2D, 3D and fingerprints collectively.
SVM | 83.33 | 79.21 | 81.27 | 0.63 | 0.89 | 78.67 | 82.67 | 80.67 | 0.61 | 0.87 | |
Random Forest | Ntree = 60 | 95.19 | 95.02 | 95.10 | 0.90 | 0.99 | 91.33 | 93.33 | 92.33 | 0.85 | 0.98 |
SMO | 76.80 | 76.98 | 76.89 | 0.54 | 0.76 | 75.33 | 83.33 | 79.33 | 0.59 | 0.79 | |
J48 | 89.69 | 87.63 | 88.66 | 0.77 | 0.90 | 84.67 | 92.00 | 88.33 | 0.77 | 0.92 | |
Naive Bayes | Default | 95.19 | 88.14 | 91.67 | 0.84 | 0.95 | 92.00 | 89.33 | 90.67 | 0.81 | 0.96 |
ROC curve showing performance of models on various structural features.
We obtained significant difference between the positive and negative features based on adjusted
It is nearly impossible to present a modified peptide by amino acid sequence. Thus, prediction of modified peptide from there sequence is not possible. Same time generating tertiary structure of a peptide is a tedious job for a biologist. We made an attempt to develop prediction model for cell penetration peptides of modified peptides from their amino acid sequence only by ignoring modifications in peptide. First, we developed simple composition-based models using various machine learning techniques. The SVM based model showed the best performance among all the classifiers used in the study. The accuracy of 91.67%, MCC of 0.83 and AUROC of 0.96 was achieved for the main dataset. On validation dataset, we obtained accuracy of 89.67%, MCC of 0.79 and AUROC of 0.96 (Table
Secondly, we developed models using dipeptide composition, SVM classifier showed the highest accuracy of 91.84%, MCC of 0.84 and AUROC of 0.96 for the main dataset. For independent dataset, the accuracy of 92.33%, MCC of 0.85 and AUROC of 0.97 was achieved (Table
To assist the scientific community, the best models are provided freely at
CPPs has shown a promising impact in the field of therapeutics or for targeting a specific disease (Bechara and Sagan,
Computational algorithms have been proved a wide success in designing therapeutic peptides (Dhanda et al.,
We have developed various models using machine learning techniques such as SVM, Random Forest, J48, naïve bayes, SMO; individually for atom composition, 2D descriptors, 3D descriptors, and Fingerprints as well as the single model by combining 2D, 3D descriptors, and Fingerprints. We obtain best performance by Random Forest for both combined (2D, 3D, and Fingerprint descriptors) as well as fingerprint with accuracy 92.33% and AUROC 0.98 on validation dataset. As fingerprint alone will be computationally more feasible as compared to the combined method, so we have implemented this model on webserver.
We believe this work will prove a great assist to the researchers aim to design cell penetrating peptide, as well as incorporate different modification and to check their effect on cell penetration ability. In future, we can improve this method, if better art of structure prediction will be developed, as right now PEPstrMOD could tackle only 7–25 amino acid length and other best model I-TASSER only deals with natural residues. So, in conclusion this field must grow simultaneously with the betterment of art-of-structure prediction.
VK and PA generated the dataset. VK, PA, RK, and SB performed the experiments. VK, PA, and RK performed data analysis and prepared the tables and figures. VK, PA, RK, SB, and SU developed the web interface. VK, RK, PA, SU, and GR write the manuscript. GR and GV conceived the idea and coordinated the project.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Authors are thankful to funding agencies J. C. Bose National Fellowship (DST), Council of Scientific and Industrial Research (CSIR), Department of Science and Technology (DST-INSPIRE), Indian Council of Medical Research (ICMR), University Grant Commission (UGC) and Department of Biotechnology (DBT) for fellowships and financial support.
The Supplementary Material for this article can be found online at:
Percentage atomic composition of modified CPPs and non-CPPs.
Percentage amino acid composition of CPPs and non-CPPs
List of 2D features with their positive mean value, negative mean value and
List of 3D features with their positive mean value, negative mean value and
List of fingerprints with their positive mean value, negative mean value and
Performance of different machine learning methods on amino acid composition.
Performance of SVM method on amino acid composition features of terminus residues.
Performance of different machine learning methods on dipeptide composition.
Performance of SVM method on dipeptide composition features of terminus residues.