In-Silico Tool for Predicting, Scanning, and Designing Defensins

Defensins are host defense peptides present in nearly all living species, which play a crucial role in innate immunity. These peptides provide protection to the host, either by killing microbes directly or indirectly by activating the immune system. In the era of antibiotic resistance, there is a need to develop a fast and accurate method for predicting defensins. In this study, a systematic attempt has been made to develop models for predicting defensins from available information on defensins. We created a dataset of defensins and non-defensins called the main dataset that contains 1,036 defensins and 1,035 AMPs (antimicrobial peptides, or non-defensins) to understand the difference between defensins and AMPs. Our analysis indicates that certain residues like Cys, Arg, and Tyr are more abundant in defensins in comparison to AMPs. We developed machine learning technique-based models on the main dataset using a wide range of peptide features. Our SVM (support vector machine)-based model discriminates defensins and AMPs with MCC of 0.88 and AUC of 0.98 on the validation set of the main dataset. In addition, we created an alternate dataset that consists of 1,036 defensins and 1,054 non-defensins obtained from Swiss-Prot. Models were also developed on the alternate dataset to predict defensins. Our SVM-based model achieved maximum MCC of 0.96 with AUC of 0.99 on the validation set of the alternate dataset. All models were trained, tested, and validated using standard protocols. Finally, we developed a web-based service “DefPred” to predict defensins, scan defensins in proteins, and design the best defensins from their analogs. The stand-alone software and web server of DefPred are available at https://webs.iiitd.edu.in/raghava/defpred.


INTRODUCTION
Defensins are a group of antimicrobial peptides (AMPs) that are an essential part of the innate immune system. Because of their broad-spectrum antimicrobial efficacy, they are imperative effector components in the defense of a host against infections (1)(2)(3). Based on configuration, defensins are categorized into two categories: a-defensins (a-helices) and b-defensins (b-sheets). Defensins are minute, cationic peptides that enable phagocytes, the skin, and the mucosa to fight bacteria. They also have a broad range of antimicrobial activity against viruses, mycoplasma, tumor, and fungi. They do have an amphipathic nature and acts on the membrane or envelopes the wall using their nature (4)(5)(6). The critical cellular secretors of these peptides include neutrophils and epithelial cells, but defensins are also generated by monocytes, macrophages, dendritic cells, and lymphocytes (7). According to previous studies, defensins are commonly dispersed among different body compartments in nearly all living organisms; however, they seem to be elevated in specific pathogenic body cells (8). These host defense peptides aid in the fight against bacterial, viral, and fungal infections via cells that produce them (7). Defensin peptides mostly destroy the structure of bacterial cell membranes as part of their action m e c h a n i s m d u r i n g w h i c h t h e y i n fl i c t m e m b r a n e permeabilization, which thereby results in the release of nutrients from the bacterial cell (9). They achieve this by binding to the membrane and forming destructive pores on the cell membrane. Defensins are induced by various stimuli (10). They are majorly synthesized and released from dendritic cells, monocytes, neutrophils, eosinophils, and epithelia cells. In addition to their antimicrobial activity, defensins are also actively involved in a range of immune-modulatory functions such as mitogenesis, cytokine release, and histamine release, as depicted in Figure 1.
• In the era of drug resistance, many emerging strains of pathogens (i.e., bacteria, fungi, parasites) are being found to be resistant to existing drugs, particularly against antibiotics (11,12). This includes multidrug-resistant strains that are resistant to most of the existing drugs (13)(14)(15). In order to manage treatment of drug-resistant strains of pathogens, researchers are exploring alternatives to antibiotics (16,17). One of the potential alternatives to antibiotics is protein-/ peptide-based therapeutics. In the last two decades, there is a significant rise in the number of peptide-based therapeutics approved by the FDA (18)(19)(20)(21). Some of the FDA-approved AMPs include poly(2-oxazoline)s, which are used as synthetic mimics of host defense peptides (22), as well as daptomycin, gramicidin, and colistin (23).
AMPs are one of the major classes of therapeutic peptides that are commonly used to kill microbial pathogens including the drug-resistant strain of pathogens (24,25). In the past, numerous computational resources and methods have been developed for predicting AMPs including chemically modified AMPs (26)(27)(28)(29)(30)(31)(32)(33)(34). In addition to AMPs, a number of methods have been developed to predict peptides for killing a specific class of microorganism which include prediction of antibacterial, antituberculosis, antiviral, antifungal, and antiparasite peptides (35)(36)(37)(38)(39)(40)(41). Though these antimicrobial peptides are an alternative to smallmolecule-based drugs, their toxicity, half-life, and allergenicity are major challenges (42)(43)(44). Thus, there is a need to explore a new class of AMPs called defensins, which are used by hosts to defend themselves from pathogens. These defensins have numerous advantages over AMPs as they are damage-associated molecular patterns (DAMPs) and released in the host itself. Due to this, they are less toxic and are highly tolerated by the body. They occur naturally and are recognized by pattern recognition receptors (PRRs) (45,46). In the past, a number of methods have been developed for predicting defensins and their classes (47)(48)(49)(50). We discussed the available tools in the section Comparison With Existing Tools.
In this paper, we describe a reliable method developed for predicting defensins with high precision. We systematically collected defensins, AMPs, and non-defensins from various sources to create the largest possible datasets. In this study, we tried to understand the differences and similarities between defensins and AMPs. We observed significant differences in defensins and AMPs. Thus, we developed models for discriminating antimicrobial peptides and defensins. In addition, we developed models for discriminating defensins and non-defensins. In order to help the scientific community, we developed a stand-alone software as well as a web server.

Creation of Datasets
Defensins were obtained from various sources that include previous studies (48)(49)(50), DRAMP2.0 (51), and CAMPR3 (30). We only collected experimentally validated defensin sequences which have antimicrobial activity. It was observed that defensins have a wide range of lengths (5-120 residues), but most of them (77.59% of the total sequences) have 10-60 residues. Thus, in this study, we removed all defensins which have number of residues less than 10 or more than 60 residues. We also removed sequences containing non-natural or non-standard amino acids (B, J, O, U, X, and Z). Finally, 1,036 unique defensins were obtained. These defensin sequences have been used to create two datasets, as described below.

Defensins/AMPs or the Main Dataset
Our main dataset contains defensins as positive sequences and AMPs as negative sequences. As described above, we collected 1,036 defensins from different sources. We obtained 2,297 experimentally validated AMPs from the CAMPR3 database. Basically, we have taken all peptides excluding peptides of the defensin family. Similar to defensins, the sequence lengths were restricted between 10 and 60 residues. We also discarded sequences containing amino acids other than natural amino acids. In summary, our main dataset contains 1,036 experimentally validated defensins and 1,035 AMPs (or non-defensins).

Defensins/Proteins or the Alternate Dataset
Our alternate dataset has defensins and non-defensins. In order to obtain non-defensins, we searched Swiss-Prot (52) with following queries: "Non-AMPs" and "Non-Defensin" and "Not antibacterial" and "Not antifungal" and "Not antiviral" and "Not antiparasitic" and "Not antimicrobial" proteins. Initially, we obtained~42,357 protein sequences, out of which we randomly selected 1,055 unique sequences having a number of residues between 10 and 60. In simple words, our alternate dataset contains 1,036 defensins and 1,054 non-defensin sequences as shown in Figure 2.

Sequence-Based Features
The stand-alone version of Pfeature (53) was used to calculate a variety of features from protein sequences in this analysis. Thousands of features/descriptors of protein or peptide sequences can be calculated using Pfeature. We applied the composition-based function module of Pfeature and created a vector of 8,968 features. Apart from these, we have also tried different composition features individually from Pfeature on both datasets. The corresponding results are shown in Supplementary Tables 1, 2.

Feature Selection and Ranking
Identifying an essential collection of features from the vast dimension of features is one of the main challenges of the study. We used the SVC-L1-based feature selection strategy, which incorporates the support vector classifier (SVC) with linear kernel, penalized with L1 regularization. SVC-L1 was chosen because it uses many methods to pick the right features from a vast number of feature vectors and is incredibly quick in comparison to other methods (54). Its main goal is to reduce the objective function, which takes into account the loss function and regularization. To minimize dimensions, the SVC-L1 algorithm chooses non-zero coefficients and, afterwards, applies the L1 penalty to choose appropriate features. During the optimization process, the L1 regularization generates sparse models by removing a few of the features from the model by setting the coefficients to zero. The sparsity is regulated by the "C" parameter, which is dependent on the number of features selected; the smaller the "C" value, the fewer features are selected. For parameter "C," we used the default value of 0.01 (55). Subsequently, the significance of these features in classifying proteins was then evaluated using the software "feature selector." The program "feature selector" ranks the features depending on the amount of time a feature is used to split data across all trees, using a DT-based algorithm called the Light Gradient Boosting Machine (56).

Machine Learning
In this study, several machine learning algorithms have been used to develop models for classification using Python's library scikit-learn (57). It includes extra tree (ET), random forest (RF), logistic regression (LR), support vector machine (SVM), knearest neighbors (KNNs), and multilayer perceptron (MLP). Different hyperparameters corresponding to these classifiers were tuned using "GridSearch" and only the best results were incorporated.

Cross-Validation Techniques
In order to provide internal and external validation, we divide our datasets into training and validation sets in 80% and 20% ration, respectively. In case of internal validation, we used a five-fold cross-validation technique, where sequences in the training sets are first arbitrarily divided into five equivalent folds (58,59). Thereafter, four of these folds are used for training and the remaining fold is used for testing. The procedure is replicated five times until each of the five folds has been used for testing at least once. Finally, the performance of the model is calculated by averaging the performance on the five folds. This is called internal validation where parameters are optimized on 80% training dataset to achieve the best performance. In order to validate the performance of our models, we evaluate the performance on 20% validation dataset, called external validation.

Evaluation Parameters
We used well-established evaluation criteria to assess the efficacy of various machine learning classification models. We used both threshold-dependent and independent parameters in this analysis like sensitivity (Sens), specificity (Spec), and accuracy (Acc). To assess the results of the models, a receiver operating characteristic (ROC) curve was plotted between sensitivity and 1 − specificity. Thereafter, we used the typical thresholdindependent parameter AUROC (area under the ROC curve) values for assessment. The following equations were used to quantify these parameters: where TP = true positive, FP = false positive, TN = true negative, and FN = false negative.

Architecture of the Web Server
To predict defensins and AMPs and defensins and nondefensins, a web server called "DefPred" (https://webs.iiitd.edu. in/raghava/defpred) was developed. HTML5, Java, CSS3, and PHP scripts were used to build the front end of the web server. It was built on responsive templates, which modify the size of the screen depending on the device. It works for virtually all electronic devices, including smartphones, tablets, and desktop computers.

RESULTS
We conducted some preliminary analyses on the main and alternate dataset sequences to understand the preference of certain types of residues. Thereafter, the models were developed on the "main" and "alternate" datasets. A comprehensive detail about these analyses as well as the performance of the models is shown in the following sections.

Compositional Analysis
The amino acid composition (AAC) for defensins, AMPs, and non-defensin peptides was calculated. Figure 3 depicts the typical amino acid composition of defensin, antimicrobial, and non-defensin peptides. As shown in Figure 3, defensins have a higher amino acid composition for certain types of residues (i.e., C, D, E, N, R, T, Y) in comparison to AMPs. In comparison to non-defensins, defensins have a higher amino acid composition for the following types of residues: C, G, R, and Y. Similarly, AMPs have a higher composition for certain types of residues (e.g., C, I, K, L) in comparison to non-defensins. These observations indicate that defensin and AMPs are different in terms of preference of residues, despite that both of them have antimicrobial activity. These observations indicate that antimicrobial peptide prediction is not suitable for predicting defensins as both have preference to different types of residues. Besides, we also conducted the "Mann-Whitney test" to determine the statistical significance among these three groups.
We found that among 60 pairs, 54 were statistically significant (Supplementary Table 3). AMPs and non-defensins have nonsignificant amino acid residue pairs like A and W. AMPs and defensins have M residue as a non-significant pair. At the same time, non-defensins with defensins have F, H, and T as nonsignificant pairs.

Preferential Position Analysis
In this analysis, the preference of a particular amino acid at a specific position in the protein string was studied. A two-sample logo (TSL) for the main and alternate datasets is represented in Figure 4. The most significant amino acid residue represents the relative abundance in the sequence. It is important to note that the first 10 positions represent the N-terminal residues of peptides, and the last 10 positions represent the C-terminus of peptides. We observed that the amino acid "C" was enriched at positions 1, 2, 3, 5, 6, 7, 8, and 9 of the C-terminus and at positions 3, 4, 5, 6, 8, and 9 of the N-terminus. Also, the amino acid "N" was enriched at position 10 of the C-terminus and "S" was enriched at position 7 of the N-terminus. However, the nondefensins show an abundance of "K," "L," and "A" at various positions in both C-and N-termini.

Machine Learning-Based Models on Selected Features
As outlined earlier, a total of 8,948 features fetched from Pfeature's composition-based module were reduced to 93 (main dataset) and 68 (alternate dataset) features after applying the SVC-L1-based selection procedure. A range of machine learning classifiers including SVM, LR, KNN, RF, MLP, and ET were implemented on both these datasets. The

Machine Learning-Based Models on Top-Ranked Selected Features
In addition to the development of prediction models over complete selected features, we assessed the significance of various feature sets. The goal was to determine the feature set with minimal features that can reliably distinguish defensins with

Comparison With Existing Methods
We have also compared our models developed in this study with the methods developed in the past. As shown in  (61), or from Swiss-Prot (62). One of the limitations of previous studies is the size of the dataset. In this study, we have taken the largest possible dataset to develop reliable models where data were obtained from different sources. In addition, we created two datasets called the main and alternate datasets to discriminate defensin from antimicrobial peptide and non-defensins. Our web service not only allows to predict defensin but also to scan defensin peptides in proteins as well as to design highly efficient defensins. In contrast, most of the web services developed in the past are inactive. This justifies the development of this new method which will complement existing methods.

The Web Server "DefPred"
We built a user-friendly prediction web server that incorporates various modules to predict defensin proteins in order to support the scientific community. The prediction models of the study are applied in the web server. Based on the score of the prediction models at a different threshold, users will predict whether a query peptide is defensin or non-defensin. Predict, Protein-scan, Design, Downloads, and Algorithm are the five major modules in the web server. The user can distinguish defensins from nondefensin peptides using the "Predict" module. The positive and negative datasets used in this analysis are both available for download in FASTA format. HTML, Java, and PHP scripts were used to build the web server "DefPred." A detailed description of these modules is provided below. The Predict module predicts whether the submitted protein sequence is defensin or not. Users can submit multiple peptides in FASTA format in the box or can upload the file containing the same. This module allows the user to predict using model-1 developed on the main dataset to predict defensins from AMPs. Model-2 was developed to predict defensins and non-defensins. The Design module allows the user to generate all possible analogs for a sequence and then rank these peptide sequences based on their scores. This allows the user to identify the best analog of defensin. The Scan module is developed to identify regions in a protein that have defensin-like properties. In order to serve the community, we have developed the stand-alone software in Python. We have also provided a stand-alone facility in the form of Docker technology. This stand-alone software is integrated into our package "GPSRdocker," which can be downloaded from the site https:// webs.iiitd.edu.in/gpsrdocker/ (63).

DISCUSSION
Antibiotic resistance is emerging among microbes throughout the world, and current treatments are ineffective to treat drug- resistant microorganisms. The fear of a post-antibiotic age, with rising pathogen drug resistance, necessitates the development of alternatives to traditional antibiotics or small molecule-based treatments. AMPs are a class of potential agents with curative prospects due to their diverse therapeutic properties. The innate immune systems of several organisms rely heavily on these evolutionarily conserved molecules. Defensins are a special class of AMPs that have a wide range of functions and use several modes of action, making them less likely to be drug resistant (8,64). Moreover, the differences in the mechanism of microbicidal action of defensins from other antibiotics make them beneficial in fighting infections when used in tandem with conventional antibiotic treatments (65). Naturally existing defensins are efficient, non-toxic microbicides that might be effective for treating infections caused by antibiotic-resistant pathogens. Recent studies have suggested that they achieve this by damaging bacterial cell membranes but not mammalian cell membranes. With this information, developing next-generation defensins with enhanced biological activity profiles is a plausible objective that will allow defensins to be employed to augment human health in the near future. New antimicrobials with defensin-based bactericidal and immunomodulatory characteristics may be effective in conjunction with conventional antibiotic therapy against drug-resistant bacteria while also increasing survival from common infections (65). Furthermore, previous research has demonstrated that defensin and antibiotic combinations may be utilized synergistically to battle infections, including biofilms, permitting for lower dosages of both drugs while still improving treatment efficacy (66)(67)(68)(69). The advancement in in-silico research particularly in the field of bioinformatics has led to the identification and delineation of properties of defensins that enable them to exert their diverse range of biological activities. However, since defensins and AMPs have highly similar nature, it is difficult to distinguish defensins and thereby challenging to develop solely defensinbased therapeutics.
Our study addresses this issue by proposing state-of-the-art machine learning models which can be employed to discriminate and predict defensins from other AMPs and defensins from other proteins (non-defensins). Additionally, since the dataset is crucial in machine learning as well as for a robust in-silico prediction model, we created a very detailed and up-to-date dataset using updated repositories. To better understand the structure and positional preference of defensins, TSL and compositional analytical experiments were conducted. In previous studies, defensins have been found to be high in cysteine (C) amino acid (7) which is consistent with our findings. The properties of the experimentally validated defensins present in the literature were utilized for developing various prediction models. The program "Pfeature" was used to generate 8,968 features from sequence data. The SVC-L1 of the scikit package was used to pick selected features, which were then ranked using feature selector methods. The compositional analysis demonstrated that some types of residues such as C, R, N, L, and Y are preferred in defensins, whereas others such as M are not. This was also corroborated from one of the topranked selected features AAC_C which denotes the amino acid composition of cysteine in a protein sequence. AAC_C ranked first in the main and second in the alternate dataset. Some other high-ranked features included CeTD_SA1 which is compositionenhanced transition and distribution of group 1 (A, L, F, C, G, I, V, W) for solvent accessibility attribute, and PAAC1_E is the pseudo-amino acid composition of glutamic acid in the main dataset (Supplementary Table 4 and Table 2). In the case of the alternate dataset, a few top-ranked features were CeTD_SS1, which is a composition of group 1 (A, L, F, C, G, I, V, W) residue for the secondary structure attribute, and BTC_T, which is the total bond composition present in the sequence (Supplementary Table 5 and Table 2). Amino acid composition of cysteine is common in both main and alternate datasets, indicating that defensins outstand with more "C" content ( Figure 3). It is worth noting that new feature selection strategies picked 93 features for the main and 68 features for the alternate dataset, which include the abovementioned features. In our work, we used these 93 and 68 features to build the two classification models. Furthermore, a five-fold cross-validation technique was used to validate the performance of different models based on the top-ranked features. We wanted a minimal set of features with the least amount of performance loss to prevent over-optimization of the models. For the final classification models, for the main and alternate datasets, we chose the top 60 and top 50 features, respectively. Model-1, which utilized the main dataset, is a SVM classifier that achieved maximum performance of 0.98 AUROC and 0.88 MCC in the training dataset and 0.98 AUROC and 0.88 MCC in the validation dataset for classifying defensins from AMPs, whereas model-2, which used the alternate dataset, classified defensins from non-defensins. Model-2 is also a SVM classifier which performed best on the training and validation datasets with AUROC of 0.99 and MCC of 0.93 and AUROC of 0.99 and MCC of 0.96, respectively.
Despite numerous improvements, there are a few limitations of this study. The current study aimed to develop a prediction method for identifying defensins/AMPs and defensins/nondefensins. To achieve this, we used the sequence data from all available species such as mammals, plants, and insects due to the small number of experimentally validated defensins, although the ideal process to develop a host-specific method for predicting defensins should contain data from the concerned host only. Additionally, our models do not account for structural properties such as secondary structure details, surface accessibility rating, and disulfide bond information. Furthermore, for prediction, our models ignore information regarding post-translational modifications (e.g., terminus modification, incorporation of chemical moieties, glycosylation, and phosphorylation). Although a systematic effort has been made in this analysis to create the best possible models under the current conditions, it is Notavailable expected that the future research will be able resolve these issues in order to improve prediction.
Finally, in order to serve the scientific community, we have developed a web server named "DefPred" as well as the standalone version which incorporated our best models. The standalone version is Python-based and offers numerous options to the user. On the other hand, the associated server is user-friendly and compatible with multiple screens such as laptops, android mobile phones, iPhone, and iPad. We have also provided a stand-alone facility in the form of Docker technology. This stand-alone software is integrated into our package "GPSRdocker," which can be downloaded from the site https:// webs.iiitd.edu.in/gpsrdocker/ (63). We anticipate that this work will benefit researchers working in the area of vaccine designing and also enable a deeper understanding of immune defense response.

CONCLUSION
In this work, we have presented a prediction server "DefPred" for the identification and classification of defensins. It possesses two models "model-1" and "model-2" for the classification of defensins from other AMPs (the main dataset) and defensins from any random proteins (the alternate dataset), respectively. Both models have been created from different datasets that are available on the web server. The web server employs SVM supervisory models in both datasets. Around 9,000 features have been taken into account, and after feature selection and ranking, 98 features for the main dataset and 68 features for the alternate dataset have been selected. Furthermore, among them, the best models for the main and alternate datasets were obtained at the top 60 and top 50, respectively. The present work is an attempt to provide a platform for addressing this important aspect of defensin prediction. To facilitate the scientific community in developing better methods for the prediction of defensins, we have provided our datasets used in the present study. Also, we have provided the stand-alone version for "DefPred."