Anti-flavi: A Web Platform to Predict Inhibitors of Flaviviruses Using QSAR and Peptidomimetic Approaches

Flaviviruses are arboviruses, which comprises more than 70 viruses, covering broad geographic ranges, and responsible for significant mortality and morbidity globally. Due to the lack of efficient inhibitors targeting flaviviruses, the designing of novel and efficient anti-flavi agents is an important problem. Therefore, in the current study, we have developed a dedicated prediction algorithm anti-flavi, to identify inhibition ability of chemicals and peptides against flaviviruses through quantitative structure–activity relationship based method. We extracted the non-redundant 2168 chemicals and 117 peptides from ChEMBL and AVPpred databases, respectively, with reported IC50 values. The regression based model developed on training/testing datasets of 1952 chemicals and 105 peptides displayed the Pearson’s correlation coefficient (PCC) of 0.87, 0.84, and 0.87, 0.83 using support vector machine and random forest techniques correspondingly. We also explored the peptidomimetics approach, in which the most contributing descriptors of peptides were used to identify chemicals having anti-flavi potential. Conversely, the selected descriptors of chemicals performed well to predict anti-flavi peptides. Moreover, the developed model proved to be highly robust while checked through various approaches like independent validation and decoy datasets. We hope that our web server would prove a useful tool to predict and design the efficient anti-flavi agents. The anti-flavi webserver is freely available at URL http://bioinfo.imtech.res.in/manojk/antiflavi.


INTRODUCTION
According to World Health Organization, flaviviruses are responsible for serious outbreaks world wide and hence considered as global health burden 1 (Liang et al., 2015;Wilder-Smith and Byass, 2016). For example, the epidemics by dengue virus, DENV (100 countries in Africa, the Eastern Mediterranean, the Americas, the Western Pacific, and South-East Asia), Zika virus, ZIKV (in 42 countries), Yellow fever virus, YFV (Angolan capital city, China), and many more are reported recently. They comprise arboviruses, which are known for their shifting epidemiology in response to the changing societal factors, e.g., population growth and urbanization (Petersen and Marfin, 2005). Among all the mosquito species, the Aedes mosquito species are known to have prominent role in flaviviruses transmission, due to their ability to thrive in diverse ecological niche (beyond their resident tropical forest niche).
In literature, limited computational resources are available for predicting antiviral potential of any compound. Our group has been developing various web servers viz. AVPpred for predicting the effective antiviral peptides (Thakur et al., 2012), AVP-IC50Pred dedicated to identify the antiviral activity of a peptide based on the half life inhibitory concentration (Qureshi et al., 2015). Likewise, AVCpred platform was designed to predict general antiviral compounds (Qureshi et al., 2017) and HIVProtI for predicting and designing inhibitors specifically against Human Immunodeficiency Virus proteins (Qureshi et al., 2018). Since, flaviviruses have been emerged as worldwide threat, affecting more than 50% population globally (∼40% infected by DENV alone) (Holbrook, 2017). Therefore, there is a need to accelerate the development of efficient therapeutics. Hence, in current study we are providing anti-flavi, a web platform for prediction and designing of novel antiviral compounds specifically against flaviviruses.
In the current study, we fetched the data for the inhibitors (chemicals and peptides) designed to "target" whole "organism." The chemicals against whole organism were extracted by using specific keywords like "Dengue virus, " "Hepatitis C virus, " "West Nile Virus, " "Yellow Fever virus, " and "Japanese encephalitis virus." Majority of the inhibition profile was reported in the form of half maximal inhibitory concentration, i.e., IC 50 , therefore we preceded our study with it. Likewise the anti-flaviviral peptides were extracted from AVPdb database with inhibition profile as the half maximal inhibitory concentration.

Quantitative Structure Activity Relationship Based Model Development
The quantitative structure-activity relationship (QSAR) is a mathematical relationship between a biological activity and physiochemical property of any compound (Cherkasov et al., 2014). It uses various descriptors that represent the chemical characteristics of a molecule in numerical form, i.e., 1D, 2D, and 3D. We utilized the PaDEL software to extract out various molecular descriptors and fingerprints (Yap, 2011). Further, the descriptors were used for model development of anti-flaviviral compounds. Initially, the PaDEL resulted in 16384 descriptors included in 2D, 3D, and fingerprints categories. This strategy was further employed for the algorithm development in various previous studies (Qureshi et al., 2017(Qureshi et al., , 2018Rajput et al., 2018).

Format Conversion
We performed format conversion before extracting the PaDel descriptors, in order to get the 3D descriptors along with 2D and fingerprints. In case of chemicals, the retrieved SMILES from ChEMBL were translated to SDF format through obabel software (O'Boyle et al., 2011). Whereas the anti-flavi peptides were in the form of amino acid sequences, which were firstly converted to pdb using pepstrmod (Singh et al., 2015) software with length 7 to 25 amino acids. Later on the pdbs were converted to SDF format using obabel software. We proceeded for the pdb to sdf conversion because the pdb format does not providing the complete descriptors as compared to sdf of the peptides.

Ten-Fold Cross Validation
Initially, the model was developed on the training/testing by sub grouping into 10 almost equal parts. Of the 10 subgroups, single part is retained for testing while remaining nine was utilized for training purpose. This process was iterated 10 times, and every subgroup got the chance to be testing dataset. Further, for checking the performance of developed model the accuracy of all the 10 iterations were averaged out (Rajput et al., 2015;Thakur et al., 2016). Finally, the developed model on training/testing data set was cross-evaluated independent validation dataset.

Support Vector Machine
For developing the regression-based predictive models, we used support vector machine (SVM) learning algorithm (Hearst, 1998). In regression mode, the SVM works on defining the function (the loss function/epsilon intensive), which ignores errors and situated within the specific distance boundaries of the actual value (Bouboulis et al., 2015). The support vector regression (SVR) is of two types, i.e., linear and non-linear. However, the non-linear SVR is much more complex as it employed kernel approach to address curse of the dimensionality. We employed SVM light module of support vector machine to develop all the models.

Random Forest
Random forest (RF) is an ensemble-learning method that works on the basis of decision tree model with bootstrapping algorithm. First, the decision tree was made from training data sets and the classes of unknown sample is assigned either according to the mode of classes in classification or mean prediction for regression based data sets. RF was used through Waikato Environment for Knowledge Analysis (WEKA) package in prediction model development (Frank et al., 2004).

Feature Selection
Feature selection is an important technique to extract out the best contributing features from the existing features. We implemented WEKA package for feature selection, initially the RemoveUseless filter were used for preprocessing. Further, the attributes were selected through CFsSubsetEval (attribute evaluator) and BestFirst (search method) (Frank et al., 2004). Finally, we got best representative features (relevant) for all the models.

Performance Measure
The performance of the QSAR developed models was evaluated using correlation coefficient (R, PCC).
Pearson's correlation coefficient (R) or bivariate correlation determine the association between two variables (actual and predicted) and calculated by the formula: Its value ranges from +1 to −1, +1 means the two variables are positively correlated whereas −1 depicts the negative correlation between two variables, here, n, E pred i , and E a i ct are size of the data set, predicted, and actual efficiencies.

Model Performance
We checked the appropriateness of the developed models by plotting the actual v/s predicted inhibition (Qureshi et al., 2018). The plot was constructed on the actual and predicted values of training/testing as well as independent validation data sets. The scatter plot was used to depict the relationship between both the values. The best predictive ability of model is depicted by the localization of the points of actual and predicted values on/nearest to the trend line.

Decoy Set
We used decoy set to check the robustness of our developed models. There were few tools like DUD (Huang et al., 2006),  DecoyFinder (Cereto-Massague et al., 2012), and RADER (Wang et al., 2017) for designing the decoys of the chemicals. In our study, the decoys were generated from the latest tool, i.e., RApid DEcoy Retriever (RADER) software (Wang et al., 2017) against the 2168 anti-flavi chemicals with similar 1D physicochemical properties but different 2D topology.

Clustering
We performed clustering using ChemMine tool (Backman et al., 2011). We used multidimensional scaling clustering method by both 2D and 3D method with cutoff similarity of 0.4. However, the clustering of the peptide sequences was dome using CLuster ANalysis of Sequences (CLANS) software

Feature Selection
The 16,384 features of anti-flavi chemicals and peptides were subjected to feature selection, which resulted in 8700 and 3822 features for chemicals and peptides, respectively, after the preprocessing by RemoveUseless filter. Further, the 8700 and 3822 features were processed using CfssubsetEval and BestFirst attribute selector and reduced to 124 and 19 features against chemicals and peptides correspondingly. The detailed information of all the selected descriptors of chemicals and peptides are provided in Supplementary Tables S1, S2, respectively. The models were developed using these reduced and relevant features.

Performance of QSAR-Based Models
The 2168 anti-flavi chemicals were divided into training/testing and independent validation data sets with 1952 and 216 sequences, respectively, through randomization method. The best performing model displayed the correlation of 0.87 and 0.87  through SVM and RF machine learning technique during 10-fold cross-validation (Table 1). Whereas, the independent validation data set showed the correlation of 0.87 and 0.86 correspondingly with developed model during the cross-validation through SVM and RF techniques (detailed in Supplementary Table S3).
The 117 anti-flavi peptides were grouped into 112 sequences as training/testing and 15 as independent validation data sets. Out of the three randomized models, the best one achieved correlation of 0.84 and 0.83, respectively, using SVM and RF techniques during 10-fold cross-validation on training/testing data sets (Table 1). While, the independent validation data set displayed the correlation of 0.84 and 0.86 correspondingly on RF and SVM techniques (detailed in Supplementary  Table S4).

Model Performance
We checked the robustness of the model by plotting actual v/s predicted value and residual plot of residuals v/s predicted values on independent validation data set of both chemical and peptides. The experimental v/s predicted values of independent validation dataset are shown in Figure 1. The plot between actual and predicted inhibition displayed the statistical significance among the pIC50 of the model on independent data sets. Maximum points found to be lie close to the origin, which shows that the model developed using training/testing data sets are robust. The scatter plot for actual v/s predicted of independent validation data set using SVM technique is provided in Figure 1. However, the scatter plot for actual v/s predicted inhibition efficiency of independent validation data set using   Supplementary Figure S1. Further, the residual plot also prove the robustness of the developed model as maximum points exist close to the origin line as shown in SVM (Figure 2) and RF (Supplementary Figure S2) models.

RF technique is available in
Further, the robustness of the model was checked using decoy data set. We opted top most hit of each 2168 chemicals, which resulted in 1417 decoys. The predicted pIC 50 of the decoy is ranges from 3.03 to 7.83, as shown in Supplementary  Table S5.

Peptidomimetics Approach
We checked the peptidomimetics approach in the anti-flaviviral inhibitors by swapping the most contributing features of and peptides (19) and chemicals (124) among each other along with the hybrid features (143) using 10-fold cross validation through SVM technique. On employing the 124 features of chemicals on 117 anti-flavi peptides and 19 features on 2168 chemicals, we achieved the PCC of 0.53 and 0.74, respectively. Interestingly, on combining the top contributing features of anti-flavi chemicals and peptides, i.e., 143, we got the PCC of 0.83 and 0.87 on chemicals and peptides correspondingly. Detailed results are tabulate in Table 2.

Clustering
We performed clustering of the anti-flavi chemicals and peptides. The clustering displayed that anti-flavi compounds are highly diverse with clustered in 58 different clusters as shown in We also perform clustering of the peptide sequences, to check the diversity in out anti-flavi data sets (as shown in Figure 4). The p-value range for the clustering was set between 1e−90 and 0.1, most of the peptides were singleton. At such a stringent p-value we get only 10 clusters, rest sequences were found in unclustered.

Webserver
Anti-flavi integrates SVM and RF predictive models to identify the inhibition efficiency of any chemical or peptides using QSARbased approaches. For the prediction of anti-flavi chemicals, the user can provide input in form of multiple sdf formats and the output would be available in tabulated form with information of SMILES, 2D structure, important chemical descriptors, and inhibition efficiency. Whereas, for predicting the flaviviral inhibition potential of peptides the input would be provided in form of pdb format, which further led to the output as percentage inhibition of the peptide and other specifications like SMILES, 2-D image, and descriptors. As the calculation of unknown chemicals and peptides usually took 2-5 min, so the user can note the job id and retrieve the results any time using "check job status" page.
The anti-flavi webservers also displayed the clustering analyses of both chemicals and peptides under the "analysis" portion. Moreover, we are also providing the format conversion facility, where the user can draw/paste the structure and get the output in form of SMILES, sdf, and mol format. The overall architecture of the anti-flavi is provided in Figure 5.

DISCUSSION
Flaviviruses emerged as an expanding threat to human health globally (Daep et al., 2014). Various efforts has been made to develop an effective anti-flavi drugs aiming specific replication, structural, non-structural, and host protein, as well as nonspecific targets, etc (Sampath and Padmanabhan, 2009;Bollati et al., 2010;García et al., 2017). To tackle the severity of the RNA viruses, the European Union released VIZIER (Coutard and Canard, 2010) and SILVER 2 projects for drug discovery against viruses. However, various computational efforts would be useful along with the experimental ones to speed up the antiviral drug discovery process. In this regards, the present study is focused to develop first dedicated computational platform against the flavivirus.
We used anti-flavi chemicals and peptides for developing the predictive models. Further, the peptidomimetics approach was also explored in addition to the individual chemicals and peptides. Interestingly, the performance of the developed models on chemicals is more than peptide and peptidomimetics. Additionally robustness of the developed models was also cross checked by plotting actual v/s predicted inhibition values of training/testing and independent validation data set. Finally, the predictive models proved to be statistically significant, which depicts their ability to predict any unknown agent as antiflaviviral with high efficiency (Fatemi et al., 2015).
The concept of peptidomimetics is evidenced to be successful among the drug inhibitors, few examples of peptidomimetics were also reported against WNV (Lim et al., 2011;Hammamy et al., 2013), HIV (Kazmierski et al., 2006), etc. Our study also demonstrated the same, as we achieved good performance though most contributing features of chemicals on the peptides and vice versa. Intriguingly, the performance of the models increases when the most contributing features of chemicals and peptides were used together. Therefore, our study suggests that the concept of peptidomimetics can also be implemented in the anti-flaviviral agents.
We evaluated the performance of the models through independent validation and decoy data set. As, the developed predictive models also showed good performance on both independent validation and decoy data set, which further proves 2 https://www.silverpcp.eu/project-overview/ their robustness. We also tried to compare our algorithm with existing one, but didn't able to perform direct comparison, due to lack of any method for anti-flaviviral agents.
The diversification of the chemicals and peptides were also explored using different clustering methods for both type of agents. The clustering analyses displayed high level of diversification among the anti-flavi agents at statistically significant conditions. Majority of chemicals and peptides tend to remain un-clustered rather than showing similarity through cluster forming tendency.
The effective inhibitors against flaviviruses are the need of the hour. The incorporation of computational approach with experimental one would definitely speed up the process of anti-flavi agents' discovery. We used 10-fold cross validation to develop a robust prediction algorithm, which was further cross validated with independent validation as well as decoy data set. We, first time, incorporate peptidomimetics approach in prediction algorithm against flaviviruses. Therefore, this computational method would be highly beneficial to microbiologists and virologists, working hard to develop a novel and effective antiviral agents. This algorithm can be used to filter out the highly effective anti-flavi agents, which can be tested directly in experimental lab, rather than doing initial high through put screening. The limitation of our study is that the predictive models were developed on major flaviviral species rather than all, e.g., HCV, DENV, ZIKV, and WNV.

AUTHOR CONTRIBUTIONS
MK conceived the idea and helped in overall supervision. AR and MK performed the data collection, model development, analyses, and wrote the manuscript. AR executed the web server.