ToxiM: A Toxicity Prediction Tool for Small Molecules Developed Using Machine Learning and Chemoinformatics Approaches

The experimental methods for the prediction of molecular toxicity are tedious and time-consuming tasks. Thus, the computational approaches could be used to develop alternative methods for toxicity prediction. We have developed a tool for the prediction of molecular toxicity along with the aqueous solubility and permeability of any molecule/metabolite. Using a comprehensive and curated set of toxin molecules as a training set, the different chemical and structural based features such as descriptors and fingerprints were exploited for feature selection, optimization and development of machine learning based classification and regression models. The compositional differences in the distribution of atoms were apparent between toxins and non-toxins, and hence, the molecular features were used for the classification and regression. On 10-fold cross-validation, the descriptor-based, fingerprint-based and hybrid-based classification models showed similar accuracy (93%) and Matthews's correlation coefficient (0.84). The performances of all the three models were comparable (Matthews's correlation coefficient = 0.84–0.87) on the blind dataset. In addition, the regression-based models using descriptors as input features were also compared and evaluated on the blind dataset. Random forest based regression model for the prediction of solubility performed better (R2 = 0.84) than the multi-linear regression (MLR) and partial least square regression (PLSR) models, whereas, the partial least squares based regression model for the prediction of permeability (caco-2) performed better (R2 = 0.68) in comparison to the random forest and MLR based regression models. The performance of final classification and regression models was evaluated using the two validation datasets including the known toxins and commonly used constituents of health products, which attests to its accuracy. The ToxiM web server would be a highly useful and reliable tool for the prediction of toxicity, solubility, and permeability of small molecules.

The experimental methods for the prediction of molecular toxicity are tedious and time-consuming tasks. Thus, the computational approaches could be used to develop alternative methods for toxicity prediction. We have developed a tool for the prediction of molecular toxicity along with the aqueous solubility and permeability of any molecule/metabolite. Using a comprehensive and curated set of toxin molecules as a training set, the different chemical and structural based features such as descriptors and fingerprints were exploited for feature selection, optimization and development of machine learning based classification and regression models. The compositional differences in the distribution of atoms were apparent between toxins and non-toxins, and hence, the molecular features were used for the classification and regression. On 10-fold cross-validation, the descriptor-based, fingerprint-based and hybrid-based classification models showed similar accuracy (93%) and Matthews's correlation coefficient (0.84). The performances of all the three models were comparable (Matthews's correlation coefficient

INTRODUCTION
The human body is exposed to numerous chemical substances in our daily life including natural compounds, cosmetics, pharmaceuticals, and other chemicals. Several of these chemicals are known to cause adverse drug reactions, non-acute and subacute poisoning leading to allergic reactions, and sometimes disability or death due to their mutagenic, carcinogenic or toxic nature (Shonkoff et al., 2012). In addition, our body is also exposed to toxic gases and aerosols such as carbon monoxide, lead, cigarette smoke, wood smoke, workplace chemicals, and dietary exposures to pesticides and herbicides that cause chronic disease mortality (Seaton et al., 1995). Even toxins and oncometabolites, which are produced naturally during microbial metabolic processes in the gut have been shown to cause diabetes, kidney disease and cancer (Nowicki and Gottlieb, 2015). The damaging potential of a toxin is determined by a multitude of related factors and not just by its inherent toxicity. The observed clinical consequences of any toxin are influenced by chemical, biological and exposure related factors depend upon its absorption, metabolism and elimination from the body. Dose along with duration, frequency and route of administration of toxin are the crucial exposure-related factors (Klaassen and Amdur, 1996). Overall, the toxicity can be understood as the sum of adverse effects exhibited by a substance on any organism.
The chemical ingredients commonly used in human applications usually go through clinical trials to be certified as safe for use in certain limits. The simplest experimental measure of toxicity is the use of bio-assays involving animals injected with the toxin (Borenfreund and Puerner, 1985). The experimental measures are known to be tedious, time-consuming and have their own limitations (Harry et al., 1998). Therefore, there is a need for alternate methods, which can use the inherent properties of a given molecule for the determination of its toxic nature (Hinderliter et al., 2010). In this scenario, the computational methods appear promising in determining the toxicity of a given compound using its structural and molecular properties. The most commonly used features for these properties are molecular descriptors (Dong et al., 2015) and fingerprints (Xue and Bajorath, 2000), which can extract the chemical and structural information inherent in any given molecule for prediction-based approaches. The chemical properties of a molecule also determine its solubility (Hutchinson et al., 1979), which influences its absorption; and a molecule with poor solubility will show limited absorption and hence reduced toxicity (Hutchinson et al., 1979). Thus, the aqueous solubility, and also the permeability (caco-2 permeability) are important toxicity determining factors. The caco-2 cells are human colon epithelial cancer cell lines, which are used as a model to predict the intestinal absorption of molecules using experimental methods (Van Breemen and Li, 2005). Therefore, the chemical and structural properties of a molecule can be exploited for the prediction of toxicity of a given molecule, the determination of which is an important and challenging task. At present, feature-based methods such as ToxiPred (Mishra et al., 2014), DeepTox (Mayr et al., 2016) are available for the prediction of toxicity (Cheng et al., 2012).
Another tool is admetSAR, a freely available tool which uses various classification and regression models for the prediction of ADMET properties. Also, a large number of machine learning tools are available for the toxicological QSAR problems, but are molecule-specific due to their training on highly similar molecules with a similar backbone to construct such models. In summary, the available tools have their own limitations since most of the available tools are either highly specific for a particular toxicity, or are not freely available in many cases. Thus, there is a need for an accurate, efficient, comprehensive and easily accessible computational tool to predict the toxicity and toxicityrelated properties of a molecule. In this work, by integrating machine-leaning and chemoinformatics approaches, we have developed a computational method "ToxiM" for the prediction of toxicity of molecules using fingerprints and descriptors as input features. It is freely available at http://metagenomics.iiserb.ac.in/ ToxiM/.

Dataset Preparation
To develop the prediction modules using machine learning approaches, two distinct datasets of molecules were curated (i) positive dataset-consisting of known toxins, and (ii) negative dataset-consisting of non-toxins.

Positive Dataset
The positive dataset contained 3,519 toxins retrieved from the T3DB database (http://www.t3db.ca/) (Lim et al., 2010). All metals and small peptides (670 toxins) were removed from the dataset, and the final dataset contained 2,849 toxins. The dataset included toxic compounds with a recorded medical consequence in relatively low concentrations.

Negative Dataset
The negative dataset was constructed using the human metabolites from the RECON1 model in BIGG database (http:// bigg.ucsd.edu/data_access) (King et al., 2016). The human metabolites were used since they are produced by the conserved pathways, and had no known reports of showing any human toxicity, and thus can serve as a good negative set, since the aim of the tool was to predict the molecules toxic for the human system. Using the id information available in the RECON1 model, a total of 1,263 unique molecules were retrieved from different sources (Supplementary Text S1). These molecules were considered as the negative dataset in this study.

Validation Datasets
Two validation datasets were used to optimize the performance. First validation set consisted of 41 drugs, which were withdrawn from the commercial market due to their adverse effects. Second validation set consisted of 15 compounds commonly used as additives in food, cosmetics, detergents and preservatives. Though these compounds are available in the commercial market, these are reported to be toxic and their use has been debatable (Gernhardt et al., 2004).

Compositional Analysis
To differentiate between the positive and negative datasets, a compositional analysis was performed using the atomcount function of "ChemmineR" "library" in "R". It is a physicochemical descriptor, which calculates the elemental composition, and was used to compare the elements occurring in toxins and non-toxins (Cao et al., 2008).

Descriptors and Fingerprints
Descriptors depict all two and three-dimensional properties in the form of numerical values, and thus, it is a form of conceptual projection of all the molecular properties. Molecular descriptors can be categorized into one, two and threedimensional descriptors. 1D-descriptors contain information on bulk properties such as molecular weight, molar refractivity, permeability and solubility. Physicochemical properties of a molecule such as LogP, LogD, and Topological Polar Surface Area (TPSA) can be estimated based on 2D structures described by the fragmental descriptors. The 1D and 2D descriptors were used in this study. Fingerprints are a complex form of descriptors. They typically encode the information of molecular structures into a bit string, which produces a pattern characteristic of a given molecule. Bit strings account for different structural fragments, and their presence or absence was indicated by 0 or 1. Fingerprint overlap was determined as a measure of similarity and was calculated by Tanimoto coefficient (Dong et al., 2015).

Development of Classification Models for the Prediction of Toxicity
Fingerprints and descriptors were used as the input feature for the development of various classification models. In total, 10,208 bits belonging to 10 different fingerprints were calculated for 2,843 molecules in positive dataset, and 1,262 molecules in negative dataset using "PaDEL" (Supplementary Text S2; Yap, 2011). The fingerprint dataset (10,208 bits) from 4,105 molecules was divided randomly into 80:20 ratios, from which the 80% part (3,282 molecules) was used for training and the remaining 20% (823 molecules) was used for testing. In total, 196 descriptors belonging to six different descriptor families (Supplementary Text S3) were calculated using "RDKit" for 2,835 molecules in the positive dataset, and 1,247 molecules in the negative dataset. Descriptors such as MinPartialCharge, MaxPartialCharge, MaxAbsPartialCharge, MaxAbsPartialCharge, and Ipc, were excluded because the values for these descriptors for all the molecules present in the dataset could not be calculated using RDKit. The final dataset consisting of 191 descriptors from 4,082 molecules was divided randomly into 80:20 ratio, from which 80% part (3,263 molecules) was used for training, and the remaining 20% (819 molecules) was used for testing.

Principal Component Analysis (PCA)
We have high-dimensional data represented by several input features. Thus, PCA was carried out to compress the data dimensions by identifying those components, which are distinctly different in the two datasets (Wold et al., 1987).

Selection of Appropriate Machine Learning Model for Classification
10-fold cross validation was performed on training data with descriptor and fingerprints as feature inputs for various machine learning algorithms including SVM, CART, Random Forest and KNN, to compare their performance (Hanley and McNeil, 1982) using "caret" package in R (Liaw and Wiener, 2002).

Optimization of Various Parameters for the Development of Classification Models
The mtry (number of randomly selected variables), ntree (number of trees generated by random forest algorithm) and input feature parameters were optimized using the random forest package in R (version 3.3.2) (Ihaka and Gentleman, 1996). The importance of each fingerprint and descriptor was calculated using the mean decrease in accuracy values at the best mtry value obtained using tuneRF function, which calculates optimal value for mtry for random forest using out of bag error estimate. The Out-of-Bag (OOB) error, which represents the prediction error of random forest algorithm, of fingerprint-based classification models was calculated using top 0.5, 1, 5, 10, and 20% fingerprints from the total 3,282 molecules using various mtry values at ntree = 200-1,000. Similarly, the OOB error performance of descriptor-based classification models was evaluated using top 20, 40, 60, 80, and 100 descriptors derived from the total 3,263 molecules using various mtry values at ntree = 200-1,000. The best model with minimum OOB error from fingerprint-based and descriptor-based classification models was selected. On the basis of performance of these two models, the top 10% (1,021 bits) fingerprints and the top 100 descriptors were combined together to develop a hybrid model with 1,121 input features. Performances of various classification models were optimized using 10-fold cross validation.

Performance Validation of Final Classification Models
Performances of fingerprint, descriptor and hybridbased models were evaluated on 20% of the test dataset, and also on two separate validation datasets Frontiers in Pharmacology | www.frontiersin.org using the following performance measures.
Where, TP = True Positive, TN = True Negative, FP = False Positive, FN = False Negative and MCC = Matthews correlation coefficient.

Development of Regression Models for the Prediction of Solubility and Permeability
To examine the toxicity determining factors such as aqueous solubility and caco-2 permeability, additional regression models were developed to calculate the numerical values of LogS and LogP. These values are expected to be high in the case of a toxin and low in the case of a non-toxin (Artursson and Karlsson, 1991).

Construction of Datasets
To prepare the solubility dataset, a total 452 molecules were considered from the positive and negative dataset for which the experimental values for aqueous solubility were available at the admetSAR dataset (Cheng et al., 2012) and from the work by Palmer et al. (2007). All these molecules are structurally diverse and solid at room temperature (Palmer et al., 2007). To prepare the permeability dataset, 133 molecules from the positive and negative dataset for toxicity classification model were considered for which the experimental values for caco-2 cell permeability were available at admetSAR dataset (Cheng et al., 2012), and from the work by Wang et al. (2016). Both the datasets were divided into two groups in the ratio of 80:20, where the 80% part was used as training dataset and the 20% part was used as testing dataset. These training and testing datasets were used for the construction and validation of the regression models. Final solubility dataset contained 362 molecules in the training set and 90 molecules in the testing set, whereas the final dataset for permeability consisted of 115 molecules in the training dataset, and 18 molecules in the testing dataset.

Selection of Appropriate Machine Learning Based Models
Molecular descriptors have been used as an input feature for the development of various regression models because descriptors have all the molecular information including electronic, topological, thermodynamics and structural properties. Multilinear regression (MLR), Random Forest Regression (RFR) and Partial Least Square Regression (PLSR) were optimized to calculate the aqueous solubility and caco-2 cell permeability (Schneider et al., 2010). Multi-linear regression (MLR) fits two or more variables on the experimentally known data by using a linear equation. In general, the linear equation for MLR regression having n observations is Y= a 1 X 1 +a 2 X 2 +a 3 X 3 +.......+a n X n +C. In this case, every independent variable X i (i = 1,2,3...,n) is associated with the dependent variable Y according to the above equation. R 2 was used for LogS and LogP estimation using MLR algorithm (Kujawski et al., 2012). For the development of final models, the best descriptors were selected based on their minimum p-values.
The RFR models for both datasets were trained using randomForest package in R. Models were constructed using mtry = 14, step = 0.5, improvement factor = 10 −5 and ntree = 100. Variable importance information for each descriptor was obtained from the random forest models. For the development of final models, the best descriptors were selected based on their mean decrease in accuracy values.
Partial Least Square regression (PLSR) is a widely used predictive modeling method to deal with highly collinear predictor variables (Van Huffel, 1997). In this model, it is considered that all the independent variables (X i = descriptors) are linearly related to the dependent variable (Y = LogS/LogP) (Geladi and Kowalski, 1986). Variable selection was performed with the help of plsgenomics library and variable. selection function (Kotsiantis et al., 2007).

Compositional Analysis Reveals the Difference between Toxin and Non-toxins
The compositional analysis revealed that non-toxic molecules were rich in Carbon, Oxygen, Nitrogen, and Phosphorus, which are the building blocks of living matter (Supplementary Figure S1). In contrast, the toxins were rich in Chlorine, Bromine, Arsenic, Lead, Cobalt and Fluorine in addition to Carbon and Oxygen (Supplementary Figure S2; Data Sheet 2). The atom frequency of compositional elements showed clear differences between toxins and non-toxins, which suggests that composition and related properties are important determinants of the toxicity of a compound.

Principal Component Analysis
To compute variance between toxins and non-toxins, PCA was performed on training data using fingerprints and descriptors as input features (Supplementary Figure S3 and Supplementary Text S4).

Selection of Appropriate Machine Learning Method for Classification
The classification performance of various machine learning methods was evaluated using "caret" package in R script (Kotsiantis et al., 2007). Using fingerprints as input features ( Figure 1A) ROC (Receiver Operating Characteristic) values displayed by RF, SVM, KNN, and CART were 0.97, 0.96, 0.95, and 0.77, respectively. Similarly, using descriptors as the input features, the ROC values displayed by RF, SVM, KNN, and CART were 0.97, 0.94, 0.94, and 0.85, respectively ( Figure 1B). Thus, it FIGURE 3 | Optimization of random forest at various mtry and ntree values using fingerprints as input features, (A) performance using top 0.5% fingerprints, (B) performance using top 1% fingerprints, (C) performance using top 5% fingerprints, (D) performance using top 10% fingerprints, and (E) performance using top 20% fingerprints.
was apparent that the RF-based model outperformed the other machine learning based models.

Optimization of RF Parameters and Development of Classification Models
Mean decrease in accuracy values for each variable in descriptors and fingerprints were calculated at the best mtry (optimized by tuneRF function) and ntree = 500. The mean decrease in accuracy values for the top 30 fingerprints and descriptors are shown in Supplementary Figures S4A,B, and the complete list of fingerprints and descriptors with the mean decrease in accuracy values are provided in Supplementary Tables S1, S2.
The performance of descriptor-based and fingerprint-based classification models was examined at various mtry values and ntree = 200-1,000 using different combinations of descriptors and fingerprints. From Table 1 and Figure 2, it is apparent that the model developed using the top 100 descriptors at mtry = 5 and ntree = 1,000 performed better than the other classification models and showed an accuracy of 0.93, MCC of 0.84, and the lowest OOB error of 6.9%. From Table 2 and Figure 3, it is apparent that the model developed using the top 10% (1,021 bits) of fingerprints at mtry = 88 and ntree = 800, performed better than other classification models as it displayed an accuracy of 0.93, MCC of 0.84 and the lowest OOB error of 6.7%. A hybrid set was constructed using the top 100 descriptors and 1,021 fingerprints, and the performance of RFbased classification model was evaluated using the optimized parameters for fingerprint-based model (mtry = 88 and ntree = 800). The performance of the hybrid model was almost similar to the performance of fingerprint-based model, and showed the Sensitivity, Specificity, Precision, Accuracy and MCC of 0.95, 0.89, 0.95, 0.93, and 0.84, respectively. From above, the best performing fingerprint-based, descriptor-based and hybrid models were selected as the final models for the classification of molecules as toxic or non-toxic.

Identification of Statistical Significant Descriptors and Fingerprints
A separate statistical analysis (Wilcoxon rank sum test) was performed on the training datasets of top 100 descriptors and top 1,021 fingerprints to find out the significantly discriminating (P ≤ 0.05) descriptors and fingerprints among toxins and non-toxins. The distribution of top descriptors and fingerprints in toxins and non-toxins are shown in Figures 4, 5, respectively. From the results, it is apparent that the proportion of these descriptors and fingerprints varied considerably from toxins to non-toxins. Detailed characterization of the discriminating features for toxin-like property in any molecule will be of great use for the researchers working in the toxicology field.

Performance Evaluation of RF Classification Models Using Blind Set
The performances of descriptor-based, fingerprint-based and hybrid models were evaluated using the blind data set (methods section). From Figure 6, it is apparent that all the models showed very high true positive rates as compared to the false positive rates. The ROC values shown by the hybrid model ( Figure 6C) was slightly higher (0.98) than the other two fingerprints ( Figure 6B) and descriptors ( Figure 6A) models (0.97). The detailed performances of each model are provided in Table 3.

Development and Validation of Regression Models
In addition to the prediction of toxicity of compounds, MLR, RF, and PLS based regression models were also developed to calculate the solubility and permeability values using descriptors as the input feature. To construct MLR-based regression models, top 15 descriptors were selected (based on their importance) to calculate LogS and top 11 descriptors were selected to calculate LogP (Supplementary Tables S3, S4). Similarly, to construct RFbased regression models, top 40 most important descriptors were selected to calculate LogS, and top 10 descriptors were selected to calculate LogP (Supplementary Tables S5, S6). To construct PLS-based regression model, top five descriptors were used to calculate LogS and LogP (Supplementary Tables S7, S8). The performance of each model was evaluated using R 2 values, which is a statistical measure of how close the data points are fitted on the regression line.
The solubility predictive ability of RF-based regression model was higher (R 2 = 0.84) as compared to the ML-based regression model (R 2 = 0.61) and PLS regression model (R 2 = 0.76) (Figure 7). Therefore, the RF-based model was selected for validation, which displayed a correlation accuracy of 0.92 on   Table 4). The permeability predictive ability of PLS-based regression model was marginally higher (R 2 = 0.678) as compared to the RF-based regression model (R 2 = 0.675) and ML regression model (R 2 = 0.66) (Figure 8). Therefore, the PLS-based regression model was selected for validation, which performed with a correlation accuracy of 0.82 on the blind dataset ( Table 5).

Development of Toxim Webserver for the Prediction of Toxicity
The steps involved in the construction of ToxiM classification models for the prediction of toxicity of molecules is represented in Figure 9. Using the final classification and regression models, a web server ToxiM was developed to facilitate the online submission of a query by the user, and to display the prediction results. The toxicity prediction page enables the user to submit the query molecule either by the PubChem CID or by uploading its SDF file. The query SDF file is processed through the models available at the web server for the prediction of toxicity, permeability and solubility properties. All the three best performing models, i.e., fingerprint-based, descriptor-based and hybrid-based are available at the webserver for selection. The query is analyzed through the selected model, and the results are displayed in the form of prediction probabilities for the classification of a query as toxic or non-toxic. To examine the solubility and permeability of a given query molecule, descriptorbased best performing models, the RF-based regression model for the calculation of LogS, and the ML-based regression model for the calculation of LogP, are available as the default options. A tutorial page is provided to explain the navigation through the website. The webserver can be accessed at http://metagenomics. iiserb.ac.in/ToxiM/.

Performance Validation of Toxim Using Validation Datasets
The performance of ToxiM was evaluated on two validation datasets. The first dataset consisted of 41 drugs, which were withdrawn due to their potentially harmful effects. The second dataset consisted of 15 commonly used molecules but their toxicity to humans have been debatable (Wishart et al., 2006). ToxiM predicted all the withdrawn drugs to be toxic with probability score >0.5, and could successfully validate the toxicity of these drugs (Supplementary Table S9). For withdrawn drugs, toxicity prediction probability scores of ToxiM were higher in comparison to the admetSAR ( Table 6). The predictions of ToxiM on the second validation set are provided in Supplementary Table S10. In addition to the toxicity prediction of withdrawn drugs and other molecules, LogP and LogS values were also calculated using the regression analysis on both the sets. The results for LogP and LogS for FDA withdrawn molecules are given in Supplementary Tables S11, S12, respectively. Similarly, the values of LogP and LogS calculated for the second dataset are given in Supplementary Tables S13, S14, respectively.

Analysis and Discussion of Toxim Prediction Results on Second Validation Set
Food additives such as aspartame, saccharin and MSG, were predicted to be toxic by all three modules of ToxiM (Maher and Wurtman, 1987;Ellwein and Cohen, 1990;Freeman, 2006). Aspartame and saccharin are artificial non-carbohydrate caloriefree sweeteners, which are commonly available in the commercial market. These were predicted to be toxic (descriptor model score:  0.54 and 0.74 respectively) and were also found to be soluble and permeable. Similarly, MSG, which is used as a flavor enhancer, was predicted to have a toxicity score of 0.57, and was found to be soluble and permeable. Thus, the ToxiM predictions also point toward their potentially harmful effects.
Pesticides are being widely used to control insects, rodents and other pests in the agricultural fields. The continued usage of pesticides has highly detrimental environmental impacts on air, water, soil and food, and could be toxic to humans, lower animals and the food chain. Their exposure has been linked to hormone disruption, cancer, neurological effects like loss of memory, and affect neurological and reproductive development. The ToxiM tool predicted the highly used pesticides DCPA and EDTA (Fountain and Reith, 2014) as toxic with prediction values >0.9, and both were also found to be soluble but, EDTA was predicted to be non-permeable. Another commonly used class of compounds comprised of the beauty and cosmetic products. Molecules such as Butylhydroxybutylnitrosamine (Parkinson and Lotzová, 1989) and Sodium tetradecanesulfonate, which are present in cosmetics, were predicted to be toxic (0.89 and 0.88, respectively) by ToxiM. Imidazolidinyl urea is used in cosmetics as an antimicrobial preservative due to its high solubility in water, and its use has been debatable (https:// ntp.niehs.nih.gov/ntp/htdocs/chem_background/exsumpdf/ imidazolidinylurea_508.pdf). Similarly, mixed results were also obtained in the case of Imidazolidinyl urea, where the ToxiM descriptor model predicted it to be non-toxic, whereas, both fingerprint and hybrid models predicted it to be toxic. The Imidazolidinyl urea is not known to induce any toxicity in human but in patients with contact dermatitis, it can show positive reaction from the exposure to Imidazolidinyl urea.
Benzethonium chloride (National Toxicology Program, 1995) is commonly used in cosmetics, medicaments, deodorants, and mouthwash because of its antiseptic and antimicrobial properties. It was predicted to be highly toxic by all the three models of ToxiM, and was also predicted to be soluble and permeable. Another commonly used synthetic product, Polysorbate-80 (Roberts et al., 2010), which is used as an emulsifier in vitamins, vaccines, medicines, surfactant in soaps and cosmetics, defoamer in the fermentation of wine, and as a binding agent in ice cream, was also predicted to be highly toxic by all the three models. Mixed results were obtained for Sodium hypochlorite, which is commonly known as bleaching powder, and is used on a large scale for bleaching, surface purification, and disinfection of water. Sodium hypochlorite was predicted to be toxic by fingerprint model and non-toxic by descriptor and hybrid models. The reason for the inconsistency in predictions in the case of sodium hypochlorite can be explained using compositional analysis, which revealed that toxins had atoms such as Cl, Br etc., and sodium hypochlorite also contained Cl atom in its structure. Also, it is predicted to be toxic only by fingerprint model that uses the structural properties of a molecule to make the prediction. On the other hand, the descriptor module predicted bleaching powder to be non-toxic, as it takes into consideration physical and chemical properties of the molecule. It is well-established that sodium hypochlorite acts as an irritant for the human skin but it is not toxic at lower concentrations. The hybrid model uses the feature information from both fingerprint and descriptor modules for making the prediction, which was inconsistent in this case due to the above reasons. Sodium hypochlorite was also predicted to be highly water soluble (Budavari, 1996) and permeable, which is in accordance with experimentally known facts. Asbestos, which is commonly used in construction works because of its thermal insulation and fire protection, was predicted to be toxic (0.54-0.84) (Kanarek, 2011). Ethylene glycol, an antifreeze agent whose toxic effects have been long debated, was predicted to be toxic using the hybrid and fingerprint-based ToxiM models (0.68 and 0.54,respectively) (Jacobsen and McMartin, 1986), whereas it was classified as a non-toxin by the descriptor model. Mixed results obtained for ethylene glycol can be explained by the fact that it does not have any inherent toxicity before it is metabolized (http://emedicine.medscape.com/article/814701-overview#a5).
Methyl methacrylate, which is a polymer used in the manufacture of PMMA and MBS and also in hip and knee replacements, was predicted to be toxic, soluble and permeable. Polyacrylamide-butylamine, a polymer used in the manufacture of pesticides, emulsifiers and pharmaceuticals, was predicted to be toxic by descriptor model but was shown to be non-toxic by fingerprint and hybrid models. Polyacrylamide-butylamine was also predicted to be soluble and permeable with the help of regression models used in ToxiM. A total of 12 out of 15 cases were predicted to be toxic, of which 9 compounds were predicted to be permeable, 11 compounds were predicted to be soluble, and 9 compounds were shown to be water soluble and permeable.
We also attempted to derive the relation between classification and regression predictions scores of molecules present in the second validation dataset. Out of 15 compounds present in the second validation set, three compounds: ethylene glycol, Sodium hypochlorite and Imidazolidinyl urea, were predicted to be nontoxic by descriptor module (Supplementary Table S15). Among 12 compounds that were predicted to be toxic, two terminal cases [Polysorbate 80(glycol) and Asbestos] were found that did not show the linear relation of permeability and solubility with the toxicity prediction done by descriptor module. Some of these cases are discussed in the Supplementary Text S5.
Dimethyl tetrachloroterepthalate (DCPA) which is a preemergent-herbicide was shown to be permeable but insoluble in water, which is supported by the report published in the Merck Index where DCPA was suggested to be <5% soluble in water (Budavari, 1989). Human toxicity excerpts have shown that this compound does not show any detectable health effects, which explains the continuous usage of DCPA (Gleason et al., 1957;Hamilton and Hardy, 1974).
Thus, for most of the discussed cases and those presented in Supplementary Tables S10, S15, the toxicity prediction values corroborated well with their known toxic nature, which attests the usability of the tool for the prediction of toxicity of a given compound. However, in a few cases such as Benzethonium Chloride (National Toxicology Program, 1995), the compound was predicted to be toxic, permeable and non-soluble in water but experimentally, it is known to be highly water soluble which is in contrast with the prediction results obtained for Benzethonium Chloride. The plausible explanation for the contrasting result in the case of solubility prediction could be inferred from the results of performance of models for predicting the solubility and permeability, which showed the R 2 -values of 0.84 and 0.678 respectively. Polysorbate 80 (glycol), which is known to be highly water soluble was predicted to be neither caco-2 cell permeable nor water soluble. Polysorbate 80 is still used in cosmetics and as food additives, because it does not show any adverse effects at lower concentration (Gosselin et al., 1976;Supplementary Text S5). Solubility and permeability are known to be complex properties, the prediction of which has been a challenging task since the available datasets compile the data generated in different laboratories using different techniques, which reduces the quality of the data and its usability as a good training set (Bergström, 2005). Thus, it is suggested that an independent validation of the predictions made on solubility and permeability properties of the query molecule should be carried out by the user.  a molecule. The 10-fold cross validation for all steps from algorithm selection to the model optimization helped in avoiding the overfitting of data, which is usually a significant concern while training. The reported accuracies in assigning a molecule as toxic or non-toxic, and coefficient of determination values to predict the aqueous solubility and permeability attest the performance of the tool.
The performances of developed models were measured using the validation sets, and a higher performance was also observed for ToxiM on comparison with admetSAR. Only for a few cases, some discrepancy in predictions by the different models were observed which justifies the inclusion of the different prediction models in ToxiM, and the known literature also supported the variable toxicity predictions observed for such molecules. Thus, it is recommended that the user should examine the query molecule using all the three models available at the web server for comprehensive results.
Several factors need to be considered while determining the toxicity of a given molecule such as its compositional, structural and molecular properties, concentration, aqueous solubility and permeability, target organism/system, which together provides useful information on the toxic potential of a given molecule. In this work, we have focused on the structural and molecular properties of molecules for developing the tool to predict the molecules which could be toxic for humans. Though, the current tool is limited for the prediction of toxicity based on the structural and molecular properties, the achieved accuracy of results on the validation datasets attests the importance of these properties and also justifies the application of machine learning for toxicity prediction. It would be a challenging task to include the effect of concentration, and binding with the downstream targets or any other factor that leads to physiological toxicity. The accuracy and applicability of the tool may further be improved in the future by including the other properties, but it would require a lot of experimental data to generate such models. The predictions made using this server will provide valuable information to the scientific community to examine the environmental and physiological toxicity of a given molecule, and especially working in the field of xenobiotics metabolism and toxicity.

AUTHOR CONTRIBUTIONS
VS and AS conceived the work, participated in the design of the study. AS, GS, and AR developed classification and regression based models. AS and GS were involved in the development of the web server. VS, AS, GS, and AR drafted the manuscript. All the authors read and approved the final manuscript.

ACKNOWLEDGMENTS
This work was supported by the intramural funding received from IISER Bhopal. We thank MHRD, Govt of India, funded Center for Research on Environment and Sustainable Technologies (CREST) at IISER Bhopal for its support. However, the views expressed in this manuscript are that of the authors alone and no approval of the same, explicit or implicit, by MHRD should be assumed. GS thanks CREST for the fellowship.