Azole Compounds as Inhibitors of Candida albicans: QSAR Modelling

Candida albicans is a pathogenic opportunistic yeast found in the human gut flora. It may also live outside of the human body, causing diseases ranging from minor to deadly. Candida albicans begins as a budding yeast that can become hyphae in response to a variety of environmental or biological triggers. The hyphae form is responsible for the development of multidrug resistant biofilms, despite the fact that both forms have been associated to virulence Here, we have proposed a linear and SPA-linear quantitative structure activity relationship (QSAR) modeling and prediction of Candida albicans inhibitors. A data set that consisted of 60 derivatives of benzoxazoles, benzimidazoles, oxazolo (4, 5-b) pyridines have been used. In this study, that after applying the leverage analysis method to detect outliers’ molecules, the total number of these compounds reached 55. SPA-MLR model shows superiority over the multiple linear regressions (MLR) by accounting 90% of the Q 2 of anti-fungus derivatives ‘activity. This paper focuses on investigating the role of SPA-MLR in developing model. The accuracy of SPA-MLR model was illustrated using leave-one-out (LOO). The mean effect of descriptors and sensitivity analysis show that RDF090u is the most important parameter affecting the as behavior of the inhibitors of Candida albicans.


INTRODUCTION
Despite significant advances in medicinal chemistry, infectious illnesses caused by fungi continue to be a major danger to public health. Patients with serious diseases, such as neoplasia, and those receiving long-term full parenteral nutrition, should be extra careful. Despite the discovery of several successful antifungal medications over the last 3 decades, there are still unknown molecules with the properties needed to treat systemic yeast infections. As a result, finding new and more effective antimicrobial (Gheidari et al., 2020) medicines is critical, and most of the research program's efforts are focused on developing new compounds. Miconazole and clotrimazole (Brincker, 1976;Smith, 1976;Rippon, 1982) are imidazole compounds that have demonstrated good clinical efficacy in dermatophytoses and nonsystemic candidiasis. Unfortunately, systemic miconazole usage has been linked to reversible thrombocytosis and anemia, whereas clotrimazole use has been linked to severe gastrointestinal problems. 2-(4-thiazolyl) benzimidazole (I) (thiabendazole)is another imidazole derivative with high clinical effectiveness in the treatment of dermatophytic infections in tropical areas. Since thiabendazole was shown to be useful in the treatment of a number of helmintic illnesses, a number of benzimidazole compounds have been tested for anti-infective properties. kThe most thoroughly investigated of these chemicals is 2-(a-hydroxybenzyl) benzimidazole (II) (HBB), which is a specific inhibitor of RNA-containing Enteroviruses.
HBB has no effect on viral adsorption, penetration, or un-coating, according to mechanism of action studies. Although the specific mechanism of this suppression is yet unknown, the major site of action of this antiviral drug appears to be inhibition of viral RNA synthesis. On the other hand, these drugs' antiviral efficacy in vivo has been accompanied with symptoms of toxicity. One approach for modifying harmful effects and achieving the required selective activity is to apply structural changes to the basic molecule and create new derivatives or analogues. As a result, a novel series of benzoxazole and oxazolo(4,5-b) pyridine derivatives that are analogues of benzimidazole were investigated for antifungal activity against Candida albicans in this work, and their structures were revealed using instrumental analytical methods. One of the most important methods for predicting the biological activity of unknown compounds based on their molecular structures is quantitative structure-activity relationships (QSAR) (Konovalov et al., 2008). In QSAR/QSPR studies three considerations are very important, the first is the descriptors to ensure that they carry enough information of molecular structure for the interpretation of the activity property, the second is the modeling method employed and most importantly, the validation of QSAR models (Tetko et al., 2008). The use of internal and external validation has recently become a source of heated discussion (Roy et al., 2007). Internal validation is supported by one set of QSAR workers, whereas the other believes that internal validation is insufficient for testing model robustness and that external validation is required. Hawkins et al., the most vocal proponents of internal validation, believe that cross-validation may test model fit and examine whether predictions would hold true with new data not utilized in the model fitting process. They claim that when the sample size is small, keeping a portion of it back for testing is inefficient, and that it is far preferable to employ "computationally more burdensome" leave-one-out cross-validation instead. (Hawkins, 2003;Hawkins et al., 2003). For feature selection in this study, we utilized SPA (successive projections algorithm), which is a forward selection method that starts with one variable and adds a new one at each iteration until N variables are achieved (Hawkins, 2003). SPA is a strategy for selecting minimal collinearity subsets of variables and improving the conditioning of multiple linear regression (MLR) models. This technique was first presented for wavelength selection in spectroscopic data sets, particularly in cases when there is a lot of spectrum overlap (Araújo et al., 2001). It has been shown that MLR models obtained using SPA are superior to PLS models (Partial Least Squares) in various applications such as UV-VIS (Araújo et al., 2001;DantasFilho et al., 2005;Di Nezio et al., 2007;Grünhut et al., 2008), ICP-OES (Kawakami HarropGalvão et al., 2001), FT-IR (Honorato et al., 2005), and NIR spectroscopy (Breitkreitz et al., 2003;Filho et al., 2004). SPA has also been used in a number of classification studies (Pontes et al., 2005;Gambarraneto et al., 2009). The objective of this technique is to pick variables with the least amount of duplicate information content in order to overcome collinearity problems. The following are the SPA stages for the provided initial variable k(0) and the number N: Step 0. x g gth column of data matrix X train ; g 1, . . . , n c (prior to the initial iteration (n 1)).
Step 1. S {g such that 1 g n c and g ∈{k(0), . . . , k(n-1)}}, or, S stands for the set of variables that have yet to be chosen.
Step 2. The projection of x g on the subspace orthogonal to x k(n-1) : For all g ∈ S, where P is the projection operator.
Step 5. n n +1, and if n < N go back to Step 1. End: The resulting variables are {k(n); n 0, . . . , N-1}. Figure 1 depicts the aforementioned processes for the initial iteration of SPA. The approach was originally designed to create multivariate calibration models (Araújo et al., 2001), but it was later broadened to address classification difficulties (Pontes et al., 2005).

Data Set
Three classes of compounds investigated in this study are 2,5,6trisubstituted benzoxazole (III), benzimidazoles(IV) 2-substituted oxazolo (4, 5-b) pyridine (V) derivatives (Yalçin et al., 2000). Figure 2 and Table 1 depicted the chemical structures and logarithmic experimental activity of these compounds. The IC50 activity parameter is a measure of antifungal potency that relates to the molar concentration of  each chemical necessary to lower Candida albicans concentration by 50% when compared to the concentration measured in an infected culture. The 3D structures of the investigated compounds were optimized by means of semi-empirical quantum-chemical techniques of AM1 applied in the HyperChem computer software before computing the molecular descriptors (Hyperchem, 1993).

Molecular Descriptor
The most essential stage in any QSAR research is the identification and computation of structural descriptors as numerical encoded parameters defining chemical structures. The molecular descriptors in this study were created with Dragon program, version web 3.0 (Todeschini et al., 2003). Several QSAR studies have used the Dragon program to construct chemical descriptors. (Garkani-Nejad et al., 2004;González et al., 2004;González et al., 2005;Khalafi-Nezhad et al., 2005;Liu et al., 2006).

Regression Analysis
To choose a variable, a Stepwise-MLR technique is utilized. In biological systems, this technique has been utilized for variable selection and model building (Gupta et al., 2005;Leonard and Roy, 2006). The data set has been subdivided in two groups for regression analysis: training and prediction sets, and then a model is produced. In the present study, MLR model has been built by using 60 molecules. the results of statistical parameters: number of descriptors, correlation coefficient (R 2 ), standard error (SE) and F statistic indicated that a series of molecules are very different from model, therefore, in the next stage, we identified outlier molecules. This is, in our view, the first QSAR research to identify outliers using a powerful and scientific method. The leverage analysis approach was utilized to detect outlier data. In order to identify outlier data, Leverage analysis method has been used (Despagne et al.,). In the first step, by making use of PCA, the pCs which had the highest data variance were selected. Since the first two pCs had the above-mentioned condition, they were selected as the main and most important PCs. After this step, Leverage graph was drawn based on the number of samples. As illustrated in Figure 3, samples of 7, 8, 9, 59, and 60 have more Leverage respectively compared to the rest of molecules and they were identified as outlier and omitted.
A trustworthy MLR model has strong R 2 and F values, a low SE, and the fewest descriptors. Also, a high level of predictability should be present in the model. In addition, the model should have a high level of predictability. As a result, among the many models, the best model was picked, the characteristics of which are listed in Table 3. It is self-evident that as the number of descriptors grows, so does the R 2 . As illustrates in Figure 4, increasing the number of descriptors has an impact on R 2 values.
From this figure one can see that the increase in the number of parameters up to twelve has a strong influence on the improvement of the correlation. As a consequence, we decided that twelve descriptors would be the best number of parameters to use. The descriptors IC 2 , BEHm8, Qxxe, RDF105m, RDF050v, Mor16u, Mor22u, Mor32u, Mor16m, Mor31m, E2V, and Mor30V exist in this model, and their meanings have been presented in Table 3. These descriptors' formulas are not presented here for brevity's sake; however, Dragon software can easily compute them (Todeschini and Consonni, 2000).
The correlation matrix (Table 4) shows that the selected descriptors have a significant degree of correlation, which is a problem for this model. In fact, Low-correlation descriptors should be utilized while creating a model, such that molecular descriptors reflect independent variables.

RESULTS AND DISCUSSION
The major purpose of this study was to use SPA to select variables for MLR modeling by developing a QSAR model to estimate the activity parameter (pIC50) of compounds depicted in Figure 1 as Candida albicans inhibitors. It can be seen from this figure and Table 1 that the inhibitors of Candida albicans consisted of three different classes with very diverse substituents. As a result, the creation of a robust and interpretable QSAR model capable of properly predicting the pIC50 is required. As a first step, we created a linear MLR model, the parameters of which are listed in Table 3. This model was created with two objectives in mind. To begin, the appropriate variables were chosen using a Stepwise-MLR technique. Table 3 shows that out of 257 parameters, twelve descriptors of IC 2 , BEHm8, Qxxe, RDF105m, RDF050v, Mor16u, Mor22u, Mor32u, Mor16m, Mor31m, E2V, and Mor30V were chosen. These descriptors are classed as Information, BCUT, Geometrical, RDF, 3D-MoRSE, and WHIM descriptors. The Detailed descriptions of these descriptors are given in the literature (Todeschini and Consonni, 2000). The model's second goal was to assess the linear connection between these characteristics and Candida albicans inhibitors' biological activity. A value of 0.60 for R 2 Pre of this model reveals that it is able to account 60% of the variances of the pIC50. In reality, the Stepwise-MRL model is ineffective in predicting these compounds' biological actions. Therefore, these results made us choose a more powerful method for selecting variables. In order to do this, successive projection algorithm was used final selection of descriptors. This study investigates the role of SPA-MLR, which has received little attention from scholars. In this method, at the first the descriptors which have the minimum correlation are selected and then, for final selection of the best model, the MLR method used. In the present study, by making use of this method, a model with thirteen descriptors as the final descriptors was selected whose statistical parameters and the name of its descriptors have been presented in Table 5.
There is no significant association between the selected descriptors, as seen in the correlation matrix ( Table 6).
The leave-one-out methodology was also utilized to demonstrate the stability of the model produced using the SPA-MLR method. The dataset (n 55) was split into a training set of 41 compounds and a test (external assessment) set of 14 compounds using the process randomization approach. From the internal validation technique, the value of Q2 0.30 and RMSE 0.74 was determined. The good results for the SPA-MLR model are not attributable to chance correlation or structural dependency of the training set, according to Q2 and RMSE values. Table 1 shows the observed and SPA-MLR predicted pIC50 values for all inhibitors of Candida albicans investigated in this study. The plot of the SPA-MLR predicted vs experimental pIC50 values for the data set is shown in Figure 5. A correlation coefficient of this plot indicates the reliability of the model.
The experimental values are plotted against the residuals of the SPA-MLR calculated values of pIC50 in Figure 6. The propagation of residuals on both sides of the line reveals zero error, indicating that the proposed model has no symmetric error.  In addition, the value of R 2 pred , REP and SEP was determined using the external validation approach, and these parameters were then utilized to determine the model predictivity. In this study, 25% of the data was chosen for external assessment.  Figure 7 shows the residuals of the SPA-MLR computed pIC50 values in the external assessment technique displayed against the experimental values. The fact that the residuals propagate on both sides of the zero line suggests that the SPA-MLR model was developed without systematic error.
Finally, we employed the suggested linear models to deduce the inhibitors of Candida albicans' mechanism of action. This   implies we should look at the variables that are the most important predictors among the MLR model's thirteen descriptors. Figure 8 shows the relative mean effect and sensitivity of each variable for the SPA-MLR models. The model show that RDF090u has a significant influence on biological activities of the Candida albicans inhibitors.

CONCLUSION
The use of QSAR methods has been effective in establishing a mathematical link between inhibitors of Candida albicans and 2D autocorrelations, Geometrical, RDF, 3D-MoRSE, and GETAWAY. The results show that the SPA-MLR model outperforms the Stepwise-MRL models. This is because, unlike regression analysis, SPA-MLR allows for flexible mapping of the chosen characteristics by changing their functional dependency implicitly. This approach enabled us to develop a precise and relatively quick method for determining the IC50 of various antifungal derivative series, as well as to accurately estimate the IC50 of novel antifungal compounds.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding authors.