Modeling structure–activity relationships with machine learning to identify GSK3-targeted small molecules as potential COVID-19 therapeutics

Coronaviruses induce severe upper respiratory tract infections, which can spread to the lungs. The nucleocapsid protein (N protein) plays an important role in genome replication, transcription, and virion assembly in SARS-CoV-2, the virus causing COVID-19, and in other coronaviruses. Glycogen synthase kinase 3 (GSK3) activation phosphorylates the viral N protein. To combat COVID-19 and future coronavirus outbreaks, interference with the dependence of N protein on GSK3 may be a viable strategy. Toward this end, this study aimed to construct robust machine learning models to identify GSK3 inhibitors from Food and Drug Administration–approved and investigational drug libraries using the quantitative structure–activity relationship approach. A non-redundant dataset consisting of 495 and 3070 compounds for GSK3α and GSK3β, respectively, was acquired from the ChEMBL database. Twelve sets of molecular descriptors were used to define these inhibitors, and machine learning algorithms were selected using the LazyPredict package. Histogram-based gradient boosting and light gradient boosting machine algorithms were used to develop predictive models that were evaluated based on the root mean square error and R-squared value. Finally, the top two drugs (selinexor and ruboxistaurin) were selected for molecular dynamics simulation based on the highest predicted activity (negative log of the half-maximal inhibitory concentration, pIC50 value) to further investigate the structural stability of the protein-ligand complexes. This artificial intelligence-based virtual high-throughput screening approach is an effective strategy for accelerating drug discovery and finding novel pharmacological targets while reducing the cost and time.


Introduction
Coronaviruses are a family of enveloped viruses known for their ability to infect humans, typically leading to respiratory illnesses (1,2). To address the current epidemic of the novel coronavirus SARS-CoV-2 causing COVID-19, numerous approaches for virus identification and infection prevention and treatment are required. In this context, high-throughput screening has been performed to identify bioactive compounds that inhibit SARS-CoV-2 replication in tissue culture models (3)(4)(5)(6). However, the mode of action and clinical efficacy of these candidates remain to be fully characterized, and additional targets need to be identified to further combat new and emerging SARS-CoV-2 variants. Among other structural and non-structural SARS-CoV-2 proteins, the nucleocapsid (N) protein is an essential RNA-binding protein that plays a crucial role in viral replication, transcription, and assembly (7)(8)(9)(10)(11). Inhibiting SARS-CoV-2 transcription will be a crucial objective, along with strengthening the immune response to the virus and reducing cytokine release syndrome linked to severe cases of COVID-19.
Glycogen synthase kinase 3 (GSK3) is a serine-threonine kinase signaling protein that plays a crucial role in a variety of biological processes, and its aberrant activity has been associated with diabetes, inflammation, and neurodegenerative and psychiatric disorders (12-15). GSK3 has two structurally identical isoforms (a and b), which are 97% similar in their catalytic domains but differs in their N and C (16)(17)(18). Both of these isoforms shows 98% of sequence identity (19). GSK3 (a and b) are required for the phosphorylation of SARS-CoV-2 N protein, and its inhibition blocks SARS-CoV-2-mediated infection in human lung epithelial tissues (16). In addition, knockout of GSK3a and GSK3b validates that it is vital for N phosphorylation (16). Inhibition of GSK3 can increase adaptive T cell and innate natural killer responses of CD8 + T cells while also inhibiting SARS-CoV-2 viral replication (20). Moreover, GSK3 phosphorylates N proteins within the arginine-serine (RS) region of the JHM strain of mouse hepatitis virus (JHMV) and SARS-CoV, which caused the outbreak of severe acute respiratory syndrome in 2002-2004 (9, 10, 21, 22). Phosphorylation of the JHMV N protein is necessary for the recruitment of the ATP-dependent RNA helicase DDX1 for the transcription of long sub-genetic RNAs (10). The N protein from infectious bronchitis virus (IBV) and SARS-CoV interact directly with GSK3, and its knockdown was shown to disrupt the replication of IBV in the Vero cell line (23,24). Moreover, GSK3 inactivators inhibit the coronavirus protease Mpro (or 3C-like protease), which cleaves SARS-CoV-2-encoded polyproteins (pp1a and pp1ab) required for viral replication and transcription. Additionally, GSK3b was also identified to control the autophagy pathway as it involves in the regulation of transcription factor EB (TFEB) nuclear expression mediated via mechanistic target of rapamycin complex 1 (mTORC1) dependent manner (25,26). It also modulates TFEB through the signaling pathways of protein kinase C (PKC) and eukaryotic translation initiation factor 4A-3 (eIF4A3) (27,28). Furthermore, GSK3binduced phosphorylation of TFEB leads to its cytoplasmic retention and contributes to the blockage of the lysosomal-mediated autophagy pathway (29,30). Autophagy plays a critical role in the degradation of dysfunctional cytoplasmic organelles and infectious pathogens, whereas defective autophagy has been observed in SARS-CoV-2 pathogenesis (31,32). An interplay between viral infection and autophagy directs towards the development of an effective therapeutic approach for COVID-19 (31,33). Therefore, it is reasonable to propose that inhibition of GSK3 using small-molecule inhibitors would effectively block SARS-CoV-2 replication as found for SARS-CoV-1.
Quantitative structure-activity relationship (QSAR) approaches are increasingly being applied to aid in the preclinical development of small-molecule drugs. QSAR models help predict the physicochemical properties, biological activity, toxicity, chemical reactivity, and metabolism of chemical compounds (34)(35)(36). The main goal of QSAR analysis is to link a set of predictor variables (X) to the response variable (Y). Techniques for linking X and Y and molecular descriptors have received substantial research attention. In this context, a key strategy in drug discovery is the development of machine learning (ML) techniques to estimate drug-target interactions. QSAR approaches employ a variety of linear and non-linear ML algorithms to produce predictive models for ligand binding to a biological receptor. The term "QSAR" refers to regression models that establish quantitative relationships between molecules' chemical structures and their physical, chemical, or biological characteristics. ML techniques such as gradient boosting, support vector machines, partial least squares, artificial neural networks, or linear regression use a set of molecular descriptors as input data to predict chemical features.
We attempted to develop ML-based QSAR models that could identify GSK3 inhibitors using the bioactivity data available in the ChEMBL and PubChem databases. ML models were developed using two algorithms, histogram-based gradient boosting (HGBM), and light gradient boost machine (LGBM), to prospectively identify GSK3 inhibitors from the Food and Drug Administration (FDA)approved drug library. The rationale behind selecting GSK3 as a drug target is that most of the anti-viral therapies are primarily designed to target the viral structure which however is frequently associated with drawbacks such as drug resistance as a consequence of viral mutation (37,38). Here drug-oriented machine learningbased repurposing approach was adopted based on the physicochemical and pharmacological properties of both active and inactive GSK3 inhibitors to build a model that can identify already approved drugs against the selected target (GSK3). Furthermore, this drug discovery strategy is highly efficient, saves time, and cost and proves to be a prospective approach towards finding already approved drugs against SARS-CoV-2 (39). Finally, molecular dynamics (MD) simulation was performed to further investigate the structural stability of the protein-ligand complexes.

Data compilation and curation
The GSK3a (Target ID: CHEMBL2850) and GSK3b (Target ID: CHEMBL262) datasets used in this study were extracted from the ChEMBL database to generate ML models (40). These datasets are composed of a diverse set of molecules that have been experimentally validated for their inhibitory activity against GSK3. A schematic illustration of the ML-based QSAR and structural dynamics workflow used in this study is shown in Figure 1. Initially, the total number of collected compounds with various bioactivity units, including IC 50 , Ki, EC 50 , and KD, were collected for GSK3a and GSK3b, consisting of 587 and 3637 molecules, respectively. The dataset was cleaned and preprocessed by applying the following filters: first, compounds with undefined activity were discarded; second, compounds containing salt or mixtures, along with overlapping compounds, were removed. Furthermore, in this study, the subset of bioactivity data obtained with IC 50 as a unit was investigated for both GSK3a and GSK3b, consisting of 495 and 3070 unique bioactive compounds, respectively. As the objective of this study was to develop classification models of biologically active compounds, the activity dataset was divided into active and inactive compounds with IC 50 thresholds of<1 and >10 mM, respectively, Machine learning (ML)-based quantitative structure-activity relationship (QSAR) and structural dynamics analysis workflow. HGBM, histogram-based gradient boosting ML model; LGBM, light gradient boosting ML model.
whereas compounds with intermediate activity levels were not considered. The selected compounds were filtered based on Lipinski's rule of five (RO5). Finally, after pre-and postprocessing, two sets of non-redundant and curated compounds for GSK3 (a, b) were used for further investigation: 1885 active compounds and 1679 inactive compounds. All of the collected compounds were randomly segregated into a training set to generate the QSAR classification model and a test set to validate the model quality with a ratio of 80:20.

Molecular descriptors
A molecular descriptor is a mathematical (quantitative/or qualitative) representation of a molecule that is encoded with various sources of chemical information, which is converted and coded to deal with the biological, chemical, and pharmacological features of different small molecules that are used for model construction (41, 42). To develop robust non-linear binary QSAR classification models with better performance, various descriptors such as electronic, topological, geometrical, and spatial descriptors were computed for each molecule in the dataset. We used PaDEL-Descriptor software for molecular descriptor calculation based on a Python code to calculate 12 sets of molecular descriptors, as shown in Table 1 (43). The descriptors belong to nine different types: CDK fingerprint, CDK extended, CDK graph only, Klekota-Roth, AtomPairs 2D, MACCS, E-state, PubChem, and Substructure. These descriptor types can be further divided into two versions: binary and count versions. In this context, the descriptors Klekota-Roth, AtomPairs 2D, and Substructure belong to both versions and provide a detailed description of the substructural components of the studied molecules. The remaining descriptors (n = 9) belong to the binary version. We also computed Lipinski's RO5 molecular descriptors to be used as classification parameters for the identification of drug-like molecules.

Data filtering
Molecular fingerprints with redundant or constant variables were discarded to remove any inherent biases that could negatively impact the resulting model. Not all molecular descriptors are required to represent the features of inhibitors and non-inhibitors. Moreover, the model learns the biases in the data and continues to amplify them, which could lead to overfitting. A selection criterion is required to discard irrelevant descriptors that can measure the relevance of a specific descriptor to the output of any classifier (44). In this context, the VarianceThreshold class from Scikit-learn was implemented to remove the low-variance features with a threshold value higher than 0.1, and the remaining features were used for further analysis.

Data splitting and test selection
Following data filtering, the GSK3 (a, b) datasets were split using the Kennard-Stone algorithm, which separated the entire dataset into internal and external sets with a ratio of 80% and 20%, respectively. The internal datasets were used for training the ML model, and its capability to extrapolate to unknown molecules was simulated by analysis against the external dataset. Finally, the training set was used to estimate the performance of the model using a five-fold cross-validation scheme. The correlation plots for the experimental versus predicted pIC 50 values for GSK3 inhibition in the training and test sets are shown in Figure 2. Additionally, the effectiveness of the ML model was evaluated using 20 compounds that were identified as inhibitors of this target in numerous previous studies (45)(46)(47). The model performance measures and the activity threshold for this external dataset were compared with the experimental IC 50 values, as shown in the Supplementary Tables S1, S2.

ML-based QSAR classification
The QSAR classification model can represent molecular descriptors as a relationship between the dependent variable (IC 50 ) and independent variables (molecular descriptors), each demonstrating the category of the corresponding sample (GSK3 inhibitory activity). In structure-activity relationships, the association between the corresponding datasets is complex and non-linear; thus, a QSAR modeling-based approach was used because it has previously shown outstanding performance in this regard (48). In this context, ML algorithms can cluster instances or observations present in data into classes. A variety of ML algorithms have been employed to construct QSAR classification models from dataset activity labels and molecular descriptors. For example, support vector machines, naïve Bayes classifiers, neural networks, rule-based classifiers, and decision trees are various ML-based techniques used to elucidate the classification problem. In this study, the LazyPredict package was employed using a Python script for model selection, which generates a variety of ML algorithms and authenticates the best-performing algorithm, as shown in Supplementary Table S3 (49). The top models were selected based on the R-squared and root mean squared error (RMSE) values to train our regression model to precisely predict the activity of GSK3 inhibitors. These ML algorithms were implemented using Python software. Subsequently, to determine the optimal values, hyperparameter tuning of the selected models was performed using GridSearchCV implemented in Scikit-learn. The list of the best parameters selected for hyperparameter tuning is presented in Table 2.

Statistical assessment for model validation/performance
Model validation is a crucial step to ensure that a fitted model can accurately predict responses to unknown data. We used two statistical parameters, the Pearson correlation coefficient (r) and RMSE, to assess the performance of the QSAR models. The Pearson correlation coefficient is a common statistic used to describe the strength of the relationship between two variables of interest, which ranges from -1 to +1, with negative values denoting a negative correlation between two variables and positive values indicating a positive correlation. The relative error of the prediction model is frequently examined using the RMSE.
The Y-scrambling test, external validation, and 10-fold crossvalidation were used to confirm the predictive capacity of the QSAR model. The 10-fold cross-validation method does not use the entire dataset to create a predictive model. As an alternative, it separates the data into training and testing datasets, enabling the model to be tested on the testing dataset using the training dataset as a basis. By repeating the 10-fold validation, the average accuracy was used to examine the performance of the prediction model. To further assess the performance and prediction accuracy of the model on external/ benchmarking datasets, mean absolute percentage error (MAPE) was evaluated, a smaller value indicates better model performance.
Xk is the actual value, X k is the predicted value of the model, |X k −X k | represents the absolute error, and N is the number of incidence data points.

Docking and MD simulation
The selected FDA-approved and natural product compounds were selected based on the predicted IC 50 values produced by our QSAR model and docked to the active-site of GSK3 using a protocol described in our previous report (50). From the top 10 docked conformations the best pose was selected based on the MOE docking score (S), protein-ligand interaction fingerprint (PLIF) calculation, including hydrogen and ionic interactions. These selected best poses of selinexor, ruboxistaurin alongwith GSK3 apo and PF-04802367 (PDB ID: 5K5N) were subjected to MD simulation (200 ns) using GROMACS. Furthermore, the dynamic behavior and stability of each system was validated. The detailed methodology adopted here can be found in our previous study (50).

Binding free energy calculations
Molecular mechanics Poisson-Boltzmann surface area (MMPBSA) method was used to measure the binding free energy between GSK3 and ligand complexes. The Poisson-Boltzmann equation was used to determine the effects of the solvent's polar and nonpolar components on the free energy.
In the above equation, DGbind represents binding free energy, DEMM is the intermolecular energy difference, DDGsol shows the difference in solvation energy, whereas T and DS stands for absolute temperature and change in entropy. The study was performed using the gmx_MMPBSA (51). Frames were extracted throughout the trajectory with an interval of 30-frames. The detailed protocol is described in our previous study (52).

Results and discussion
The computational workflow for elucidating the underlying basis of the bioactivity of GSK3 is summarized in Figure 1. To gain a deeper understanding of the dataset, a standard chemical space analysis was performed on the investigated compounds. The preprocessed dataset was utilized to create predictive classification and regression models using the HGBM and LGBM models after rigorous data curation. Subsequently, hyperparameter optimization was performed to determine the optimal parameter configuration of the model. The best-performing model was used to evaluate the predictive capability after training the selected algorithms to gain biological insight. Finally, MD simulations were performed on selected FDA-approved drugs based on the pIC 50 value to further assess the structural dynamics and stability of the proteindrug complexes.

Exploratory chemical space analysis of GSK3 inhibitors
A set of 3,565 compounds tested against GSK3 (a, b) were extracted from public databases. This dataset included seven bioassay formats characterized using the BioAssays Ontology number (53). However, 95% of the dataset was linked to the same bioassay (BAO_0000357) connected to the single-protein affinity format, which could infer the homogeneity of the dataset.
Chemical space analysis of GSK3 inhibitors was explored using Lipinski's RO5 descriptors to gain an understanding of structureactivity relationships. Chemical space analysis can provide considerable understanding of the general characteristics of compounds that define their inhibitory activity. Exploratory data analysis was performed using RO5 descriptors, including the number of hydrogen bond acceptors (nHBAcc), number of hydrogen bond donors (nHBDon), molecular weight (MW), and octanol/water partition coefficient (LogP). The MW of a chemical compound is often used to compute its size, as it facilitates the analysis and prediction of the appropriate size of a drug, which is critical for its transport across a lipid membrane (54). Molecular hydrophobicity (lipophilicity) is usually computed as LogP, which is an important estimator of chemical membrane penetration and permeability (55). Moreover, nHBDon and nHBAcc were computed to estimate the hydrogen bond-forming capacity of a chemical compound.
Initially, the analysis was carried out by visualizing the distribution of active and inactive compounds as determined by the scatter plot of MW vs. LogP, followed by a comparative analysis of active and inactive compounds as a function of Lipinski's RO5 descriptors (56). Figure 3 shows the chemical space distribution of the training set in a scatter plot of the MW versus the logarithm of the LogP. The MW and LogP of the active and inactive compounds showed almost identical distributions, with the majority of the compounds having an MW falling between approximately 200 and 600 Da and a LogP falling between 0 and 7. Most oral drugs are more likely to have optimal physicochemical and absorption, distribution, metabolism, and excretion properties between 1 and 4 (57,58), which is also evident in Figure 3. Furthermore, Figure 3 shows that MW and ALogP cannot be used to discriminate between active and inactive compounds because of their shared chemical space. Figure 4A illustrates the total distribution of the compounds. As can be inferred from the median values in the box plot, active compounds tended to have a slightly higher MW of approximately 380 Da than that inactive compounds, which was approximately 350 Da. Similarly, inactive compounds had LogP values of approximately 3.8, which was higher than that of active compounds approximately 3.5 ( Figure 4B). The distribution of nHBAcc, as deduced from the median, shows that the active compounds have a higher number of hydrogen bond acceptors (n ≈ 7) than inactive compounds (n ≈ 5) ( Figure 4C). However, no significant differences were observed between the active and inactive compounds for nHBDon ( Figure 4D). Therefore, it is difficult to predict the activity of inhibitors using simple molecular descriptors.

Model construction for the prediction of GSK3 kinase inhibitors using ensemble boosting
After the molecular descriptor calculation (Table 1), the LazyPredict package was used to acquire robust ML models, as described in Section 2.5. The best-performing models, HGBM and LGBM, with an R-squared value of 0.53 and 0.52 were used for model construction to better target the GSK3 kinase protein.The performance of the model was evaluated using the R-squared and RMSE metrics, as shown in Supplementary Table S3. Boosting algorithms (HGBM and LGBM) are a type of ensemble learning technique that gradually add tree models to fix the prediction error that already exists in the ensemble (59). To evaluate the performance of our ML models, an external test set using 20 known GSK3 inhibitors was used (45)(46)(47)60). Because these external test set compounds were not considered when creating the models, the resulting performance showed the ability of the ML models to precisely predict the inhibitory activity of already known GSK-3 kinase inhibitors, as shown in Supplementary Tables S1 and S2. The performance of the predictive ML model can also be seen in the correlation plot between predicted values vs. experimental values, as shown in Figure 5. Additionally, the overlapping GSK3 a, and b compounds in the benchmarking dataset of already known inhibitors are shown on Figure 6. The model evaluation metrics, MAPE was used to validate the prediction accuracy in benchmarking datasets (Figure 7). Both the HGBM and LGBM showed relatively good mean absolute percentage error scores of 19.1% and 22.6%, respectively, which represents the percentage difference between the predicted and experimental values. In addition to it, we compared the predictive performance of our QSAR model with another dataset (Supplementary Table S4). The data shows that compounds with higher activity in our model was comparable to the reported model (60).The results in Table 2 indicate that both models exhibited good overall prediction accuracy; however, HGBM was the more accurate model.
This QSAR-based model was then used to predict the inhibitory activity of FDA-approved drugs against GSK3 kinase. Each compound was assigned a pIC 50 value; higher values indicate that the drug is effective at lower concentrations against GSK3 kinase and therefore will show lower systemic toxicity (61).
The top 10 FDA-approved drugs based on pIC 50 values from both the HGBM and LGBM models are presented in Tables 3, 4. The drugs with the best-predicted pIC 50 values produced by our model were selinexor and ruboxistaurin hydrochloride.
Selinexor is an FDA-approved drug for the treatment of multiple myeloma that binds to and inhibits exportin-1 (XPO) and is being evaluated against SARS-CoV-2 in a phase-2 clinical trial (NCT04349098) (62). XPO-1 protein plays an important role in the export of RNA transcripts and nuclear proteins having leucinerich nuclear export signals (NES) (47,(60)(61)(62). However, blocking XPO-1 with its selective inhibitors causes the nuclear accumulation of transcription factor EB (TFEB) and results in autophagy enhancement in human cells and model organisms (33). Because of this, it has been demonstrated that selinexor inhibits the spread of SARS-CoV-2 by preventing the movement of nuclear proteins into the cytoplasm (63, 64). Similar phenomena have been observed when GSK3 is inhibited; this causes translocation of TFEB into the nucleus, where it controls the transcription of around 400 genes involved in autophagy, which eliminates the invading viruses like SARS-CoV-2 (65)(66)(67). The current study indicates additional, mechanisms through which selinexor could prevent SARS-CoV-2 replication, such as by preventing the phosphorylation of the virally encoded N protein. In addition to the drug's effect on autophagy (16,20). Chemical space of the training set. The molecular weight (MW) on the X-axis and the logarithm of the octanol/water partition coefficient (LogP) on the Y-axis serve as the parameters for the chemical space. Red and green spots, respectively, represent active and inactive substances.
Ruboxistaurin is an investigational drug that targets protein kinase C beta for the treatment of diabetic retinopathy. Protein kinase C (PKC) is a serine/threonine protein kinase that has been identified to modulate autophagy (68). Autophagy is an innate immune response to kill and degrade invading viruses (69). In this context, one such mechanism is the activation of PKC-a/d induced by GSK3b inhibition which leads to the phosphorylation repression of TFEB and its nuclear localization and activation of autophagy pathways (30,70). The nuclear localization of TFEB induced by PKC-a/d occurs via GSK3b in mTORC1-independent manner (30). Additionally, an orally active PKC inhibitor ruboxistaurin proves to be active against SARS-CoV-2 as it inhibits NETosis, and has passed phase 3 for other indications (71). Previous studies have also shown that ruboxistaurin was active against GSK3a and GSK3b with IC 50 values of 695.9 nM and 97.3 nM, respectively (60). This data indicates that ruboxistaurin inhibits both PKC and GSK3 which supports our QSAR model prediction.
In conclusion, the top predicted medications (selinexor, ruboxistaurin) by our model also correlate with the data that revealed their effectiveness against SARS-CoV-2 as shown above, and GSK3 inhibition seems to play a significant part in the activity of these drugs. Furthermore, the drugs with the highest predicted IC50 including selinexor and ruboxistaurin indicate that the molecular fingerprints of these compounds exist in the compounds used to develop the training set of GSK3 inhibitors. In addition, we validate the performance of our ML model from a structural viewpoint, these two drugs were subjected to MD simulation.

Atomic-level characterization and binding free energy calculations
Characterization of the protein-ligand complex is essential for predicting selective GSK3b inhibitors. The top two FDA-approved drugs (selinexor and ruboxistaurin) were selected for molecular dynamics (MD) simulation based on the highest predicted activity [according to the negative log of the half-maximal inhibitory concentration (pIC 50 ) value] to further investigate the structural stability of the protein-ligand complexes. The three-dimensional structure (PDB ID: 5K5N) of GSK3b was retrieved from the Protein DataBank and used to generate multiple docking poses to select the best conformer for the MD simulation.
All GSK3-apo, PF-04802367 and ligand-bound complexes were docked and subjected to MD simulations in an aqueous environment for 200 nano second (ns) to discern conformational variations and dynamic stability. The dynamic nature of the GSK3b protein and the test drugs (selinexor, ruboxistaurin, and PF-04802367) was explored separately using MD simulation trajectories. To explore the average displacement of the atoms, the root mean squared deviation (RMSD) of the complexes was evaluated and contrasted with the RMSD of the GSK3-apo and PF-04802367 (control) bound structures, as shown in Figure 8A. The GSK3b-ruboxistaurin complex showed a stable trajectory with slight variations throughout the simulation, as the RMSD value ranged from 0 to 2.4 Å. However, from 100 ns to 150 ns, the RMSD showed an incremental deviation from 2.5 to 2.7 Å before attaining the final trajectory of 2.4 Å. In the GSK3-selinexor complex, a similar RMSD trajectory profile was observed from 0 to 2.1, and from 50 to 150 ns, with an incremental deviation (2 to 2.4 Å) before reaching 2.1. The RMSD plots for GSK3-PF-04802367 (control drug) showed a steady incremental deviation with the trajectory originating from 0 to 2.8 Å, along with some acceptable variations during different time intervals. The RMSD profile of the apo structure showed a pattern similar to that of GSK3b in complexes with PF-04802367 and ruboxistaurin. However, acceptable deviations were observed in complexes with selinexor compared with those of the apo-form. These findings suggest that after a few conformational changes, the protein-ligand complex achieved a stable conformation during the simulations.
Additionally, to analyze the protein flexibility in the ligandbound complexes, root mean square fluctuation (RMSF) values were calculated, as shown in Figure 8B. The RMSF of the residues measures the residue-level structural fluctuation of a specific segment of the protein that deviates from its mean structure, which often occurs upon ligand binding. The variations observed for each residue represent the degree of flexibility they attained. The ATP-binding site of GSK3b is present at the interface of the N and C termini and consists of Pro136, Leu132, Asp133, Tyr134, and Val135; however, the hydrophobic side chain of Arg141 forms another segment of the pocket (72). In the case of the GSK3b-apo structure, no significant fluctuations occurred in the binding site residues such as Pro136 (0.733 Å), Asp133 (0.633 Å), Tyr134 (0.637 Å), Leu132 (0.806 Å), and Val135 (0.87 Å). In the GSK3-PF04802367 complex, the RMSF values of residues Pro136 (0.614 Å), Leu132 (0.681 Å), Tyr134 (0.584 Å), Val135 (0.547 Å), and Arg141 (0.77 Å) were lower than those found for the apo structure. Similarly, in the GSK3b-selinexor complex, all active-site residues fluctuated less than in the apo structure, particularly residues Pro136 (0.64 Å), Leu132 (0.715 Å), Asp133 (0.593 Å), Tyr134 (0.589 Å), Val135 (0.698 Å), and Arg141 (0.797 Å). The fluctuation patterns of residues Asp133 (0.609 Å) and Tyr134 (0.622 Å) in the GSK3-ruboxistaurin complex were similar to those of the apo structure; however, slightly higher fluctuations were observed for  Venn diagrams visualizing the overlap of known GSK3 (a, b) inhibitors retrieved from the ChEMBL database. In total benchmarking dataset has 20 compounds out of which 8 are specific to GSK3b, 11 are to both GSK3 (a,/b) and 1 is to GSKa specific.   residues Pro136 (0.938 Å) and Arg141 (0.959 Å). The GSK3-ligand interaction reduced the overall fluctuations of the protein; the RMSF value of the selected FDA drugs (selinexor and ruboxistaurin) was similar to that of the control drug PF-04802367.
To further characterize the compactness of the protein in the binding of the ligand at the ATP site of GSK3, the radius of gyration (Rg) was determined, as shown in Figure 8C. The impact of ligand binding on the Rg of the GSK3b protein was calculated and compared with that of the apo-GSK3b protein structure. The Rg of the apo and ligand-bound complexes remained between 21.5 and 21.7 Å, which indicated their compactness and sustained stability.
To analyze the formation of hydrogen bonds throughout the MD simulation, the gmx hbond program from the GROMACS package was used. As shown in Figure 8D, the average hydrogen bond for the control (PF-04802367) was about 1.5; however, an increase in the number of hydrogen bonds from 1.5 to 3.9 was observed. On the other hand, the average hydrogen bond for selinexor and ruboxistaurin was about 1.3 and 1.6. However, during the early steps (0 to 66 ns) of MD simulation, a low number of hydrogen bonds was observed in selinexor, which was about 0.6. Overall, the hydrogen bond formation remains intact throughout the simulations, indicating that the ligands were present in the binding pocket throughout the process.
The MM-PBSA technique was used to calculate binding free energy to measure the strength of receptor-ligand binding. The change in binding free energy upon binding of ruboxistaurin, selinexor, and PF-04802367 (control) with GSK3 receptors is shown in Figure 9. Here, GSK3-drugs (ruboxistaurin, selinexor) displayed acceptable binding free energy, which is comparable with a positive control (PF-04802367). The cumulative binding energy (Gbinding) of selinexor and ruboxistaurin is -8509.006 56.335 kcal/ mol and -8491.473 57.039 kcal/mol, whereas PF-04802367 is -8681.320 58.8627 kcal/mol. However, the total binding energy of PF-04802367 is slightly higher but comparable with that of selinexor and ruboxistaurin, which clearly reflects the robustness of our ML model in predicting the drug candidates that could bind tightly to GSK3.

Conclusion
The COVID-19 pandemic caused by the new coronavirus, SARS-CoV-2, poses a serious threat to the global health system. We employed ML-based predictive modeling to identify FDAapproved and clinical candidate drugs inhibiting GSK3, as this kinase plays a critical role in the phosphorylation of the SARS-CoV-2 N protein that is required for viral replication (16,20). Furthermore, among the FDA-approved compound libraries, leads with good pIC 50 values were subjected to MD simulations to investigate protein-drug interactions in a dynamic environment. Recently, one of the identified drugs, selinexor, was also found to be effective against SARS-CoV-2. It is currently in the clinical trial recruitment phase (NCT04534725) for COVID-19 treatment.
We anticipate that the current research, which combines data curation from relevant databases with ML-based predictive algorithms to identify possible therapeutic candidates for COVID-19, could complement ongoing antiviral research efforts. These artificial intelligence-based pipelines may help in the design of preclinical laboratory studies, future clinical trials, and drug discovery. These approaches may also help to improve our understanding of other diseases and related biological phenomena. COVID-19 and potential future outbreaks of coronaviruses may be treatable because of the interference with the conserved dependence of the N protein on GSK3 and its potential role in the regulation of autophagy.

Data availability statement
The original contributions presented in the study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding author.

Author contributions
Conceptualization, RP and SC. methodology, RP and BA. Writing-original draft preparation, RP. Writing-review and editing, RP, BA, NQ, and SC. Visualization, BA, and NQ. Supervision, SC. Funding acquisition, SC. All authors contributed to the article and approved the submitted version.

Conflict of interest
Authors RHP and SC were employed by S&K Therapeutics. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher's note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fendo.2023.1084327/ full#supplementary-material MM-PBSA binding free energy calculation. (A) Representative contributions of each Gas phase free energy (G-gas) components for GSK3 with the selected drugs (PF-04802367, ruboxistaurin, and selinexor); (B) The solvation free energy for GSK3 in complex with the selected drugs; (C) Total of gas phase and solvation free energies along with cumulative total binding free energy for PF-04802367, ruboxistaurin, and selinexor.