An Integrative Network Science and Artificial Intelligence Drug Repurposing Approach for Muscle Atrophy in Spaceflight Microgravity

Muscle atrophy is a side effect of several terrestrial diseases which also affects astronauts severely in space missions due to the reduced gravity in spaceflight. An integrative graph-theoretic network-based drug repurposing methodology quantifying the interplay of key gene regulations and protein–protein interactions in muscle atrophy conditions is presented. Transcriptomic datasets from mice in spaceflight from GeneLab have been extensively mined to extract the key genes that cause muscle atrophy in organ muscle tissues such as the thymus, liver, and spleen. Top muscle atrophy gene regulators are selected by Bayesian Markov blanket method and gene–disease knowledge graph is constructed using the scalable precision medicine knowledge engine. A deep graph neural network is trained for predicting links in the network. The top ranked diseases are identified and drugs are selected for repurposing using drug bank resource. A disease drug knowledge graph is constructed and the graph neural network is trained for predicting new drugs. The results are compared with machine learning methods such as random forest, and gradient boosting classifiers. Network measure based methods shows that preferential attachment has good performance for link prediction in both the gene–disease and disease–drug graphs. The receiver operating characteristic curves, and prediction accuracies for each method show that the random walk similarity measure and deep graph neural network outperforms the other methods. Several key target genes identified by the graph neural network are associated with diseases such as cancer, diabetes, and neural disorders. The novel link prediction approach applied to the disease drug knowledge graph identifies the Monoclonal Antibodies drug therapy as suitable candidate for drug repurposing for spaceflight induced microgravity. There are a total of 21 drugs identified as possible candidates for treating muscle atrophy. Graph neural network is a promising deep learning architecture for link prediction from gene–disease, and disease–drug networks.


INTRODUCTION
Drug discovery is an expensive process costing an average of $1.8 million per drug. Most drug discovery done on Earth is under a constant environment with a gravity value of 9.81 m/s 2 . Spaceflight in satellites and the International Space Station (ISS) provides a gravitational acceleration of 1 × 10 −6 m/s 2 . This is referred to as microgravity which has direct and indirect effects on an organism. The direct effects are changes in weight, distortion and deformation of organelles, and other measurable changes. The indirect changes are those that occur prior due to microgravity. Bacterial virulence and increased genetic recombination have been observed in space thereby requiring increased concentrations of antibiotics for treatment. Spaceflight environment is conducive for drug discovery, as observed in an experiment conducted on spaceflight tested a molecule Amgn-0007 and sActRIIB for increasing bone mineral density in mice (Zea, 2015).
In addition to aging, muscle atrophy is slightly implicated in the etiology of chronic diseases such as diabetes, cancer, obesity, and muscular dystrophy (Kalyani et al., 2014;Muscular Dystrophy, n.d.). 1 Muscle wasting also develops as a consequence of acquired immune deficiency syndrome (AIDS) (Dudgeon et al., 2006), neuromuscular disorders, and organ failure (cachexia) (Wyart et al., 2020;Rausch et al., 2021). Muscle wasting is the hallmark of cancer cachexia and is associated with serious clinical consequences such as physical impairment, poor quality of life, reduced tolerance to treatments, and shorter survival (Burckart et al., 2010). Muscle atrophy is a severe disabling clinical condition that is accompanied by cancer development in the pancreatic, lung, liver, and bladder (Bei and Xiao, 2017;Yang et al., 2018). Prolonged stay in spaceflight of up to 4 months can lead to a 17% loss of muscle mass. Muscle atrophy condition is accelerated in space due to microgravity by unloading of the muscles. Gene expression datasets have been analyzed using traditional fold change analysis and clustering methods for the identification of differentially regulated genes involved in muscle atrophy in mice flown in spaceflight (Horie et al., 2019a,b). Spaceflight simulation studies have shown differential expression of small number of microRNAs in the context of muscle physiology in response to loading (Rullman et al., 2016). Recent miRNA studies have shown that muscle degeneration with accelerated aging enhanced by exposure to space radiation and microgravity are driven by circulating miRNA and are being suggested as a potential biomarker (Malkani et al., 2020). But, advanced network analysis to identify causally related key target genes and their association with other diseases, and the application of Artificial Intelligence (AI) methods for identification of drugs suitable for treatment of muscle atrophy in spaceflight have not been performed.
Several treatments have been proposed and used for countering muscle atrophy in humans. Inhibition of a protein called myostatin has shown to result in an increase in muscle mass (Smith et al., 2020). The drug formeterol has been used for counteracting muscle atrophy in mice in spaceflight (Ballerini et al., 2020). There are many drug candidates that can be used for treating muscle atrophy, and the use of traditional methods for drug repurposing are time consuming due to the large volume of compounds that need to be tested. AI based methods have gained importance in this pandemic era for rapid, low-cost, and effective drug repurposing (Gysi et al., 2020). AI methods rely on the fact that drugs that target one disease can target another disease with similarly functioning proteinprotein interaction networks. AI related methods are Machine Learning methods and/or Deep Learning (DL), a sub-branch of ML . ML methods such as Support Vector Machines (SVM), Random Forest (RF), and Gradient Boosting (Gboost) method have been used for drug repositioning to treat schizophrenia and anxiety disorders (Zhao and So, 2018). Employing ML based drug repositioning is a cost-effective way of automatizing the drug discovery process, and gaining deeper knowledge in the genetic causality of diseases, their associations, and planning preclinical trials for the selected drugs (Koromina et al., 2019;Réda et al., 2020). DL neural network architectures can explore a large amount of data, and search for similarities in several thousands of protein-protein interactions. If the input data is in the form of sequences, then Recurrent Neural Networks (RNN) are trained with the time-stamped data and used for prediction of drugs (Wang et al., 2020). Hybrid models that combine the power of Convolutional Neural Networks (CNN) and RNN have been used for drug repurposing (Xuan et al., 2019;Jarada et al., 2020). Gene protein and protein-protein interactions are generally depicted in the form of a graph, which have led to identifying disease networks and network medicine approaches for drug repurposing (Gysi et al., 2020). Network measures and evaluation metrics such as Area Under Receiver Operating Characteristic (AUROC) curves, and Area Under Precision and Recall (AUPR) have been used for network link prediction in drug discovery (Chen et al., 2018;Abbas et al., 2021). Network medicine uses graph representation for learning the patterns of protein-protein interactions. The SPOKE database (Nelson et al., 2021) is a heterogeneous knowledge graph connecting biological and clinical data from over 30 databases, that is used in this work in combination with transcriptomic datasets to create the inputs to the AI model. The Bayesian Markov blanket method applied to spaceflight transcriptomic datasets for muscle atrophy gives information on which genes are highly activated due to muscle unloading in spaceflight.
In this paper, we analyze spaceflight gene expression datasets for muscle atrophy using advanced network analysis methods and combine it with the power of AI for identifying drugs that can be repurposed for successful treatment of muscle atrophy. The rest of the paper is organized as follows. Section "Materials and Methods" presents the GeneLab datasets, and the methods used for drug repurposing, Section "Results" presents the knowledge graphs, and the link prediction results, section "Discussion" presents a discussion of the gene-disease associations, and disease-drug link predictions, and the "Conclusions" section presents the conclusions.

MATERIALS AND METHODS
This section describes the GLDS datasets used for mining, the SPOKE database, the network analysis methods, and the ML and AI methods for link prediction. Gene expression data were downloaded from NASA GeneLab repository. The datasets were preprocessed by NASA GeneLab.

GLDS-4
Thymus lobes were extracted from young adult C57BL/6NTac mice at 8 weeks of age after exposure to spaceflight aboard the space shuttle STS-118 for a period of 13 days. Gene expression analysis demonstrate that spaceflight induces significant changes in the thymic mRNA expression of genes that regulate stress, glucocorticoid receptor metabolism, and T cell signaling activity (Lebsack et al., 2010). Key master regulators such as TGF-β1 coordinating systemic response of mice to spaceflight microgravity and/or space radiation were identified in Beheshti et al. (2018).

GLDS-244
A cohort of healthy mice was implanted with subcutaneous nanofluidic delivery system (nF) of formoterol (FMT), a β2adrenergic receptor agonist for therapeutic treatment of skeletal muscle loss. The mice were subjected to spaceflight microgravity on ISS for 29 and 56 days before euthanizing. RNA sequencing analysis of thymus tissues showed that nF-FMT treatment mass loss in comparison to control mice (Ballerini et al., 2020).

GLDS-245
Liver tissue was extracted from the same cohort of mice used in GLDS-244 experiment. RNA sequence data was obtained from liver preserved in liquid nitrogen after dissection and stored at -80 • C. RNA sequencing analysis of thymus tissues was done.

GLDS-246
A cohort of forty 32-weeks-old female C57BL/6NTac mice were either sham operated or implanted with vehicle or treatment-filled nDS, launched in two Transporters (20 mice per Transporter) on SpaceX-13. They were transferred to Rodent Habitats onboard the ISS, and maintained in microgravity. After 56 days, they were euthanized on the ISS and RNA samples from spleen tissue was extracted and sequencing analysis was performed.

GLDS-288
The spleens and lymph nodes were analyzed from mice flown aboard the ISS in orbit for 35 days, as part of a Japan Aerospace Exploration Agency mission. The mice were exposed to 1 g microgravity in the ISS. Paired end sequencing (PE36bp) was performed with NextSeq500. Whole-transcript cDNA sequencing (RNASeq) analysis of the spleen suggested that erythrocyte-related genes regulated by the transcription factor GATA1 and Tal1 were significantly down-regulated in ISS (Horie et al., 2019b).

GLDS-289
Twelve C57BL/6 J male mice (8-week-old for MHU-1 and 9-week-old for MHU-2) in transportation cage units (TCU) were launched aboard the SpaceX rocket from the KSC and transported to the ISS. After one month in spaceflight, RNA sequencing analysis showed a significantly reduced expression of cell cycle-regulating genes, resulting in reduced size of thymus. However, exposure to 1 × g alleviated the impairment of thymus homeostasis induced by spaceflight (Horie et al., 2019a).

Gene Regulatory Network Inferencing Using Incremental Association Markov Blanket Method
In genomics, genome to phenome analysis, and transcriptional regulatory analysis are facilitated by construction of Gene Regulatory Networks (GRNs) from gene expression datasets. The GRNs also show causal relations between the genes. Traditionally, causal relations are difficult to infer and require careful application of experimental interventions. However, causal relations can be discovered by statistical analysis of purely observational data, which is known as causal structure learning (Anand, 2009). Using Markov property, a gene is conditionally independent of all other genes except its parents, children, and children's parent variables (genes). Causal relationships are useful for combining omics data with Genome Wide Association Studies (GWAS), for inferring relationships between genotype and phenotype (Ainsworth et al., 2017).
The method used for causal relation inferencing used here is the Markov Blankets (MB) method and Bayesian Network (BN) learning (Tsamardinos et al., 2003;Ram and Chetty, 2011;Syed Sazzad et al., 2020). Joint conditional probabilities are represented by a graph in a Bayesian network, the nodes (genes) are connected by Markov property which states that a node is conditionally independent of its non-descendants, given its parents. Applying the faithfulness condition, the IAMB of any node (gene) in a BN is the set of parents, children, and spouses (the other parents of their common children) of the gene. In our case, each gene is a variable with a series of expression values. The Markov blanket of a gene X is the smallest set MB(X) containing all genes carrying information about X that cannot be obtained from any other gene. Association measures and conditional independent tests are applied to identify the strongly relevant genes (Pellet and Elisseeff, 2008;Bui and Jun, 2012). Hence, MB(T) is a causal structure learning algorithm useful for the discovery of regulatory interactions among genes from gene expression data. Here, MB is used to construct GRNs for regulatory relationship between genes/proteins.

Gene Disease Knowledge Graph Using SPOKE
Gene disease associations are important as the key genes of muscle atrophy are also affected by other diseases which can turn out to be lethal when transferred to the next generation. Hence, it is vital to predict which new disease can occur because of the higher activity of particular genes in the GRNs for muscle atrophy. In order to obtain the gene disease associations, we use the Scalable Precision Medicine Knowledge Engine (SPOKE), which is a large heterogeneous network containing multiple types of biological data capturing the essential structure of biomedicine and human health for discovery (Scalable Precision Medicine Knowledge Engine, n.d.). The maximally regulated genes identified from the GRNs are input to the SPOKE which generates all the diseases associated with these key genes obtained from the GRNs for muscle atrophy. These associations are used to construct the Gene Disease Knowledge Graph (GDKG).

Network Measures
We define a network using a graph based representation. Formally, a graph is a pair of sets G: = (V,E) where | V| is the set of vertices (molecules, genes, proteins, nodes, points) and | E| is the set of edges, which is an ordered pair of V. The graph (V, E, o, t) is called directed, if directed edges are allowed, i.e., not all edges have reverse edges as members of E. In a directed graph, G = (V,E,o,t), the edges are e(u,v) E, the origin of e is denoted by o and the terminal v is denoted by t(v). In a network G = (V,E), for a node u, (u) = {v| (u,v) E} represents the set of neighbors of node u. The link prediction task in a network G = (V,E) is to determine whether there is or will be a link e(u,v) between a pair of nodes u and v, where u, v ∈ V, and e(u, v) / ∈ E. Similarity measures computed from neighborhoods in a graph are widely used in link prediction algorithms (Abbas et al., 2021). Random walks have been used for link prediction. Random walk methods efficiently explore neighborhoods of a node to determine a path from a starting node to a terminal node. Probabilities are usually used to select the next neighboring node in the path. Biomolecular networks are complex and random walks are an efficient way for exploring them (Costa and Travieso, 2007;Janwa et al., 2019). A semi-supervised scalable feature learning method is proposed in Grover and Leskovec (2016), where the authors develop a family of biased random walks resulting in a flexible search space of nodes for link prediction. We have used this method to obtain the highest ranked nodes for possible links between muscle atrophy genes and their associated diseases. Apart from random walk, we have computed the preferential attachment network measure to obtain possible gene-disease and disease-drug associations. Preferential Attachment is the multiplication of the degrees of nodes u and v:

Graph Neural Network for Prediction of Gene-Disease Associations
A deep Graph Neural Network (GNN) architecture consisting of multiple layers and hundreds of nodes is constructed and takes as input the GDKG constructed as described in section "Gene Disease Knowledge Graph Using SPOKE." This graph G = (V,E) is multimodal and heterogeneous with N nodes vi V is the set of nodes representing proteins or genes, and diseases. The edges E represents gene-disease associations. The link prediction task is to predict whether there will, is, or will be a link e(u,v) between a pair of nodes u and v, where u, v V and e(u, v) / ∈ E. A link prediction problem is setup on the GDKG representation for identifying links between genes and the diseases associated with it. The GNN is a three layer model. The edge features of the GDKG are the input to the input layer of the GNN. The hidden layer consists of 300 neurons with "tanh" activation function. Limitedmemory Broyden-Fletcher-Goldfarb-Shanno (lbfgs) solver from the sktlearn library is used for link prediction. It approximates the second derivative matrix updates with gradient evaluations. It stores only the last few updates, so it saves memory. The output of the GNN is a matrix consisting of new predicted edges.

Random Forest Method
The RF is a classifier using the ensemble learning algorithm on a multitude of decisions trees constructed at training time. It trains decision trees using random sampling with replacement. For each node in the base decision tree, random forest randomly chooses an attribute subset including k (k ≤ m) attributes from the attribute set of the node (including m attributes), and then, chooses the best attribute from the subset to split samples (the optimal judgment is usually based on the minimum of a Gini index). The split process will be repeated until the split termination condition is satisfied (generally, the Gini index is small enough), and the model integrated by multiple decision trees is a random forest (Wu et al., 2018). Each tree emits a prediction, and the class with the most votes becomes the model's prediction. It is based on the principle that many uncorrelated models (trees) operating as a committee will outperform any of the individual constituent models.

Gradient Boosting Classifier
The Gboost classifier is also an ensemble learning method similar to random forest except that it trains one tree at a time. This additive model (ensemble) works in a forward stage-wise manner, introducing a weak learner to improve the shortcomings of existing weak learners (Li et al., 2016). In Gboost, shortcomings are identified by gradients. Whereas in Adaboost, shortcomings are identified by high-weight data points. Both high-weight data points and gradients tell us how to improve our model. RF combine results at the end of the process (by averaging or "majority rules") while Gboost combines results along the way. Gboost is not a best method if there is lot of noise in the data, as it results in overfitting. The parameters are harder to tune than RF.

Disease Drug Link Prediction
The top ranked disease associations from the GDKG that have highest probability of predicted and existing links are selected. For each of these disease up to ten drugs are chosen from the DrugBank database (Drugbank online, n.d.). 2 There are multiple drugs that are used for several diseases. An adjacency matrix with rows for diseases and columns for drugs is constructed. The Disease Drug Knowledge Graph (DDKG) is generated and the link prediction algorithm is run on this graph. This results in top ranked drugs with highest probability that can be repurposed for muscle atrophy. The choice of the best drug also depends on the diagnostics and prognostics of the disease, hence the most prevalent comorbidities with muscle atrophy is considered for the drug selection. Drug selection is also a very sensitive process and requires clinical intervention, hence, we provide a list of drugs that may be considered for treatment of this condition in spaceflight. Figure 1 shows the sequence of steps followed for constructing GDKG, DDKG, and link prediction for drug repurposing in muscle atrophy. The predicted drugs have the highest probability for muscle atrophy treatment in spaceflight. Once the DDKG is constructed any machine learning approach for link prediction can be used for predicting probable drugs. The network feature extraction method used here is based on random walks, which can be replaced with other local or global graph similarity based indices such as common neighbors, Jaccard index, Sorensen index, preferential attachment, Adamic-Adar index, resource allocation index, hub promoted index, Leicht-Holme-Newman index, parameter dependent index, local affinity structure index, individual attraction index, mutual information index, functional similarity weight, and local neighbors link index, Katz index, and page rank (Fire et al., 2011;Mutlu et al., 2020). In this case, we found random walk and preferential attachment to give better results than the other features.

Metrics for Evaluation of Link Prediction Methods
There are several measures to evaluate the performance of link prediction methods. The Receiver Operating Curve (ROC) represents the performance trade-off between true positive and false positives at different decision boundary thresholds. AUROC is the Area Under the Receiver Operating Characteristics (AUROC) value, which is the area under the plot between True Positive Rate (TPR) and the False Positive Rate (FPR). It represents the trade-of between TP and FP prediction rates. The TPR is also known as sensitivity, recall, or probability of FIGURE 1 | Flow diagram showing sequence of steps followed for constructing GDKG, DDKG, and link prediction for drug repurposing to treat muscle atrophy in spaceflight microgravity.
Frontiers in Cell and Developmental Biology | www.frontiersin.org detection. AUROC measures the separability of the classifier and is therefore a vital metric (Yang et al., 2015).

Computational Network Measures
The network measures used to analyze the GDKG and DDKG networks include spectral gap, girth or diameter, and density. Measures computed on the gene nodes and drug nodes are degree distribution, neighborhood connectivity, and subgraph centrality (Biggs, 1993).
Spectral gap: For a graph G, the Laplacian eigenvalues can be ordered as 1 = | λ 1 | ≥ | λ 2 | ≥ · · · ≥ | λn| (G may be directed or undirected, weighted or unweighted, simple or not). The Spectral gap is defined as: δ λ = | λ 1 | -| λ 2 |. By normalizing the Laplacian matrix of G, the eigenvalues are λ 1 ≥ λ 2 ≥ · · · ≥ λn > 0, and the Laplacian spectral gap will be: δ λ = 1 -| λ 2 |. The spectral gap is also known as a random walk, in terms of this concept λ 2 is the most important eigenvalue. Note that if the spectral gap is 0, which means λ 2 = 1 [ is not (strongly) connected or if is bipartite], this means a typical random walk will not converge to a unique distribution or dominant eigenvector. As long as the spectral gap is greater than 0, which means | λ 2 | < 1, then the random walk converges to a unique dominant eigenvector, and the spectral gap measures the rate of convergence, the larger the spectral gap (the smaller| λ 2 |), the better the network flow [large h(G), diffusion, mixing, random walk, expansion, sparsity, and other highly desirable properties of the network G].
Girth of a graph is the smallest positive integer r such that Trace(A r ) > 0. Let d = d(G) be the smallest integer (if it exists) so that for every pair of vertices (u,v) there is a walk of length at most d from u to v. Then d(G) is called the diameter or maximum eccentricity of the graph G.
Density of a graph is the ratio between the number of edges and the number of possible edges. Density is a measure of the compactness of a module (subnetwork) and measures the connectivity strength of pairs of genes in the module (Hussain Ahmed et al., 2020).
The clustering coefficient models the degree of clustering of a subset of nodes. A node is selected, and we see how connected the node is with other nodes that are also connected to it. The clustering coefficient is used to characterize network modularity, which is a strength of measure of a network division into modules or groups.
Degree distribution is the number of neighbors connected to a node; in other words, it is the number of edges incident on a node. The degree distribution can give information about the structure of a network. The networks can be directed or undirected. In the undirected case, the degree of node i is the number of connections it has, and it can be represented as an adjacency matrix, with the sum over all nodes. For directed graphs, there are two types of degree distributions: in-degree, which is the number of connections entering the node, and outdegree, which is the number of outgoing connections. In this case, the degree distribution is computed for the genes in the GDKG and for the drugs in the DDKG.
Subgraph centrality of a node is a weighted sum of closed walks of different lengths in the network starting and ending at a node. Centrality measures are used widely in biological networks to infer protein-protein interactions and identify essential proteins (Opsahl et al., 2010).

Implementation
The GRN inferencing method using MB is implemented in R. This method is used to construct the GRN's for each of the Frontiers in Cell and Developmental Biology | www.frontiersin.org The network measurements computed for these genes from the GDKG is given in columns 2-4.
spaceflight muscle atrophy datasets. The GDKG construction is done using SPOKE database and its adjacency matrix is created in MS Excel. The drug disease adjacency matrix is first constructed in MS Excel after downloading the drugs for each disease from drug bank. Cytoscape is used to visualize the networks. Exhaustive search method from the GridSearchCV library is used to estimate the best parameters for the link prediction methods. For the gene disease link prediction, the parameters chosen for the RF method are: depth of 15 for the RF with 500 estimators, and a learning rate of 0.2 for Gboost method. The GNN is a deep neural network with 10 layers consisting of 100 hidden nodes in each layer, it uses "relu" for activation, and Adam solver. For disease drug link prediction, the estimated parameters are a depth of 5 for the RF method with 500 estimators, and a learning rate of 0.2 for the Gboost method. The GNN has 10 layers with 100 hidden nodes in each layer, uses "relu" for activation, and Lbfgs solver. The GridSearchCV library also estimates the best number of split for cross validation, as well. In our implementation, we have chosen 10-fold cross validation. The computation of network features, and graph features are implemented in Python using the libraries networkX, node2vec, pandas, numpy, and sklearn. The implementations are available in github. 3

RESULTS
Results of GRN inferencing, knowledge graph construction, and the training and validation of link prediction methods are presented below.

GRN Inferencing and Construction of Knowledge Graphs
The gene expression values corresponding to spaceflight experiments are extracted from the excel files for the six GeneLab datasets and input to the MB GRN inferencing method. The number of values range from three to eight. Figure 2 shows the MB GRN for GLDS-246 dataset. Table 1 gives the list of the common genes identified from the GRNs that are highly activated due to muscle atrophy in spaceflight microgravity from the GLDS-4, 244, 245, 246, 288, and 289 datasets. Red nodes are genes with higher regulatory activity for muscle atrophy selected for constructing the GDKG. Figure 3 shows the GDKG constructed from the highly activated genes from the six datasets and the SPOKE database. Figure 4 shows the complete DDKG. Table 2 lists the diseases identified from the GDKG. The GDKG in matrix notation is of dimension 299 × 1195, where 299 is the number of nodes and 1,195 is the number of edges.

Training and Validation of Link Prediction Methods
The Preferential Attachment (PA) method outputs the predicted links from the GDKG and DDKG matrices. These matrices are divided into training and validation sets. The training network is of size 299 × 298 which is input to the random walk network feature extraction method. The input matrix to the three link prediction methods of RF, Gboost, and GNN is a network measure matrix of dimension 2199 × 100, where 2,199 is the number of pairs of nodes, and 100 is the number of random walk features. Overall, the three link prediction methods perform better than PA method. The output of all the link prediction methods is a matrix of nodes and edges with a "1" indicating new edge between the node pairs. If an edge does not exist originally or after link prediction, that entry remains a "0." Table 3 ranks the top muscle atrophy gene-disease associations based on a probability greater than 90% of link prediction using GNN. The most common disease associated with muscle atrophy are cancer, diabetes, and neural diseases. Table 4 lists the commonly used drugs for these diseases. There are about 180 drugs mentioned in the drug bank database as recommended treatment for the diseases mentioned in Table 2 which overlap with muscle atrophy condition. Table 5 lists 40 diseases with links to 21 drugs obtained from link prediction with probabilities higher than 80%. Some of these drugs treat more than one disease. Further fewer drugs can be selected by choosing a higher threshold for prediction probability. Table 6 lists the network measures computed for the 21 top ranked drugs in the DDKG. Table 7 lists the network measures for the GDKG and DDKG networks. Table 8 shows the True Positive, True Negative, False Positive, and False Negative for each of the link prediction methods for the GDKG and DDKG networks. Figures 5, 6 show the ROC curves for the GDKG and the DDKG link prediction, respectively. As can be seen the GNN has higher AUROC, followed by the RF method. The input graph network features are divided into training and validation sets to evaluate the link prediction methods. A 10-fold cross validation is carried out. Tables 9, 10 summarizes the 10-fold cross validation  accuracies using the link prediction methods for the GDKG and DDKG networks, respectively. The average accuracies obtained for the gene-disease network link predictions are 93.07, 92.32, and 89.72% for the GNN, RF, and Gboost methods, respectively. The average accuracies obtained for the disease-drug network link predictions are 92.11, 92.63, and 91.62% for the GNN, RF, and Gboost methods, respectively. Overall, the GNN has the highest accuracy of 92.59%, followed by 92.48 and 90.67% for the RF and Gboost methods, respectively. The preferential attachment based link prediction gives an average accuracy of 83.92 and 67.06% for gene disease, and disease-drug link prediction, respectively. Here, we have combined the analysis of the six GeneLab GLDS datasets related to organ muscle atrophy in spaceflight. This is advantageous than analyzing them individually, as it reduces space and time complexity of processing. The three methods of RF, Gradient boosting, and GNN perform equally well, while the GNN shows a slightly higher accuracy.

DISCUSSION
The shared key genes from the Markov Blanket GRN of all the six GeneLab datasets with maximal differential regulation are given in Table 1. Figure 3 shows the GDKG constructed using the top regulated genes from the six GeneLab datasets and the SPOKE database. The red nodes represent the genes, and the blue nodes represent diseases. Table 2 lists the disease nodes present in Figure 3. Table 3 lists 15 new gene disease associations predicted by the GNN link prediction method. There are several differentially regulated genes resulting in reduced proliferation of thymic cells, thereby reducing the size of the thymus (Horie et al., 2019a). Of these the ATF3 is a key gene player identified in Table 1. This gene encodes a member of the mammalian activation transcription factor and is induced by a variety of signals, including many of those encountered by cancer cells. It is involved in the complex process of cellular stress response. This gene has 15 additional links predicted by the GNN. PTEN is an important gene that suppresses cell growth into tumors, which has been identified as a key gene in the GDKG. This gene is found to regulate muscle protein degradation in diabetes (Hu et al., 2007). In the GDKG network this gene has 33 existing links, and 29 new links to existing diseases are predicted. Tumor Necrosis Factor (TNF) is one of the most important musclewasting cytokine, elevated levels of which cause significant muscular abnormalities (Bhatnagar et al., 2010). The protein encoded by TNFRSF19 is a member of the TNF-receptor family. When overexpressed it activates the JNK signaling pathway. The diseases associated with this gene are ovarian cancer and ectodermal dysplasia (Dostert et al., 2019). This gene originally had nine links in the GDKG, and eleven new links were added by the GNN link prediction method implying the importance of this gene in muscle atrophy prognosis in spaceflight. The  (Wu et al., 2020). This gene has eight existing links and eight links have been added by the link prediction method, showing the importance of this gene in spaceflight induced muscle atrophy. The EEF1B2 gene encodes a translation elongation factor specifically expressed in neurons and muscles (Doig et al., 2013). The protein is a guanine nucleotide exchange factor involved in the transfer of aminoacylated tRNAs to the ribosome. Diseases associated with EEF1B2 are seizures, alacrima, achalasia, and intellectual instability syndrome. This gene has seven existing links in the GDKG, and 17 new predicted links. Apart from these five key genes there are 10 more mentioned in Table 3. The network measures for these 15 genes are listed in Table 1. Compared to the other genes in Table 1, these 15 genes with higher number of predicted links also have higher values of degree distribution, neighborhood connectivity, and subgraph centrality network measures, as listed in Table 1. These genes also have higher link prediction probabilities greater than 90%. The diseases associated with these genes are cancer, diabetes, and neurological disorders most of which have muscle atrophy as a side effect. Prolonged exposure to spaceflight may cause risk of contracting these diseases. Hence, preventive medicine and therapeutics are key in warding off these conditions.

Implications for Spaceflight
Several spaceflight experiments have shown that changes in the physical environment modulate cellular responses thus accelerating the risk of age-related diseases such as bone loss, muscle atrophy, and impaired immune responses (Versari et al., 2013;Cadena et al., 2019). Investigations on muscle atrophy in organs and tissues including cutaneous muscles in rodent and human models are being conducted in spaceflight for over a decade (Däpp et al., 2004;Neutelings et al., 2015;Goropashnaya et al., 2020). There are about 20 datasets available in GeneLab on muscle atrophy investigation on animal models in spaceflight (NASA Gene Lab data repository, n.d.).
Formeterol is the only drug tested so far in spaceflight to mitigate muscle atrophy in mice (Ballerini et al., 2020). While experimental drug repurposing and clinical testing are prolonged and expensive, our proposed network science and artificial intelligence framework is computationally inexpensive and can be used for the rapid selection of candidate drugs to treat muscle atrophy in spaceflight. As muscle atrophy is a condition caused by many terrestrial diseases, the medications prescribed for these diseases can be useful candidates for repurposing for muscle atrophy. Hence, we constructed the GDKG for muscle atrophy to determine the diseases that have muscle atrophy as a primary side effect, and performed link prediction to identify the drugs that treat these diseases and can be repurposed for treating muscle atrophy. Figure 4 shows the DDKG constructed from the top ranked gene diseases associations from the GDKG, and the drugs used in treating these diseases. The blue nodes represent diseases, and the purple nodes represent the drugs. Table 4 lists the drugs from the network in Figure 4. The three link prediction algorithms are applied to the DDKG for identifying possible drugs for muscle atrophy treatment. Table 5 lists the drugs with probabilities higher than 80%. These drugs are used for treating the conditions that have muscle atrophy as a severe side effect such as cancer, diabetes, and nervous system disorders. For example, antidiabetic agents such as metformin, incretins, vitamin D, formoterol are medications that can reduce muscle wastage while treating diabetes (Campins et al., 2017). Indeed the GeneLab datasets GLDS-244 and GLDS-245 were collected to evaluate the efficacy of the drug formoterol to treat muscle atrophy in spaceflight flown mice (Ballerini et al., 2020). Muscle loss is also present in Chronic Obstructive Pulmonary Disease (COPD). The medication bimagrumab that treats COPD also resulted in increase in thigh muscle volume. By constructing the DDKG and applying link prediction, we have identified drugs belonging to the Monoclonal AntiBodies (MABs) family that are used for treating cancer as promising candidates for muscle atrophy in spaceflight. These include adalimumab, arcitumomab, certolizumab, golimumab, and infliximab. Table 5 lists the probabilities for these drugs as well as others that treat cancer and other diseases. Hence, one drug is repeated several times in Table 5. In total, there are 21 drugs that have higher probabilities for predicted links. The network measures for these drug nodes in the DDKG network is listed FIGURE 5 | Receiver Operating Characteristic curves for link prediction between genes differentially regulated in muscle atrophy and diseases in the GDKG using PA, RF, Gboost, and GNN methods.
FIGURE 6 | Receiver Operating Characteristic curves for link prediction between muscle atrophy related diseases and drugs used for their treatment in the DDKG using PA, RF, Gboost and GNN methods.  in Table 6. As can be seen, all of these drugs have similar values for degree distribution, and have a neighborhood connectivity between 9 and 10. The drugs with highest measures for degree distribution, neighborhood connectivity, and subgraph centrality are Nimodipine, Arcitumomab, Selegiline, Tetracydine, and Loteprednol. Arcitumomab, L-Arginine, L-Ornithine, and Nimodipine which are used for treating cancer and muscle disuse. Selgiline is used for treating cardiovascular diseases. Most of the 21 drugs that can be repurposed for muscle atrophy treat some type of cancer. The repurposing of a drug to treat muscle atrophy is limited by the drug database as the condition itself is secondary to diseases that have no cures. The selection of drugs to treat muscle atrophy in spaceflight could be based on those that can provide clear cures and can be effectively repurposed. Table 7 lists the network measures of girth, density and spectral gap for the GDKG and DDKG networks. As can be seen from these measures the GDKG network has higher spectral gap of 9.015. The larger the spectral gap (the smaller | λ 2 |), the higher the network flow with sparseness, expansion, diffusion, and random walk. Hence, these networks have a higher measure of random walks, implying that the nodes that lie closer to each other in the network perform similar functions. The advantage of using networks and AI methods for drug repurposing is that the graphs themselves are scalable and can include more genes, disease, and drug nodes and the deep learning architecture can be built to handle corresponding large scale prediction problems. The network sciences approach and the AI based tool can be used to predict key targets and potential diseases arising from spaceflight missions and will facilitate countermeasure development. Table 1 lists the highly activated genes from the spaceflight mice muscle atrophy datasets. These genes are involved in protein amino acid binding, glycoprotein binding, cell growth and/or maintenance, and cell adhesion receptor inhibitor activity. These genes are part of cellular metabolic pathways by which individual cells transform chemical substances and pathways involving organic or inorganic compounds that contain nitrogen. They are also involved in chemical reactions and pathways involving an organic substance, any molecular entity containing carbon, and in chemical reactions and pathways involving those compounds which are formed as a part of the normal anabolic and catabolic processes. Some of these genes are involved in organ system process carried out by any of the organs or tissues of the neurological system. 15 key genes with the highest number of newly predicted links is given in Table 3 and their associated diseases from diseases from GeneCards (2021) is also given here.

Key Genes Description
As can be seen half of these genes are associated with some type of cancer, followed by diabetes.

CONCLUSION
We have presented a novel method for generating GDKGs for a particular disease from gene expression datasets using network analysis and the SPOKE database. In this research, we have worked with transcriptional gene expression datasets for muscle atrophy in mice flown in spaceflight microgravity. Link prediction applied to this network reveals interesting relationships of key genes with different types of cancer. The link prediction method is also used on the Disease Drug Knowledge Graph resulting in the identification of novel drugs that are possible candidates for treating muscle atrophy accelerated due to spaceflight travel. We have combined six GeneLab datasets in an innovative way with disease and drug databases and applied network analysis and artificial intelligence methods for drug repurposing.

DATA AVAILABILITY STATEMENT
Publicly available datasets were analyzed in this study. This data can be found here: genelab.nasa.gov.