NNAN: Nearest Neighbor Attention Network to Predict Drug–Microbe Associations

Many drugs can be metabolized by human microbes; the drug metabolites would significantly alter pharmacological effects and result in low therapeutic efficacy for patients. Hence, it is crucial to identify potential drug–microbe associations (DMAs) before the drug administrations. Nevertheless, traditional DMA determination cannot be applied in a wide range due to the tremendous number of microbe species, high costs, and the fact that it is time-consuming. Thus, predicting possible DMAs in computer technology is an essential topic. Inspired by other issues addressed by deep learning, we designed a deep learning-based model named Nearest Neighbor Attention Network (NNAN). The proposed model consists of four components, namely, a similarity network constructor, a nearest-neighbor aggregator, a feature attention block, and a predictor. In brief, the similarity block contains a microbe similarity network and a drug similarity network. The nearest-neighbor aggregator generates the embedding representations of drug–microbe pairs by integrating drug neighbors and microbe neighbors of each drug–microbe pair in the network. The feature attention block evaluates the importance of each dimension of drug–microbe pair embedding by a set of ordinary multi-layer neural networks. The predictor is an ordinary fully-connected deep neural network that functions as a binary classifier to distinguish potential DMAs among unlabeled drug–microbe pairs. Several experiments on two benchmark databases are performed to evaluate the performance of NNAN. First, the comparison with state-of-the-art baseline approaches demonstrates the superiority of NNAN under cross-validation in terms of predicting performance. Moreover, the interpretability inspection reveals that a drug tends to associate with a microbe if it finds its top-l most similar neighbors that associate with the microbe.


INTRODUCTION
The human microbiome refers to all the microbes associated with a human body, including bacteriophages, archaea, bacteria, eukaryotes, and fungi (Lynch and Pedersen, 2016). To assess the diversity and functions of the human microbiome, the Human Microbiome Project (HMP) was supported by the National Institutes of Health (NIH) from 2007 to 2016 (Turnbaugh et al., 2007).
HMP provided a complete description of the microbiome in five tissues of the human body, including skin, gut, nostrils, vagina, and mouth (Aagaard et al., 2013). Human microbes have been verified for their close associations with human health by cell experiments, animal experiments, epidemiological studies, clinical case studies (Schwabe and Jobin, 2013;Lynch and Pedersen, 2016), etc. Previous works have revealed that abnormal microbe communities lead to metabolic disorders [e.g., non-alcoholic fatty liver disease (Younossi et al., 2016), obesity, and diabetes mellitus (Jaacks et al., 2019;Zheng et al., 2018)]. Oral drug administration is a typical treatment. Many drugs, however, can be metabolized by human microbes, and the drug metabolites would significantly alter pharmacological effects and result in low therapeutic efficacy for patients. For example, after being modified by gut microbes, the compounds can lead to their activation [e.g., salicylazosulfapyridine (Sousa et al., 2014)] or inactivation [e.g., inactivation of the cardiac drug digoxin by the intestinal actinomycete Eggerthella lenta (Haiser et al., 2013)], or induce toxicity [e.g., 70% toxicity of Brivudine may be attributed to intestinal microorganisms (Zimmermann et al., 2019b)]. The persistent findings of microbiome-induced individual pathogenesis, phenotypes, and treatment responses boost the microbiome to be an integral part of precision medicine (Kashyap et al., 2017). Therefore, drug-microbe association (DMA) prediction is of great significance for therapy and medicine development. However, the acquisition of DMAs needs a large scale of assays with high costs, low efficiency, and culturing limitations, and that are time-consuming. To identify DMAs rapidly and effectively, machine learning methods, especially deep learning-based methods, have attracted many scientists due to their inspiring applications in other areas [e.g., predicting microbe-disease associations (He et al., 2018;Peng et al., 2018), drug-drug interactions (Yu et al., 2021a), lncRNA-miRNA interactions (Zhang L. et al., 2021), and lncRNA-protein interactions (Lihong et al., 2021;Zhou et al., 2021)].
In recent years, researchers have applied Graph Attention Network [GAT (Velickovic et al., 2018)] to bioinformatics with remarkable results. For instance, Zhang Z. et al. (2021) used fragments containing functional groups to represent molecular maps for molecular property prediction through a fragmentoriented multi-scale graph attention model. Bang et al. (2021) made the prediction of polypharmacy side effects with enhanced interpretability based on graph feature attention network. Constructing a bipartite network is the most popular approach to represent associations between two types of nodes. The prediction problem of DMA can then be transformed into a link prediction problem in a bipartite graph network. However, few models predict DMAs through bipartite graph networks. For example, EGATMDA (Long et al., 2020b) used the drugdisease-microbe perspective to predict the DMAs, which does not show a direct relationship between drugs and microbes and may contain noise. HMDAKATZ (Zhu et al., 2019) predicted the interactions between drugs and microbes based on the Katz (1953); the disadvantage of this method in the node's information transmission (i.e., a node with a high central value transmits its high influence to all its neighbors) may not be appropriate in real life. GCNMDA (Long et al., 2020a) used GCN, random walk with restart, and GAT to learn node features, which relies on the parameter "step size" when using the restart random walk algorithm. HNERMDA (Long and Luo, 2020) learned the drugmicrobe heterogeneous network information by metapath2vec measure, which considered the type of nodes in the meta-pathbased random walk but the skip-gram does not treat them differently during training.
In the field of drug-target interaction prediction, there is a widely accepted assumption that structurally similar drugs tend to interact with the same target (Khalili et al., 2012). Analogously, we anticipate that if a drug (d x ) can associate with a microbe (b p ), the other drugs associated with the same microbe (b p ) are usually the first l nearest neighbors of the drug (d x ). Therefore, we propose a new model, Nearest Neighbor Attention Network (NNAN), which aggregates the information from nodes' neighbors according to their entity types and maps them into a unified embedding space for further predicting potential DMAs. The comparison with state-of-the-art methods on two different databases demonstrates the superiority of our NNAN. Moreover, its interpretability is illustrated and validates our assumption. Finally, the case study assesses its ability to find potential associations between drugs and microbes. In general, our contribution is as follows: • We make use of three networks: drug-drug similarity network, microbe-microbe similarity network, and a drugmicrobe bipartite graph network. Imitate the idea of KNN [K-Nearest-Neighbor (Cover and Hart, 1967)] to learn the substructures of the bipartite graph network, which can promote the accuracy of link prediction. • We follow the idea of GAT and use multiple DNNs to learn the weights of embedding features to improve the screening efficiency of potential associations. • In a quantitative way, we verify the hypothesis that "If a drug can associate with a microbe, the other drugs that associate with the microbe are usually the first l nearest neighbors to the drug."

MATERIALS AND METHODS
In this section, we describe a model for predicting DMAs in a bipartite graph network, named NNAN as shown in Figure 1. It consists of four components: a similarity network constructor, a nearest-neighbor aggregator, a feature attention block, and a predictor. Firstly, the similarity network constructor is mainly used to build a drug similarity network and a microbe similarity network (section "Similarity Networks" for details). Secondly, the nearest-neighbor aggregator generates the embedding representations of drug-microbe pairs by integrating drug neighbors and microbe neighbors of each drug-microbe pair in the network (section "Nearest-Neighbor Aggregator for Drug-Microbe Pair Embeddings" for details). Thirdly, the feature attention block evaluates the importance of each dimension of drug-microbe pair embedding by a set of ordinary multi-layer neural networks (section "Feature Attention Block" for details).  Finally, we make use of a fully-connected deep neural network as a binary classifier to predict potential DMAs.

Drug Similarity Network
We calculate drug similarities by the following steps. First, drugs are represented by Functional-Class Fingerprints [FCFPs (Rogers and Hahn, 2010)], which is the generalized version of Extended-Connectivity Fingerprints [ECFPs (Rogers and Hahn, 2010)] with more attention to atom functions. The FCFPs is implemented by RDKit (Landrum, 2010). Second, the similarity between drug d i and drug d j is calculated by the Tanimoto coefficient (Rogers and Tanimoto, 1960) as follows: where f d i and f d j represent the FCFPs vector of drug d i and drug d j , respectively, || · || indicates the norm of the vector. Fingerprint similarity provides intuitive results: why the two molecules have been determined to be similar, but FIGURE 3 | Feature attention block. Input the representation matrix E k × g into a set of DNNs, then we obtain an attention matrix M k × g of drug-microbe embedding features. After the element-wise product operation of M k × g and E k × g , the final feature matrix F k × g of the drug-microbe pairs is obtained. this transparency tends to vanish completely when molecular fingerprints are used as input to machine learning models. Inspired by the similarity maps (Riniker and Landrum, 2013), we calculate the contribution of each atom to the similarity between two molecules. To make it easier to distinguish the drugs, we regard d i as a reference drug, d j as a comparison drug, and S d i , d j as the base similarity of this drug pair. The RDKit will automatically number each atom of the comparison drug d j (K = {0, 1,..., t−1}). Then, we remove the atoms of the comparison drug one by one in the order of the atomic numbers to form multiple new comparing drugs (d k j , k ∈ K, K = {0, 1,..., t−1}). We calculate the new similarity between the reference drug (d i ) and the new comparison drug (d k j ), and regard the difference between the new similarity and the base similarity as the weight (w k j ) of each removed atom. The weight w k j is formulated as: We set the dimension of the FCFPs vector to 1,024 bits, of which the non-zero bits indicate the occurrences of drug feature substructures. To obtain the weight of each non-zero bit, we add up the weights of all the atoms contained in the feature substructure: where w bit q denotes the weight of the q th dimensional bit of the FCFPs vector, and the function SUM q (·) denotes the sum of all the atomic weights contained in the feature substructure represented by the q th dimensional bit of the FCFPs.
Then, the weighted Tanimoto similarity (Ioffe, 2010) between the reference drug and the comparison drug can be calculated as follows: where f q d i and f q d j denote the q th dimension of the FCFPs vectors for the reference drug and the comparison drug.
Based on drug similarities, we can build a drug similarity network Net d , where nodes are drugs. There are edges between the drugs if these drugs associate with the same microbe; the edges are weighted by drug similarities.

Microbe Similarity Network
To calculate microbe similarities, we use BLAST (Altschul et al., 1990) to make pairwise alignments of microbial genomes. Specifically, the main function of BLAST is to discover local similarity regions between sequences and then use the local sequence alignment algorithm (Smith and Waterman, 1981) to calculate the similarity. For example, G A = g 1 A g 2 A ...g n A and G B = g 1 B g 2 B ...g m B are the genome sequences of microbe A and microbe B, where n and m are the lengths of sequences G A and G B , respectively. BLAST creates the scoring matrix H (n + 1) × (m + 1) and makes the first row and column elements zero. The formula for the element H ij (H ij ∈ H (n + 1) × (m + 1) , i = 1, 2, ..., n; j = 1, 2, ..., m) in this scoring matrix is: the highest value in the matrix H (n+1) × (m+1) is chosen as sw (G A , G B ). The similarity between microbes A and B is adopted by the same definition as Yamanishi et al. (2008), as follows: Based on microbe similarities, we can build a microbe similarity network Net b , where nodes are microbes. There are edges between the microbes if these microbes associate with the same drug; the edges are weighted by microbe similarities.

Nearest-Neighbor Aggregator for Drug-Microbe Pair Embeddings
In this section, inspired by the idea of KNN [K-Nearest-Neighbor (Cover and Hart, 1967)], we learn the substructures of the bipartite graph network to obtain the embedding representations of drug-microbe pairs. First, we construct the drug-microbe bipartite graph network, .., b n } represents n microbes, and each edge (e ij ) in edge set E connects two nodes that belong to two different sets of vertexes (i.e., i in D, j in B). We regard the DMAs as bidirectional links. That is, e d x →b p denotes the edge pointing from the drug d x to the microbe b p , and e b p →d x denotes the edge pointing from the microbe b p to the drug d x . Correspondingly, the nearest-neighbor aggregator contains two blocks (Figure 2), the microbe-specific drug neighbor aggregator (MsDNA), and the drug-specific microbe neighbor aggregator (DsMNA). Due to their architectures being similar, we only illustrate the MsDNA block in this section.
Microbe-specific drug neighbor aggregator (Figure 2A) contains a virtual key dictionary; N = {n 1 , n 2 , ..., n m } indicates all the drugs. In the dictionary, we imitate the idea of KNN to learn the substructures of the bipartite graph network, where virtual keys are sorted by their semantic nearest neighbors. In simple terms, n 1 denotes d x itself, its nearest neighbor is the second key, and the farthest neighbor is the last key. The embedding representation of the edge, which is from drug d x to microbe b p , is formulated as follows: ⊆ N is a set of instantiated keywords, and N p denotes the neighbors of d x in the Net d . S d d x , n i denotes the similarity of d x and n i , and v i is the corresponding one-hot encoding vector of n i (i.e., the one-hot encoding has a non-zero value only in the i th element, and all other position elements are zero).
Similarly, DsMNA ( Figure 2B) makes the single directional embedding representation from b p to d x as a(b p , d x ). Then, the representation of drug-microbe pair could be encoded as where e d x , b p is generated via the concatenation of bidirectional embedding, and is the concatenation operation. All the embedding representations of drug-microbe pairs could stack as a matrix E k × g , where k is the number of all the drug-microbe pairs and g is the dimension of each embedding. The nearest-neighbor aggregator effectively learns the bipartite graph substructures, and E k × g will be input into a feature attention block to select crucial features for achieving a better DMA prediction.

Feature Attention Block
To improve the performance of the prediction, we build the feature attention block (Figure 3) for updating the embedding of drug-microbe pairs. Recall the equation of output feature representation in GAT (Velickovic et al., 2018): where σ is a nonlinear activation function, K i is the first-order neighbors of node i (including i), α ij is the coefficients computed by the attention mechanism, and W is a weight matrix. To make equation (9) easier to understand. We compute the coefficients as: where A = A + I is the adjacency matrix of the undirected graph G with added self-connections (Kipf and Welling, 2017), is the element-wise product operation, and M is the attention matrix. Then, the layer-wise propagation rules in GAT can be formulated as: where σ is a nonlinear activation function, and W (l) is the weight matrix of the l th neural network layer. Inspired by the conception of the layer-wise propagation rules in GAT, we calculate the augmented representation matrix F k × g by where E k × g is the representation matrix of the drug-microbe pairs obtained from the nearest-neighbor aggregator, M k × g is an attention matrix of E k × g , and is the element-wise product operation. We take the representation matrix E k × g as a feature matrix F (F = {f 1 , f 2 , ..., f g }), which is composed of g column vectors (f i (i = 1, 2, ..., g)). The feature attention block mainly uses M k × g to indicate the importance of features in the E k × g . Each feature dimension f i can be labeled as "selected" or "discarded" in a hard way, or be associated with a probability to be selected in a soft way; we employ DNNs to model the mapping by the DNN contains an input layer for each element of the feature dimension f i and an output layer with sigmoid as its activation function.
In total, we build k × g DNNs to obtain M k × g . The final feature matrix F k × g of the drug-microbe pairs is obtained after the element-wise product operation of M k × g and E k × g . F k × g is further fed into a predictor to achieve better predictive performance. The highest value is indicated in bold, and the next highest value is underlined.

Predictor
To implement the link prediction in the drug-microbe bipartite graph network, an ordinary DNN is utilized as the binary predictor that contains an input layer for the embedding representation of drug-microbe pairs, a hidden layer with ReLU as its activation function, and the two-neuron output layer with Sigmoid as its activation function. The output layer generates a probability that indicates the association likelihood of the drug and the microbe. The probability is formulated as: where ϕ is the sigmoid activation function, and F(·) is the fullyconnected layer. The entire network of NNAN with the nearest-neighbor aggregator, feature attention weights, and DNN weights can be jointly optimized through the binary cross-entropy loss as follows: where Y is the truth labels of drug-microbe pairs, D(·) is the DNN, θ denotes the weight parameters in the entire network, R(·) is an L 2 -norm, and λ is coefficient of the regularization item.

Data
In our experiments, two databases are collected from MDAD (Sun et al., 2018) and Zimmermann et al. (2019a), respectively. The former work MDAD (Sun et al., 2018) investigated 5,505 clinically or experimentally DMAs between 1,388 drugs and  The first column records the top 10 drugs, while the third column records the top 10-20 drugs.
180 microbes. After removing redundant information, these association entries are grouped into Database 1, which contains 999 drugs, 133 microbes, and 1,708 DMAs. The latter work (Zimmermann et al., 2019a) originally studied how 76 kinds of human gut bacteria metabolize 271 oral drugs, and found that 176 out of 217 drugs are significantly consumed by at least one bacteria strain. These associations are grouped into Database 2, which includes 176 drugs, 76 bacteria, and 4,194 associations (These two databases are shown in Table 1).

Comparison
Since there are few existing approaches for predicting DMAs, we compare NNAN with three state-of-the-art methods, which were raised for bipartite link prediction.
• LAGCN (Yu et al., 2021b): A layer attention graph convolutional network for the drug-disease association prediction.
• NIMCGCN (Li et al., 2020): A neural inductive matrix completion with graph convolutional networks for miRNA-disease association prediction.
To evaluate the performance of these methods, we regard the known DMA pairs as positive samples and unlabeled DMA pairs as negative samples (Peng et al., 2020;Li et al., 2022). We set up a 5-fold cross-validation scenario in which we randomly divide positive samples and negative samples into five groups, respectively. One group of positive samples and one group of negative samples are treated as test samples in turn for each round. The remaining groups are used for training purposes. Our model is trained by Gradient Descent Optimizer (Cauchy, 2009), with batch size 3,000 for 2,000 epochs, the initial learning rate is set to 0.9, and the regularization rate is set to 2e-4. We use AUROC (area under the receiver operating characteristic curve) and AUPRC (area under the precision-recall curve) as metrics to measure the DMA prediction performance. Moreover, we investigate the running time in terms of per epoch.
The comparison ( Table 2) shows that NNAN obtains the best AUROC value (0.911) and the best AUPRC value (0.502) in Database 1. NNAN attains the next-highest AUROC value (0.902) and the best AUPRC value (0.840) in Database 2. To further present the performance of NNAN, we calculate the running time for one epoch of the baselines and NNAN, respectively. As presented, with the same computing equipment, NNAN takes the third-shortest running time in Database 1 and the shortest running time in Database 2. In general, we can see that NNAN are comparable in terms of AUROC, AUPRC, and computation time. It demonstrates that NNAN is superior to other methods on the databases we collected.

Interpretability of Nearest Neighbor Attention Network
How does the NNAN interpret the hypothesis that "If a drug can associate with a microbe, the other drugs that associate with the microbe are usually the first l nearest neighbors to the drug." The model has two significant advantages to enhance interpretability. First, each column vector m i of M k × g indicates the global importance of each feature dimension f i . Moreover, the element-wise product between E k × g and M k × g generates the importance map of embedding features.
We first use the MsDNA in the nearest-neighbor aggregator block to show how the representation of drug-microbe pairs can provide intuitive hints, on which embedding features lead to the association. For the queried drug d x to the microbe b p of associated, non-zero cells in the embedding representation of a(d x , b p ) stand for its attention values derived from the drugs commonly linking b p . Since the keys are sorted in descending order from the drug itself (n 1 ) to the farthest neighbor (n m ), the positions of non-zero cells are crucial to the final association.
Take Database 1 as an example. By calculating two average embedding vectors for approved DMAs and unlabeled drugmicrobe pairs, we obtained a distribution along with the drug key dictionary from n 1 to n 66 ( Figure 4A). As illustrated, the significantly high values of embedding features occurring among the first l nearest neighbors reveal that a drug (d x ) associated with a specific microbe (b p ) can always find its top-l nearest neighbors among other drugs that associate with the same microbe. This observation demonstrates that a drug is possibly associated with the microbe if it has more non-zero value cells on the positions of the first l feature dimensions. This phenomenon could be caused by the fact that over 80% of approved drugs are of "follow-on" or "me-too" drugs. Due to high cost and high risk, the design of novel drugs, except for pioneer drugs, always starts from the structures of one or several existing drugs and then slightly modify them until meeting pharmacological needs (DiMasi and Faden, 2011). Analogously, the results of the DsMNA block along the microbe neighbor aggregator keys reveal that a microbe associated with a specific drug usually finds its near neighbors associated with the same drug.
Moreover, we illustrate how the feature attention matrix M k × g can provide data-driven hints on which embedding features lead to the association. Since a high-value cell in M k × g stands for a crucial feature dimension contributing to determine the association between a queried drug and a microbe, the importance m(i, :) of each feature f i can be measured by the average of value entries in the i th column of M k × g (Figure 4B). The importance distribution along with the sorted drug neighbor keys illustrates that highly important features are usually located among the first l nearest neighbors. In addition, the predictive performance with topl features concerning l is investigated (Figures 4C,D). The number of top features is tuned in the list {1, 6, 11, 16,..., 66}. As l is increasing to 16, the performance increases sharply in the top-l features. When l keeps increasing, the performance increases slowly, then even decreases at the greater value of l. Again, this illustration demonstrates that the selection of crucial features is significantly better than the set of all features.
In summary, both embedding feature matrix E k × g , which is generated by the nearest-neighbor aggregator, and its feature attention matrix M k × g provide mensurable clues to the association outcome.
To complement the verification of the interpretability of NNAN, we selected one microbe (i.e., Staphylococcus aureus, which is a common causative agent of food poisoning) and one drug (i.e., Hexyl gallate, which has strong antimalarial activity against Plasmodium falciparum) from Database 1, and there was an association between them (de Lima Pimenta et al., 2013). We calculated the similarities between drugs using Hexyl gallate as the reference molecule and sorted the drugs in order of their similarity to Hexyl gallate. Then, we picked the top 10 drugs and checked whether these drugs were associated with S. aureus in Database 1. Finally, we found out that 8 out of the top 10 ranked drugs for Hexyl gallate are associated with S. aureus (Table 3).
From Table 3, it is clear that a drug tends to associate with a microbe if it finds its top-l near neighbors associate with the same microbe. Moreover, the higher the ranks of its topl near neighbors are, the more possible it is to associate with the microbe. This conclusion would be helpful to screen druglike molecules.

CASE STUDY OF NOVEL PREDICTION
To further confirm the effectiveness of NNAN, we apply our model on one microbe (i.e., Bacteroides fragilis) in Database 2 as a case study. Bacteroides are the major human colonic commensal microbes (Kuwahara et al., 2004). Although B. fragilis is rare in comparison to other Bacteroides species, it is the most prevalent clinical isolation of the genus (Salyers, 1984). Thus, we select B. fragilis for the case study experiment.
Nearest neighbor attention network predicts potential associations between drugs and B. fragilis by scoring drugmicrobe pairs (probability). The higher the score, the more likely the association between the drugs and B. fragilis exists. In the case study, we verified whether NNAN could find out potential linkages between B. fragilis and drugs. According to the ranking of potential DMAs, we validated the top 10, 20, and 50 predicted candidate drugs by a literature search. Eventually, the validation indicates that 10, 17, and 38 out of the top 10, 20, and 50 predicted drugs associated with B. fragilis were found by previously published literature. For example, 85% out of the top 20 predicted candidate drugs for B. fragilis are validated ( Table 4); more details can be found in the Supplementary Material. These results of prediction demonstrate the ability of NNAN for predicting potential DMAs in practice.

CONCLUSION
This work has introduced NNAN, a deep learning-based bipartite graph network model to predict potential associations between drugs and microbes. NNAN calculates drug similarities using the weights of feature substructures. It provides an embedding representation based on the near neighbor aggregation for drugmicrobe pairs, to enhance the explanation of DMAs. In addition, the model provides a crucial feature selection attention matrix for achieving more accurate predictions. These three components of NNAN jointly reveal that a drug associated with a specific microbe can always find its top-l near neighbors among other drugs that associate with the same microbe. Moreover, they uncover that the higher the ranks of its top-l near neighbors are, the more possible it is to associate with the microbe. Under both a cross-validation setting and a realistic potential linkage discovery setting, the empirical comparison of the proposed framework with three state-of-the-art baselines demonstrates that NNAN has significant competitive performance in predicting DMA. In addition, the framework of our model can also be evaluated in more similar biological issues (e.g., miRNA-disease, drug-target, and compound-protein associations prediction). Furthermore, there is still room to improve the model. We can set new experimental scenarios, which identify the DMAs for new drugs or new microbes, and can also integrate more biological databases to enrich the information of DMAs to improve the predictive ability.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author/s.