Abstract
Background: Drug repositioning is considered a promising drug development strategy with the goal of discovering new uses for existing drugs. Compared with the experimental screening for drug discovery, computational drug repositioning offers lower cost and higher efficiency and, hence, has become a hot issue in bioinformatics. However, there are sparse samples, multi-source information, and even some noises, which makes it difficult to accurately identify potential drug-associated indications.
Methods: In this article, we propose a new scheme with improved tensor robust principal component analysis (ITRPCA) in multi-source data to predict promising drug–disease associations. First, we use a weighted k-nearest neighbor (WKNN) approach to increase the overall density of the drug–disease association matrix that will assist in prediction. Second, a drug tensor with five frontal slices and a disease tensor with two frontal slices are constructed using multi-similarity matrices and an updated association matrix. The two target tensors naturally integrate multiple sources of data from the drug-side aspect and the disease-side aspect, respectively. Third, ITRPCA is employed to isolate the low-rank tensor and noise information in the tensor. In this step, an additional range constraint is incorporated to ensure that all the predicted entry values of a low-rank tensor are within the specific interval. Finally, we focus on identifying promising drug indications by analyzing drug–disease association pairs derived from the low-rank drug and low-rank disease tensors.
Results: We evaluate the effectiveness of the ITRPCA method by comparing it with five prominent existing drug repositioning methods. This evaluation is carried out using 10-fold cross-validation and independent testing experiments. Our numerical results show that ITRPCA not only yields higher prediction accuracy but also exhibits remarkable computational efficiency. Furthermore, case studies demonstrate the practical effectiveness of our method.
1 Introduction
Over the past few decades, while funding for drug development has seen a substantial surge, the number of newly approved drugs for market release has remained limited. Notably, developing a new drug demands an average of 13.5 years and involves an average expenditure of 1.8 billion (Liu et al., 2020). This process is time-consuming and tremendously expensive and involves high risk (Chong and Sullivan, 2007; Dickson and Gagnon, 2009). Since the approved drugs already possess safety records, tolerance, and pharmacokinetic data of the human body in clinical trials, discovering new clinical indications for commercialized drugs is an important strategy to improve the efficiency of drug development (Ashburn and Thor, 2004). In fact, there have been a few successful repurposed drugs, such as sildenafil, thalidomide, and retinoic acid, which have been widely used in application (Luo et al., 2021).
Using computational methods to discover new uses for established drugs is a crucial aspect of drug repositioning, which is based on the assumption that drugs with similar properties tend to treat similar diseases. With the rapid development of high-throughput technology and continuously generating multi-omics data, there is an increasing focus on crafting computational methods for elevated precision (Wang et al., 2021). These approaches can be classified into four distinct groups: encompassing network-based methods, machine learning-based methods, matrix-based methods, and deep learning-based methods.
Network-based approaches infer the scores of drug–disease pairs by constructing drug and disease heterogeneous biological networks and extracting topological information. The fundamental assumption is guilt by association, whereby if a certain drug can interact with most of the target’s neighbors, it is probable that the target will also be able to interact with the same drug and vice versa. Based on the guilt-by-association principle, Wang et al. (2013) used a priori information about drugs and targets to establish a heterogeneous graph. A heterogeneous graph-based inference (HGBI) model was used to predict new drug–target interactions. Luo et al. (2016) enhanced the quality of the similarity between drugs and diseases by exploiting the existing drug–disease associations. Building on the combined similarity measures, a new bi-random walk algorithm called MBiRW was developed to infer potential associations between drugs and diseases. Qin et al. (2022) proposed a network-based inference model for new emerging diseases, which used genes as a bridge in a tripartite drug–gene–disease network to infer latent drug–disease associations. Additionally, to account for the structures of networks and the biological aspects related to drugs and indications, Zhao et al. (2022) presented a novel graph representation model based on heterogeneous networks, namely, HINGRL. It integrated the biological networks of drugs and diseases to learn the features from both topological and biological perspectives.
Machine learning-based approaches use supervised learning algorithms to identify potential indications for drugs based on input features and known associations (Vamathevan et al., 2019), including logistic regression (Yang et al., 2021), random forests Zhao et al. (2022), and support vector machines (Lavecchia, 2015). Jiang and Huang (2022) proposed a graph representation model based on random forest for drug repositioning. The method identified drug–disease associations by feeding combined features from the molecular association network into a random forest algorithm. Gao et al. (2022) presented a model for predicting associations between drugs and diseases, employing similarity kernel fusion (SKF) to merge diverse similarity kernels for drugs and diseases. This fusion resulted in two integrated similarity kernels, and the scores of association pairs were calculated using the Laplacian regularized least square (LapRLS) algorithm.
Matrix-based methods use the low-rank matrix representation of the drug–disease association space to identify novel associations based on the similarity of their profiles. Yang et al. (2019) developed a bounded nuclear norm regularization (BNNR) method to obtain the low-rank matrix of the drug–disease association. This method efficiently handles noise originating from the drug and disease similarity. Yang et al. (2021) proposed a multi-similarity bilinear matrix factorization (MSBMF) method that dynamically integrated multiple similarities of drug and disease into drug–disease association training. It limited the predicted values of the drug–disease association to non-negative. Huang et al. (2020) proposed a multi-task learning method that used ensemble matrix factorization to predict both treatment associations and non-treatment associations between drug and disease. The proposed method can capture complementary features associated with these two tasks. Yan et al. (2022) proposed a multi-view learning with matrix completion method (MLMC), which is capable of effectively utilizing multi-source similarity matrices. The Laplacian graph regularization was pulled into MLMC to acquire an all-encompassing feature representation derived from the multi-similarity information of drugs and diseases.
Deep learning-based methods typically use neural network models to learn the feature representation of drugs and diseases and use these features to predict new association pairs. Xuan et al. (2019) introduced a bidirectional deep learning model based on the convolutional neural network (CNN) and bi-directional long- and short-term memory (BiLSTM). This framework incorporates both similarities and associations between drugs and diseases in addition to pathways that connect specific drug–disease pairs. This approach effectively integrates raw and topological data between nodes. Combining similarity network fusion (SNF) and neural network (NN) deep learning models, Jarada et al. (2021) proposed a method known as SNF-NN, which was designed to forecast novel drug–disease associations. Yu et al. (2021) proposed a layer attention graph convolutional network model to detect the potential uses of drugs. The model performs graph convolutional processing on a heterogeneous network constructed from drug and disease information, thereby achieving association prediction.
To mine latent association features in multiple similarities and association data, we present an improved tensor robust principal component analysis (ITRPCA) method. First, we integrate the prior information of drug and disease to compute five indicators for drug similarity and two indicators for disease similarity. Considering that validated drug–disease associations are extremely sparse, a weighted
k-nearest neighbor (WKNN) preprocessing step is employed to enrich the association matrix that aids in prediction. Then, we construct a drug tensor and a disease tensor using multi-similarity matrices and an updated association matrix. Finally, we apply ITRPCA to isolate the low-rank tensor and noise information in these two tensors, respectively. We focus on the drug–disease association pairs in the clean low-rank tensor to infer promising indications for drugs.
Figure 1illustrates the comprehensive workflow of the ITRPCA method. Our method’s key contributions are as follows:
• ITRPCA presents a comprehensive scheme for incorporating diverse drug and disease similarities into prediction training.
• By leveraging the weighted tensor Schatten p-norm, ITRPCA can effectively extract the low-rank association tensor from the updated drug and disease tensors, which efficiently separates noisy data and leads to significantly improved accuracy, as demonstrated in our results.
• The ITRPCA model includes a boundary constraint that ensures all predicted tensor entries fall within the predefined interval.
• We have devised an iterative approach employing the augmented Lagrangian multiplier (ALM) to numerically address the ITRPCA model.
FIGURE 1
2 Materials
To validate the effectiveness of our proposed method, this study involves three crucial datasets: the gold standard dataset (Gottlieb et al., 2011), Cdataset (Luo et al., 2016), and CTD (Davis et al., 2019). Table 1 summarizes the details of these three data, such as the number of drugs and diseases, the number of known association pairs, and the intended purposes in this study. These drugs and diseases are obtained from DrugBank (Wishart et al., 2006) and the Online Mendelian Inheritance in Man (OMIM) database (Ada Hamosh, 2005), respectively. The corresponding drug–disease association matrix denoted as A is represented by a binary matrix, where the proven drug–disease associations are denoted by 1 s, while unproven associations are denoted by 0 s.
TABLE 1
| Dataset | # of drugs | # of diseases | # of association pairs | Purpose |
|---|---|---|---|---|
| Gold standard dataset | 593 | 313 | 1,933 | 10-fold cross-validation |
| Cdataset | 663 | 409 | 2,352 | Independent testing |
| CTD (February 2020) | 1,613 | 969 | 15,339 | Independent testing |
Number of drugs, diseases, and known association pairs in each dataset along with their respective purposes for dataset utilization.
Here, we calculate a total of five similarity matrices for drugs: chemical structure similarity Rchem, anatomical therapeutic chemical (ATC) code similarity Ratc, side-effect similarity Rse, drug–drug interactions similarity Rddi, and target profile similarity Rtarg. Based on the drug’s canonical SMILES (Weininger, 1988) file, we use the Chemical Development Kit (CDK) (Steinbeck et al., 2003) tool to compute the hashed fingerprints for all drugs and then obtained Rchem. ATC codes for all relevant drugs were extracted from DrugBank. We use the semantic similarity algorithm (Resnik et al., 1995) to calculate the similarity scores between ATC terms and then obtained Ratc. The rest of the similarities are calculated using the Jaccard similarity coefficient (Jaccard, 1908), which can be expressed as follows:where Si implies the side-effect profiles of drug i in Rse, drug–drug interaction profiles of drug i in Rddi, and drug–target interaction profiles of drug i in Rtarg.
For diseases, a total of two similarity measures are calculated: disease phenotypic similarity Dph and disease ontology similar Ddo. Dph is obtained from MimMiner (Van Driel et al., 2006), which calculates the frequency of MeSH (medical subject heading (MeSH) vocabulary terms co-occurring in the medical descriptions of two diseases retrieved from the OMIM database. According to the structure of the disease ontology academic language, Ddo is computed by using the Gene Ontology-based algorithm (Wang et al., 2007).
In summary, we have collected a total of one drug–disease association matrix A, five drug similarity matrices (i.e., Rchem, Ratc, Rse, Rddi, and Rtarg), and two disease similarity matrices (i.e., Dph and Ddo) for computational drug repositioning.
3 Methods
In this section, we introduce our method for identifying potential uses for established drugs. The structure is as follows: first, we depict the robust principal component analysis (RPCA) and tensor robust principal component analysis (TRPCA). Then, we propose the model of ITRPCA according to the requirements of drug repositioning. At last, the ALM method is demonstrated to solve the ITRPCA model in detail.
For the ease of reference, bold calligraphy letters represent third-order tensors, e.g., , capital letters denote the matrices, e.g., X, bold lower-case letters indicate the vectors, e.g., x, and lower-case denote the elements of .
3.1 Robust principal component analysis
RPCA stands as a prominent technique in low-rank representation, which can separate the noise matrix from the original data matrix and learn the clean low-rank matrix. It has found successful applications in computer vision and machine learning, such as video surveillance (Wright et al., 2009), facial modeling (Peng et al., 2012), and subspace clustering (Liu et al., 2010). RPCA is targeted at a matrix, which can decompose the target matrix into a low-rank matrix and a sparse matrix for achieving noise reduction. Generally, the mathematical formula of RPCA can be expressed aswhere M denotes the original matrix, X is the low-rank matrix, and E is the sparse noise matrix. ‖X‖* = ∑rσr(X) represents the nuclear norm of matrix X, where σr(X) is the rth singular value of X. denotes the L1-norm of E, and eij is the (i, j) element of E.
3.2 Tensor robust principal component analysis
TRPCA (Lu et al., 2020) is a continuation of RPCA. The primary motivation behind developing TRPCA is to handle multi-dimensional datasets, which are prevalent in various domains, including computer vision (Wang et al., 2014), object recognition (Zhang and Peng, 2019), and medical imaging (Pham et al., 2021). TRPCA aims to decompose the multi-dimensional data into a low-rank tensor, which captures the essential features of the data, and a sparse tensor, which contains the outliers and noise. The low-rank tensor can be interpreted as the underlying structure of the data, while the sparse tensor represents the deviations from this structure.
Similar to the nuclear norm of the matrix, the tensor nuclear norm (Kilmer and Martin, 2011) is defined aswhere and . is denoted as the ith frontal slice of , and is denoted as the discrete fast Fourier transform (FFT) of along the third dimension, i.e., . Thus, . The TRPCA model is formulated as follows:where is the original tensor data, measures the low-rank tensor, and denotes the sparse noise tensor. According to Eq. 3, model (4) is equally regularized for all singular values of the tensor data and shrunk with the same parameters when minimizing the tensor nuclear norm.
3.3 ITRPCA for drug repositioning
Weightedk-nearest neighbor preprocessing. and represent the collection of n disease nodes and m drug nodes, respectively. represents the original drug–disease association matrix, where Aij = 1 if disease di is recognized to have a known connection with drug rj; otherwise, Aij = 0. The ith row vector of matrix A, i.e., , represents the association profile of disease di. The jth column vector of matrix A, i.e., , represents the association profile of drug rj. In fact, if novel drug nodes or disease nodes are considered, the values of their corresponding columns or rows in the adjacency matrix are zero. This case will lead to unsatisfactory performance in prediction (Xiao et al., 2018). We utilize the WKNN algorithm to populate the drug–disease association matrix. This is achieved by considering the similarities of drugs and diseases.
For each drug rq, the similarities of the other k-nearest known drugs (where at least one validated association exists) are combined to update the drug’s association profile:where the drugs r1 to rk are arranged in descending order according to their similarity with rq. α ∈ [0, 1] is a decay term, and R denotes the mean matrix of five drug similarity matrices. This means that when the similarity between rj and rq is strong, a higher weight will be assigned; conversely, a lower weight will be assigned. Furthermore, is the normalization term.
In the same way, the updated association profile for each disease dp is obtained as follows:where d1 to dk are the diseases sorted in descending order based on their similarity to dp. α ∈ [0, 1] is a decay term, and D denotes the mean matrix of two disease similarity matrices. Qd is a normalization term, and . Finishing these profile operations, we obtain the aforementioned two matrices Ar and Ad from drug and disease spaces, respectively. Then, the new drug–disease association matrix ADR is calculated as follows:After the processing of WKNN, the density of the updated association matrix ADR is greatly improved, and it no longer contains all zero rows and all zero columns. However, some noise information is inevitably added into the association matrix. Subsequently, we will propose our new method for noise separation.
Algorithm 1 summarizes the preprocessing step for updating the drug–disease association matrix using WKNN.
Algorithm 1
• Input: The original drug–disease association matrix , the five drug similarity matrices: Rchem, Ratc, Rse, Rddi, Rtarg; the two disease similarity matrices: Dph, Ddo, decay term α, neighborhood sizes k.
• Output: Optimized association matrix ADR.
1. R = (Rchem + Ratc + Rse + Rddi + Rtarg)/5, D = (Dph + Ddo)/2
2. for each drug rq do
3. ; is the function to obtain the k known nearest neighbors of rq in matrix R in descending order.
4. ;
5.
6. end for
7. for each disease dp do
8. ;
9. ;
10.
11. end for
12. ;
13. return ADR.
Drug tensor and disease tensor. We construct a third-order drug tensor with five frontal slices denoted as . This tensor comprised five drug similarity matrices and an updated association matrix. Specifically, the first frontal slice of the drug tensor is a concatenation of Rchem and ADR, which can be described as follows:where . In the same way, the remaining four frontal slices of the drug tensor can be constructed with other similarity matrices and ADR, which is presented as
A third-order disease tensor with two frontal slices, namely, , is constructed using two disease similarity matrices and an updated association matrix. The disease tensor is stacked by two slices. Each of its slices can be denoted aswhere .
ITRPCA model. In the two tensors, and , some noise is involved in both the similarity data and the inferred association data by WKNN. TRPCA can be employed to separate noise tensors from low-rank tensors. In order to fully exploit the significant information embedded within drug and disease tensors, it is crucial that we adjust the shrinking of large and small singular values such that the large singular values shrink less and the small singular values shrink more. However, TRPCA fails to effectively utilize this prior knowledge during the minimization of tensor nuclear norm. Therefore, the weighted tensor Schatten p-norm is introduced to treat different singular values separately, which is defined aswhere denotes the jth singular value, and ωj denotes the weight value of the jth singular value. When p = 1 and is a special case of . Moreover, it is crucial to note that the entries of the low-rank tensor using TRPCA can be any real value in the range of (−∞, + ∞). However, it is imperative to ensure that the predicted values are contextually relevant, as any values falling outside the interval of [0,1] would be meaningless. To address this concern, a bound constraint should be incorporated to restrict the predicted values of unobserved elements within the interval [0, 1]. Our ITRPCA model is formulated as follows:where can be replaced by a drug tensor and disease tensor in practice.
Here, we use the drug tensor instead of as an example. By optimizing the ITRPCA model, a clean low-rank drug tensor can be obtained. Its potential low-rank representation comes from drug multiple similarity data and association information. Actually, we focus on the part of the association tensor in , which is denoted as and equal to . In order to obtain a predicted association matrix for inferring potential drug–disease pairs, we take the average of the tensor in the longitudinal direction. This operation can be expressed as , where is the optimized drug–disease association matrix from the drug’s perspective. In the same manner, we substitute the disease tensor for in model (12). A new low-rank tensor , the other association tensor , and the corresponding association matrix can be conducted from the perspective of diseases using ITRPCA. It should be noted that and . Finally, the integrated drug–disease association matrix was obtained by averaging the prediction results of both drugs and diseases.
Algorithm 2 summarizes the process of applying ITRPCA in drug repositioning. Based on the predicted pair scores in , the potential drug–disease association can be inferred.
Algorithm 2
• Input: Original drug–disease association matrix , the mean of drug multiple similarity , the mean of disease multiple similarity , neighborhood sizes k, p-value of Schatten p-norm.
• Output: Predicted association matrix .
1. ADR ←WKNN preprocessing ;
2. Assign and by the Eqs (8 − 10).
3. ;
4. ;
5. ;
6. ;
7. ;
8. ;
9. return .
3.4 Solutions for ITRPCA
In this subsection, the ALM method is derived to solve the model (12). Accordingly, the augmented Lagrangian function becomeswhere is the Lagrange multiplier and μ is the penalty parameter. The primary procedure comprises the subsequent distinct subtasks:
:We fix and to minimize for . The model (14) becomeswhere , and drawing inspiration from the soft-thresholding operator, we havewhere the (i, j, k)th element of is .
: We fix and to minimize for . The model (14) becomeswhere . This is a weighted tensor Schatten p-norm minimization (WTSNM) problem based on t-SVD (Kilmer and Martin, 2011). In order to tackle this concern, the subsequent lemma and theorems can be employed.
(
Xie et al., 2016). For the optimization problem
with the given
pand
ω, there exists a specific threshold
and we have the following conclusions:
1) When , the optimal solution of Eq. 18 is 0.
2) When , the optimal solution is , and can be obtain by solving .
Theorem 1 (Xie et al., 2016). Let be the SVD of , l = min (m, n), 0 ≤ ω1 ≤ ω2 ≤ ⋯ ≤ ωl, then a global optimal solution of the following model:iswhere and , which can be obtained by Lemma 1. The is organized in a descending order, while is arranged in an ascending order.
Theorem 2 (Gao et al., 2021). Suppose , let given the model
Then, a global optimal solution to the model (22) iswhere is a tensor and is the ith frontal slice of . and .
According to Theorem 2, the global optimal solution of model (17) is
In addition, we limit the entry values of to the interval [0,1] by using the following projection operator.where is defined as
: We fix and to minimize for . The model (14) becomes
Compute μk+1: In the ITRPCA model, we employ a scheme that gradually increases the learning rate to facilitate fast convergence (Gao et al., 2021). The penalty parameter becomes
Algorithm 3 provides the overall iterative scheme of the ITRPCA model. It can extract significant information from the drug tensor and disease tensor and ensure that the predicted drug–disease association values are within [0,1].
Algorithm 3
• Input: Tensor data (using drug tensor or disease tensor ), p-value of Schatten p-norm.
• Output: Low-rank tensor .
• Initialize: , ρ = 1.1, regularization coefficient λ and weight vector ω.
repeat
1. Update by Eq. 16.
2. Update by Eq. 24.
3. .
4. Update by Eq. 27.
5. Update μk+1 by Eq. 28.
6. k ← k + 1.
until convergence
.
4 Results and discussion
4.1 Evaluation metrics
To evaluate the effectiveness of ITRPCA, we employ 10-fold cross-validation to predict potential indications for existing drugs. In this process, known drug–disease associations within the gold standard dataset are randomly split into 10 distinct sets of comparable sizes. One subset serves as the test data, while the remaining nine subsets serve as the training data. This 10-fold cross-validation is repeated 10 times with varied random splits, and the resultant averages are considered the final results. Following prediction generation, potential diseases associated with the test drug are sorted in descending order according to their prediction scores. We utilize three evaluation metrics to evaluate the overall performance of ITRPCA: the area under the receiver operating characteristic curve (AUC), the area under the precision–recall curve (AUPR), and precision.
4.2 Parameter setting
In the ITRPCA algorithm, there are some default parameters and two key hyperparameters that need to be adjusted. These default parameters are determined empirically. Specifically, in the WKNN step, we set a decay term α equal to 0.95. In model (12), the regularization coefficient , where n1, n2, and n3 are the size of . For drug and disease tensors, we design an adaptive scheme to determine the weight vector ω of model (12). We divide ω into three parts: the first part ranges from 1 to u, the second part ranges from u + 1 to v, the third part ranges from v + 1 to the end, and the specific weights of each part are [1, 2, 4]. The number of u and v are determined bywhere
Actually, σj,i is the ith largest singular value of the jth frontal slice matrix of . It is evident that by minimizing the weighted tensor Schatten p-norm, the singular values of the second and third parts can be shrunk more compared to the first part. The reason is that these two parts are assigned weight values greater than 1. In addition, the two key hyperparameters are needed to be adjusted, which are neighborhood sizes k and p value of Schatten p-norm. We perform grid search to select the appropriate values according to the sum of AUC and AUPR in cross-validation. k is chosen from {10, 20, 30, 40, 50}, and p is picked from {0.6, 0.7, 0.8, 0.9, 1}. The numerical results for determining the parameters k and p are reported in Table 2. When k = 30 and p = 0.9, the highest rating value appears. Meanwhile, we terminate the ITRPCA algorithm when the following stopping criterions are satisfied or the maximum number of iteration steps is reached.where and tol1 and tol2 are the given tolerances, which are set as 10–3 and 10–4 in the algorithm, respectively.
TABLE 2
| p \ k | 10 | 20 | 30 | 40 | 50 |
|---|---|---|---|---|---|
| 0.6 | 1.294 | 1.299 | 1.302 | 1.308 | 1.312 |
| 0.7 | 1.318 | 1.327 | 1.335 | 1.342 | 1.348 |
| 0.8 | 1.352 | 1.365 | 1.370 | 1.377 | 1.378 |
| 0.9 | 1.379 | 1.391 | 1.395 | 1.394 | 1.394 |
| 1 | 1.359 | 1.364 | 1.366 | 1.366 | 1.365 |
Sum of AUC and AUPR values using different k- and p-values in the 10-fold cross-validation.
4.3 Comparison with state-of-the-art drug repositioning methods
We compare ITRPCA with five state-of-the-art methods in computational drug repositioning: HGBI (Wang et al., 2013), MBiRW (Luo et al., 2016), BNNR (Yang et al., 2019), MSBMF (Yang et al., 2021), and MLMC (Yan et al., 2022). To ensure a fair comparison, the parameters used in these compared methods are set to the recommended values by the authors (HGBI: α = 0.4; MBiRW: α = 0.3 and l = r = 2; and BNNR: α = 1 and β = 10) or determined by a grid search (MSBMF: λ1 and λ2 are chosen from {0.001, 0.01, 0.1, 1}, and τ = 0.7; MLMC: λr and λd are selected from {0.0001, 0.001, 0.01, 0.1, 1}, and threshold = 0.8).
We assess the performance of all methods in a 10-fold cross-validation for the gold standard dataset. Table 3 shows the AUC, AUPR, and precision values of all compared methods. As shown in Table 3, ITRPCA has the best performance compared to other methods in terms of AUC, AUPR, and precision. Specifically, ITRPCA achieves the best AUPR value of 0.442, which is 67.424%, 4.492%, 4.988%, and 1.376% higher than the corresponding AUPRs of MBiRW, BNNR, MSBMF, and MLMC, respectively. It can be seen that ITRPCA performs slightly better than MLMC. The ROC curves of all methods in the 10-fold cross-validation are shown in Figure 2A.
TABLE 3
| Metric | ITRPCA | HGBI | MBiRW | BNNR | MSBMF | MLMC |
|---|---|---|---|---|---|---|
| AUC | 0.952 | 0.829 | 0.917 | 0.932 | 0.941 | |
| AUPR | 0.442 | 0.102 | 0.264 | 0.423 | 0.421 | |
| Precision | 0.476 | 0.130 | 0.304 | 0.463 | 0.455 |
AUC, AUPR, and precision values of all comparison methods in 10-fold cross-validation for the gold standard dataset.
The most optimal outcomes are indicated in bold, while the second-best results are underlined.
FIGURE 2
Based on the test results from the repeated 10-fold cross-validation, we used the Wilcoxon rank sum tests to evaluate the statistical significance of ITRPCA compared to other methods in terms of AUC, AUPR, and precision. The p-values were carefully adjusted using the Bonferroni correction to control for multiple testings. Table 4 shows the p-values obtained from the rank sum test and the Bonferroni correction. The results indicate that ITRPCA is significantly better than the other methods, except for MLMC (p-value ). It suggests that ITRPCA outperforms most of the compared methods in terms of AUC, AUPR, and precision. The significance of the comparison was carefully adjusted using the Bonferroni correction to control for multiple testing.
TABLE 4
| p-value | MLMC | MSBMF | BNNR | MBiRW | HGBI |
|---|---|---|---|---|---|
| AUC | 1.899 | 1.125e-09 | 3.019e-22 | 4.932e-32 | 1.281e-33 |
| AUPR | 1.154 | 4.741e-05 | 4.741e-05 | 4.741e-05 | 4.741e-05 |
| Precision | 3.846 | 4.554e-05 | 1.433e-11 | 1.235e-33 | 1.227e-33 |
p-values obtained through Wilcoxon rank sum tests and Bonferroni correction, comparing ITRPCA with other methods on AUC, AUPR, and precision.
In addition, to demonstrate the computational efficiency of the compared methods, we have recorded the average amount of time taken by each fold. The 10-fold cross-validation is executed on a personal laptop, which is powered by an Intel Core i7 processor and comes with 16 GB RAM. Figure 2B shows the average running time for each of the 10 folds across all comparison methods. As shown in Figure 2B, the methods with an average running time of less than 10 seconds are HGBI, MSBMF, and ITRPCA. The average required time for MLMC and MBiRW is relatively long, which is approximately five times that of our method. Therefore, ITRPCA is a promising prediction method that shows both effective predictions and efficient computational performance.
4.4 Independent testing
To further demonstrate the performance of ITRPCA in real applications, we conduct two types of independent testing experiments. The gold standard dataset is used as the training set to train the models, and the set of associated pairs in the Cdataset excluding the training set is used as the testing set to evaluate the performance of the models. To be specific, we have collected a total of 57 drug–disease association pairs in the testing set. Figure 3 shows the ROC curves of all comparative methods in independent testing. As shown in Figure 3, ITRPCA has demonstrated clear superiority over other methods in this independent testing. Specifically, ITRPCA yields an AUC value of 0.943, while HGBI, MBiRW, BNNR, MSBMF, and MLMC yield AUC values of 0.873, 0.908, 0.882, 0.925, and 0.892, respectively. It is worth mentioning that the AUC value of ITRPCA is 5.717% higher than that of MLMC.
FIGURE 3
In addition, the other independent testing is conducted using all known associations in the gold standard dataset as training samples and unknown associations as candidate samples. The prediction scores of all candidate pairs are obtained by computational methods and ranked for each specific drug. We focus on how many of the top n candidate indications for each drug could be found and confirmed to have been used in clinical treatment in the CTD (released in February 2020) (Davis et al., 2019). Specifically, among all the drugs and diseases involved in the gold standard dataset, we have identified a total of 938 drug–disease associations that were subsequently validated in the CTD. As shown in Figure 4, the number of correctly predicted associations for 593 drugs is counted for the top 5 to top 30 candidate indications. It is evident that ITRPCA predicts the highest number of correct associations among all the methods for all drugs, followed by MSBMF and MLMC. Specifically, the number of validated associations from the top 5 to top 30 identified by ITRPCA is 163, 269, 348, 409, 456, and 492, respectively. In contrast, the corresponding numbers of identified associations by MLMC are all lower than those by ITRPCA, with a difference of 7, 30, 35, 44, 45, and 46, respectively.
FIGURE 4
4.5 Ablation experiment
To elucidate the individual impact of components within ITRPCA, we designed four ablation experiments: “w/o WKNN,” “only WKNN,” “ITRPCA-drug,” and “ITRPCA-disease.” To be specific, “w/o WKNN” implies the ITRPCA method without WKNN preprocessing for prediction. “only WKNN” represents using only the WKNN algorithm to infer the potential drug–disease associations, without the need for using our tensor RPCA model. “ITRPCA-drug” represents that only the drug tensor in ITRPCA was used to predict drug–disease associations, while “ITRPCA-disease” only uses the disease tensor in ITRPCA. To ensure a rigorous and unbiased comparison, the same prior similarity information and parameters as the ITRPCA model are employed in the aforementioned experiments.
Table 5 shows the AUC, AUPR, and precision results obtained from the 10-fold cross-validation of the comparative methods on the gold standard dataset. As anticipated, ITRPCA performs the best with AUC, AUPR, and precision values. This indicates that combining WKNN and TRPCA has a positive impact on predictive performance. In fact, the “w/o WKNN” model does not exhibit prominent results in predicting latent associations. It illustrates that WKNN preprocessing in ITRPCA can assist in the prediction. For the “only WKNN” model, relevant information was added based on drug and disease similarity. However, this addition also introduced more noise, leading to poor prediction performance. It serves as evidence from the opposite perspective that the effectiveness of TRPCA in noise reduction is significant. Furthermore, based on the prediction results of “ITRPCA-drug” and “ITRPCA-disease,” we find that the simultaneous utilization of tensor information from both drugs and diseases leads to better performance compared to using only one type of tensor information. It implies that the effective enhancement of prediction outcomes can be achieved through the integration of prior knowledge from drugs and diseases.
TABLE 5
| Metric | ITRPCA | w/o WKNN | only WKNN | ITRPCA-drug | ITRPCA-disease |
|---|---|---|---|---|---|
| AUC | 0.952 | 0.929 | 0.903 | 0.940 | |
| AUPR | 0.442 | 0.397 | 0.336 | 0.425 | |
| Precision | 0.476 | 0.437 | 0.380 | 0.462 |
Compared results of ITRPCA, “w/o WKNN,” “only WKNN,” “ITRPCA-drug,” and “ITRPCA-disease” with 10-fold cross-validation on the gold standard dataset.
The most optimal outcomes are indicated in bold, while the second-best results are underlined.
4.6 Case studies
To demonstrate the practical application of ITRPCA, we conducted case studies with the aim of uncovering novel applications for existing drugs. By utilizing all available drug–disease associations and multiple similarities in the gold standard dataset, we applied the ITRPCA method to predict the unexplored relationship between drugs and diseases. Based on the prediction results of ITRPCA, we generated all possible candidate indications for each drug and sorted them according to their obtained scores. In recent years, the development of drugs for tumors and leukemia has received widespread attention. Here, we selected four commonly used anti-tumor drugs ( cisplatin, vincristine, doxorubicin, and methotrexate) and one anti-malignant hematologic drug ( cytarabine) to search for evidence of their candidate indications in the CTD.
Table 6 shows the top 10 candidate indications predicted by the ITRPCA algorithm for the five drugs, with confirmed indications highlighted in bold. It was observed that each drug had 4–6 validated indications among the top 10 predictions. As an example, doxorubicin (DB00997), a broad-spectrum antitumor medication with antibiotic-like properties, was found to be effective in treating various types of cancer, including esophageal cancer, colon cancer, prostate cancer, renal cell carcinoma (nonpapillary), and hepatocellular carcinoma, as shown in Table 6. Additionally, chronic lymphocytic leukemia (susceptibility to, 2) and reticulum cell sarcoma were ranked first and second in the candidate indication list, respectively. However, their validity as indications has not been confirmed yet. These unconfirmed candidate indications hold potential as promising targets for further research.
TABLE 6
| Drugs (DrugBank ID) | Top 10 candidate diseases (OMIM ID) |
|---|---|
| Cisplatin (DB00515) | Rhabdomyosarcoma 2 (268220); lung cancer (211980); lymphoblastic leukemia, acute, with lymphomatous features (247640); diffuse gastric and lobular breast cancer syndrome (137215); reticulum cell sarcoma (267730); leukemia, chronic lymphocytic, susceptibility to 2 (109543); Wilms tumor 1 (194070); breast cancer (114480); colorectal cancer (114500); thyroid cancer, and non-medullary, 2 (188470) |
| Vincristine (DB00541) | Leukemia, chronic lymphocytic (151400); mycosis fungoides (254400); myelofibrosis (254450); breast cancer (114480); osteogenic sarcoma (259500); bladder cancer (109800); lung cancer (211980); Kaposi sarcoma, susceptibility to (148000); small cell cancer of the lung (182280); and diffuse gastric and lobular breast cancer syndrome (137215) |
| Doxorubicin (DB00997) | Leukemia, chronic lymphocytic, susceptibility to, 2 (109543); reticulum cell sarcoma (267730); esophageal cancer (133239); small cell cancer of the lung (182280); testicular germ cell tumor (273300); colorectal cancer (114500); Dohle bodies and leukemia (223350); prostate cancer (176807); renal cell Carcinoma, nonpapillary (144700); and hepatocellular carcinoma (114550) |
| Methotrexate (DB00563) | Lung cancer (211980); Wilms tumor 1 (194070); leukemia, chronic lymphocytic (151400); myeloma, multiple (254500); prostate cancer (176807); renal cell carcinoma, nonpapillary (144700); neuroblastoma, susceptibility to, 1 (256700); thyroid cancer, nonmedullary, 2 (188470); myelofibrosis (254450); and moved to 619182 (175505) |
| Cytarabine (DB00987) | Leukemia, chronic lymphocytic (151400); mycosis fungoides (254400); myelofibrosis (254450); rhabdomyosarcoma 2 (268220); myeloma multiple (254500); colorectal cancer (114500); small cell cancer of the lung (182280); Kaposi sarcoma, susceptibility to (148000); testicular germ cell tumor (273300); and breast cancer (114480) |
Top 10 candidate indications for cisplatin, vincristine, doxorubicin, methotrexate, and cytarabine.
The predicted indications in bold have been confirmed by the CTD.
5 Conclusion
In the study, we have proposed a novel computational method called ITRPCA for identifying drug-associated indications. ITRPCA can not only effectively exhibit robustness in isolating the low-rank tensor and noise information but also restrict predicted entry values of the low-rank tensor within a specific interval. The cross-validation and independent testing experiments have shown that ITRPCA is a highly effective prediction method. In particular, when compared to existing drug repositioning methods in independent testing, ITRPCA outperforms them in all measures, indicating a clear advantage. Additionally, case studies have confirmed ITRPCA’s reliability in predicting new indications for known drugs. Therefore, we are confident that ITRPCA will serve as a valuable tool to successfully facilitate practical drug repositioning.
Statements
Data availability statement
Publicly available datasets were analyzed in this study. These data can be found at: https://github.com/YangPhD84/ITRPCA.
Author contributions
MY and BY wrote the manuscript. MY conducted the methodology and software development. BY performed the investigation and validation. GD and JW reviewed and edited the manuscript. All authors contributed to the article and approved the submitted version.
Funding
This work was supported by the National Natural Science Foundation of China (Grant Nos 62272309 and 61972423), the Key Project of Hunan Education Department (Grant No. 21A0478), the Natural Science Foundation of Hunan (Grant No. 2022JJ30549), and the Graduate Research Innovation Project of Shaoyang University (Grant No. CX2023SY053).
Conflict of interest
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors, and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
References
1
AdaH.ScottA. F.AmbergerJ. S.BocchiniC. A.McKusickV. A. (2005). Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res.33 (1), 514–517. 10.1093/nar/gki033
2
AshburnT. T.ThorK. B. (2004). Drug repositioning: identifying and developing new uses for existing drugs. Nat. Rev. Drug Discov.3 (8), 673–683. 10.1038/nrd1468
3
ChongC.SullivanD. (2007). New uses for old drugs. Nature448 (7154), 645–646. 10.1038/448645a
4
DavisA.GrondinC. J.JohnsonR. J.SciakyD.McMorranR.WiegersJ.et al (2019). The comparative toxicogenomics database: update 2019. Nucleic Acids Res.47 (D1), D948–D954. 10.1093/nar/gky868
5
DicksonM.GagnonJ. P. (2009). The cost of new drug discovery and development. Discov. Med.4 (22), 172–179.
6
GaoC. Q.ZhouY. K.XinX. H.MinH.DuP. F. (2022). DDA-SKF: predicting drug-disease associations using similarity kernel fusion. Front. Pharmacol.12, 784171. 10.3389/fphar.2021.784171
7
GaoQ.ZhangP.XiaW.XieD.GaoX.TaoD. (2021). Enhanced tensor RPCA and its application. IEEE Trans. Pattern Analysis Mach. Intell.43 (6), 2133–2140. 10.1109/TPAMI.2020.3017672
8
GottliebA.SteinG. Y.RuppinE.SharanR. (2011). Predict: A method for inferring novel drug indications with application to personalized medicine. Mol. Syst. Biol.7 (1), 496. 10.1038/msb.2011.26
9
HuangF.QiuY.LiQ.LiuS.NiF. (2020). Predicting drug-disease associations via multi-task learning based on collective matrix factorization. Front. Bioeng. Biotechnol.8, 218. 10.3389/fbioe.2020.00218
10
JaccardP. (1908). Nouvelles recheres sur la distribution florale. Bull. Soc. Vaud. Sci. Nat.44, 223–270.
11
JaradaT. N.RokneJ. G.AlhajjR. (2021). SNF-NN: computational method to predict drug–disease interactions using similarity network fusion and neural networks. BMC Bioinforma.22 (1), 28–20. 10.1186/s12859-020-03950-3
12
JiangH.HuangY. (2022). An effective drug–disease associations prediction model based on graphic representation learning over multi-biomolecular network. BMC Bioinforma.23, 9–17. 10.1186/s12859-021-04553-2
13
KilmerM. E.MartinC. D. (2011). Factorization strategies for third-order tensors. Linear Algebra Its Appl.435 (3), 641–658. 10.1016/j.laa.2010.09.020
14
LavecchiaA. (2015). Machine-learning approaches in drug discovery: methods and applications. Drug Discov. Today20 (3), 318–331. 10.1016/j.drudis.2014.10.012
15
LiuC.WeiD.XiangJ.RenF.HuangL.LangJ.et al (2020). An improved anticancer drug-response prediction based on an ensemble method integrating matrix completion and ridge regression. Mol. Therapy-Nucleic Acids21, 676–686. 10.1016/j.omtn.2020.07.003
16
LiuG.LinZ.YuY. (2010). “Robust subspace segmentation by low-rank representation,” in Proceedings of the 27th International Conference on Machine Learning (ICML-10), Madison, WI, United States, June 2010, 663–670.
17
LuC.FengJ.ChenY.LiuW.LinZ.YanS. (2020). Tensor robust principal component analysis with a new tensor nuclear norm. IEEE Trans. Pattern Analysis Mach. Intell.42 (4), 925–938. 10.1109/TPAMI.2019.2891760
18
LuoH.LiM.YangM.WuF. X.LiY.WangJ. (2021). Biomedical data and computational models for drug repositioning: A comprehensive review. Briefings Bioinforma.22 (2), 1604–1619. 10.1093/bib/bbz176
19
LuoH.WangJ.LiM.LuoJ.PengX.WuF. X.et al (2016). Drug repositioning based on comprehensive similarity measures and Bi-Random walk algorithm. Bioinformatics32 (17), 2664–2671. 10.1093/bioinformatics/btw228
20
PengY.GaneshA.WrightJ.XuW.MaY. (2012). Rasl: robust alignment by sparse and low-rank decomposition for linearly correlated images. IEEE Trans. Pattern Analysis Mach. Intell.34, 2233–2246. 10.1109/TPAMI.2011.282
21
PhamD. H.BasarabA.ZemmouraI.RemenierasJ. P.KouameD. (2021). Joint blind deconvolution and robust principal component analysis for blood flow estimation in medical ultrasound imaging. IEEE Trans. Ultrasonics, Ferroelectr. Freq. Control68 (4), 969–978. 10.1109/TUFFC.2020.3027956
22
QinL.WangJ.WuZ.LiW.LiuG.TangY. (2022). Drug repurposing for newly emerged diseases via network-based inference on a gene-disease-drug network. Mol. Inf.41 (9), 2200001. 10.1002/minf.202200001
23
ResnikP. (1995). “Using information content to evaluate semantic similarity in a taxonomy,” in Proceedings of the 14th International Joint Conference on Artificial Intelligence, Montreal, Quebec, Canada, August 1995.
24
SteinbeckC.HanY.KuhnS.HorlacherO.LuttmannE.WillighagenE. (2003). The Chemistry Development Kit (CDK): an open-source java library for chemo-and bioinformatics. J. Chem. Inf. Comput. Sci.43 (2), 493–500. 10.1021/ci025584y
25
VamathevanJ.ClarkD.CzodrowskiP.DunhamI.FerranE.LeeG.et al (2019). Applications of machine learning in drug discovery and development. Nat. Rev. Drug Discov.18 (6), 463–477. 10.1038/s41573-019-0024-5
26
VanD.BruggemanJ.VriendG.BrunnerH. G.LeunissenJ. A. M. (2006). A text-mining analysis of the human phenome. Eur. J. Hum. Genet.14 (5), 535–542. 10.1038/sj.ejhg.5201585
27
WangH.ZhaoS.ZhaoJ.FengZ. (2021). A model for predicting drug–disease associations based on dense convolutional attention network. Math. Biosci. Eng.18 (6), 7419–7439. 10.3934/mbe.2021367
28
WangJ. Z.DuZ.PayattakoolR.YuP. S.ChenC. F. (2007). A new method to measure the semantic similarity of GO terms. Bioinformatics23 (10), 1274–1281. 10.1093/bioinformatics/btm087
29
WangS. J.YanW. J.ZhaoG.FuX.ZhouC. G. (2014). Micro-Expression Recognition Using Robust Principal Component Analysis and Local Spatiotemporal Directional Features. European Conference on Computer Vision. Berlin, Germany: Springer.
30
WangW.YangS.LiJ. (2013). Drug target predictions based on heterogeneous graph inference. Pac. Symposium Biocomput.18, 53–64. 10.1142/9789814447973_0006
31
WeiningerD. (1988). SMILES, a chemical language and information system. 1. introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci.28 (1), 31–36. 10.1021/ci00057a005
32
WishartD.KnoxC.GuoA. C.ShrivastavaS.HassanaliM.StothardP.et al (2006). DrugBank: A comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res.34, 668–672. 10.1093/nar/gkj067
33
WrightJ.GaneshA.RaoS.PengY.MaY. (2009). Robust principal component analysis: exact recovery of corrupted low-rank matrices by convex optimization. Adv. Neural Inf. Process. Syst.22.
34
XiaoQ.LuoJ.LiangC.CaiJ.DingP. (2018). A graph regularized non-negative matrix factorization method for identifying microRNA-disease associations. Bioinformatics34 (2), 239–248. 10.1093/bioinformatics/btx545
35
XieY.GuS.LiuY.ZuoW.ZhangW.ZhangL. (2016). Weighted Schatten p-Norm Minimization for Image Denoising and Background Subtraction. IEEE Trans. Image Process.25 (10), 4842–4857. 10.1109/tip.2016.2599290
36
XuanP.YeY.ZhangT.ZhaoL.SunC. (2019). Convolutional neural network and bidirectional long short-term memory-based method for predicting drug-disease associations. Cells8 (7), 705. 10.3390/cells8070705
37
YanY.YangM.ZhaoH.DuanG.PengX.WangJ. (2022). Drug repositioning based on multi-view learning with matrix completion. Briefings Bioinforma.23 (3), bbac054. 10.1093/bib/bbac054
38
YangJ.HeS.ZhangZ.BoX. (2021a). NegStacking: drug-Target interaction prediction based on ensemble learning and logistic regression. IEEE/ACM Trans. Comput. Biol. Bioinforma.18 (6), 2624–2634. 10.1109/TCBB.2020.2968025
39
YangM.LuoH.LiY.WangJ. (2019). Drug repositioning based on bounded nuclear norm regularization. Bioinformatics35 (14), i455–i463. 10.1093/bioinformatics/btz331
40
YangM.WuG.ZhaoQ.LiY.WangJ. (2021b). Computational drug repositioning based on multi-similarities bilinear matrix factorization. Briefings Bioinforma.22 (4), bbaa267. 10.1093/bib/bbaa267
41
YuZ.HuangF.ZhaoX.XiaoW.ZhangW. (2021). Predicting drug–disease associations through layer attention graph convolutional network. Briefings Bioinforma.22 (4), bbaa243. 10.1093/bib/bbaa243
42
ZhangL.PengZ. (2019). Infrared small target detection based on partial sum of the tensor nuclear norm. Remote Sens.11 (4), 382. 10.3390/rs11040382
43
ZhaoB. W.HuL.YouZ. H.WangL.SuX. R. (2022). Hingrl: predicting drug–disease associations with graph representation learning on heterogeneous information networks. Briefings Bioinformatic23 (1), bbab515. 10.1093/bib/bbab515
Summary
Keywords
drug repositioning, tensor robust principal component analysis, weighted k-nearest neighbor, low-rank tensor, drug–disease associations
Citation
Yang M, Yang B, Duan G and Wang J (2023) ITRPCA: a new model for computational drug repositioning based on improved tensor robust principal component analysis. Front. Genet. 14:1271311. doi: 10.3389/fgene.2023.1271311
Received
02 August 2023
Accepted
23 August 2023
Published
18 September 2023
Volume
14 - 2023
Edited by
Rui Yin, University of Florida, United States
Reviewed by
Hailin Chen, East China Jiaotong University, China
Bo-Wei Zhao, Chinese Academy of Sciences (CAS), China
Updates
Copyright
© 2023 Yang, Yang, Duan and Wang.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Mengyun Yang, mengyun_yang@126.com; Jianxin Wang, jxwang@mail.csu.edu.cn
Disclaimer
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.