ITRPCA: a new model for computational drug repositioning based on improved tensor robust principal component analysis

Background: Drug repositioning is considered a promising drug development strategy with the goal of discovering new uses for existing drugs. Compared with the experimental screening for drug discovery, computational drug repositioning offers lower cost and higher efficiency and, hence, has become a hot issue in bioinformatics. However, there are sparse samples, multi-source information, and even some noises, which makes it difficult to accurately identify potential drug-associated indications. Methods: In this article, we propose a new scheme with improved tensor robust principal component analysis (ITRPCA) in multi-source data to predict promising drug–disease associations. First, we use a weighted k-nearest neighbor (WKNN) approach to increase the overall density of the drug–disease association matrix that will assist in prediction. Second, a drug tensor with five frontal slices and a disease tensor with two frontal slices are constructed using multi-similarity matrices and an updated association matrix. The two target tensors naturally integrate multiple sources of data from the drug-side aspect and the disease-side aspect, respectively. Third, ITRPCA is employed to isolate the low-rank tensor and noise information in the tensor. In this step, an additional range constraint is incorporated to ensure that all the predicted entry values of a low-rank tensor are within the specific interval. Finally, we focus on identifying promising drug indications by analyzing drug–disease association pairs derived from the low-rank drug and low-rank disease tensors. Results: We evaluate the effectiveness of the ITRPCA method by comparing it with five prominent existing drug repositioning methods. This evaluation is carried out using 10-fold cross-validation and independent testing experiments. Our numerical results show that ITRPCA not only yields higher prediction accuracy but also exhibits remarkable computational efficiency. Furthermore, case studies demonstrate the practical effectiveness of our method.


Introduction
Over the past few decades, while funding for drug development has seen a substantial surge, the number of newly approved drugs for market release has remained limited.Notably, developing a new drug demands an average of 13.5 years and involves an average expenditure of 1.8 billion (Liu et al., 2020).This process is timeconsuming and tremendously expensive and involves high risk (Chong and Sullivan, 2007;Dickson and Gagnon, 2009).Since the approved drugs already possess safety records, tolerance, and pharmacokinetic data of the human body in clinical trials, discovering new clinical indications for commercialized drugs is an important strategy to improve the efficiency of drug development (Ashburn and Thor, 2004).In fact, there have been a few successful repurposed drugs, such as sildenafil, thalidomide, and retinoic acid, which have been widely used in application (Luo et al., 2021).
Using computational methods to discover new uses for established drugs is a crucial aspect of drug repositioning, which is based on the assumption that drugs with similar properties tend to treat similar diseases.With the rapid development of highthroughput technology and continuously generating multi-omics data, there is an increasing focus on crafting computational methods for elevated precision (Wang et al., 2021).These approaches can be classified into four distinct groups: encompassing network-based methods, machine learning-based methods, matrix-based methods, and deep learning-based methods.
Network-based approaches infer the scores of drug-disease pairs by constructing drug and disease heterogeneous biological networks and extracting topological information.The fundamental assumption is guilt by association, whereby if a certain drug can interact with most of the target's neighbors, it is probable that the target will also be able to interact with the same drug and vice versa.Based on the guilt-byassociation principle, Wang et al. (2013) used a priori information about drugs and targets to establish a heterogeneous graph.A heterogeneous graph-based inference (HGBI) model was used to predict new drug-target interactions.Luo et al. (2016) enhanced the quality of the similarity between drugs and diseases by exploiting the existing drug-disease associations.Building on the combined similarity measures, a new bi-random walk algorithm called MBiRW was developed to infer potential associations between drugs and diseases.Qin et al. (2022) proposed a network-based inference model for new emerging diseases, which used genes as a bridge in a tripartite drug-gene-disease network to infer latent drug-disease associations.Additionally, to account for the structures of networks and the biological aspects related to drugs and indications, Zhao et al. (2022) presented a novel graph representation model based on heterogeneous networks, namely, HINGRL.It integrated the biological networks of drugs and diseases to learn the features from both topological and biological perspectives.
Machine learning-based approaches use supervised learning algorithms to identify potential indications for drugs based on input features and known associations (Vamathevan et al., 2019), including logistic regression (Yang et al., 2021), random forests Zhao et al. (2022), and support vector machines (Lavecchia, 2015).Jiang and Huang (2022) proposed a graph representation model based on random forest for drug repositioning.The method identified drug-disease associations by feeding combined features from the molecular association network into a random forest algorithm.Gao et al. (2022) presented a model for predicting associations between drugs and diseases, employing similarity kernel fusion (SKF) to merge diverse similarity kernels for drugs and diseases.This fusion resulted in two integrated similarity kernels, and the scores of association pairs were calculated using the Laplacian regularized least square (LapRLS) algorithm.
Matrix-based methods use the low-rank matrix representation of the drug-disease association space to identify novel associations based on the similarity of their profiles.Yang et al. (2019) developed a bounded nuclear norm regularization (BNNR) method to obtain the low-rank matrix of the drug-disease association.This method efficiently handles noise originating from the drug and disease similarity.Yang et al. (2021) proposed a multi-similarity bilinear matrix factorization (MSBMF) method that dynamically integrated multiple similarities of drug and disease into drug-disease association training.It limited the predicted values of the drug-disease association to non-negative.Huang et al. (2020) proposed a multi-task learning method that used ensemble matrix factorization to predict both treatment associations and non-treatment associations between drug and disease.The proposed method can capture complementary features associated with these two tasks.Yan et al. (2022) proposed a multi-view learning with matrix completion method (MLMC), which is capable of effectively utilizing multi-source similarity matrices.The Laplacian graph regularization was pulled into MLMC to acquire an allencompassing feature representation derived from the multisimilarity information of drugs and diseases.
Deep learning-based methods typically use neural network models to learn the feature representation of drugs and diseases and use these features to predict new association pairs.Xuan et al. (2019) introduced a bidirectional deep learning model based on the convolutional neural network (CNN) and bi-directional long-and short-term memory (BiLSTM).This framework incorporates both similarities and associations between drugs and diseases in addition to pathways that connect specific drug-disease pairs.This approach effectively integrates raw and topological data between nodes.Combining similarity network fusion (SNF) and neural network (NN) deep learning models, Jarada et al. ( 2021) proposed a method known as SNF-NN, which was designed to forecast novel drug-disease associations.Yu et al. (2021) proposed a layer attention graph convolutional network model to detect the potential uses of drugs.The model performs graph convolutional processing on a heterogeneous network constructed from drug and disease information, thereby achieving association prediction.
To mine latent association features in multiple similarities and association data, we present an improved tensor robust principal component analysis (ITRPCA) method.First, we integrate the prior information of drug and disease to compute five indicators for drug similarity and two indicators for disease similarity.Considering that validated drug-disease associations are extremely sparse, a weighted k-nearest neighbor (WKNN) preprocessing step is employed to enrich the association matrix that aids in prediction.Then, we construct a drug tensor and a disease tensor using multisimilarity matrices and an updated association matrix.Finally, we apply ITRPCA to isolate the low-rank tensor and noise information in these two tensors, respectively.We focus on the drug-disease association pairs in the clean low-rank tensor to infer promising indications for drugs.Figure 1 illustrates the comprehensive workflow of the ITRPCA method.Our method's key contributions are as follows: • ITRPCA presents a comprehensive scheme for incorporating diverse drug and disease similarities into prediction training.• By leveraging the weighted tensor Schatten p-norm, ITRPCA can effectively extract the low-rank association tensor from the updated drug and disease tensors, which efficiently separates noisy data and leads to significantly improved accuracy, as demonstrated in our results.• The ITRPCA model includes a boundary constraint that ensures all predicted tensor entries fall within the predefined interval.• We have devised an iterative approach employing the augmented Lagrangian multiplier (ALM) to numerically address the ITRPCA model.

Materials
To validate the effectiveness of our proposed method, this study involves three crucial datasets: the gold standard dataset (Gottlieb et al., 2011), Cdataset (Luo et al., 2016), and CTD (Davis et al., 2019).Table 1 summarizes the details of these three data, such as the number of drugs and diseases, the number of known association pairs, and the intended purposes in this study.These drugs and diseases are obtained from DrugBank (Wishart et al., 2006) and the Online Mendelian Inheritance in Man (OMIM) database (Ada Hamosh, 2005), respectively.The corresponding drug-disease association matrix denoted as A is represented by a binary matrix, where the proven drug-disease associations are denoted by 1 s, while unproven associations are denoted by 0 s.
Here, we calculate a total of five similarity matrices for drugs: chemical structure similarity R chem , anatomical therapeutic chemical (ATC) code similarity R atc , side-effect similarity R se , drug-drug interactions similarity R ddi , and target profile similarity R targ .Based on the drug's canonical SMILES (Weininger, 1988) file, we use the Chemical Development Kit (CDK) (Steinbeck et al., 2003) tool to compute the hashed fingerprints for all drugs and then obtained R chem .ATC codes for all relevant drugs were extracted from DrugBank.We use the semantic similarity algorithm (Resnik et al., 1995) to calculate the similarity scores between ATC terms and then obtained R atc .The rest of the similarities are calculated using the Jaccard similarity coefficient (Jaccard, 1908), which can be expressed as follows:  R se / ddi / targ i, j where S i implies the side-effect profiles of drug i in R se , drug-drug interaction profiles of drug i in R ddi , and drug-target interaction profiles of drug i in R targ .
For diseases, a total of two similarity measures are calculated: disease phenotypic similarity D ph and disease ontology similar D do .D ph is obtained from MimMiner (Van Driel et al., 2006), which calculates the frequency of MeSH (medical subject heading (MeSH) vocabulary terms co-occurring in the medical descriptions of two diseases retrieved from the OMIM database.According to the structure of the disease ontology academic language, D do is computed by using the Gene Ontology-based algorithm (Wang et al., 2007).
In summary, we have collected a total of one drug-disease association matrix A, five drug similarity matrices (i.e., R chem , R atc , R se , R ddi , and R targ ), and two disease similarity matrices (i.e., D ph and D do ) for computational drug repositioning.

Methods
In this section, we introduce our method for identifying potential uses for established drugs.The structure is as follows: first, we depict the robust principal component analysis (RPCA) and tensor robust principal component analysis (TRPCA).Then, we propose the model of ITRPCA according to the requirements of drug repositioning.At last, the ALM method is demonstrated to solve the ITRPCA model in detail.
For the ease of reference, bold calligraphy letters represent thirdorder tensors, e.g., X ∈ R n1×n2×n3 , capital letters denote the matrices, e.g., X, bold lower-case letters indicate the vectors, e.g., x, and lowercase X ijk denote the elements of X .

Robust principal component analysis
RPCA stands as a prominent technique in low-rank representation, which can separate the noise matrix from the original data matrix and learn the clean low-rank matrix.It has found successful applications in computer vision and machine learning, such as video surveillance (Wright et al., 2009), facial modeling (Peng et al., 2012), and subspace clustering (Liu et al., 2010).RPCA is targeted at a matrix, which can decompose the target matrix into a low-rank matrix and a sparse matrix for achieving noise reduction.Generally, the mathematical formula of RPCA can be expressed as min where M denotes the original matrix, X is the low-rank matrix, and E is the sparse noise matrix.X * = r σ r (X) represents the nuclear norm of matrix X, where σ r (X) is the rth singular value of X.
, and e ij is the (i, j) element of E.

Tensor robust principal component analysis
TRPCA (Lu et al., 2020) is a continuation of RPCA.The primary motivation behind developing TRPCA is to handle multi-dimensional datasets, which are prevalent in various domains, including computer vision (Wang et al., 2014), object recognition (Zhang and Peng, 2019), and medical imaging (Pham et al., 2021).TRPCA aims to decompose the multi-dimensional data into a low-rank tensor, which captures the essential features of the data, and a sparse tensor, which contains the outliers and noise.The low-rank tensor can be interpreted as the underlying structure of the data, while the sparse tensor represents the deviations from this structure.
Similar to the nuclear norm of the matrix, the tensor nuclear norm (Kilmer and Martin, 2011) is defined as where X ∈ R n1×n2×n3 and l min(n 1 , n 2 ).X i) is denoted as the ith frontal slice of X, and X is denoted as the discrete fast Fourier transform (FFT) of X along the third dimension, i.e., X ifft(X , [], 3).Thus, X ifft( X , [], 3).The TRPCA model is formulated as follows: where M is the original tensor data, X measures the low-rank tensor, and E denotes the sparse noise tensor.According to Eq. 3, model ( 4) is equally regularized for all singular values of the tensor data and shrunk with the same parameters when minimizing the tensor nuclear norm.

ITRPCA for drug repositioning
Weighted k-nearest neighbor preprocessing.d 1 , d 2 , . . ., d n { } and r 1 , r 2 , . . ., r m { }represent the collection of n disease nodes and m drug nodes, respectively.A ∈ R n×m represents the original drug-disease association matrix, where A ij = 1 if disease d i is recognized to have a known connection with drug r j ; otherwise, A ij = 0.The ith row vector of matrix A, i.e., A d (d i ) (A i1 , A i2 , . . ., A im ), represents the association profile of disease d i .The jth column vector of matrix A, i.e., A r (r j ) (A 1j , A 2j , . . ., A nj ), represents the association profile of drug r j .In fact, if novel drug nodes or disease nodes are considered, the values of their corresponding columns or rows in the adjacency matrix are zero.This case will lead to unsatisfactory performance in prediction (Xiao et al., 2018).We utilize the WKNN algorithm to populate the drug-disease association matrix.This is achieved by considering the similarities of drugs and diseases.
For each drug r q , the similarities of the other k-nearest known drugs (where at least one validated association exists) are combined to update the drug's association profile: where the drugs r 1 to r k are arranged in descending order according to their similarity with r q .α ∈ [0, 1] is a decay term, and R denotes the mean matrix of five drug similarity matrices.This means that when the similarity between r j and r q is strong, a higher weight will be assigned; conversely, a lower weight will be assigned.Furthermore, Q r 1≤j≤k R(r j , r q ) is the normalization term.In the same way, the updated association profile for each disease d p is obtained as follows: where d 1 to d k are the diseases sorted in descending order based on their similarity to d p .α ∈ [0, 1] is a decay term, and D denotes the mean matrix of two disease similarity matrices.Q d is a normalization term, and Q d 1≤i≤k D(d i , d p ). Finishing these profile operations, we obtain the aforementioned two matrices A r and A d from drug and disease spaces, respectively.Then, the new drug-disease association matrix A DR is calculated as follows: After the processing of WKNN, the density of the updated association matrix A DR is greatly improved, and it no longer contains all zero rows and all zero columns.However, some noise information is inevitably added into the association matrix.
Subsequently, we will propose our new method for noise separation.Algorithm 1 summarizes the preprocessing step for updating the drug-disease association matrix using WKNN.
• Input: The original drug-disease association matrix A ∈ R n×m , the five drug similarity matrices: R chem , R atc , R se , R ddi , R targ ; the two disease similarity matrices: D ph , D do , decay term α, neighborhood sizes k.
• Output: Optimized association matrix A DR .
2. for each drug r q do 3. V KNN(r q , k, R); //KNN(r q , k, R) is the function to obtain the k known nearest neighbors of r q in matrix R in descending order.4. Q r k j 1 R(r j , r q ); 5.A r (r q ) k j 1 α j−1 R(r j , r q )A(r j )/Q r ; //r j ∈ V 6. end for 7. for each disease d p do 8.U KNN(d p , k, D); Algorithm 1 : WKNN preprocessing step for updating the association matrix.
Drug tensor and disease tensor.We construct a third-order drug tensor with five frontal slices denoted as R ∈ R (m+n)×m×5 .This tensor comprised five drug similarity matrices and an updated association matrix.Specifically, the first frontal slice of the drug tensor is a concatenation of R chem and A DR , which can be described as follows: R where R (1) ∈ R (m+n)×m .In the same way, the remaining four frontal slices of the drug tensor can be constructed with other similarity matrices and A DR , which is presented as R A third-order disease tensor with two frontal slices, namely, D ∈ R n×(m+n)×2 , is constructed using two disease similarity matrices and an updated association matrix.The disease tensor D is stacked by two slices.Each of its slices can be denoted as D 1 ( ) where m+n) .
ITRPCA model.In the two tensors, R and D, some noise is involved in both the similarity data and the inferred association data by WKNN.TRPCA can be employed to separate noise tensors from low-rank tensors.In order to fully exploit the significant information embedded within drug and disease tensors, it is crucial that we adjust the shrinking of large and small singular values such that the large singular values shrink less and the small singular values shrink more.However, TRPCA fails to effectively utilize this prior knowledge during the minimization of tensor nuclear norm.Therefore, the weighted tensor Schatten p-norm is introduced to treat different singular values separately, which is defined as where X ∈ R n1×n2×n3 , h min(n 1 , n 2 ), σ j denotes the jth singular value, and ω j denotes the weight value of the jth singular value.When p = 1 and ω 1, X * is a special case of X ω,Sp .Moreover, it is crucial to note that the entries of the low-rank tensor using TRPCA can be any real value in the range of (−∞, + ∞).However, it is imperative to ensure that the predicted values are contextually relevant, as any values falling outside the interval of [0,1] would be meaningless.To address this concern, a bound constraint should be incorporated to restrict the predicted values of unobserved elements within the interval [0, 1].Our ITRPCA model is formulated as follows: min where M can be replaced by a drug tensor R and disease tensor D in practice.
Here, we use the drug tensor R instead of M as an example.By optimizing the ITRPCA model, a clean low-rank drug tensor R* ∈ R (m+n)×m×5 can be obtained.Its potential low-rank representation comes from drug multiple similarity data and association information.Actually, we focus on the part of the association tensor in R*, which is denoted as A DR1 * and equal to R*(m + 1: n + m, : , : ).In order to obtain a predicted association matrix for inferring potential drug-disease pairs, we take the average of the tensor A DR1 * in the longitudinal direction.This operation can be expressed as A DR1 * avg(A DR1 * , 3), where A DR1 * is the optimized drug-disease association matrix from the drug's perspective.In the same manner, we substitute the disease tensor D for M in model ( 12).A new low-rank tensor D*, the other association tensor A DR2 * , and the corresponding association matrix A DR2 * can be conducted from the perspective of diseases using ITRPCA.It should be noted that A DR2 * D*(: , 1: m, : ) and A DR2 * avg(A DR2 * , 3).Finally, the integrated drug-disease association matrix A DR * was obtained by averaging the prediction results of both drugs and diseases.
Algorithm 2 summarizes the process of applying ITRPCA in drug repositioning.Based on the predicted pair scores in A DR * , the potential drug-disease association can be inferred.Algorithm 2 : ITRPCA algorithm in drug repositioning.

Solutions for ITRPCA
In this subsection, the ALM method is derived to solve the model ( 12).Accordingly, the augmented Lagrangian function becomes where L is the Lagrange multiplier and μ is the penalty parameter.The primary procedure comprises the subsequent distinct subtasks: where , and drawing inspiration from the soft-thresholding operator, we have where the (i, where . This is a weighted tensor Schatten p-norm minimization (WTSNM) problem based on t-SVD (Kilmer and Martin, 2011).In order to tackle this concern, the subsequent lemma and theorems can be employed.
Theorem 1 (Xie et al., 2016). is where P τpω (Y) diag(γ 1 , γ 2 , . . ., γ l ) and γ i T GST p (σ i (Y), τpω i ), which can be obtained by Lemma 1.The σ i (Y) { } is organized in a descending order, while ω i { } is arranged in an ascending order.Theorem 2 (Gao et al., 2021) Then, a global optimal solution to the model ( 22) is where P τpω ( A) is a tensor and P τpω ( A (i) ) is the ith frontal slice of Frontiers in Genetics frontiersin.org06 Yang et al. 10.3389/fgene.2023.1271311According to Theorem 2, the global optimal solution of model ( 17) is In addition, we limit the entry values of X k+1 to the interval [0,1] by using the following projection operator. where Compute L k+1 : We fix E k+1 and X k+1 to minimize Γ(E k+1 , X k+1 , L, μ) for L k+1 .The model ( 14) becomes Compute μ k+1 : In the ITRPCA model, we employ a scheme that gradually increases the learning rate to facilitate fast convergence (Gao et al., 2021).The penalty parameter becomes μ k+1 min ρμ k , μ max . (28) Algorithm 3 provides the overall iterative scheme of the ITRPCA model.It can extract significant information from the drug tensor and disease tensor and ensure that the predicted drug-disease association values are within [0,1].
• Input: Tensor data M (using drug tensor R or disease tensor D), p-value of Schatten p-norm.
Algorithm 3 Solution for the ITRPCA model.

Evaluation metrics
To evaluate the effectiveness of ITRPCA, we employ 10-fold crossvalidation to predict potential indications for existing drugs.In this process, known drug-disease associations within the gold standard The most optimal outcomes are indicated in bold, while the second-best results are underlined.

FIGURE 2
Prediction results of all methods for 10-fold cross-validation on the gold standard dataset.(A) Receiver operating characteristic curve of prediction results.(B) Average running time for each of the 10 folds.
dataset are randomly split into 10 distinct sets of comparable sizes.One subset serves as the test data, while the remaining nine subsets serve as the training data.This 10-fold cross-validation is repeated 10 times with varied random splits, and the resultant averages are considered the final results.Following prediction generation, potential diseases associated with the test drug are sorted in descending order according to their prediction scores.We utilize three evaluation metrics to evaluate the overall performance of ITRPCA: the area under the receiver operating characteristic curve (AUC), the area under the precision-recall curve (AUPR), and precision.

Parameter setting
In the ITRPCA algorithm, there are some default parameters and two key hyperparameters that need to be adjusted.These default parameters are determined empirically.Specifically, in the WKNN step, we set a decay term α equal to 0.95.In model ( 12), the regularization coefficient λ , where n 1 , n 2 , and n 3 are the size of X.For drug and disease tensors, we design an adaptive scheme to determine the weight vector ω of model ( 12).We divide ω into three parts: the first part ranges from 1 to u, the second part ranges from u + 1 to v, the third part ranges from v + 1 to the end, and the specific weights of each part are [1,2,4].The number of u and v are determined by where u j arg min v j arg min Actually, σ j,i is the ith largest singular value of the jth frontal slice matrix of X.It is evident that by minimizing the weighted tensor Schatten p-norm, the singular values of the second and third parts can be shrunk more compared to the first part.The reason is that these two parts are assigned weight values greater than 1.In addition, the two key hyperparameters are needed to be adjusted, which are neighborhood sizes k and p value of Schatten p-norm.We perform grid search to select the appropriate values according to the sum of AUC and AUPR in cross-validation.k is chosen from {10, 20, 30, 40, 50}, and p is picked from {0.6, 0.7, 0.8, 0.9, 1}.The numerical results for determining the parameters k and p are reported in Table 2.When k = 30 and p = 0.9, the highest rating value appears.Meanwhile, we terminate the ITRPCA algorithm when the following stopping criterions are satisfied or the maximum number of iteration steps is reached.
where f k and tol1 and tol2 are the given tolerances, which are set as 10 -3 and 10 -4 in the algorithm, respectively.
We assess the performance of all methods in a 10-fold crossvalidation for the gold standard dataset.Table 3 shows the AUC, AUPR, and precision values of all compared methods.As shown in Table 3, ITRPCA has the best performance compared to other methods in terms of AUC, AUPR, and precision.Specifically, ITRPCA achieves the best AUPR value of 0.442, which is 67.424%, 4.492%, 4.988%, and 1.376% higher than the corresponding AUPRs of MBiRW, BNNR, MSBMF, and MLMC, respectively.It can be seen that ITRPCA performs slightly better than MLMC.The ROC curves of all methods in the 10-fold crossvalidation are shown in Figure 2A.
Based on the test results from the repeated 10-fold crossvalidation, we used the Wilcoxon rank sum tests to evaluate the  In addition, to demonstrate the computational efficiency of the compared methods, we have recorded the average amount of time taken by each fold.The 10-fold cross-validation is executed on a personal laptop, which is powered by an Intel Core i7 processor and comes with 16 GB RAM. Figure 2B shows the average running time for each of the 10 folds across all comparison methods.As shown in Figure 2B, the methods with an average running time of less than 10 seconds are HGBI, MSBMF, and ITRPCA.The average required time for MLMC and MBiRW is relatively long, which is approximately five times that of our method.Therefore, ITRPCA is a promising prediction method that shows both effective predictions and efficient computational performance.

Independent testing
To further demonstrate the performance of ITRPCA in real applications, we conduct two types of independent testing experiments.The gold standard dataset is used as the training set to train the models, and the set of associated pairs in the Cdataset excluding the training set is used as the testing set to evaluate the performance of the models.To be specific, we have collected a total of 57 drug-disease association pairs in the testing set. Figure 3 shows the ROC curves of all comparative methods in independent testing.As shown in Figure 3, ITRPCA has demonstrated clear superiority over other methods in this independent testing.Specifically, ITRPCA yields an AUC value of 0.943, while HGBI, MBiRW, BNNR, MSBMF, and MLMC yield AUC values of 0.873, 0.908, 0.882, 0.925, and 0.892, respectively.It is worth mentioning that the AUC value of ITRPCA is 5.717% higher than that of MLMC.The most optimal outcomes are indicated in bold, while the second-best results are underlined.
Frontiers in Genetics frontiersin.org In addition, the other independent testing is conducted using all known associations in the gold standard dataset as training samples and unknown associations as candidate samples.The prediction scores of all candidate pairs are obtained by computational methods and ranked for each specific drug.We focus on how many of the top n candidate indications for each drug could be found and confirmed to have been used in clinical treatment in the CTD (released in February 2020) (Davis et al., 2019).Specifically, among all the drugs and diseases involved in the gold standard dataset, we have identified a total of 938 drug-disease associations that were subsequently validated in the CTD.As shown in Figure 4, the number of correctly predicted associations for 593 drugs is counted for the top 5 to top 30 candidate indications.It is evident that ITRPCA predicts the highest number of correct associations among all the methods for all drugs, followed by MSBMF and MLMC.Specifically, the number of validated associations from the top 5 to top 30 identified by ITRPCA is 163,269,348,409,456,and 492,respectively.In contrast, the corresponding numbers of identified associations by MLMC are all lower than those by ITRPCA, with a difference of 7, 30, 35, 44, 45, and 46, respectively.

Ablation experiment
To elucidate the individual impact of components within ITRPCA, we designed four ablation experiments: "w/o WKNN," "only WKNN," "ITRPCA-drug," and "ITRPCA-disease."To be specific, "w/o WKNN" implies the ITRPCA method without WKNN preprocessing for prediction."only WKNN" represents using only the WKNN algorithm to infer the potential drug-disease associations, without the need for using our tensor RPCA model."ITRPCA-drug" represents that only the drug tensor in ITRPCA was used to predict drug-disease associations, while "ITRPCAdisease" only uses the disease tensor in ITRPCA.To ensure a rigorous and unbiased comparison, the same prior similarity information and parameters as the ITRPCA model are employed in the aforementioned experiments.
Table 5 shows the AUC, AUPR, and precision results obtained from the 10-fold cross-validation of the comparative methods on the gold standard dataset.As anticipated, ITRPCA performs the best with AUC, AUPR, and precision values.This indicates that combining WKNN and TRPCA has a positive impact on predictive performance.In fact, the "w/o WKNN" model does not exhibit prominent results in predicting latent associations.It illustrates that WKNN preprocessing in ITRPCA can assist in the prediction.For the "only WKNN" model, relevant information was added based on drug and disease similarity.However, this addition also introduced more noise, leading to poor prediction performance.It serves as evidence from the opposite perspective that the effectiveness of TRPCA in noise reduction is significant.Furthermore, based on the prediction results of "ITRPCA-drug" and "ITRPCA-disease," we find that the simultaneous utilization of tensor information from both drugs and diseases leads to better performance compared to using only one type of tensor information.It implies that the effective enhancement of prediction outcomes can be achieved through the integration of prior knowledge from drugs and diseases.

Case studies
To demonstrate the practical application of ITRPCA, we conducted case studies with the aim of uncovering novel applications for existing drugs.By utilizing all available drug-disease associations and multiple similarities in the gold standard dataset, we applied the ITRPCA method to predict the unexplored relationship between drugs and diseases.Based on the prediction results of ITRPCA, we generated all possible candidate indications for each drug and sorted them according to their obtained scores.In recent years, the development of drugs for tumors and leukemia has received widespread attention.Here, we selected four commonly used anti-tumor drugs ( cisplatin, vincristine, doxorubicin, and methotrexate) and one anti-malignant hematologic drug ( cytarabine) to search for evidence of their candidate indications in the CTD.
Table 6 shows the top 10 candidate indications predicted by the ITRPCA algorithm for the five drugs, with confirmed indications highlighted in bold.It was observed that each drug had 4-6 The predicted indications in bold have been confirmed by the CTD.
Frontiers in Genetics frontiersin.orgvalidated indications among the top 10 predictions.As an example, doxorubicin (DB00997), a broad-spectrum antitumor medication with antibiotic-like properties, was found to be effective in treating various types of cancer, including esophageal cancer, colon cancer, prostate cancer, renal cell carcinoma (nonpapillary), and hepatocellular carcinoma, as shown in Table 6.Additionally, chronic lymphocytic leukemia (susceptibility to, 2) and reticulum cell sarcoma were ranked first and second in the candidate indication list, respectively.However, their validity as indications has not been confirmed yet.These unconfirmed candidate indications hold potential as promising targets for further research.

Conclusion
In the study, we have proposed a novel computational method called ITRPCA for identifying drug-associated indications.ITRPCA can not only effectively exhibit robustness in isolating the low-rank tensor and noise information but also restrict predicted entry values of the low-rank tensor within a specific interval.The cross-validation and independent testing experiments have shown that ITRPCA is a highly effective prediction method.In particular, when compared to existing drug repositioning methods in independent testing, ITRPCA outperforms them in all measures, indicating a clear advantage.Additionally, case studies have confirmed ITRPCA's reliability in predicting new indications for known drugs.Therefore, we are confident that ITRPCA will serve as a valuable tool to successfully facilitate practical drug repositioning.

FIGURE 1
FIGURE 1 Overall workflow of ITRPCA.(A) Construction process of a drug tensor.(B) Drug-disease associations and WKNN preprocessing.(C) Construction process of a disease tensor.(D) ITRPCA model on a drug tensor.(E) ITRPCA model on a disease tensor.(F) Final association matrix.

FIGURE 3
FIGURE 3ROC curves for all comparison methods tested independently on the Cdataset.

FIGURE 4
FIGURE 4 Number of top 5 to top 30 indications correctly predicted for all drugs by all comparison methods in the CTD.The x-axis represents the comparison of different methods across six specific top n scenarios.The y-axis represents the cumulative sum of confirmed indications among the top n predicted indications for each drug, as determined by the respective methods.

TABLE 1
Number of drugs, diseases, and known association pairs in each dataset along with their respective purposes for dataset utilization.

TABLE 2
Sum of AUC and AUPR values using different kand p-values in the 10fold cross-validation.

TABLE 3
AUC, AUPR, and precision values of all comparison methods in 10-fold cross-validation for the gold standard dataset.

TABLE 4 p
-values obtained through Wilcoxon rank sum tests and Bonferroni correction, comparing ITRPCA with other methods on AUC, AUPR, and precision.