Applying negative sample denoising and multi-view feature for lncRNA-disease association prediction

Increasing evidence indicates that mutations and dysregulation of long non-coding RNA (lncRNA) play a crucial role in the pathogenesis and prognosis of complex human diseases. Computational methods for predicting the association between lncRNAs and diseases have gained increasing attention. However, these methods face two key challenges: obtaining reliable negative samples and incorporating lncRNA-disease association (LDA) information from multiple perspectives. This paper proposes a method called NDMLDA, which combines multi-view feature extraction, unsupervised negative sample denoising, and stacking ensemble classifier. Firstly, an unsupervised method (K-means) is used to design a negative sample denoising module to alleviate the imbalance of samples and the impact of potential noise in the negative samples on model performance. Secondly, graph attention networks are employed to extract multi-view features of both lncRNAs and diseases, thereby enhancing the learning of association information between them. Finally, lncRNA-disease association prediction is implemented through a stacking ensemble classifier. Existing research datasets are integrated to evaluate performance, and 5-fold cross-validation is conducted on this dataset. Experimental results demonstrate that NDMLDA achieves an AUC of 0.9907and an AUPR of 0.9927, with a 5-fold cross-validation variance of less than 0.1%. These results outperform the baseline methods. Additionally, case studies further illustrate the model’s potential in cancer diagnosis and precision medicine implementation.


Introduction
Non-coding transcripts, particularly lncRNAs that do not encode proteins, constitute the majority of the genome (Maher, 2012).Typically, lncRNAs are transcripts that exceed 200 nucleotides in length.Noteworthy examples of lncRNAs such as H19 (Brannan et al., 1990) and Xist (Brockdorff et al., 1992) were first implicated in epigenetic regulation in the early 1990s.Numerous functional examples have also demonstrated the involvement of lncRNAs in various human physiological processes, including embryonic stem cell pluripotency, cell cycle regulation, and complex diseases (Rinn and Chang, 2012).
Therefore, exploring the relationship between lncRNAs and complex human diseases will contribute to a better understanding of disease pathogenesis and the development of lncRNA-based pharmacology.
In the past decade, extensive studies have identified many types of lncRNAs that can serve as promising biomarkers for cancer diagnosis and targeted therapy.For instance, LINC01608 has been identified as a promising prognostic biomarker for hepatocellular carcinoma (Liu et al., 2022), NALT1 promotes the targeting of PEG10 via sponge microRNA-574-5p to advance colorectal cancer progression (Ye et al., 2022), and RNA demethylase ALKBH5 promotes lung cancer progression (Shen et al., 2022).However, traditional biological experiments used to identify the association between lncRNA and diseases, such as PCR (Heid et al., 1996) and microarray analysis (Zhai et al., 2015), have always been limited by high costs and lack of specificity in exploring and understanding lncRNA.
With advances in computer technology and its ability to handle vast amounts of data, computational method has been explored to validate LDA and has yielded promising results.The first LDA prediction model (called LRLSLDA) was proposed by Chen et al. (Chen and Yan, 2013), utilizing the Laplace regularized least square method to predict LDA.This model is built on the hypothesis that similar diseases are associated with similar lncRNAs (Chen and Yan, 2013).Chen et al. (Chen et al., 2015) enhanced LRLSLDA by introducing a fusion method for lncRNA functional similarity.Although these methods did not achieve excellent prediction performance, they sparked further interest in studying the association between lncRNAs and diseases.
To capture comprehensive association information between lncRNAs and diseases, several LDA prediction methods based on similarity network feature fusion have been proposed.For example, Wei et al. proposed the iLncRNAdis-FB model for data fusion through feature blocks (Wei et al., 2021), Chen et al. proposed the iLDMSF model based on KNN for nonlinear multi-similarity fusion (Chen et al., 2021a), and Fan et al. proposed the GCRFLDA framework that integrates the conditional random field layer and the attention mechanism to fuse various similarities between lncRNAs and diseases in a linear manner as auxiliary features of nodes (Fan et al., 2022).
Moreover, Data sets in Bioinformatics usually present a high level of noise (Miranda et al., 2009).The noisy training data set increases the training time and complexity of the model.Consequently, identifying noisy instances and then eliminating or correcting them are useful techniques in data mining research (Nematzadeh et al., 2020).Chen et al. found that the presence of noisy samples can significantly impact the predictive performance of the LDA model (Chen et al., 2021b) Some papers (Yao et al., 2020;Wei et al., 2021;Kang et al., 2022;Lu and Xie, 2023) have used random sampling to create balanced datasets by including an equal number of unknown and positive samples in an attempt to mitigate the impact of unbalanced datasets.However, this approach may introduce potentially noisy data into the negative sample set.Lan et al. proposed an LDA prediction model based on an improved graph convolution network with Top-K negative sampling (Lan et al., 2021).Another method by Peng et al. involved screening reliable negative samples through a graph autoencoder (Peng et al., 2022).He et al. proposed two similarity-based negative sampling methods, one based on the Euclidean distance calculation between unlabeled samples and positive samples, and the other by reducing the number of unlabeled samples based on the functional similarity between lncRNAs (He et al., 2023).
Although existing methods have achieved good performance in predicting LDA, there still needs to be more potential in utilizing the association information between diseases and lncRNAs.Additionally, constructing the negative sample set may introduce latent LDA as noise, leading to reduced predictive accuracy of the model.This paper proposes a predictive model to construct a more accurate LDA model that combines multi-view feature extraction, an unsupervised negative sample denoising module, and a stacking ensemble classifier to uncover the associations between lncRNAs and diseases.The main contributions of this paper are as follows: 1. To mitigate the impact of sample imbalance and potential noise in negative samples on the model's performance, a negative sample denoising module is designed using an unsupervised method (K-means (Hartigan and Wong, 1979)).By simultaneously clustering positive and negative samples using K-means, this module not only improves the model's performance but also provides potential solutions for mitigating sample imbalance and achieving negative sample denoising in LDA. 2. To construct a more precise LDA model, we use graph attention networks (Veličković et al., 2017) to obtain multiview features.These features are then combined with an unsupervised negative sample denoising module and a stacked ensemble classifier.Experimental results consistently demonstrate the outstanding performance of the proposed LDA prediction model.This model has potential applications in cancer diagnosis and can contribute to the advancement of precision medicine.

Materials and methods
The research flowchart of this paper can be divided into three steps, as illustrated in Figure 1 (Ⅰ) data preprocessing (Ⅱ) construction of the NDMLDA model by incorporating multiview feature extraction, an unsupervised negative sample denoising module, and a stacking ensemble classifier, and (Ⅲ) utilization of the NDMLDA model to make predictions regarding the association between unknown lncRNAs and diseases.Furthermore, in Figure 1 section (Ⅰ), DSS (disease semantic similarity network), DCS (disease cosine similarity network), DGS (disease gaussian interaction profile kernel similarity network), LSES (lncRNA sequence similarity network), LGS (lncRNA gaussian interaction profile kernel similarity network), LFS (LncRNA functional Similarity network) represent six similarity networks, respectively.
To gain a more comprehensive understanding of the correlation between lncRNAs and diseases, we merged and manually curated the LDA data from three databases: Lnc2Cancer, LncRNADisease, and RNADisease (see supplementary for details).As a result, we obtained a total of 8,334 lncRNA-disease associations involving 629 lncRNAs and 511 diseases, which were stored in matrix A. Subsequently, we retrieved the sequence information of all lncRNAs in matrix A from the NONCODE database.Additionally, we applied the same preprocessing method to process the data from lncTarD, resulting in 504 lncRNA-disease associations between 103 diseases

Disease semantics similarity
We use the method proposed by Wang et al. (2010) to calculate the semantic similarity of diseases which is given by the following formula: Where, d represents disease; D represents ancestors' nodes of d; ∩ represents intersection; SC and SV represent the semantic contribution value and semantic value of disease, respectively.

Disease cosine similarity
The cosine similarity between two diseases can be calculated using the following formula: Where, vector A(i, : ) represents the set of elements in the ith row in matrix A. The length of this vector is denoted as A(i, : ) .

Disease (lncRNA) Gaussian interaction profile kernel similarity
We utilize the algorithm presented by van Laarhoven et al. (2011) to calculate the similarity of gaussian interaction profile kernel similarity for disease (lncRNA), which is given by the following formulas: Where, DGS and LGS represents disease (lncRNA) gaussian interaction profile kernel similarity; γ represents the normalized kernel bandwidth.

LncRNA functional similarity
We adopt the method proposed by Sun et al. (2014) to calculate the functional similarity of lncRNAs(LFS).The formula is as follows: Where d represents a disease associated with a lncRNA l; D represents a group of diseases associated with l and nD represents the total number of diseases in this group; SS(d, D) represents the maximum semantic similarity between d and D.

LncRNA sequence similarity
We are inspired by Li et al. (2020) to introduce lncRNA sequence similarity (LSES), which is calculated using the following formula: The construction process of NDMLDA comprises three parts: Where len(l) represents the length of the sequence l; cost(l i , l j ) is used to measure the minimum cost required to transform the sequence of l i into the sequence of l j by performing three types of operations: insertion, deletion, and replacement (with insertion or deletion cost being 1, and replacement cost being 2).

Methods
The construction process of NDMLDA is shown in Figure 2, which mainly consists of three steps (A) multi-view feature extraction; (B) negative sample set denoising; (C) training and prediction of the stacking ensemble classifier.

Multi-view feature extraction
The Graph Attention Network (GAT) has demonstrated significant potential in predicting LDA as a primary approach for multi-view feature extraction (Shi et al., 2021;Liang et al., 2022;Zhao et al., 2022) Following the guidance of previous literature (Forster et al., 2022), we developed a GAT-based module for multiview feature extraction, as depicted in Figure 3.To begin with, we transformed the similarity networks of various views (including DSS, DCS, DGS, LFS, LGS, and LSES) into edge list format, where each row represents the source, target, and weight.Subsequently, for each input network, we constructed an encoder by concatenating three GAT layers, as illustrated in Figure 3.This encoding process enables the learning of high-order neighborhood features for the specific input networks using dedicated encoders.Next, by employing feature aggregation and random loss processing, a unified disease (or lncRNA) feature H is generated.Finally, all node features were arranged in a matrix F. Through multiple experiments, we fixed the length of this feature to 64 (see supplementary material for details).To enhance the quality of feature extraction, we decode and reconstruct the unified feature matrix F, which has learned the lncRNA (or disease).Our objective is to minimize the discrepancy between the reconstructed network F T and the original input network.The process of network reconstruction after decoding is exemplified below.

A F • F T
The loss function in this process can be defined as follows: Where, n represents the total number of nodes in the input network, b j represents the node mask in input network j, A j represents the adjacency matrix corresponding to input network j, ⊙ represents the inner product and • F represents the F-norm.

Negative sample set denoising
The process of the negative sample set denoising is shown in Figure 4 Firstly, the positions of elements with values 1 and 0 in the LDA matrix are recorded separately.Then, the unified feature vectors of corresponding diseases and lncRNAs are retrieved based on these positions.These two feature vectors are directly The encoding process of a specific input network.

FIGURE 4
The denoising process of negative sample set.
Frontiers in Genetics frontiersin.orgconcatenated, with diseases preceding lncRNAs, to form a sample.
The complete sample set (All Samples) is obtained by concatenating the features of all positions.Next, the number of clusters K is determined by calculating the silhouette coefficient.The silhouette coefficient (SC), which ranges from −1 to 1, is a commonly used indicator in previous studies for evaluating the effectiveness of clustering algorithms (Rousseeuw, 1987).SC usually follows the trend of K-value changes.When the silhouette coefficient approaches 1, the K-value also approaches the ideal value.SC can be calculated as follows: Where, C a (i) the average distance between sample i and the other samples in its cluster, while C b (i) represents the minimum average distance between sample i and the samples in different clusters.In this study, we set K as 3.
We used the K-means algorithm (Hartigan and Wong, 1979) to perform 10 rounds of clustering on the entire sample set.The complete description of the negative sample denoising process is as follows: Let P represent the known positive sample set, P p 1 , p 2 , . . ., p m , where each sample p i represents a known lncRNA-disease association.Let U represent the unknown sample set, U u 1 , u 2 , . . ., u n−m { } .Assuming that the samples in U that are similar to P are noise samples, we take the following steps to denoise U: First, we cluster the entire sample set using the K-means algorithm, which results in cluster divisions C C 1 , C 2 , . . ., C k { } , where each cluster C i is a set.For each cluster C i , we calculate the proportion of positive samples and denote it as r(C i ).
Then, we repeat the following steps 10 times: 1. Cluster the sample set using the K-means algorithm to obtain cluster divisions C′ C 1 ′ , C 2 ′ , . . ., C ′ k .2. For each cluster C ′ i , calculate the proportion of positive samples and denote it as r′(C ′ i ).

Find the cluster C ′
i with the highest r′(C ′ i ) and denote its unknown sample set as U′.

Save U′.
Finally, we take the intersection of the noise sample sets obtained from these 10 clustering iterations, U noise U 1 ′ ∩ U 2 ′ ∩ . . .∩ U 10 ′ , and remove these samples from U. The final denoised unknown sample set is represented as U reliable U − U noise .The unknown samples in U reliable represent the denoised negative samples.

Training stacking ensemble classifier
To overcome the limited predictive capabilities of individual classifier, we draw inspiration from previous research (Li et al., 2021;Liang et al., 2022).The training process of the stacking ensemble classifier is illustrated in Figure 5. Five decision tree-based classifiers, including CatBoost (Dorogush et al., 2018), ExtraTrees (Geurts et al., 2006), LightGBM (Ke et al., 2017), RandomForest (Breiman, 2001), and XGBoost (Chen and Guestrin, 2016), are employed as base classifier, with LogisticsRegression (Cramer, 2002) serving as the meta-classifier.This framework creates a stacked ensemble LDA prediction model (refer to the supplementary material for the training process of the ensemble classifier).We conduct a five-fold cross-validation on 80% of the samples from the reconstructed new dataset (details can be found in the supplementary material), while the remaining 20% of samples are used as an independent dataset to evaluate the trained classifiers.Finally, we select the classifier with the best performance for the final LDA prediction.

Experimental settings
The performance evaluation of NDMLDA is conducted using five performance metrics: accuracy (ACC), Matthew's correlation coefficient (MCC) (Harald, 1946), F1-score, area under the receiver operating characteristic curve (AUC), and area under the precisionrecall curve (AUPR).The calculation formulas for these metrics are as follows: The training process of a stacking ensemble classifier.
Frontiers in Genetics frontiersin.org In the context of the confusion matrix, TP, TN, FP, and FN are variables that represent the four different types of prediction situations.
MAGCNSE (Liang et al., 2022) employs a two-step approach, first utilizing GCN to extract the multi-view representation of lncRNA and diseases, and then employing CNN to obtain the final representation.The integrated classifier is then used for prediction (Zhao et al., 2022).VGAELDA (Shi et al., 2021) proposes a LDA prediction method that combines variational inference and graph autoencoders.
LDAformer (Zhou et al., 2022) introduces a LDA prediction method based on topological feature extraction and Transformer encoding.
As shown in Table 1, NDMLDA achieved higher AUC and AUPR by 2.5% and 1.6%, respectively, compared to the second-best MAGCNSE.Furthermore, the overall performance of NDMLDA (with all metrics above 92%) is superior to other comparative methods.
MAGCNSE and CapsNet-LDA mitigate the impact of sparse features on the model through a multi-view approach, achieving good performance (overall performance higher than 0.8).However, they are affected by negative sample noise, resulting in suboptimal performance.Additionally, as shown in Table 1, our model, despite having a decrease in performance in five evaluation metrics without sample reconstruction, still outperforms methods such as SSMF-BLNP and CapsNet-LDA.This indicates that our negative sample denoising module is effective in mitigating the impact of negative sample noise on the model.
LDAformer proposed a method for LDA prediction based on topological feature extraction and Transformer encoder.By enhancing feature extraction, the performance of complex models is improved.Compared to our method, without using the sample denoising module, we obtain multi-view features through GAT and achieve better overall performance in LDA using a simple stacking model.This indicates that our multi-view feature processing method is effective.
Meanwhile, to further demonstrate the generalization ability of our method, we conducted comparative experiments on an independent dataset lncTarD.The experimental results are shown in Table 2.It can be observed that our proposed method still outperforms the comparative methods in four main indicators, indicating the robustness of NDMLDA.

Ablation studies
3.3.1The influence of negative sample set denoising on the predictive performance of NDMLDA When the negative sample set denoising module is integrated into NDMLDA (as depicted in Figure 6), all five performance measures exhibit superior results compared to the state without the module.Notably, the addition of the module improves the AUC by 2.3%, AUPR by 2.2%, MCC by 12.4%, F1-score by 5.4% and ACC by 5.6%.These findings suggest that incorporating the negative sample set denoising module enhances the prediction performance of NDMLDA.We visualized the distribution of samples before and after denoising using t-SNE (Maaten and Hinton, 2008).Figure 7 shows the visualization results.Comparing Figures 7A,C, it can be observed that our proposed method for denoising the negative sample set successfully removes the noisy samples.The influence of negative sample set denoising module on the predictive performance of NDMLDA.

Classifier selection
Table 3 demonstrates that among the five metrics, the stacked ensemble classifier attained optimal results for three of them.While the stacked ensemble classifier's performance in terms of MCC and ACC is slightly lower than that of ExtraTrees (with a maximum difference of 0.02%), it surpasses ExtraTrees in the more significant evaluation metrics of AUC and AUPR (with improvements of 0.16% and 0.15% respectively).These results indicate that the inclusion of the stacked ensemble classifier can enhance the predictive performance of NDMLDA.

Combination of different views
According to Figure 8, the performance of the model is influenced by the combination of different views (AUC: 0.9709-0.9907;AUPR: 0.9818-0.9927).Furthermore, increasing the number of combined views leads to an improvement in the model's performance.To construct a more precise LDA prediction model, we have chosen to utilize fusion features from lncRNA, which include lncRNA gaussian interaction profile kernel similarity (LGS), lncRNA functional similarity (LFS), and lncRNA sequence similarity (LSES), as well as fusion features from diseases, which  We systematically validated the top 30 lncRNAs, associated with each specific type of cancer by cross-referencing three important databases: LncRNADisease v2.0, Lnc2Cancer v3.0, and RNADisease v4.0 (Bao et al., 2019;Gao et al., 2021;Chen et al., 2022), as well as consulting relevant literature records.

Case studies
To further validate the performance of NDMLDA in predicting the association between specific diseases and lncRNA, we conducted case studies on six prevalent cancers: breast cancer, cervical cancer, colon cancer, esophageal cancer, lung cancer, and stomach cancer.In each case study, we utilized all samples related to cancer as the testing set, while the remaining samples served as the training set.Subsequently, we trained NDMLDA on the training set and employed it to evaluate the samples in the testing set.
The validated lncRNAs related to breast cancer and cervical cancer are summarized in Table 4 and Table 5, respectively.In the evidence column, "C" denotes candidate lncRNAs corroborated by the Lnc2Cancer database."D" denotes candidate lncRNAs supported by the LncRNADisease database."P" denotes candidate lncRNAs supported by a single literature source."R" denotes candidate lncRNAs corroborated by the RNADisease database."P*" denotes candidate lncRNAs supported by multiple published literature sources.Further details regarding the predictions of NDMLDA for lncRNAs associated with four other cancers can be found in the supplementary materials.

Discussion
The NDMLDA method utilizes the negative sample denoising module to obtain negative sample data that closely approximates the real distribution.Instead of introducing a new clustering method, our approach focuses on integrating the multi-view similarity network with the negative sample denoising technique.To achieve this, we adopt the K-means algorithm, a well-established clustering algorithm, as the core algorithm for negative sample denoising.
The NDMLDA model demonstrates good performance by combining stacked classifiers.However, we have also noticed that several single classifiers used for comparison have AUC and AUPR values around 0.99.On one hand, this is because we balanced the positive and negative samples during classifier evaluation.On the other hand, it is due to the relatively small number of known lncRNA-disease associations, which results in an insufficient number of samples for performance evaluation.However, considering the increasing complexity of data in future model applications, we have chosen the stacked ensemble classifier as our final classifier to ensure the competitiveness of our model.
However, our proposed model (NDMLDA) still has some limitations.Although we obtained a large number of known LDAs (8,334) by merging multiple databases, the comparison with the huge number of unknown samples (313,085) is still very sparse.At the same time, the dataset only includes a limited number of lncRNA-disease pairs, which is only a small fraction of the real-world scenarios.Therefore, in the future, we will attempt to further expand the number of LDAs in the dataset to address the constantly changing real situations.We also recognize that there is still a possibility that some reliable negative samples may be discarded in the process.To mitigate this, we plan to conduct further research and improvements in our future work.
LncRNAs have been established as pivotal regulators of gene expression, playing a significant role in a wide range of biological functions and disease processes, including cancer.This study presents a model known as NDMLDA, which integrates multi-view feature extraction, unsupervised negative sample denoising, and stacked ensemble classifier.The experimental results demonstrate that the proposed prediction method achieves exceptional performance across five metrics (including AUC, AUPR, MCC, F1-score and ACC).Additionally, the accuracy and reliability of NDMLDA in the prediction process for LDA are further substantiated through six case studies (involving breast cancer, cervical cancer, colon cancer, esophageal cancer, lung cancer, and gastric cancer).

Conclusion
This article introduces an LDA prediction model (NDMLDA) that combines negative sample denoising and multi-view network feature extraction.The experimental results demonstrate that our method outperforms the six recent base models, achieving excellent performance in five metrics (including AUC, AUPR, and MCC).Additionally, the results of six case studies (breast cancer, cervical cancer, colon cancer, esophageal cancer, lung cancer, and gastric cancer) further validate the accuracy and reliability of NDMLDA in LDA prediction tasks.
(A) multi-view feature extraction; (B) negative sample set denoising; (C) training and prediction of the stacking ensemble classifier.
TWhere α represents the attention coefficient, H represents the features of nodes in A, W represents the trainable weight parameters, T represents the transpose operation, and σ represents the non-linear activation functionLeakyReLU (Maas  et al., 2013).

FIGURE 7
FIGURE 7 Comparison of sample distribution before and after negative sample denoising.(A) represents the original distribution of samples, with red dots indicating positive samples and green dots indicating unknown samples; (B) uses gray dots to represent noisy samples in the unknown samples; (C) represents the distribution of samples after denoising.

FIGURE 8
FIGURE 8AUC and AUPR corresponding to different combinations of views.

TABLE 1
Comparison of the performance of NDMLDA with other LDA prediction methods.

TABLE 2
Comparison of the performance of NDMLDA with other LDA prediction methods on lncTarD dataset.TABLE3The performance comparison between individual classifiers and stacked ensemble classifiers.

TABLE 4
Top 30 lncRNAs related to breast cancer predicted by NDMLDA.

TABLE 5
Top 30 lncRNAs related to cervical cancer predicted by NDMLDA.