Finding potential lncRNA–disease associations using a boosting-based ensemble learning model

Introduction: Long non-coding RNAs (lncRNAs) have been in the clinical use as potential prognostic biomarkers of various types of cancer. Identifying associations between lncRNAs and diseases helps capture the potential biomarkers and design efficient therapeutic options for diseases. Wet experiments for identifying these associations are costly and laborious. Methods: We developed LDA-SABC, a novel boosting-based framework for lncRNA–disease association (LDA) prediction. LDA-SABC extracts LDA features based on singular value decomposition (SVD) and classifies lncRNA–disease pairs (LDPs) by incorporating LightGBM and AdaBoost into the convolutional neural network. Results: The LDA-SABC performance was evaluated under five-fold cross validations (CVs) on lncRNAs, diseases, and LDPs. It obviously outperformed four other classical LDA inference methods (SDLDA, LDNFSGB, LDASR, and IPCAF) through precision, recall, accuracy, F1 score, AUC, and AUPR. Based on the accurate LDA prediction performance of LDA-SABC, we used it to find potential lncRNA biomarkers for lung cancer. The results elucidated that 7SK and HULC could have a relationship with non-small-cell lung cancer (NSCLC) and lung adenocarcinoma (LUAD), respectively. Conclusion: We hope that our proposed LDA-SABC method can help improve the LDA identification.

Network-based methods predict new LDAs through label propagation and multi-information fusion on the heterogeneous lncRNA-disease networks (Jiang et al., 2010;Zou et al., 2016;Hu et al., 2017;Wang et al., 2019;Yu et al., 2020;Qiu et al., 2023b).Chen et al. conducted many research studies and significantly promoted LDA prediction (Chen and Yan, 2013;Chen et al., 2015;Chen, 2015a;Chen, 2015b).Based on these studies, they comprehensively concluded the current computational methods for non-coding RNA analysis and unfolded existing challenges and corresponding solutions (Chen and Huang, 2022;2023).Xie et al. used the unbalanced bi-random walk algorithm (Xie et al., 2020b;a) and bidirectional linear neighborhood label propagation (Xie et al., 2023) for LDA identification.In addition, a random walk with a restart algorithm (Wang et al., 2022) has been still applied to find new LDAs.Network-based methods found many possible LDAs, but they did not analyze the topological features of LDA networks.
To boost the LDA prediction performance, here, we developed LDA-SABC, a novel boosting-based framework for LDA prediction.LDA-SABC extracts LDA features based on SVD and classifies LDPs by integrating LightGBM (Wang et al., 2023) and AdaBoost combined with the convolutional neural network (AdaBoost-CNN) (Taherkhani et al., 2020;Peng et al., 2023c).The LDA-SABC performance was evaluated under fivefold cross validations (CVs) on lncRNAs, diseases, and LDPs.This approach accurately found a few potential lncRNAs for lung cancer.LDA-SABC is publicly available at https://github.com/plhhnu/LDA-SABC.

Overview of LDA-SABC
LDA-SABC contains two main steps: 1) LDA feature extraction: the LDP linear features are extracted through SVD. 2) LDA classification: the association probability of each LDP is computed by integrating AdaBoost-CNN and LightGBM.The details are shown in Figure 1.

Data preparation
LDA-SABC was evaluated on two human LDA datasets (Peng et al., 2024a), namely, LncRNADisease (Chen et al., 2012) and MNDR (Cui et al., 2018).After deleting diseases without regular names or MeSH data and lncRNAs without sequence data, the number of lncRNAs, one of the diseases, and one of the LDAs in two LDA datasets are listed in Table 1.Subsequently, an LDA network containing n lncRNAs and m diseases is denoted as Y ∈ R n×m , where y ij = 1 if lncRNA l i is associated with disease d j , otherwise y ij = 0.

LDA feature extraction
SVD (Abdi, 2007) can effectively extract features by eigen decomposition.By selecting larger singular values, SVD can reduce the dimensionality of the data and remove features that contribute less to data variability, thereby reducing the storage and calculation costs of the data.In addition, the feature vectors corresponding to smaller singular values represent noise or redundant parts in the data.By selecting larger singular values, SVD can retain the main linear features, thereby removing noise and redundant information.Furthermore, the size of singular values represents important features in the data, and SVD helps us understand the structure and variation patterns of the data by observing the size of singular values and their corresponding feature vectors.Thus, SVD is used to extract lncRNA and disease features: the LDA matrix Y ∈ R n×m is factorized using Eq.1: where V T represents the transpose of V, U ∈ R n×n and V ∈ R m×m are two real matrices, and Σ denotes a diagonal matrix composed of n singular values.Subsequently, the e largest singular values are selected to build an approximation representation using Eq.2: ( Consequently, U i and V j T denote the features of the ith lncRNA l i and the jth disease d j , respectively.As a result, the features of each lncRNA can be represented as an a-dimensional vector, and the features of each disease can be represented as a b-dimensional vector.The two features are concatenated as a d (d = a + b)-dimensional vector for characterizing each LDP.

LDA prediction
For an LDA dataset D (X, Ŷ), with p (p = n × m) samples (i.e., p LDPs), let x i ∈ X denote the ith LDP with d-dimensional features, and y i ∈ Ŷ denotes its label.For the ith feature map in the lth layer y l i , its activity is computed using Eq.3:

LDA-AdaBoost-CNN
where w l i,j represents the weight of a convolutional kernel, which maps the jth feature at the (l − 1)th CNN layer to the ith feature at the lth CNN layer, and b l i is the bias of the ith feature in the lth layer.Finally, the output F l at the lth hidden layer is computed using Eq.4: where f(•) denotes a non-linear function.Consequently, the probability distribution matrix Z of all LDPs is computed via a softmax function using Eq.5: where W o denotes a weight matrix linking the last hidden layer with the output layer, b o indicates the bias, and F L represents the output at the last hidden layer.For the ith sample x i , after training Q CNNs, its output is computed based on its output o k q (x i ) (k 1, 2) in the qth CNN using Eq.6:

LDA-LightGBM
LightGBM is a gradient-based model.It uses two powerful techniques to acquire the optimal split node and accurately classify unknown samples: one-side sampling and exclusive where and g i represents the negative gradient.However, LDA features have high dimensions and multiple zero values, that is, the features cannot simultaneously have nonzero values.To solve this problem, LRI-LightGBM first uses weights to characterize the whole conflict between all LDA features and construct a weighted graph.Subsequently, all LDA features are sorted and are set to a defined bundle or create a new bundle.Finally, all LDPs are classified using Eq.9: where T q is the maximum iteration number and h q (x i ) is the qth basic decision tree.

Ensemble learning
Ensemble learning exhibits strong classification performance compared to individual classifiers.Thus, we combined LDA-AdaBoost-CNN and LDA-LightGBM for LDA identification.For one LDP x i , let C(x i ) and F(x i ) represent its association scores computed by LDA-AdaBoost-CNN and LDA-LightGBM, respectively; its final association probability p(x i ) is obtained Eq. 10: where α and β(β = 1 − α) are used to evaluate the importance of LDA-AdaBoost-CNN and LDA-LightGBM with respect to the LDA inference performance, respectively.Frontiers in Genetics frontiersin.org08 Zhou et al. 10.3389/fgene.2024.1356205 3 Results

Experimental settings
To assess the LDA inference performance of LDA-SABC, we implemented three fivefold CVs to compare it with four representative LDA prediction approaches, namely, SDLDA (Zeng et al., 2020), LDNFSGB (Zhang et al., 2020), IPCARF (Zhu et al., 2021), and LDASR (Guo et al., 2019).The parameters in the above four methods were derived from their corresponding literatures.For the LDA-SABC model, we set n_ estimators, learning rate, and epochs to 100, 0.1, and 10, respectively, in LDA-AdaBoost-CNN and n_estimators and learning rate to 100 and 0.1, respectively, in LRI-LightGBM.The dimension d of an LDA feature vector was set to 64.

Comparison with four classical LDA prediction methods
We used six evaluation metrics (precision, recall, accuracy, F1 score, AUC, and AUPR (Shen et al., 2022;Liu et al., 2023;Qiu et al., 2023a)) to assess the performance of LDA-SABC and four other LDA prediction algorithms (SDLDA, LDNFSGB, IPCARF, and LDASR) under three different fivefold cross validations.The three CVs are fivefold CV on lncRNAs (CV l ), five-fold CV on diseases (CV d ), and fivefold CV on LDPs (CV ld ).The details refer to Peng et al. (2024a).Tables 2-4 depict the performance of LDA-SABC and four other methods on two databases (i.e., LncRNADisease and MNDR) under the three CVs. Figure 2 characterizes the corresponding ROC and precision-recall (PR) curves.
CV l was used to compare the performance of LDA-SABC with SDLDA, LDNFSGB, LDASR, and IPCAF when identifying diseases linking to a new lncRNA.Under CV l , all five methods randomly selected 80% of lncRNAs as the training set and used the remaining as the test set.The results are listed in Table 2 and Figure 2. We found that LDA-SABC outperformed in terms of precision, recall, accuracy, F1 score, AUC, and AUPR compared with the four classical LDA prediction algorithms.For example, LDA-SABC obtained the highest AUC values of 0.9328 and 0.9675, outperforming by 13.05% and 3.09% compared to those of the second best algorithm, on the LncRNADisease and MNDR databases, respectively.It also calculated the highest AUPR values of 0.9304 and 0.9703, outperforming by 8.43% and 1.76% compared to those of the second best algorithm, respectively.These results imply that LDA-SABC could accurately capture the underlying diseases linking to a new lncRNA.
CV d was applied to compare the performance of LDA-SABC with SDLDA, LDNFSGB, LDASR, and IPCAF when identifying lncRNAs linking to a new disease.Under CV d , all five methods randomly selected 80% of diseases as the training set and used the remaining as the test set.As demonstrated in Table 3 and Figure   datasets.For example, LDA-SABC obtained the highest AUC values of 0.9630 and 0.9860, outperforming by 8.42% and 3.01% compared to those of the second best algorithm (i.e., SDLDA), on the LncRNADisease and MNDR databases, respectively.It also calculated the highest AUPR values of 0.9605 and 0.9836, outperforming by 6.71% and 2.75% compared to those of the second best algorithm (i.e., SDLDA), on the LncRNADisease and MNDR databases, respectively.These results suggest that LDA-SABC could accurately infer potential lncRNAs linking to a new disease.
CV ld is used to compare the performance of all five LDA inference methods when identifying new LDAs from unknown LDPs.Under CV ld , all five methods randomly selected 80% of LDPs as the training set and used the remaining as the test set.As demonstrated in Table 4 and Figure 2, LDA-SABC significantly improved LDA prediction in comparison with the four other methods.For example, LDA-SABC achieved the highest AUC values of 0.9628 and 0.9878, outperforming by 8.54% and 3.18% compared to those of the second best algorithm (i.e., SDLDA), on the LncRNADisease and MNDR databases, respectively.It also calculated the highest AUPR values of 0.9606 and 0.9881, outperforming by 6.54% and 2.42% compared to those of the second best algorithm (i.e., SDLDA), on the LncRNADisease and MNDR databases, respectively.Thus, LDA-SABC could more accurately infer the underlying LDAs through known LDAs.

Ablation study
LDA-SABC combined AdaBoost-CNN and LightGBM for LDA prediction.In model Ensemble, α and β were used to evaluate the effects of LDA-AdaBoost-CNN and LDA-LightGBM on the LDA inference performance, respectively.As shown in Figure 3, when α was set to 0, 0.2, 0.4, 0.6, 0.8, and 1, respectively, LDA-SABC achieved the best performance on the LncRNADisease and MNDR databases under fivefold CVs on lncRNAs, diseases, and LDPs.Supplementary Tables S1-S3 show the detailed performance of LDA-SABC when α was set to the above six values, respectively.Thus, we set α and β to 0.4 and 0.6, respectively.
To better understand the performance of ensemble learning, we compared LDA-SABC with other boosting algorithms, i.e., AdaBoost-CNN, AdaBoost, and LightGBM, under three different CVs.The boosting algorithms used the same feature extraction procedures as LDA-SABC except for using different boosting models for classifying unknown LDPs.Tables 5-7 show their LDA prediction performance under fivefold CVs on lncRNAs, diseases, and LDPs, respectively.The results demonstrate that LDA-SABC computed the best LDA inference accuracy on the two LDA databases under the three CVs in most cases, thereby elucidating the powerful LDP classification performance of our proposed ensemble learning model with LightGBM and AdaBoost-CNN.

Case study
Lung cancer is one of the most frequent malignant tumors and has a very high incidence and mortality rate.More importantly, its 5-year survival rate is much lower compared to other leading cancers (Huang et al., 2023).Non-small-cell lung cancer and lung adenocarcinoma (LUAD) are two prevalent lung cancers, wherein NSCLC accounts for approximately 85% of lung cancers (Tan et al., 2023) and LUAD is the most predominant subtype (Li et al., 2023).lncRNAs have close associations with various complex diseases and are potential biomarkers of many types of cancers.Therefore, it is very important to discover potential lncRNAs and further provide therapeutic options for lung cancer.
Through performance comparison, we validated the accurate LDA classification performance of LDA-SABC.Subsequently, we utilized LDA-SABC to discover the potential lncRNAs for NSCLC and LUAD.We computed the association probabilities between all lncRNAs and NSCLC and LUAD.Tables 8 and 9 demonstrate the top 15 lncRNAs with the highest association probability with NSCLC and LUAD among all lncRNAs which have no observed association with NSCLC and LUAD on the LncRNADisease and MNDR databases, respectively.Figure 4 elucidates two predicted LDA networks for NSCLC and LUAD.
Among the inferred top 15 lncRNAs associated with LUAD, 8 and 11 lncRNAs, predicted on the LncRNADisease and MNDR databases, have been reported by Lnc2Cancer 3.0, LncRNADisease v3.0, and/or RNADisease, respectively.We found that HULC could be associated with LUAD, which was ranked 7 and 11, respectively.HULC is an oncogenic lncRNA and may serve as a prognostic biomarker of hepatocellular carcinoma development (Liu S. et al., 2023).Moreover, it displays the potential to be a novel biomarker for assisting acute myocardial infarction diagnosis when combined with other biomarkers (Xie et al., 2022).

Discussion and conclusion
Inferring possible LDAs can advance our understanding of human complex diseases in the context of lncRNAs.However, traditional experimental techniques for LDA prediction are costly, laborious, and time-consuming, which restricts the number of the verified LDAs.Thus, substantive computational frameworks have been exploited.In this manuscript, we proposed a novel computational LDA inference framework LDA-SABC by combining SVD and an ensemble model of LightGBM and AdaBoost-CNN.
LDA-SABC first acquired LDP linear features using SVD.Next, it computed the association probability for each LDP with LDA-LightGBM and LDA-AdaBoost-CNN.Finally, all LDPs were classified through ensemble learning.To illustrate the effectiveness of LDA-SABC, it was compared with four classical computational methods (SDLDA, LDNFSGB, IPCARF, and LDASR) under three CVs.The results elucidated that its performance was significantly improved.To validate the performance of LDA-SABC, we further performed case studies to find potential biomarkers of NSCLC and LUAD and discovered the top 15 lncRNAs linked to them from all unknown LDPs.The results demonstrated that among the inferred top lncRNAs reported by RNADisease, LncRNADisease v3.0, or/and Lnc2Cancer 3.0 databases, 7SK and HULC could have a relationship with NSCLC and LUAD, respectively.
The novelty of this study is the use of SVD for extracting LDP features and designing an ensemble model with LightGBM and AdaBoost-CNN for improving the LDA prediction accuracy.Differing from traditional LDA prediction performance validation, LDA-SABC was assessed under fivefold CVs on lncRNAs, diseases, and LDPs.However, in the process of negative LDA selection, a random selection strategy was adopted, which affected the overall performance of the model.In the future, we will design a reasonable negative LDA selection strategy based on positive-unlabeled learning.More importantly, we will still explore a stronger classification model for LDP classification by integrating various data and deep learning methods.We hope that our proposed LDA-SABC could contribute to the lncRNA biomarker discovery of various complex diseases, especially cancers, and further help find new therapeutic options for various types of cancers.
Inspired by AdaBoost-CNN proposed byHastie et al. (2009) andTaherkhani et al. (2020), we exploit an LDA identification algorithm LDA-AdaBoost-CNN by integrating AdaBoost and CNNs based on transfer learning.Given Q CNNs, LDA-AdaBoost-CNN uses CNNs as base estimators for predicting LDAs.During training, we use a vector D with initial values 1 p to measure the importance of each sample.Next, the weights of all training samples are updated and normalized.Finally, LDA-AdaBoost-CNN outputs a binary vector
i ) with the last CNN to identify one LDP as LDA (k = 1) or non-LDA (k = 2).

FIGURE 2
FIGURE 2 ROC and PR curves of LDA-SABC and four other methods: (A,B) ROC and PR curves on the LncRNADisease and MNDR databases under CV l , respectively.(C,D) Curves under CV d .(E,F) Curves under CV ld .

FIGURE 3
FIGURE 3 Effects of the parameters α and β on the LDA prediction performance: (A) performance of LDA-SABC based on α and β on LncRNADisease under CV l , CV d , and CV ld , respectively.(B) Performance of LDA-SABC based on different α and β values on MNDR under CV l , CV d , and CV ld , respectively.
2, LDA-SABC significantly surpassed four other algorithms on the two FIGURE 4 (A) Inferred top 15 lncRNAs associated with NSCLC on LncRNADisease and MNDR databases.(B) Inferred top 15 lncRNAs associated with LUAD on LncRNADisease and MNDR databases.

TABLE 1 Introduction
of two LDA datasets.

TABLE 2
Performance of five LDA inference methods under CV l .

TABLE 3
Performance of five LDA inference methods under CV d .

TABLE 4
Performance of five LDA inference methods under CV ld .

TABLE 5
Performance of four boosting methods under CV l .

TABLE 6
Performance of four boosting methods under CV d .

TABLE 8
Predicted top 15 lncRNAs associated with NSCLC on LncRNADisease and MNDR.