Predicting potential lncRNA biomarkers for lung cancer and neuroblastoma based on an ensemble of a deep neural network and LightGBM

Introduction: Lung cancer is one of the most frequent neoplasms worldwide with approximately 2.2 million new cases and 1.8 million deaths each year. The expression levels of programmed death ligand-1 (PDL1) demonstrate a complex association with lung cancer. Neuroblastoma is a high-risk malignant tumor and is mainly involved in childhood patients. Identification of new biomarkers for these two diseases can significantly promote their diagnosis and therapy. However, in vivo experiments to discover potential biomarkers are costly and laborious. Consequently, artificial intelligence technologies, especially machine learning methods, provide a powerful avenue to find new biomarkers for various diseases. Methods: We developed a machine learning-based method named LDAenDL to detect potential long noncoding RNA (lncRNA) biomarkers for lung cancer and neuroblastoma using an ensemble of a deep neural network and LightGBM. LDAenDL first computes the Gaussian kernel similarity and functional similarity of lncRNAs and the Gaussian kernel similarity and semantic similarity of diseases to obtain their similar networks. Next, LDAenDL combines a graph convolutional network, graph attention network, and convolutional neural network to learn the biological features of the lncRNAs and diseases based on their similarity networks. Third, these features are concatenated and fed to an ensemble model composed of a deep neural network and LightGBM to find new lncRNA–disease associations (LDAs). Finally, the proposed LDAenDL method is applied to identify possible lncRNA biomarkers associated with lung cancer and neuroblastoma. Results: The experimental results show that LDAenDL computed the best AUCs of 0.8701, 107 0.8953, and 0.9110 under cross-validation on lncRNAs, diseases, and lncRNA‐disease pairs on Dataset 1, respectively, and 0.9490, 0.9157, and 0.9708 on Dataset 2, respectively. Furthermore, AUPRs of 0.8903, 0.9061, and 0.9166 under three cross‐validations were obtained on Dataset 1, and 0.9582, 0.9122, and 0.9743 on Dataset 2. The results demonstrate that LDAenDL significantly outperformed the other four classical LDA prediction methods (i.e., SDLDA, LDNFSGB, IPCAF, and LDASR). Case studies demonstrate that CCDC26 and IFNG-AS1 may be new biomarkers of lung cancer, SNHG3 may associate with PDL1 for lung cancer, and HOTAIR and BDNF-AS may be potential biomarkers of neuroblastoma. Conclusion: We hope that the proposed LDAenDL method can help the development of targeted therapies for these two diseases.


Introduction
Long non-coding RNAs (lncRNAs) are non-coding RNAs with more than 200 nucleotides (Bertone et al., 2004;Peng et al., 2022a;Peng et al., 2022b).LncRNAs play an important role in the development and progression of various diseases (Lanjanian et al., 2021;Meng et al., 2021;Yang and Li 2021;Peng et al., 2022c).LncRNAs have dense associations with many diseases, for example, lung cancer, colorectal cancer, prostate cancer, and Alzheimer's disease (Klattenhoff et al., 2013;Tan et al., 2013;Chakravarty et al., 2014;He et al., 2014;Zhang et al., 2014).LncRNA H19 is associated with the under-regulation of renal carcinoma cells (Wang et al., 2015).The expression of EGOT in breast cancer is much lower than one in adjacent noncancerous tissues (Broadbent et al., 2008).NEAT1 is overexpressed in prostate cancer cells (Pasmant et al., 2011).The identification of lncRNA-disease associations (LDAs) helps us to further understand the biological processes and the molecular mechanisms of various complex diseases.However, the number of known and experimentally validated LDAs is very small.Thus, it is important to identify potential LDAs.Determining LDAs through in vivo experiments is costly and time-consuming, therefore, it is necessary to design efficient computational approaches for identifying potential LDAs (Meng et al., 2021;Peng et al., 2022d).Computational LDA prediction methods are categorized as biological network-based methods and machine learning-based methods.
Biological network-based methods use network algorithms for association prediction (Liu et al., 2023a).This type of method first constructs heterogeneous networks of lncRNAs and diseases and then identifies LDAs via matrix decomposition, random walk, and so on.To predict potential LDAs, LRWRHLDA combined Laplace normalized random walk with restart (Wang et al., 2022), LDGRNMF used graph regularized nonnegative matrix factorization (Wang et al., 2021), DSCMF developed a dual sparse collaborative matrix factorization approach (Liu et al., 2021a), RWSF-BLP added random walk-based multi-similarity fusion to bidirectional label propagation (Xie et al., 2021), HBRWRLDA utilized bi-random walk on hypergraphs (Xie et al., 2022), and MHRWRLDA exploited a random walk model with restart through multiplex and heterogeneous networks (Yao et al., 2021).
With the fast advance of RNA sequencing technologies, artificial intelligence has obtained wide applications in biomedical data analysis (Peng et al., 2023a;Peng et al., 2023b;Xu et al., 2023).Notably, artificial intelligence technologies, especially machine learning methods, have been widely applied to predict miRNAdisease associations (Liu et al., 2022) and circRNA-disease associations (Liu et al., 2023b).To find new LDAs, HGATLDA developed a novel heterogeneous graph attention network model (Zhao et al., 2022), DeepMNE extracted multi-omics data and designed a deep multi-network embedding model (Ma, 2022), iLncDA-LTR is a rank-based method (Wu et al., 2022), MAGCNSE utilized a graph convolutional network (Liang et al., 2022), LDAformer extracted topological features and used a transformer encoder for LDA classification (Zhou et al., 2022), BiGAN explored a bidirectional generative adversarial network (Yang et al., 2021), and SVDNVLDA extracted linear and nonlinear features and used an XGBoost for LDA prediction (Li et al., 2021).
Computational methods have found many potential LDAs, however, network-based methods were more likely to favor wellinvestigated lncRNAs or diseases and can not predict LDAs for new lncRNAs or new diseases.Machine learning-based methods failed to effectively integrate different kernels from multiple data sources.Thus, in this study, we developed a machine learning-based method named LDAenDL to detect potential lncRNA biomarkers for lung cancer and neuroblastoma based on an ensemble of a deep neural network and LightGBM.

Materials and methods
As shown in Figure 1, LDAenDL first computes the Gaussian kernel similarity and functional similarity of lncRNAs and the Gaussian kernel similarity and semantic similarity of diseases to obtain their similar networks.Next, LDAenDL combines a graph convolutional network (GCN) (Kipf and Welling, 2016), graph attention network (GAT) (Velickovic et al., 2017), and convolutional neural network (Gu et al., 2018) to learn the biological features of lncRNAs and diseases based on their similarity networks.Third, these features are concatenated and fed to an ensemble model composed of a deep neural network (DNN) and LightGBM to find new LDAs.Finally, LDAenDL was applied to identify possible lncRNA biomarkers associated with lung cancer and neuroblastoma.

Data preparation
We used two human LDA datasets that were provided by Chen et al. (2012) and Cui et al. (2018).Dataset 1 contains 605 LDAs between 157 diseases and 82 lncRNAs.Dataset 2 contains 1,529 LDAs between 190 diseases and 89 lncRNAs.An LDA network can be denoted as Y ∈ R n×m where y ij 1 if lncRNA l i interacts with disease d j , otherwise, it equals 0.

Similarity computation
Inspired by the LDA-DLPU method (Peng et al., 2022a), we computed the Gaussian kernel similarity and functional similarity of lncRNAs and the Gaussian kernel similarity and semantic similarity of diseases.Based on the computed lncRNA similarity and disease similarity matrices, we learned the features of lncRNAs and diseases by combining a GCN, GAT, and CNN.Dai et al. (2022) designed a hybrid graph representation learning model (GraphCDA) to represent the features of circRNAs and diseases and obtained better circRNA-disease association prediction performance.Inspired by GraphCDA proposed by Dai et al. (2022), we exploit a GraphCDA-based LDA feature learning model.

Graph convolutional network
A GCN was applied to obtain the feature representations of lncRNAs and diseases based on their similarity networks.For a GCN G, it is denoted as an adjacency matrix S ∈ R N×N with N nodes where each node can be described as an F-dimensional vector.And GCN outputs node representation matrix H new in Eqs 1, 2: where S′ I + S, A j S i,j ′ and Q ∈ R F×F denote degree matrix and trainable weight matrix, and σ(•) denotes a ReLU activation function.

Graph attention network
A GAT (Veličković et al., 2017) uses multi-head attention to set weights for all adjacent nodes based on their importance.LDAenDL introduces a GAT layer between two GCN layers to help the GCN to extract high-level features of lncRNAs and diseases.
For the GCN G, a GAT layer outputs node representations H new in Eq. 3: For K attention mechanisms in multi-head attention and its weight matrix W k , let H i → denote the input feature vector of the i-th lncRNA, its feature representation H new i in H new can be denoted as Eq.4: where ϕ k it denotes the k-th attention coefficients between two lncRNA nodes i and t: where || denotes a concatenation operation, f denotes the LeaklyReLU activation function, a k ∈ R 2F+1 denotes a weight vector related to the k-th attention mechanism, and B k denotes the weight of an edge S ij .

FIGURE 1
The pipeline of LDAenDL.

Feature representation of lncRNAs and diseases
For a lncRNA similarity network G c , its adjacency matrix C, and node feature matrix H (0)  c ∈ R Nc×Fc , we alternately use GCN and GAT layers to obtain the graph feature representation of lncRNAs at different levels in Eq. 6: Thus, a 1D CNN is used to produce the lncRNA feature representation matrix X c by combining the output features H (1) c and H (3)  c in the different GCN layers.Similarly, the graph feature representations of diseases at different levels are denoted by Eq. 7: A 1D CNN is used to produce the disease feature representation matrix X c by combining the output features H (1)  d and H (3) d in the different GCN layers.

Preference matrix construction
The preference matrix U that describes all lncRNA-disease pairs can be represented as Eq. 8 based on X c and X d : We used binary cross-entropy as the activation function to evaluate the difference between the preference matrix U and the known adjacency matrix R. By minimizing the loss function on two LDA datasets, the feature representation matrices X c and X d of lncRNAs and diseases are learned.

DNN
We built a DNN to predict new LDAs based on known LDAs and the learned LDA features.The DNN contains an input layer, an output layer, and multiple hidden layers.In the input layer, there are F neurons that are the same as the number of LDA features.
Given an LDA sample x, the input layer with k inputs is represented by Eq. 9: where x i denotes the i-th feature in a sample x.
The hidden layer is represented by Eq. 10: where w i and b j denote the weight of x i and the bias in the j-th hidden layer, respectively.The output in the j-th hidden layer is denoted by Eq. 11: where f denotes a ReLU activation function.Finally, the output layer with the sigmoid function outputs the LDA prediction results in Eq. 12:

LightGBM
In this section, we built a LightGBM (Ke et al., 2017) to identify new LDAs.For a training set X (x i , y i ) n i 1 with n lncRNAdisease pair, LightGBM intends to build an approximation of f to a certain function f(x) by minimizing the expected value of loss function L(y, f(x)) by Eq. 13: LightGBM integrates T regression trees to approximate the final model by Eq. 14: The regression trees are expressed as w q(x) , q ∈ 1, 2, . . ., J { } , where J, q, and w denote the number of leaves, the decision rules of the tree, and the sample weight of leaf nodes, respectively.
At step t, LightGBM is trained in an additive form: The objective function ( 15) is rapidly approximated with Newton's method (Sun et al., 2020).
To solve the objective function of LightGBM, we removed the constant term for simplicity, and model ( 15) can be represented as Eq.16: where g i and h i are the first-order and second-order gradients related to the loss function.Given the sample set I j related to leaf j, Eq. 16 is transformed to Eq. 17: Given a certain tree structure q(x), for each leaf node w * j , its optimal leaf weight and the extreme value of Γ k could be computed by Eq. 18: where Γ * T is a scoring function used to evaluate the quality of a tree structure q.Finally, Model (15) can be denoted as: where I L and I R denote the example sets in the left and right subtrees of q, respectively.

Ensemble learning
Through the solution of models ( 12) and ( 15), we can identify potential LDAs based on a DNN and LightGBM.Ensemble learning has better prediction accuracy than a single model.To further improve LDA prediction accuracy, we combined a DNN and LightGBM and developed an ensemble model for LDA identification through soft voting in Eq. 16: where C DNN and C LightGBM denote LDA prediction results from the DNN and LightGBM, respectively.α and β are their weights with values of 0.4 and 0.6, respectively.In particular, a lncRNA-disease pair is taken as an LDA if its association probability is greater than 0.5; otherwise, the pair is taken as a negative LDA.

Evaluation metrics
In this article, we compared our proposed LDAenDL method with four LDA prediction methods, SDLDA, LDNFSGB, IPCAF, and LDASR.Precision, recall, accuracy, F1-score, AUC, and AUPR were used to compare the

Comparison of LDAenDL with the other four methods
To implement the performance evaluation, inspired by the three cross-validations proposed by Zhou et al. (2021), we conducted cross-validations on lncRNAs (CV1), diseases (CV2), and lncRNA-disease pairs (CV3).Tables 1-3 give the precision, recall, accuracy, F1-score, AUC, and AUPR under CV1, CV2, and CV3 on two LDA datasets.In Tables 1-6, the bold font in each row denotes the best performance.
Under CV1, LDAenDL randomly took 80% of lncRNAs as training samples, and the rest were taken as test samples to investigate the LDA prediction ability for new lncRNAs.The results from Table 1 show that our proposed LDAenDL approach obtained the best precision, recall, accuracy, F1-score, AUC, and AUPR on two datasets under CV1 except that it computed slightly lower precision on Dataset 2 (0.9391 vs. 0.9399).It computed the highest AUPRs of 0.8903 and 0.9582, and far exceeded the AUPR values computed by SDLDA (i.e., 0.8461 and 0.9533).
Figure 2 shows the AUC and AUPR values computed by LDAenDL and the other four methods on two datasets under CV1.The results demonstrated that LDAenDL can discover possible diseases associated with a new lncRNA.
Under CV2, LDAenDL randomly took 80% of diseases as training samples, and the rest were taken as test samples to investigate the LDA prediction ability for new diseases.The results from Table 2 show that our proposed LDAenDL approach obtained better precision, AUC, and AUPR on two datasets under CV2.However, SDLDA computed higher recall, accuracy, and F1-score than LDAenDL, which may be caused by smaller disease samples.
Figure 3 shows the AUC and AUPR values computed by LDAenDL and the other four methods on two datasets under CV2.The results show that LDAenDL can be applied to screen possible lncRNAs associated with a new disease.
Under CV3, LDAenDL randomly took 80% of lncRNA-disease pairs as training samples, and the rest were taken as test samples to investigate the LDA prediction ability.The results from Table 3 show that our proposed LDAenDL approach obtained the best precision, recall, accuracy, F1-score, AUC, and AUPR on two datasets under CV3.It computed the highest AUCs of 0.9110 and 0.9708 and far exceeded The bold value denotes the best performance.
Figure 4 shows the AUC and AUPR values computed by LDAenDL and the other four methods on two datasets under CV3.The results demonstrated that LDAenDL could find potential LDAs based on known LDAs.The bold value denotes the best performance.

FIGURE 2
The AUC and AUPR values of five LDA prediction methods under CV1.

FIGURE 3
The AUC and AUPR values of five LDA prediction methods under CV2.

FIGURE 4
The AUC and AUPR values of five LDA prediction methods under CV3.

Comparison of LDAenDL with individual models
To measure the effect of the ensemble algorithm on LDA prediction performance, we compared LDAenDL with two individual models, DNN, and LightGBM.Tables 4-6 show the precision, recall, accuracy, F1-score, AUC, and AUPR of the DNN, LightGBM, and LDAenDL under CV1, CV2, and CV3, respectively.
Under CV1, as shown in Table 4, LDAenDL outperformed the DNN and LightGBM on two LDA datasets for the majority of conditions.LDAenDL computed the best accuracy and F1-score on the two datasets.Although LDAenDL computed slightly lower AUC value than the DNN on dataset 1, and still slightly lower AUC than LightGBM on dataset 2, their differences were very small.For example, the DNN computed an AUC of 0.8712 while LDAenDL computed 0.8701 on dataset 1, and the DNN calculated an AUC of 0.9497 while LDAenDL calculated 0.9490 on dataset 2. LDAenDL obtained the best AUPR on dataset 1, and LightGBM obtained an AUPR of 0.9586 while LDAenDL obtained an AUPR of 0.9582.
Under CV2, as shown in Table 5, LDAenDL outperformed the DNN under all conditions on two LDA datasets.Recall, accuracy, The top 20 predicted lncRNA biomarkers for lung cancer in each of the two datasets (The repeated lncRNAs in the two datasets have been removed).This figure was drawn using Cytoscape (Shannon et al., 2003).
and F1-score computed by LightGBM were slightly better than LDAenDL on the two datasets.But it calculated the best AUC and AUPR on dataset 1.Under CV3, as shown in Table 6, LDAenDL computed the highest precision, recall, accuracy, F1-score, AUC, and AUPR on the two LDA datasets except that it computed a slightly lower recall on dataset 1.The results demonstrate that LDAenDL is appropriate to predict possible LDAs from unknown lncRNAdisease pairs.

Identifying possible lncRNA biomarkers for lung cancer
Lung cancer is one of the most prevalent causes of mortality globally.It mainly contains small cell lung cancer and non-small cell lung cancer.Targeted drug therapy is its one therapeutic option (Lahiri et al., 2023).We used the proposed LDAenDL method to predict possible lncRNA biomarkers for lung cancer.Table 7 shows the predicted top 20 lncRNA biomarkers for lung cancer.The 20 lncRNA biomarkers associated with lung cancer have no known association information with lung cancer in the two datasets.
In dataset 1, LDAenDL predicted that CCDC26 could be associated with lung cancer.CCDC26 can enhance thyroid cancer malignant progression (Ma et al., 2021).It promotes imatinib resistance in human gastrointestinal stromal tumors (Yan et al., 2019).Its inhibition could increase the sensitivity of doxorubicin in MDR-CML cells (Liu et al., 2021b).In this study, we predicted that CCDC26 could be associated with lung cancer in dataset 1.
In dataset 2, LDAenDL predicted that IFNG-AS1 could be associated with lung cancer.IFNG-AS1 has been reported in long-lasting memory T cells (Castellucci et al., 2021).It can boost interferon gamma generation in human natural killer cells (Stein et al., 2019).We identified that IFNG-AS1 could be associated with lung cancer in Dataset 2.
Figure 5 shows the top 20 predicted lncRNAs associated with lung cancer in each of the two datasets.Yellow solid lines and blue solid lines denote lncRNA-lung cancer associations confirmed by the literatures among the predicted top 20 associations on datasets 1 and 2, respectively.Grey solid lines denote the predicted and co-occurring lncRNA-lung cancer associations that can be confirmed by the literatures in the two datasets, and grey dashed lines denote the predicted and unconfirmed lncRNA-lung cancer associations in the two datasets.The repeated lncRNAs in the two datasets have been removed.Recent advances in lung cancer treatment have demonstrated significant responses in patients when they were treated with programmed death-1/programmed death-ligand 1 (PD-1/PD-L1) checkpoint blockade immunotherapies (Lahiri et al., 2023).To find possible lncRNAs associated with PDL1 for lung cancer, inspired by LPI-DLDN proposed by Peng et al. (2022a), we first downloaded the sequence of PDL1 from the UniProt database.Next, we extracted the biological features of PDL1 and depicted PDL1 as a 10,029-dimensional vector using BioTriangle.Finally, we used cosine similarity to compute the similarities between PDL1 and the other proteins in a lncRNAprotein interaction dataset (Li et al., 2015) and found the top 3 proteins with the highest interaction probabilities with PDL1.The results show that SNHG3 has a higher interaction probability with PDL1 and has been reported to be associated with lung cancer.

Identifying possible lncRNA biomarkers for neuroblastoma
Neuroblastoma is the most frequent pediatric solid tumor and accounts for approximately 15% of childhood cancer-related mortality (Zafar et al., 2021).We used the proposed LDAenDL method to identify possible lncRNA biomarkers for neuroblastoma.Table 8 shows the top 20 predicted lncRNA biomarkers for neuroblastoma in each of the two datasets.The repeated lncRNAs in the two datasets have been removed.
In dataset 1, we predicted that HOTAIR could be associated with neuroblastoma with the highest probability.HOTAIR is a novel oncogenic biomarker in human cancer (Rajagopal et al., 2020).Its knockdown can promote radiosensitivity in colorectal cancer (Liu et al., 2020).It also can enhance the carcinogenesis of gastric (Zhang et al., 2020).We identified that HOTAIR may be one biomarker of neuroblastoma in dataset 1.
In dataset 2, we predicted that BDNF-AS could be associated with neuroblastoma with the highest probability.PABPC1-induced stabilization of BDNF-AS helps the inhibition of malignant progression in glioblastoma cells (Su et al., 2020).It can regulate the miR-9-5p/BACE1 pathway that affects neurotoxicity in Alzheimer's disease (Ding et al., 2022).We identified that BDNF-AS is a possible biomarker of neuroblastoma in dataset 2.
Figure 6 shows the top 20 predicted lncRNAs associated with neuroblastoma in each of the two datasets.Yellow solid lines and blue solid lines denote lncRNA-neuroblastoma associations confirmed by the literatures among the predicted top 20 associations on datasets 1 and 2, respectively.Grey solid lines denote the predicted and co-occurring lncRNA-neuroblastoma associations that can be confirmed by the literatures in the two datasets, and grey dashed lines denote the predicted and unconfirmed lncRNA-neuroblastoma associations in the two datasets.The repeated lncRNAs in the two datasets have been removed.

Conclusion
Lung cancer and neuroblastoma are two human diseases that severely affect the human body.Detecting new biomarkers for them contributes to their diagnosis and therapy.Experimental biomarker identification methods are costly and laborious.Thus, we developed a machine learning-based method named LDAenDL to predict possible lncRNA biomarkers for the two diseases based on an ensemble of a deep neural network and LightGBM.LDAenDL first computed lncRNA similarity and disease similarity and then combined a GCN, GAT, and CNN to learn the biological features of lncRNAs and diseases.Finally, these features were fed to a DNN and LightGBM to find new LDAs.
LDAenDL was compared with the other four classical LDA prediction methods (i.e., SDLDA, LDNFSGB, IPCAF, and LDASR).The results showed that LDAenDL computed the best AUCs and AUPRs under three cross-validations on two LDA datasets, demonstrating the optimal LDA prediction performance of LDAenDL.We further identified possible lncRNA biomarkers for lung cancer and neuroblastoma.The results demonstrated that CCDC26 and IFNG-AS1 may be new biomarkers for lung cancer, SNHG3 may be associated with PDL1 for lung cancer, and HOTAIR and BDNF-AS may be potential biomarkers for neuroblastoma.
In the future, we will combine data from multiple sources, for example, miRNA, circRNA, and drugs, to improve LDA identification performance.We will also design a new deep-learning model to efficiently extract the biological features of lncRNAs and diseases for LDA prediction.We hope that the proposed LDAenDL can help the development of targeted therapies for these two diseases.

FIGURE 6
The top 20 predicted lncRNA biomarkers for neuroblastoma in each of the two datasets.(The repeated lncRNAs in the two datasets have been removed).This figure was drawn using Cytoscape (Shannon et al., 2003).

TABLE 1
Comparison of LDAenDL with the other four methods under CV1.

TABLE 2
Comparison of LDAenDL with the other four methods under CV2.

TABLE 3
Comparison of LDAenDL with the other four methods under CV3.

TABLE 5
Comparison of LDAenDL with individual models under CV2.

TABLE 6
Comparison of LDAenDL with individual models under CV3.

TABLE 7
The predicted top 20 lncRNA biomarkers for lung cancer in each of the two datasets.

TABLE 8
The top 20 predicted lncRNA biomarkers for neuroblastoma in each of the two datasets.