Predicting potential lncRNA biomarkers for lung cancer and neuroblastoma based on an ensemble of a deep neural network and LightGBM

Su, Zhenguo; Lu, Huihui; Wu, Yan; Li, Zejun; Duan, Lian

doi:10.3389/fgene.2023.1238095

ORIGINAL RESEARCH article

Front. Genet., 16 August 2023

Sec. RNA

Volume 14 - 2023 | https://doi.org/10.3389/fgene.2023.1238095

Predicting potential lncRNA biomarkers for lung cancer and neuroblastoma based on an ensemble of a deep neural network and LightGBM

ZS
Zhenguo Su ¹^†
HL
Huihui Lu ²^†
YW
Yan Wu ³
ZL
Zejun Li ⁴^*
LD
Lian Duan ^5,6,7,8^*

1. Clinical Lab, Yantai Affiliated Hospital of Binzhou Medical University, Yantai, China
2. Department of Thoracic Cardiovascular Surgery, Hunan Province Directly Affiliated TCM Hospital, Zhuzhou, China
3. Geneis (Beijing) Co., Ltd., Beijing, China
4. School of Computer Science, Hunan Institute of Technology, Hengyang, China
5. Faculty of Pediatrics, The Chinese PLA General Hospital, Beijing, China
6. Department of Pediatric Surgery, The Seventh Medical Center of PLA General Hospital, Beijing, China
7. National Engineering Laboratory for Birth Defects Prevention and Control of Key Technology, Beijing, China
8. Beijing Key Laboratory of Pediatric Organ Failure, Beijing, China

Article metrics

View details

Citations

2,3k

Views

1,2k

Downloads

Abstract

Introduction: Lung cancer is one of the most frequent neoplasms worldwide with approximately 2.2 million new cases and 1.8 million deaths each year. The expression levels of programmed death ligand-1 (PDL1) demonstrate a complex association with lung cancer. Neuroblastoma is a high-risk malignant tumor and is mainly involved in childhood patients. Identification of new biomarkers for these two diseases can significantly promote their diagnosis and therapy. However, in vivo experiments to discover potential biomarkers are costly and laborious. Consequently, artificial intelligence technologies, especially machine learning methods, provide a powerful avenue to find new biomarkers for various diseases.

Methods: We developed a machine learning-based method named LDAenDL to detect potential long noncoding RNA (lncRNA) biomarkers for lung cancer and neuroblastoma using an ensemble of a deep neural network and LightGBM. LDAenDL first computes the Gaussian kernel similarity and functional similarity of lncRNAs and the Gaussian kernel similarity and semantic similarity of diseases to obtain their similar networks. Next, LDAenDL combines a graph convolutional network, graph attention network, and convolutional neural network to learn the biological features of the lncRNAs and diseases based on their similarity networks. Third, these features are concatenated and fed to an ensemble model composed of a deep neural network and LightGBM to find new lncRNA–disease associations (LDAs). Finally, the proposed LDAenDL method is applied to identify possible lncRNA biomarkers associated with lung cancer and neuroblastoma.

Results: The experimental results show that LDAenDL computed the best AUCs of 0.8701, 107 0.8953, and 0.9110 under cross-validation on lncRNAs, diseases, and lncRNA‐disease pairs on Dataset 1, respectively, and 0.9490, 0.9157, and 0.9708 on Dataset 2, respectively. Furthermore, AUPRs of 0.8903, 0.9061, and 0.9166 under three cross‐validations were obtained on Dataset 1, and 0.9582, 0.9122, and 0.9743 on Dataset 2. The results demonstrate that LDAenDL significantly outperformed the other four classical LDA prediction methods (i.e., SDLDA, LDNFSGB, IPCAF, and LDASR). Case studies demonstrate that CCDC26 and IFNG-AS1 may be new biomarkers of lung cancer, SNHG3 may associate with PDL1 for lung cancer, and HOTAIR and BDNF-AS may be potential biomarkers of neuroblastoma.

Conclusion: We hope that the proposed LDAenDL method can help the development of targeted therapies for these two diseases.

1 Introduction

Long non-coding RNAs (lncRNAs) are non-coding RNAs with more than 200 nucleotides (Bertone et al., 2004; Peng et al., 2022a; Peng et al., 2022b). LncRNAs play an important role in the development and progression of various diseases (Lanjanian et al., 2021; Meng et al., 2021; Yang and Li 2021; Peng et al., 2022c). LncRNAs have dense associations with many diseases, for example, lung cancer, colorectal cancer, prostate cancer, and Alzheimer’s disease (Klattenhoff et al., 2013; Tan et al., 2013; Chakravarty et al., 2014; He et al., 2014; Zhang et al., 2014). LncRNA H19 is associated with the under-regulation of renal carcinoma cells (Wang et al., 2015). The expression of EGOT in breast cancer is much lower than one in adjacent noncancerous tissues (Broadbent et al., 2008). NEAT1 is overexpressed in prostate cancer cells (Pasmant et al., 2011). The identification of lncRNA-disease associations (LDAs) helps us to further understand the biological processes and the molecular mechanisms of various complex diseases. However, the number of known and experimentally validated LDAs is very small. Thus, it is important to identify potential LDAs. Determining LDAs through in vivo experiments is costly and time-consuming, therefore, it is necessary to design efficient computational approaches for identifying potential LDAs (Meng et al., 2021; Peng et al., 2022d). Computational LDA prediction methods are categorized as biological network-based methods and machine learning-based methods.

Biological network-based methods use network algorithms for association prediction (Liu et al., 2023a). This type of method first constructs heterogeneous networks of lncRNAs and diseases and then identifies LDAs via matrix decomposition, random walk, and so on. To predict potential LDAs, LRWRHLDA combined Laplace normalized random walk with restart (Wang et al., 2022), LDGRNMF used graph regularized nonnegative matrix factorization (Wang et al., 2021), DSCMF developed a dual sparse collaborative matrix factorization approach (Liu et al., 2021a), RWSF-BLP added random walk-based multi-similarity fusion to bidirectional label propagation (Xie et al., 2021), HBRWRLDA utilized bi-random walk on hypergraphs (Xie et al., 2022), and MHRWRLDA exploited a random walk model with restart through multiplex and heterogeneous networks (Yao et al., 2021).

With the fast advance of RNA sequencing technologies, artificial intelligence has obtained wide applications in biomedical data analysis (Peng et al., 2023a; Peng et al., 2023b; Xu et al., 2023). Notably, artificial intelligence technologies, especially machine learning methods, have been widely applied to predict miRNA-disease associations (Liu et al., 2022) and circRNA-disease associations (Liu et al., 2023b). To find new LDAs, HGATLDA developed a novel heterogeneous graph attention network model (Zhao et al., 2022), DeepMNE extracted multi-omics data and designed a deep multi-network embedding model (Ma, 2022), iLncDA-LTR is a rank-based method (Wu et al., 2022), MAGCNSE utilized a graph convolutional network (Liang et al., 2022), LDAformer extracted topological features and used a transformer encoder for LDA classification (Zhou et al., 2022), BiGAN explored a bidirectional generative adversarial network (Yang et al., 2021), and SVDNVLDA extracted linear and non-linear features and used an XGBoost for LDA prediction (Li et al., 2021).

Computational methods have found many potential LDAs, however, network-based methods were more likely to favor well-investigated lncRNAs or diseases and can not predict LDAs for new lncRNAs or new diseases. Machine learning-based methods failed to effectively integrate different kernels from multiple data sources. Thus, in this study, we developed a machine learning-based method named LDAenDL to detect potential lncRNA biomarkers for lung cancer and neuroblastoma based on an ensemble of a deep neural network and LightGBM.

2 Materials and methods

As shown in Figure 1, LDAenDL first computes the Gaussian kernel similarity and functional similarity of lncRNAs and the Gaussian kernel similarity and semantic similarity of diseases to obtain their similar networks. Next, LDAenDL combines a graph convolutional network (GCN) (Kipf and Welling, 2016), graph attention network (GAT) (Velickovic et al., 2017), and convolutional neural network (Gu et al., 2018) to learn the biological features of lncRNAs and diseases based on their similarity networks. Third, these features are concatenated and fed to an ensemble model composed of a deep neural network (DNN) and LightGBM to find new LDAs. Finally, LDAenDL was applied to identify possible lncRNA biomarkers associated with lung cancer and neuroblastoma.

FIGURE 1

2.1 Data preparation

We used two human LDA datasets that were provided by Chen et al. (2012) and Cui et al. (2018). Dataset 1 contains 605 LDAs between 157 diseases and 82 lncRNAs. Dataset 2 contains 1,529 LDAs between 190 diseases and 89 lncRNAs. An LDA network can be denoted as where if lncRNA interacts with disease , otherwise, it equals 0.

2.2 Similarity computation

Inspired by the LDA-DLPU method (Peng et al., 2022a), we computed the Gaussian kernel similarity and functional similarity of lncRNAs and the Gaussian kernel similarity and semantic similarity of diseases. Based on the computed lncRNA similarity and disease similarity matrices, we learned the features of lncRNAs and diseases by combining a GCN, GAT, and CNN.

2.3 Feature learning

Dai et al. (2022) designed a hybrid graph representation learning model (GraphCDA) to represent the features of circRNAs and diseases and obtained better circRNA-disease association prediction performance. Inspired by GraphCDA proposed by Dai et al. (2022), we exploit a GraphCDA-based LDA feature learning model.

2.3.1 Graph convolutional network

A GCN was applied to obtain the feature representations of lncRNAs and diseases based on their similarity networks. For a GCN G, it is denoted as an adjacency matrix with nodes where each node can be described as an -dimensional vector. And GCN outputs node representation matrix in Eqs 1, 2:where , and denote degree matrix and trainable weight matrix, and σ(·) denotes a ReLU activation function.

2.3.2 Graph attention network

A GAT (Veličković et al., 2017) uses multi-head attention to set weights for all adjacent nodes based on their importance. LDAenDL introduces a GAT layer between two GCN layers to help the GCN to extract high-level features of lncRNAs and diseases.

For the GCN G, a GAT layer outputs node representations in Eq. 3:

For attention mechanisms in multi-head attention and its weight matrix , let denote the input feature vector of the -th lncRNA, its feature representation in can be denoted as Eq. 4:where denotes the -th attention coefficients between two lncRNA nodes and :where || denotes a concatenation operation, denotes the LeaklyReLU activation function, denotes a weight vector related to the -th attention mechanism, and denotes the weight of an edge .

2.3.3 Feature representation of lncRNAs and diseases

For a lncRNA similarity network , its adjacency matrix , and node feature matrix , we alternately use GCN and GAT layers to obtain the graph feature representation of lncRNAs at different levels in Eq. 6:

Thus, a 1D CNN is used to produce the lncRNA feature representation matrix by combining the output features and in the different GCN layers.

Similarly, the graph feature representations of diseases at different levels are denoted by Eq. 7:

A 1D CNN is used to produce the disease feature representation matrix by combining the output features and in the different GCN layers.

2.3.4 Preference matrix construction

The preference matrix that describes all lncRNA-disease pairs can be represented as Eq. 8 based on and :

We used binary cross-entropy as the activation function to evaluate the difference between the preference matrix and the known adjacency matrix . By minimizing the loss function on two LDA datasets, the feature representation matrices and of lncRNAs and diseases are learned.

2.4 LDA prediction

2.4.1 DNN

We built a DNN to predict new LDAs based on known LDAs and the learned LDA features. The DNN contains an input layer, an output layer, and multiple hidden layers. In the input layer, there are F neurons that are the same as the number of LDA features.

Given an LDA sample , the input layer with inputs is represented by Eq. 9:where denotes the -th feature in a sample .

The hidden layer is represented by Eq. 10:where and denote the weight of and the bias in the -th hidden layer, respectively.

The output in the -th hidden layer is denoted by Eq. 11:where denotes a ReLU activation function. Finally, the output layer with the sigmoid function outputs the LDA prediction results in Eq. 12:

2.4.2 LightGBM

In this section, we built a LightGBM (Ke et al., 2017) to identify new LDAs. For a training set with lncRNA-disease pair, LightGBM intends to build an approximation of to a certain function by minimizing the expected value of loss function by Eq. 13:

LightGBM integrates regression trees to approximate the final model by Eq. 14:

The regression trees are expressed as , where , , and denote the number of leaves, the decision rules of the tree, and the sample weight of leaf nodes, respectively.

At step , LightGBM is trained in an additive form:

The objective function (15) is rapidly approximated with Newton’s method (Sun et al., 2020).

To solve the objective function of LightGBM, we removed the constant term for simplicity, and model (15) can be represented as Eq. 16:where and are the first-order and second-order gradients related to the loss function. Given the sample set related to leaf , Eq. 16 is transformed to Eq. 17:

Given a certain tree structure , for each leaf node , its optimal leaf weight and the extreme value of could be computed by Eq. 18:where is a scoring function used to evaluate the quality of a tree structure . Finally, Model (15) can be denoted as:where and denote the example sets in the left and right subtrees of , respectively.

2.4.3 Ensemble learning

Through the solution of models (12) and (15), we can identify potential LDAs based on a DNN and LightGBM. Ensemble learning has better prediction accuracy than a single model. To further improve LDA prediction accuracy, we combined a DNN and LightGBM and developed an ensemble model for LDA identification through soft voting in Eq. 16:where and denote LDA prediction results from the DNN and LightGBM, respectively. and are their weights with values of 0.4 and 0.6, respectively. In particular, a lncRNA–disease pair is taken as an LDA if its association probability is greater than 0.5; otherwise, the pair is taken as a negative LDA.

3 Results

3.1 Evaluation metrics

In this article, we compared our proposed LDAenDL method with four LDA prediction methods, SDLDA, LDNFSGB, IPCAF, and LDASR. Precision, recall, accuracy, F1-score, AUC, and AUPR were used to compare the performance of LDAenDL with the four methods. The six metrics have been defined by Peng et al. (2022b) (Shen et al., 2022).

3.2 Comparison of LDAenDL with the other four methods

To implement the performance evaluation, inspired by the three cross-validations proposed by Zhou et al. (2021), we conducted cross-validations on lncRNAs (CV1), diseases (CV2), and lncRNA-disease pairs (CV3). Tables 1–3 give the precision, recall, accuracy, F1-score, AUC, and AUPR under CV1, CV2, and CV3 on two LDA datasets. In Tables 1–6, the bold font in each row denotes the best performance.

TABLE 1

		SDLDA	LDNFSGB	IPCARF	LDASR	LDAenDL
Precision	Dataset 1	0.8514 ± 0.0509	0.7004 ± 0.0639	0.4878 ± 0.1309	0.6726 ± 0.1200	0.8764 ± 0.0493
Precision	Dataset 2	0.9399 ± 0.0154	0.8552 ± 0.0393	0.6615 ± 0.0966	0.8405 ± 0.0300	0.9391 ± 0.0290
Recall	Dataset 1	0.6521 ± 0.0732	0.6092 ± 0.0790	0.5721 ± 0.1580	0.5129 ± 0.0946	0.7019 ± 0.0639
Recall	Dataset 2	0.8239 ± 0.0437	0.8021 ± 0.0498	0.6434 ± 0.1545	0.7358 ± 0.0562	0.8304 ± 0.0523
Accuracy	Dataset 1	0.7799 ± 0.0341	0.6769 ± 0.0423	0.4906 ± 0.0951	0.6417 ± 0.0597	0.7996 ± 0.0312
Accuracy	Dataset 2	0.8857 ± 0.0283	0.8323 ± 0.0230	0.6526 ± 0.0775	0.7972 ± 0.0268	0.8879 ± 0.0289
F1-score	Dataset 1	0.7365 ± 0.0563	0.6462 ± 0.0451	0.5125 ± 0.1100	0.5668 ± 0.0536	0.7768 ± 0.0399
F1-score	Dataset 2	0.8775 ± 0.0278	0.8260 ± 0.0230	0.6401 ± 0.1017	0.7827 ± 0.0260	0.8804 ± 0.0334
AUC	Dataset 1	0.8023 ± 0.0477	0.7346 ± 0.0465	0.5096 ± 0.1432	0.7057 ± 0.0420	0.8701 ± 0.0339
AUC	Dataset 2	0.9366 ± 0.0195	0.8839 ± 0.0270	0.7104 ± 0.0997	0.8641 ± 0.0256	0.9490 ± 0.0220
AUPR	Dataset 1	0.8461 ± 0.0553	0.7239 ± 0.0626	0.5336 ± 0.1423	0.6775 ± 0.0971	0.8903 ± 0.0273
AUPR	Dataset 2	0.9533 ± 0.0129	0.8832 ± 0.0307	0.7128 ± 0.1012	0.8671 ± 0.0252	0.9582 ± 0.0167

Comparison of LDAenDL with the other four methods under CV1.