Edited by: Liang Cheng, Harbin Medical University, China
Reviewed by: Jingpu Zhang, Henan University of Urban Construction, China; Hui Liu, Changzhou University, China
This article was submitted to Statistical Genetics and Methodology, a section of the journal Frontiers in Genetics
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Interactions between genetic factors and environmental factors (EFs) play an important role in many diseases. Many diseases result from the interaction between genetics and EFs. The long non-coding RNA (lncRNA) is an important non-coding RNA that regulates life processes. The ability to predict the associations between lncRNAs and EFs is of important practical significance. However, the recent methods for predicting lncRNA-EF associations rarely use the topological information of heterogenous biological networks or simply treat all objects as the same type without considering the different and subtle semantic meanings of various paths in the heterogeneous network. In order to address this issue, a method based on the Gradient Boosting Decision Tree (GBDT) to predict the association between lncRNAs and EFs (GBDTL2E) is proposed in this paper. The innovation of the GBDTL2E integrates the structural information and heterogenous networks, combines the Hetesim features and the diffusion features based on multi-feature fusion, and uses the machine learning algorithm GBDT to predict the association between lncRNAs and EFs based on heterogeneous networks. The experimental results demonstrate that the proposed algorithm achieves a high performance.
The environment factor (EF) is a biological or non-biological factor that affects a living organism. Non-biological factors include physical factors, chemical factors, and social factors. Biological factors include parasites and viruses. Many studies have demonstrated that Gene-Environment (G–E) interactions play an important role in the etiology and progression of many complex diseases (Xu et al.,
According to the central law of molecular biology, genetic information is mainly saved in DNA sequences. Genetic information is transcribed from DNA into RNA, which is then translated into proteins. Genome sequence analysis shows that the protein-coding sequences account for about 2% of the human genome, and 98% are non-encoding protein sequences (Bertone et al.,
There are many studies on the biological mechanism and interaction between genes, microRNAs (miRNAs), lncRNAs, EFs, and diseases, such as the relationship between genes and diseases, miRNAs and diseases, lncRNAs and diseases, miRNAs and EFs, etc. Among them, microRNA (miRNA) is a kind of non-coding RNA that has only about 21–25 nucleotides (Deng et al.,
For the association between genes and diseases, a data synthesis platform based on gene variation and gene expression was established by Luo et al. This method applies the method of network analysis to predict the interaction between genes and diseases (Luo Z. et al.,
For the association between miRNAs and diseases, KBMF-MDI was proposed by Lan et al. KBMF-MDI predicts the association between miRNAs and diseases based on their similarities to diseases (Lan et al.,
For the association between lncRNAs and diseases, a method to predict the association between human lncRNAs and diseases based on the random walk of the global network was proposed by Gu et al. (
For the association between miRNAs and EFs, the MiREFRWR was proposed by Chen et al., and it uses the Random Walk with Restart algorithm in a complex network to predict interactions (Chen,
With the application of computing technology in the field of biology, more and more public biological databases have also been established, such as HMDD (Huang et al.,
The development of genomics and bioinformatics facilitated the identification of lncRNA. LncRNA has also been found to interact with various EFs, such as chemicals, smoking, and air pollution (Flynn and Chang,
However, the aforementioned studies for predicting the association between disease-related lncRNAs and EFs usually use traditional similarity search methods, which focus on measuring the similarity between objects of the same type. Those existing methods to study the association between disease-related lncRNAs and EFs simply treat all objects as the same type without considering different subtle semantic meanings of different paths in the heterogeneous network. This will reduce the accuracy and persuasiveness of the results. In this paper, we have proposed a high-performance method to predict the correlation between lncRNAs and EFs based on heterogeneous networks. The proposed method integrates the structural information and heterogenous networks and combines the Hetesim features and the diffusion features as data features and uses the GBDT algorithm as a prediction model. The HeteSim features are a path-based measurement method in heterogeneous networks and can measure the relationship between objects of the same or different types. The Hetesim has not been used to predict the association between lncRNAs and EFs. It is the first time that the Hetesim is integrated as a fusion feature in the step of feature extraction for predicting the association between lncRNAs and EFs. The method GBDT is used in the proposed algorithm, which is an integrated learning method in machine learning, and has superior accuracy compared with other algorithms. It is also the first time that the integrated learning method GBDT is used to investigate the association between lncRNAs and EFs. From our perspective, on the one hand, our proposed method provides an efficient calculation method for mining the association between lncRNAs and EFs, which greatly saves manpower and material resources. On the other hand, it also helps biologists to explore the influence of environmental factors on diseases.
For the rest of the paper, the materials and methods have been presented in section 2, the experimental results and evaluates have been discussed in section 3, and, finally, we have concluded this paper in section 4.
The data used in this experiment are downloaded from the DLREFD database (Sun et al.,
A method based on the Gradient Boosting Decision Tree (GBDT) to predict the association between LncRNA and EFs (GBDTL2E) has been proposed in this section. The GDDTL2E integrates the structural information and heterogenous networks, combines the Hetesim features and the diffusion features based on multi-feature fusion, and uses the machine learning algorithm GBDT to predict the association. This mainly includes several steps: (1) according to the lncRNA-EF correlations dataset downloaded from the public database DLREFD, after the duplicate data are removed, the set of lncRNAs and EFs and the association matrix A of the lncRNA-EF correlations are obtained, respectively. Then, the gaussian interaction profile kernel similarity of lncRNA (KL) and the gaussian interaction profile kernel similarity of EFs (KE) are calculated, respectively. (2) The chemical structure similarity matrix E between EFs is calculated by using the published tool SimComp. (3) The lncRNA similar information (KL) is transformed by the logistic function to obtain lncRNA similarity information SL, and the chemical structure similarity matrix E and the gaussian interaction profile kernel similarity matrix (KE) are then used to construct a similarity matrix SE of EFs. (4) A global heterogeneous network is constructed by integrating the three subnets of association matrix A, similarity matrix SL of lncRNA, and similarity matrix SE of EFs to construct adjacency matrix G of the global heterogeneous network. On the heterogeneous network, the Random Walk with Restart (RWR) algorithm is used to calculate the diffusion score and obtain the diffusion features, and singular value decomposition (SVD) is used to reduce the dimension of the diffusion features. (5) The Hetesim feature (score) for the lncRNAs-EFs pair is calculated. (6) The feature data set is obtained by combining the diffusion feature and the HeteSim score. The obtained combined feature is used to train the Gradient Boosting Decision Tree (GBDT) for predicting the relationship between lncRNAs and EFs.
Flowchart of our method:
In this section, the calculation of the gaussian interaction profile kernel similarity was presented first. The association matrix A of lncRNAs and EFs was obtained by the known lncRNA-EF correlations. The gaussian interaction profile kernel similarity matrix of lncRNA and the gaussian interaction profile kernel similarity matrix of EF were calculated. Let
The gaussian interaction profile kernel similarity matrix KL of lncRNA was constructed. For a given lncRNA
where γ
Similarly, the known lncRNA-EF correlations were used to construct the gaussian interaction profile kernel similarity matrix of EFs. For a given EF
where γ
In this section, the computation of the chemical structure similarity has been given. The chemical structural similarity matrix between EFs is calculated using the SimComp tool (Hattori et al.,
The structural information and heterogenous networks were integrated in the proposed GBDTL2E. The transformed similarity matrix SL and integrated similarity matrix calculation SE have been described in this section. The lncRNA similarity matrix KL was transformed by logistic function to obtain lncRNA similar matrix SL. The similarity matrix SE of EFs was constructed by using the chemical structure similarity matrix E of EFs and the gaussian interaction profile kernel similarity matrix KE of EFs, given by
where
where
In this section, the association matrix A of lncRNA-EF, the similarity matrix SL of lncRNA, and the similarity matrix SE of EFs were integrated to construct a global heterogeneous network. In heterogeneous networks, the Random Walk with Restart (RWR) is used to calculate the diffusion score and obtain the diffusion features. Due to the fact that the higher-dimensional features in model training are more susceptible to noise interference, the singular value decomposition (SVD) is used to reduce the dimension of the diffusion features. The details of each sub-steps were as follows.
In this section, the roaming network was constructed firstly. The adjacency matrix
where AT represents the transpose of
where
The RWR algorithm (Liu et al.,
where
The initial assumption is that the transition probability value of each node is 1/
The calculation of low-dimensional diffusion features has been given in this section following the diffusion features obtained by RWR. As the number of nodes increases, the diffusion state increases in dimension as well. Singular value decomposition (SVD) (Golub and Reinsch,
where U and V represent the left singular matrix and the right singular matrix, respectively. The U and V are units on an orthogonal matrix, Σ only has value on the diagonal, and the other elements are 0. We refer to these non-zero values as singular values and order these values in Σ from largest to smallest. Singular values can be thought of as representing values of a matrix, or as representing information about the matrix. The larger the singular value, the more information it represents. Therefore, in order to reduce the computation, we only need to take the first 50 maximum singular values, and we can basically restore the data itself. Therefore, we take the first 50 singular values and eigenvectors, which are given by
where X is the low-dimensional node feature matrix derived from the high-dimensional diffusion feature. Each row of matrix X is the low-dimensional feature vector of each node in the network. W is the low-dimensional context eigenmatrix derived from the high-dimensional diffusion feature. Thus, we obtain the diffusion feature X after dimensionality reduction.
In order to obtain high performance, apart from the diffusion feature obtained in the above section, the proposed method combines the Hetesim features and the diffusion features based on multi-feature fusion. Another important feature is that HeteSim (Shi et al.,
Example of understanding HeteSim masure. Different color circles denote three different kinds of objects in the heterogeneous network.
As we can see from
The paths from a lncRNA to an environmental factor in our heterogeneous network with a length of less than 5.
1 | LLE | lncRNA-lncRNA-EF | 2 |
2 | LEE | lncRNA-EF-EF | 2 |
3 | LLLE | lncRNA-lncRNA-lncRNA-EF | 3 |
4 | LELE | lncRNA-EF-lncRNA-EF | 3 |
5 | LLEE | lncRNA-lncRNA-EF-EF | 3 |
6 | LEEE | lncRNA- EF-EF-EF | 3 |
7 | LLLLE | lncRNA-lncRNA-lncRNA-lncRNA-EF | 4 |
8 | LLLEE | lncRNA-lncRNA-lncRNA-EF-EF | 4 |
9 | LLELE | lncRNA-lncRNA-EF-lncRNA-EF | 4 |
10 | LLEEE | lncRNA-lncRNA-EF-EF-EF | 4 |
11 | LELLE | lncRNA-EF-lncRNA-lncRNA-EF | 4 |
12 | LELEE | lncRNA-EF-lncRNA- EF-EF | 4 |
13 | LEELE | lncRNA-EF-EF-lncRNA-EF | 4 |
14 | LEEEE | lncRNA-EF-EF-EF-EF | 4 |
The HeteSim score between lncRNA and EF is calculated:
After the multi-features were combined, the Hetesim features and the diffusion features were obtained. The method for training the GBDT classifier model to predict the association between lncRNAs and EFs based on heterogeneous networks has been presented in this section. The 50-dimensional diffusion features and 14-dimensional HeteSim scores were combined to get the 64-dimensional features data set. The features of the data were used for training the Gradient Boosting Decision Tree (GBDT) (Friedman,
GBDT is an effective machine learning method for classification and regression problems. GBDT is composed of multiple decision trees, and the final answer is obtained via the sum of the conclusion of all trees. GBDT generates a weak classifier in each iteration through multiple rounds of iteration. Each classifier is trained on the basis of the gradient (residual value) of the previous round of classifiers. The final total classifier is obtained by weighted summation of the weak classifier obtained in each round of training, which is the addition model. The model training steps have been presented:
In this section, the proposed GBDTL2E algorithm to predict the association between lncRNAs and EFs based on heterogeneous networks has been described in Algorithm 1. From lines four to nine of Algorithm 1, the low-dimensional diffusion feature matrix X was obtained by using the random walk with restart algorithm and singular value decomposition. In lines 10–41 of Algorithm 1, the Hetesim score was obtained. In lines 42–58 of Algorithm 1, the training data is obtained and used to train the GBDT classifier. Furthermore, the final classification model is obtained.
GBDTL2E algorithm
1: Construct the adjacency matrix |
2: Initialize the global transition probability matrix |
3: Initialize the transition probability vector for each node |
4: |
5: Obtain the updated probability vector: |
6: |
7: |
8: |
9: |
10: Input L,P to caculate |
11: |
12: |
13: |
14: |
15: |
16: |
17: |
18: |
19: |
20: |
21: |
22: |
23: |
24: Divide the path into two parts. |
25: |
26: |
27: |
28: |
29: |
30: |
31: |
32: |
33: |
34: |
35: |
36: |
37: |
38: |
39: |
40: |
41: |
42: Combined with the diffusion feature and HeteSim score to get the data set |
43: Dtrain = {( |
44: Use Dtrain to train the Gradient Boosting Decision Tree (GBDT). |
45: Initialize the model as Θ0( |
46: |
47: |
48: Calculate loss function: L(y, Θ |
49: Calculate the residuals: |
50: |
51: Construct the |
52: Get the corresponding leaf node area |
53: |
54: Calculate |
55: |
56: Update weak model: Θ |
57: |
58: Get the strong model Θ |
We randomly selected 300 positive samples and 300 negative samples for training the model. Positive samples were that samples with a correlation between lncRNA and EF, while negative samples were samples without a correlation between lncRNA and EF. For objective performance evaluation, an independent test set was built by randomly selecting 300 positive samples and 300 negative samples. Note that all the positive and negative samples in these test sets were independently chosen and excluded from the training set.
The 10-fold cross-validation was used to measure the performance of the GBDTL2E. The GBDTL2E parameters used are listed in
where
The experimental parameters of GBDTL2E.
475 | The number of lncRNAs | |
152 | The number of EFs | |
627 | The sum number of EFs and lncRNAs | |
1 | The frequency band of gaussian interaction profile kernel similarity of lncRNA | |
1 | The frequency band of gaussian interaction profile kernel similarity of EF | |
0.7 | The weight parameter of correlation information of two environmental factors in SE | |
5 | The length constraint in Hetesim | |
50 | The dimension of the low-dimensional diffusion features | |
0.5 | The restart probability in the random walk with restart | |
600 | The number of training samples | |
10 | The number of training iterations |
In this section, the proposed GBDTL2E method was compared with the following schemes, which include the k-nearest neighbor algorithm (KNN) (Cover and Hart,
The performance comparison with other machine learning methods.
KNN | 0.953 | 0.937 | 0.952 | 0.907 | 0.985 |
RF | 0.863 | 0.827 | 0.849 | 0.739 | 0.912 |
SVM | 0.966 | 0.967 | 0.966 | 0.933 | 0.988 |
GBDTL2E | 0.975 | 0.967 | 0.976 | 0.949 | 0.997 |
The ROC curve comparison with other machine learning methods.
The ROC curves comparison with other machine learning methods on independent dataset.
In order to verify the performance of combined diffusion and Hetesim features in GBDTL2E, we compared the performance by using two separate features and combined features in this section.
The performance comparison of different feature groups (Diffusion, HeteSim and combined feature).
The ROC curve comparison with different feature groups.
In this section, the GBDTL2E algorithm was compared with the existing methods for predicting associations between lncRNAs and EFs. However, there were a few studies that predicted new potential associations between lncRNAs and EFs. Three methods were chosen to compare with the proposed GBDTL2E method. These were KATZ (Vural and Kaya,
The Roc curve comparison with existing method.
To further measure the performance of our proposed algorithm, we investigated an environmental factor “Cisplatin,” which is an effective chemotherapy drug for many cancers (Florea and Büsselberg,
After processed by our algorithm, we sorted the correlation values between “Cisplatin” and ordered LncRNA from largest to smallest. We found that all the top 10 lncRNAs were related to “Cisplatin,” and these lncRNAs are confirmed to be related to “Cisplatin” in the DLREFD database. The 10 lncRNAs and their corresponding PUBMED reference ID are shown in
The TOP 10 predicted lncRNAs related to cisplatin.
1 | AK12669 | 23741487 |
2 | AC015818.3 | 25250788 |
3 | ABCC6P1 | 25250788 |
4 | GABPB-AS1 | 24036268 |
5 | CASC2 | 28495512 |
6 | PSORS1C3 | 25250788 |
7 | H19 | 28189050 |
8 | AK125699 | 25250788 |
9 | SRGAP3-AS2 | 25250788 |
10 | XLOC_001406 | 25250788 |
Recent studies have shown that the interaction between lncRNA and EF is closely related to the production of diseases. As more and more computational methods are used to deal with biological problems, which can greatly save manpower, it is possible to use computational methods to predict the interaction between lncRNAs and EFs. In this paper, we proposed a method to predict the association between lncRNAs and EFs. The proposed method combined the Hetesim features and the diffusion features based on multi-feature fusion, and used the machine learning algorithm GBDT to predict the association between lncRNAs and EFs based on heterogeneous networks. The 10-fold cross validation was used to evaluate our method. We also compared our method with others. An environmental factor in the case study was also be used to compare our performance. The results show that the GDBTL2E can obtain high performance. In future, adding the expression profile of lncRNAs to further improve the performance will be investigated.
Publicly available datasets were analyzed in this study. This data can be found here:
JW, ZK, ZM, and GH conceived this work and designed the experiments. JW and ZK carried out the experiments. ZM and GH collected the data and analyzed the results. JW and ZK wrote, revised, and approved the manuscript.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
We would like to thank the Experimental Center of School of Computer and Information Engineering, Central South University of Forestry and Technology, for providing computing resources.
The Supplementary Material for this article can be found online at: