METHODS article

Front. Genet., 22 February 2022

Sec. Computational Genomics

Volume 13 - 2022 | https://doi.org/10.3389/fgene.2022.832244

SAAED: Embedding and Deep Learning Enhance Accurate Prediction of Association Between circRNA and Disease

  • 1. School of Electronics and Information Technology, Sun Yat-Sen University, Guangzhou, China

  • 2. Macquarie Business School, Macquarie University, Sydney, NSW, Australia

  • 3. College of Information Science and Technology, Jinan University, Guangzhou, China

  • 4. College of Engineering, Shantou University, Shantou, China

Article metrics

View details

3

Citations

2k

Views

911

Downloads

Abstract

Emerging evidence indicates that circRNA can regulate various diseases. However, the mechanisms of circRNA in these diseases have not been fully understood. Therefore, detecting potential circRNA–disease associations has far-reaching significance for pathological development and treatment of these diseases. In recent years, deep learning models are used in association analysis of circRNA–disease, but a lack of circRNA–disease association data limits further improvement. Therefore, there is an urgent need to mine more semantic information from data. In this paper, we propose a novel method called Semantic Association Analysis by Embedding and Deep learning (SAAED), which consists of two parts, a neural network embedding model called Entity Relation Network (ERN) and a Pseudo-Siamese network (PSN) for analysis. ERN can fuse multiple sources of data and express the information with low-dimensional embedding vectors. PSN can extract the feature between circRNA and disease for the association analysis. CircRNA–disease, circRNA–miRNA, disease–gene, disease–miRNA, disease–lncRNA, and disease–drug association information are used in this paper. More association data can be introduced for analysis without restriction. Based on the CircR2Disease benchmark dataset for evaluation, a fivefold cross-validation experiment showed an AUC of 98.92%, an accuracy of 95.39%, and a sensitivity of 93.06%. Compared with other state-of-the-art models, SAAED achieves the best overall performance. SAAED can expand the expression of the biological related information and is an efficient method for predicting potential circRNA–disease association.

1 Introduction

CircRNA is a non-coding RNA formed by reverse splicing (Nigro et al., 1991; Danan et al., 2012; Salzman et al., 2013) and performs multiple functions in the nucleus, cytoplasm, and extracellular matrix (Li et al., 2018). In the nucleus, circRNA can regulate the splicing of their linear mRNA counterpart (Ashwal-Fluss et al., 2014; Zhang et al., 2014; Kelly et al., 2015) and control the transcription of parental genes (Li et al., 2015). In the cytoplasm, as miRNA sponges (Hansen et al., 2013) and ceRNAs (Salmena et al., 2011), circRNAs competitively bind with miRNA, which can interact with target mRNAs to induce mRNA degradation and translational repression (Fabian and Sonenberg, 2012). Moreover, it plays a regulatory role through binding proteins (Memczak et al., 2013), and can be translated (Chen and Sarnow, 1995; Conn et al., 2015; Wang and Wang, 2015). In vitro, circRNA can serve as an ideal biomarker, because it is more stable compared to other linear non-coding RNA molecules (Zhang and Xin, 2018; Shang et al., 2019; Slack and Chinnaiyan, 2019).

As described above, circRNA engages in a large number of biological processes and is associated with various diseases. It has been found that N6-methyladenosine-modified CircRNA-SORE, sequestering miR-103a-2-5p and miR-660-3p by acting as a microRNA sponge, sustains sorafenib resistance in hepatocellular carcinoma by regulating β-catenin signaling (Xu et al., 2020). In addition, it has been proved that circMRPS35 governs histone modification in anticancer treatment and advocates for triggering the circMRPS35/KAT7/FOXO1/3a pathway to combat gastric cancer (Jie et al., 2020).

So, analyzing the relationship between circRNA and disease can help understand the disease mechanism, treatments, and diagnoses (Ghosal et al., 2013; Liu et al., 2019a). However, traditional experiments are time-consuming and a lack of circRNA–disease association data limits further improvement. In recent years, various models are developed for association analysis. These models could be divided into three categories. The first model category involves the use of Gaussian Interaction Profile (GIP) or JACCARD index to calculate the similarity between circRNA and between diseases, and then the application of different models to extract features from the similarity matrix for further analysis, such as the KATZ measure (Fan et al., 2018a), path weighting methods (Lei et al., 2018), and k-nearest neighbor method with decreasing weight (Yan et al., 2018).

The second category is based on machine learning and deep learning. Wang et al. (2020a) propose an efficient computational method based on multi-source information combined with deep convolutional neural network (CNN) to predict circRNA–disease associations. The method extracts the hidden deep feature through the CNN and finally sends them to learning machine classifier for prediction. GCNCDA (Wang et al., 2020b) is based on the Fast learning with Graph Convolutional Networks (FastGCN) algorithm to predict the potential disease-associated circRNA. Specifically, the method first forms the unified descriptor by fusing disease semantic similarity information, disease, and circRNA GIP kernel similarity information. They use traditional methods to deal with association, which is difficult to integrate more knowledge.

The third category is based on embedding. In deep learning, the vector transformed by the embedding model is called embedding vector (Mikolov et al., 2013a). Recently, large-scale pre-trained models are the most popular embedding models. They can effectively introduce large amount of information into the embedding vectors with self-supervised learning and unlabeled data. Word2Vec (Mikolov et al., 2013b) and BERT (Jacob et al., 2019) are famous embedding models in natural language processing to calculate the embedding vector. Bordes et al. (2011) propose a structured distributed embedding method to learn the entities relations in Knowledge Bases. The embedding space allows to estimate the probability density of any relation between entities, preserves the knowledge of the original data, and presents the interesting ability of generalizing to new reasonable relations. In the field of bioinformatics, codon-based encoding (Zhang et al., 2019) and rna2vec (Xiao et al., 2018) are two embedding models transforming the codon and nucleobase into embedding vector. However, nucleobase and codon are heavily recurring in RNA sequences and require complex models to extract their semantics information. Xiao et al. (2021) propose an embedding model to calculate the embedding vector of circRNA and disease, but it cannot fuse more new association information into the embedding vector, and does not use large-scale learning methods such as deep learning.

Therefore, we proposed a novel neural network embedding model called Entity Relation Network (ERN) to calculate the embedding vector of diseases and circRNA. The model introduces various entity association information, i.e., circRNA–miRNA, circRNA–disease, disease–gene, disease–lncRNA, disease–drug, and disease–miRNA association, so the embedding vector contains more information than the previous model for extraction and analysis. Compared with traditional embedding vector, ERN can generate a fixed low-dimensional embedding vector, which is learnable and can reduce computational complexity. This means similar circRNAs or diseases will approach each other in the embedding space during the training process, hence making the model easier to converge and analyze the associations. By using the embedding vectors calculated by ERN, we have made significant progress in circRNA–disease association analysis.

2 Materials and Methods

2.1 Dataset of circRNA Association

In this study, we implement the model on the CircR2Disease (Fan et al., 2018b) and Circbank (Liu et al., 2019b) to calculate the embedding of the circRNA. CircR2Disease database supplies experimentally varied circRNA–disease associations, which can be freely obtained from http://bioinfo.snnu.edu.cn/CircR2Disease/. Currently, CircR2Disease has collected 725 associations between 661 circRNAs and 100 diseases from existing literatures. CircR2Disease is the benchmark dataset for evaluation of the circRNA–disease association analysis. Circbank is a comprehensive database of human circRNA with 16,844,375 circRNA–miRNA predicted associations between 1,917 miRNAs and 140,790 circRNAs, which can be freely obtained from http://www.circBANK.cn.

2.2 Dataset of Disease Association

We integrate DisGeNET (Piñero et al., 2016), HMDD (Huang et al., 2019), LncRNADisease (Bao et al., 2019), and Comparative Toxicogenomics Database (CTD) (Davis et al., 2021) for the calculation of the embedding of disease. DisGeNET is a dataset of disease–gene association, which can be freely obtained from https://www.disgenet.org/. It contains 1,134,942 gene-disease associations, between 21,671 genes and 30,170 diseases. HMDD is a dataset of disease–miRNA with 35,547 miRNA–disease associations between 1,206 miRNAs and 893 diseases, which can be freely obtained from https://www.cuilab.cn/hmdd. LncRNADisease is a dataset with 20,595 LncRNA–disease associations and 1,004 circRNA–disease associations from 19,166 lncRNAs, 823 circRNAs, and 529 diseases, which can be freely obtained from http://www.rnanut.net/lncrnadisease/. CTD is a dataset with 224,627 disease–drug associations from 10,152 drugs and 3,278 diseases, which can be freely obtained from http://ctdbase.org/.

Table1 shows the number of different types of associations in SAAED. We use two adjacency matrices to represent the associations between circRNA, disease, and other biological entities, respectively. When the specific circRNA or disease and specific entity is associated, the element is assigned a value of 1, otherwise 0.

TABLE 1

DatasetAssociationAmount of relationAmount of Entity 1Amount of Entity 2
CircR2DiseaseCircRNA–Disease725661100
CircbankCircRNA–miRNA16,844,375140,7901,917
DisGeNETDisease–Gene1,134,94230,17021,671
HMDDDisease–miRNA35,5478931,206
LncRNADiseaseDisease–LncRNA20,59552919,166
CTDDisease–Drug224,6273,27811,152
Total18,260,811

The number of different types of associations in SAAED.

2.3 Method Overview

SAAED consists of two parts, which is shown in Figure 1. The first part is the ERN for embedding circRNA and diseases. The second one is the Pseudo-Siamese network (Koch et al., 2015), which is used to analyze the probability of association between circRNA and diseases. More specifically, the input to SAAED is entity association information, which can be represented by an adjacency matrix. The size of the adjacency matrix is arbitrary. Therefore, the model can easily introduce information about semantic entities.

FIGURE 1

FIGURE 1

SAAED flowchart. (A) An outline of flowchart of data curation. (i) The association information is extracted from data sources. (ii) Data preprocess deal with data cleaning, redundancy removal, and unifying the names of circRNA and diseases from different datasets. (iii) The association information, i.e., represented by the adjacency matrix. (B) The Entity Relation Network (ERN) calculates the embedding vectors. (C) The Pseudo-Siamese Network calculates the probability of association with the embedding vector of circRNA and disease.

2.4 Embedding and Entity Relation Network

The similarity between circRNA or diseases can be analyzed through their structure or function. Structure refers to sequence information or spatial structure of the circRNA and disease. Function refers to the interaction between circRNA, disease, and other biological entities. Researchers often use one-hot encoding vector to convert these information into embedding vectors and analyze them by various models (Fan et al., 2018a; Lei et al., 2018; Yan et al., 2018; Wang et al., 2020a; Wang et al., 2020b), but one-hot encoding increases the computational complexity as the information used increases. In addition, the matrix of the one-hot encoding is sparse, and the computational efficiency is low. Hence, we try to construct a deep learning model called Entity Relation Network (ERN) to learn a fixed-length continuous embedding vector from the associated information.

In ERN, we construct the embedding model to transform a large amount of association information, such as circRNA–disease, circRNA–miRNA, disease–gene, disease–miRNA, disease–lncRNA, and disease–drug associations, into embedding vectors so that the model can analyze heterogeneous information simultaneously. In terms of solving biological association problems, embedding of ERN has the following advantages:

  • 1) Embedding vector has strong expression ability. Fixed length embedding vector learned by ERN is used to introduce semantic knowledge without increasing computational complexity, so as to solve the flexibility of semantic expansion.

  • 2) Embedding expression is more accurate. ERN translates the 0,1 matrix into a fixed length embedding vector by neural network learning algorithm, and optimizes the correlation degree by automatic learning, which is more accurate than the Euclidean distance correlation measure of GIP.

ERN adopts a probabilistic feedforward neural network language model (Bengio et al., 2003; Koch et al., 2015) to extract information from association data and further transforms it into an embedding vector. The association data between entity A and entity B can be represented by a adjacency matrix M, where I represents the number of entity A, and J represents the number of entity B. When entity A(i) is associated with entity B(j), the element M(A(i), B(j)) of matrix M is assigned the value of 1. Otherwise, it has a value of 0.

The adjacency matrix is used to represent entity association information. ERN projects the adjacency matrix onto the embedding vector, which consists of an input layer, a projection layer, a feedforward neural network, and an output layer. The flowchart of ERN is shown in Figure 2.

FIGURE 2

FIGURE 2

The flowchart of the Entity Relation Network (ERN). is the one-hot encoding vector of the entity A(i), is the projection matrix, is a feedforward neural network, and is the predicted probability vector of A(i), and is embedding vector of entity A(i). One-hot encoding vector selects the specific column of parameter matrix as the embedding vector of entity i, which is used for predicting the related entities. After training the embedding vector, i.e., the parameter matrix is learned.

We define the vector V(A(i)) to represent all associated entities of A(i), which is the ith row of the adjacency matrix M. ERN can be trained to predict probability of each associated entity related to A(i), and the feedforward neural network is used to analyze the embedding vector and output the probability.Where is the one-hot encoding vector of the entity A(i), is the projection matrix, is a feedforward neural network, and is the predicted probability vector of A(i), and is the embedding vector of entity A(i).

The ERN training uses one-hot encoding vector and adjacency matrix as the input and probability of association as the output. Taking the adjacency matrix as the learning goal, the ERN uses the linear transformation to generate the embedding vector by deep leaning algorithm.

The loss function of ERN is the mean square error,where is the jth element of , which is the probability that A(i) is associated with B(j); is the jth element of , which indicates whether A(i) is associated with entity B(j).

GIP is the most commonly used encoding method in the past. Compared with GIP, ERN has three advantages. Firstly, ERN can introduce any amount of external information into the embedding vectors. Secondly, the size of the embedding vector is fixed regardless of the number of entities and information introduced, keeping the complexity of the model constant. Thirdly, by reducing the loss function during training, the representation of features is enhanced, especially for two entities with high similarity, and their embedding vectors will be closer in the embedding space. It should be noted that in the training of ERN, we should overfit the model so that the embedding vector calculated by ERN can accurately reflect the relationship, i.e., the distance between different entities.

The use of embedding vectors with significant features makes the Pseudo-Siamese network easier to converge and analyze, so the overall model achieves better performance than previous models.

2.5 Calculation of the Disease Embedding Vector

Based on the disease–gene, disease–miRNA, disease–lncRNA, and disease–drug association data, the disease related adjacency matrix D is constructed: is the ith row of the adjacency matrix D, which reflects the associated genes of .

The input of the model is the one-hot encoding vector of .The details of the model are as follows:Where is the one-hot encoding vector of disease , is the projection matrix, is the embedding vector of disease , and are the weights of feedforward neural network, and is the predicted probability indicating which gene may be associated with the .

2.6 Calculation of circRNA Embedding Vector

Based on the circRNA–disease and circRNA–miRNA association data, the circRNA-related adjacency matrix C is constructed, and the embedding vector of circRNA is calculated.Where is the one-hot encoding vector of circRNA , is the projection matrix, is the embedding vector of , and are the weights of the model, and is the predicted probability indicating which disease may be associated with circRNA .

2.7 Information Fusion and circRNA–Disease Association Analysis

We used circRNA–disease association data from CircR2Disease as positive samples and randomly selected the same number of associations as negative samples. Although unconfirmed circRNA–disease associations may be regarded as negative samples, the probability is significantly lower.

The Pseudo-Siamese network is adopted to fuse information from circRNA and diseases to infer their relationship. The flowchart is shown in Figure 3. The embedding vectors of circRNA and disease are calculated based on different information. The Pseudo-Siamese network can learn two different transformations to project the embedding vectors of circRNA and diseases from the original semantic space into a new semantic space for analysis.

FIGURE 3

FIGURE 3

Flowchart of Pseudo-Siamese network. Two feedforward neural networks are used as the extractor to extract features and from the embedding vector and . The difference and element-wise product of u and v are concatenated with u and v for analyzing association by a Feedforward neural network.

The Pseudo-Siamese network has two inputs. After feature extraction, the features are concatenated and analyzed by feedforward neural network to output the probability of association. and are the embedding vectors calculated by ERN. Two different feedforward neural networks as the extractors extract feature vectors from these two inputs, respectively:Where u and v are the feature vectors transformed from the embedding vectors of circRNA and disease. The difference and element-wise product of u and v are calculated to enhance the inference of local information, and then concatenated with u and v. Finally, another feedforward neural network is used to calculate the probability.Where P is the predicted probability, is the feedforward neural network. The Sigmoid function is used to limit the predicted probability to a range from 0 to 1.

2.8 General Entity Embedding

After training the ERN and obtaining the embedding vector, the embedding vector can be used as the input to the feedforward neural network in ERN,

If we derive the inverse function of and use V(A(i)) as an input to , we haveWhere is the inverse function of .

Since it is difficult to derive the inverse function, a new feedforward neural network can be used to fit the inverse function with and Where is a feedforward neural network.

The above function indicates that the embedding vector can be calculated directly by using the association information, regardless of whether the disease is in CircR2Disease or not. Thus, the scope of circRNA–disease association analysis is greatly expanded. can be regarded as an embedding function learned by ERN from association information. Unlike GIP or other formulas, can be adjusted based on the data, so as to introduce more information into the embedding vector and improve the quality of the embedding vector.

3 Results

3.1 Performance Metrics

To evaluate the performance of SAAED, we used the fivefold cross-validation to divide the data into training sets and testing sets in the ratio of 4:1, i.e., 1,000 data are used for training and 250 data are used for testing. The fivefold cross-validation can make full use of the data to train and test the generalization capability of the model, and avoid the adverse effects of unreasonable division of the training and testing sets on model evaluation. The model is evaluated by accuracy (Accu.), sensitivity (Sen.), precision (Prec.), F1 score (F1), and AUC. They are defined as:where TP, FP, TN, and FN represent the number of true positive, false positive, true negative, and false negative, respectively. TP is the number of positive (given circRNA is related with given disease) correctly classified by the model; FP is the number of negative (given circRNA is not related with given disease) misclassification; TN is the number of negative (unrelated) correctly classified by the model; FN is the number of positive that is wrongly labeled.

3.2 Model Performance Evaluation

SAAED is implemented on the circR2Disease dataset to evaluate its ability to predict potential circRNA–disease associations. The results of fivefold CV are summarized in Table 2.

TABLE 2

Test setAccu.(%)Sen.(%)Prec.(%)F1(%)AUC(%)
195.5993.3399.4196.2899.08
294.0892.5795.8694.1998.47
395.5693.2698.2295.6899.26
496.1593.8298.8296.2598.36
595.5692.3199.4195.7399.44
Average

Result of fivefold CV generated by SAAED on the CircR2Disease Dataset.

According to statistical indicators, the average accuracy of the model is 95.39%, the average sensitivity is 93.06%, the average precision is 98.34%, the average F1 score is 95.63%, and the AUC is 0.9892, with all standard deviations less than 2. This indicated that SAAED achieved excellent robustness in the CircR2Disease dataset and is able to effectively predict circRNA–disease associations.

In addition, we also plotted the ROC curves generated by the model. As shown in Figure 4, the ROC curves can reach the upper left corner of the graph.

FIGURE 4

FIGURE 4

ROC curves of fivefold CV obtained by SAAED on the CircR2Disease Dataset.

We made a comparison of KATZHCDA (Liu et al., 2019a), PWCDA (Ghosal et al., 2013), DWNN-RLS (Fan et al., 2018a), and GCNCDA (Yan et al., 2018). The results are shown in Table 3. According to the fivefold CV AUC scores, SAAED obtained the highest AUC.

TABLE 3

MethodsSAAEDGCNCDADWNN-RLSPWCDAKATZHCDA
AUC(%)98.9290.9088.5489.0079.36

The fivefold CV AUC scores generated by various models on the same benchmark dataset CircR2Disease.

3.3 Cases Studies of the Association Between circRNA and Breast Cancer/HCC

In order to evaluate the practical value of SAAED, we choose the model with highest AUC to make predictions for circRNA associated with breast cancer and HCC, two diseases for which sufficient data are available in the CircR2Disease dataset to avoid model bias due to a lack of data as much as possible. We used the embedding vector of circRNA, breast cancer, and HCC as inputs to the model, and the predicted probability can reflect the relationship between the specified circRNA and disease.

As shown in Table 4, 16 of the 20 data with the highest predicted probabilities are confirmed to be associated with breast cancer, and according to the literature, 6 prediction results (hsa_circ_0000615, CDR1as, ciRS-7, cZNF609, hsa_circ_0007386, and circSMARCA5) are newly identified by the model.

TABLE 4

RankcircRNAEvidence (PMID)RankcircRNAEvidence (PMID)
1hsa_circ_00006153239866411hsa_circ_0001445Unconfirmed
2CDR1as3124592712circSMARCA532838810
3iRS-73007258213hsa_circ_0001785CircR2Disease
4cZNF6093239866414hsa_circ_0011946CircR2Disease
5hsa_circ_00073863280835015hsa_circ_0008717CircR2Disease
6circHIPK3CircR2Disease16hsa_circ_0000732CircR2Disease
7hsa_circ_0000284CircR2Disease17circRNA-001283CircR2Disease
8mmu_circ_0001878Unconfirmed18hsa_circ_0001721CircR2Disease
9hsa_circ_0067934Unconfirmed19circABCB10CircR2Disease
10hsa_circ_0004712Unconfirmed20hsa_circ_0086241CircR2Disease

The top 20 breast cancer-related candidate circRNA.

As shown in Table 5, 13 of the 20 data with the highest predicted probabilities are confirmed to be associated, and according to the literature, 4 of the associated circRNA (hsa_circ_0000615, cZNF609, has_circ_0000520, and hsa_circ_0000517) are newly identified by the model.

TABLE 5

RankcircRNAEvidence (PMID)RankcircRNAEvidence (PMID)
1hsa_circ_00006153239866411hsa_circ_0001819Unconfirmed
2CDR1asCircR2Disease12hsa_circRNA_000598Unconfirmed
3ciRS-7CircR2Disease13hsa_circ_000052027258521
4cZNF6093239866414hsa_circ_0004018CircR2Disease
5hsa_circ_0007386Unconfirmed15hsa_circ_0005986CircR2Disease
6circHIPK3CircR2Disease16circRNA_000839CircR2Disease
7hsa_circ_0000284CircR2Disease17hsa_circ_0056731Unconfirmed
8mmu_circ_0001878Unconfirmed18hsa_circ_0001400Unconfirmed
9hsa_circ_0067934CircR2Disease19hsa_circ_0067531CircR2Disease
10hsa_circ_0004712Unconfirmed20hsa_circ_000051731750237

The top 20 hepatocellular carcinoma-related candidate circRNA.

In addition, recall rates are analyzed for data with probabilities greater than 0.9, with a recall rate of 0.9310 for breast cancer and 0.7647 for HCC.

We plotted all the predicted results of the two diseases for detailed analysis. Figure 5 shows a significant decrease in both curves from 0.8 to 0.1, which indicates that the model has a strong reliability for each predicted result. Otherwise, the predicted probabilities would be distributed more around 0.5. In addition, a total of 286 data are greater than 0.8, while most of them are less than 0.2; presumably, most circRNA are not associated with breast cancer and HCC, which is in line with the reality.

FIGURE 5

FIGURE 5

Comparison of predicted probabilities. Both curves decrease from 0.8 to 0.1 significantly, while the curve of Breast Cancer is above the curve of HCC.

3.4 Prediction and Analysis of Six Diseases

It is worth noting that there is a difference in the prediction performance between breast cancer and HCC. In order to analyze whether it is caused by the difference of data volume in the CircR2Disease dataset, we selected four more diseases, predicted them by using SAAED, and plotted the results in box plots.

Table 6 shows the amount of related circRNA with diseases in CircR2Disease. The amount of related circRNA with breast cancer is the largest. The amount of related circRNA with infantile hemangioma is the smallest. Table 7 shows the median probability of the related circRNA with the diseases predicted by SAAED. The box plots in Figure 6 clearly show that most probability data are distributed between 0.92 and 1 (except for infantile hemangioma, whose median probability distribution is 0.8979), which indicates that SAAED can effectively identify most associations in CircR2Disease. Meanwhile, the median of each box plot is a common measure used in data centers, which indicates that the more the training data are, the closer the probability distribution is to 1. Therefore, collecting more training data can significantly improve model performance.

TABLE 6

DiseaseBreast CancerBladder CancerHCCOsteosarcomaGlioblastomaInfantile Hemangioma
Amount583130221612

Selected circRNA and their amount in CircR2Disease.

TABLE 7

DiseaseBreast CancerBladder CancerHCCOsteosarcomaGlioblastomaInfantile Hemangioma
Median0.97990.94120.92550.94830.92350.8979

Median of the predicted probability.

FIGURE 6

FIGURE 6

Probability distribution of specific associations in CircR2Disease predicted by SAAED. Figure 6 clearly shows that most probability data are distributed between 0.92 and 1 (except for infantile hemangioma, the median probability distribution is 0.8979), which indicates that SAAED can effectively identify most associations in CircR2Disease. The more the data are, the closer the predicted probability is to 1.

In addition, it is worth noting that the top 10 candidate circRNAs of both breast cancer and HCC are the same. We tried to analyze more diseases and found that most predicted results have similar top 10 candidate circRNAs. We selected the top 5 candidate circRNAs, i.e., hsa_circ_0000615, CDR1as, ciRS-7, hsa_circ_0007386, and circHIPK3 (cZNF609 is an alias of hsa_circ_0000615), and counted their associated diseases in CircR2Disease. The result is visualized in Figure 7. There are 20 associated diseases, which account for 28.6% of all diseases. However, most circRNAs in CircR2Disease are associated with only one disease. We believe that such imbalance of data introduces bias in the model.

FIGURE 7

FIGURE 7

Knowledge graph of the selected circRNA associations. A total of 5 circRNA and 20 related diseases is identified in the graph.

4 Conclusion

In this study, we proposed a method called SAAED to calculate embedding vectors of circRNA and diseases to predict associations. SAAED consists of ERN and the Pseudo-Siamese network, and ERN is an effective model to calculate entity embedding vectors. The innovative combination of embedding and deep learning can obtain biological association information without adding algorithm complexity. Experimental results show that the model outperforms other state-of-the-art models and can effectively identify circRNA–disease associations. In addition, SAAED can be used for association analysis between any entities. It provides a widely tried path for biological information mining.

It is worth mentioning the limitations of SAAED. First, the reliability of dataset may affect the semantic expression of embedding vector. For example, the imbalance of data in CircR2Disease leads to a similar top 10 associated circRNA of different diseases. Fusing multi-source data and mitigating the bias from different datasets are essential for the generalization and prediction for circRNA–disease association analysis. Second, the data diversity and inconsistency in different datasets are a challenge for data fusing and embedding. We can alleviate this problem by tedious preprocessing, while we believe that the introduction of knowledge graph network is a more effective way to improve the quality of embedding vector and introduce more information.

Statements

Data availability statement

The original contributions presented in the study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding author.

Author contributions

All authors listed have made a substantial, direct, and intellectual contribution to the work and approved it for publication.

Funding

This research is funded by the National Natural Science Foundation of China, grant No. 61872396.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

  • 1

    Ashwal-FlussR.MeyerM.PamudurtiN. R.IvanovA.BartokO.HananM.et al (2014). circRNA Biogenesis Competes with Pre-mRNA Splicing. Mol. Cel.56 (1), 5566. 10.1016/j.molcel.2014.08.019

  • 2

    BaoZ.YangZ.HuangZ.ZhouY.CuiQ.DongD. (2019). LncRNADisease 2.0: an Updated Database of Long Non-coding RNA-Associated Diseases. Nucleic Acids Res.47 (D1), D1034D1037. 10.1093/nar/gky905

  • 3

    BengioY.DucharmeR.VincentP.JauvinC. (2003). A Neural Probabilistic Language Model. J. machine Learn. Res.3, 11371155. 10.1007/3-540-33486-6_6

  • 4

    BordesA.WestonJ.CollobertR.BengioY. (2011). “Learning Structured Embeddings of Knowledge Bases,” in Proceedings of the 25 th Annual Conference on Artificial Intelligence(AAAI), San Francisco, CA, USA, August 2011, 301306.

  • 5

    ChenC.-y.SarnowP. (1995). Initiation of Protein Synthesis by the Eukaryotic Translational Apparatus on Circular RNAs. Science268 (5209), 415417. 10.1126/science.7536344

  • 6

    ConnS. J.PillmanK. A.ToubiaJ.ConnV. M.SalmanidisM.PhillipsC. A.et al (2015). The RNA Binding Protein Quaking Regulates Formation of circRNAs. Cell160 (6), 11251134. 10.1016/j.cell.2015.02.014

  • 7

    DananM.SchwartzS.EdelheitS.SorekR. (2012). Transcriptome-wide Discovery of Circular RNAs in Archaea. Nucleic Acids Res.40 (7), 31313142. 10.1093/nar/gkr1009

  • 8

    DavisA. P.GrondinC. J.JohnsonR. J.SciakyD.WiegersJ.WiegersT. C.et al (2021). Comparative Toxicogenomics Database (CTD): Update 2021. Nucleic Acids Res.49 (D1), D1138D1143. 10.1093/nar/gkaa891

  • 9

    FabianM. R.SonenbergN. (2012). The Mechanics of miRNA-Mediated Gene Silencing: a Look under the Hood of miRISC. Nat. Struct. Mol. Biol.19 (6), 586593. 10.1038/nsmb.2296

  • 10

    FanC.LeiX.FangZ.JiangQ.WuF. X. (2018). CircR2Disease: a Manually Curated Database for Experimentally Supported Circular RNAs Associated with Various Diseases. Database (Oxford)2018, bay044. 10.1093/database/bay044

  • 11

    FanC.LeiX.WuF.-X. (2018). Prediction of CircRNA-Disease Associations Using KATZ Model Based on Heterogeneous Networks. Int. J. Biol. Sci.14 (14), 19501959. 10.7150/ijbs.28260

  • 12

    GhosalS.DasS.SenR.BasakP.ChakrabartiJ. (2013). Circ2Traits: a Comprehensive Database for Circular RNA Potentially Associated with Disease and Traits. Front. Genet.4, 283. 10.3389/fgene.2013.00283

  • 13

    HansenT. B.JensenT. I.ClausenB. H.BramsenJ. B.FinsenB.DamgaardC. K.et al (2013). Natural RNA Circles Function as Efficient microRNA Sponges. Nature495 (7441), 384388. 10.1038/nature11993

  • 14

    HuangZ.ShiJ.GaoY.CuiC.ZhangS.LiJ.et al (2019). HMDD v3.0: a Database for Experimentally Supported Human microRNA-Disease Associations. Nucleic Acids Res.47 (D1), D1013D1017. 10.1093/nar/gky1010

  • 15

    JacobD.ChangM-W.LeeK.KristinaT. (2019). “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proceedings of the 2019 Conference of the North American, Chapter of the Association for Computational Linguistics:Human Language Technologies, Volume 1, Stroudsburg, PA, USA, 2019 (Pennsylvania, United States: Association for Computational Linguistics), 41714186.

  • 16

    JieM.WuY.GaoM.LiX.LiuC.OuyangQ.et al (2020). CircMRPS35 Suppresses Gastric Cancer Progression via Recruiting KAT7 to Govern Histone Modification. Mol. Cancer19 (1), 56. 10.1186/s12943-020-01160-2

  • 17

    KellyS.GreenmanC.CookP. R.PapantonisA. (2015). Exon Skipping Is Correlated with Exon Circularization. J. Mol. Biol.427 (15), 24142417. 10.1016/j.jmb.2015.02.018

  • 18

    KochG.ZemelR.SalakhutdinovR. (2015). Siamese Neural Networks for One-Shot Image Recognition. ICML deep Learn. WorkshopVol. 2.

  • 19

    LeiX.FangZ.ChenL.WuF.-X. (2018). PWCDA: Path Weighted Method for Predicting circRNA-Disease Associations. Ijms19 (11), 3410. 10.3390/ijms19113410

  • 20

    LiX.YangL.ChenL.-L. (2018). The Biogenesis, Functions, and Challenges of Circular RNAs. Mol. Cel.71 (3), 428442. 10.1016/j.molcel.2018.06.034

  • 21

    LiZ.HuangC.BaoC.ChenL.LinM.WangX.et al (2015). Exon-intron Circular RNAs Regulate Transcription in the Nucleus. Nat. Struct. Mol. Biol.22 (3), 256264. 10.1038/nsmb.2959

  • 22

    LiuM.WangQ.ShenJ.YangB. B.DingX. (2019). Circbank: A Comprehensive Database for circRNA with Standard Nomenclature. RNA Biol.16 (7), 899905. 10.1080/15476286.2019.1600395

  • 23

    LiuQ.CaiY.XiongH.DengY.DaiX. (2019). CCRDB: A Cancer circRNAs-Related Database and its Application in Hepatocellular Carcinoma-Related circRNAs. Database (Oxford)2019, baz063. 10.1093/database/baz063

  • 24

    MemczakS.JensM.ElefsiniotiA.TortiF.KruegerJ.RybakA.et al (2013). Circular RNAs Are a Large Class of Animal RNAs with Regulatory Potency. Nature495 (7441), 333338. 10.1038/nature11928

  • 25

    MikolovT.ChenK.CorradoG.DeanJ. (2013). Efficient Estimation of Word Representations in Vector Space. Comput. Sci.

  • 26

    MikolovT.SutskeverI.ChenK.CorradoG. S.DeanJ. (2013). “Distributed Representations of Words and Phrases and Their Compositionality,” in NIPS'13: Proceedings of the 26th International Conference on Neural Information Processing Systems, Lake Tahoe, Nevada, USA, December 5–10, 2013 (NY, United States: Curran Associates Inc), 31113119.

  • 27

    NigroJ. M.ChoK. R.FearonE. R.KernS. E.RuppertJ. M.OlinerJ. D.et al (1991). Scrambled Exons. Cell64 (3), 607613. 10.1016/0092-8674(91)90244-s

  • 28

    PiñeroJ.BravoÀ.Queralt-RosinachN.Gutiérrez-SacristánA.Deu-PonsJ.CentenoE.et al (2016). DisGeNET: A Comprehensive Platform Integrating Information on Human Disease-Associated Genes and Variants. Nucleic Acids Res.45 (D1), D833D839. 10.1093/nar/gkw943

  • 29

    SalmenaL.PolisenoL.TayY.KatsL.PandolfiP. P. (2011). A ceRNA Hypothesis: the Rosetta Stone of a Hidden RNA Language?Cell146 (3), 353358. 10.1016/j.cell.2011.07.014

  • 30

    SalzmanJ.ChenR. E.OlsenM. N.WangP. L.BrownP. O. (2013). Cell-type Specific Features of Circular RNA Expression. Plos Genet.9 (9), e1003777. 10.1371/journal.pgen.1003777

  • 31

    ShangQ.YangZ.JiaR.GeS. (2019). The Novel Roles of circRNAs in Human Cancer. Mol. Cancer18 (1), 610. 10.1186/s12943-018-0934-6

  • 32

    SlackF. J.ChinnaiyanA. M. (2019). The Role of Non-coding RNAs in Oncology. Cell179 (5), 10331055. 10.1016/j.cell.2019.10.017

  • 33

    WangL.YouZ.-H.HuangY.-A.HuangD.-S.ChanK. C. C. (2020). An Efficient Approach Based on Multi-Sources Information to Predict circRNA-Disease Associations Using Deep Convolutional Neural Network. Bioinformatics36 (13), 40384046. 10.1093/bioinformatics/btz825

  • 34

    WangL.YouZ.-H.LiY.-M.ZhengK.HuangY.-A. (2020). GCNCDA: A New Method for Predicting circRNA-Disease Associations Based on Graph Convolutional Network Algorithm. Plos Comput. Biol.16 (5), e1007568. 10.1371/journal.pcbi.1007568

  • 35

    WangY.WangZ. (2015). Efficient Backsplicing Produces Translatable Circular mRNAs. Rna21 (2), 172179. 10.1261/rna.048272.114

  • 36

    XiaoQ.FuY.YangY.DaiJ.LuoJ. (2021). NSL2CD: Identifying Potential circRNA-Disease Associations Based on Network Embedding and Subspace Learning. Brief Bioinform22 (6), 6. 10.1093/bib/bbab177

  • 37

    XiaoY.CaiJ.YangY.ZhaoH.ShenH. (2018). “November). Prediction of Microrna Subcellular Localization by Using a Sequence-To-Sequence Model,” in 2018 IEEE International Conference on Data Mining (ICDM), Sentosa, Singapore, November, 2018 (IEEE), 13321337.

  • 38

    XuJ.WanZ.TangM.LinZ.JiangS.JiL.et al (2020). N6-methyladenosine-modified CircRNA-SORE Sustains Sorafenib Resistance in Hepatocellular Carcinoma by Regulating β-catenin Signaling. Mol. Cancer19 (1), 163. 10.1186/s12943-020-01281-8

  • 39

    YanC.WangJ.WuF. X. (2018). DWNN-RLS: Regularized Least Squares Method for Predicting circRNA-Disease Associations. BMC bioinformatics19 (19), 520581. 10.1186/s12859-018-2522-6

  • 40

    ZhangK.PanX.YangY.ShenH.-B. (2019). CRIP: Predicting circRNA-RBP-Binding Sites Using a Codon-Based Encoding and Hybrid Deep Neural Networks. Rna25 (12), 16041615. 10.1261/rna.070565.119

  • 41

    ZhangM.XinY. (2018). Circular RNAs: A New Frontier for Cancer Diagnosis and Therapy. J. Hematol. Oncol.11 (1), 2129. 10.1186/s13045-018-0569-5

  • 42

    ZhangX.-O.WangH.-B.ZhangY.LuX.ChenL.-L.YangL. (2014). Complementary Sequence-Mediated Exon Circularization. Cell159 (1), 134147. 10.1016/j.cell.2014.09.001

Summary

Keywords

circRNA–disease association, embedding, neural network, Pseudo-Siamese network, deep learning

Citation

Liu Q, Yu J, Cai Y, Zhang G and Dai X (2022) SAAED: Embedding and Deep Learning Enhance Accurate Prediction of Association Between circRNA and Disease. Front. Genet. 13:832244. doi: 10.3389/fgene.2022.832244

Received

09 December 2021

Accepted

17 January 2022

Published

22 February 2022

Volume

13 - 2022

Edited by

Leyi Wei, Shandong University, China

Reviewed by

Lihua Li, Hangzhou Dianzi University, China

Jian-Huan Chen, Jiangnan University, China

Updates

Copyright

*Correspondence: Xianhua Dai,

This article was submitted to Computational Genomics, a section of the journal Frontiers in Genetics

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Outline

Figures

Cite article

Copy to clipboard


Export citation file


Share article

Article metrics