Advances in the Identification of Circular RNAs and Research Into circRNAs in Human Diseases

Circular RNAs (circRNAs) are a class of endogenous non-coding RNAs (ncRNAs) with a closed-loop structure that are mainly produced by variable processing of precursor mRNAs (pre-mRNAs). They are widely present in all eukaryotes and are very stable. Currently, circRNA studies have become a hotspot in RNA research. It has been reported that circRNAs constitute a significant proportion of transcript expression, and some are significantly more abundantly expressed than other transcripts. CircRNAs have regulatory roles in gene expression and critical biological functions in the development of organisms, such as acting as microRNA sponges or as endogenous RNAs and biomarkers. As such, they may have useful functions in the diagnosis and treatment of diseases. CircRNAs have been found to play an important role in the development of several diseases, including atherosclerosis, neurological disorders, diabetes, and cancer. In this paper, we review the status of circRNA research, describe circRNA-related databases and the identification of circRNAs, discuss the role of circRNAs in human diseases such as colon cancer, atherosclerosis, and gastric cancer, and identify remaining research questions related to circRNAs.


INTRODUCTION
Circular RNAs (circRNAs) are endogenous non-coding RNAs (ncRNAs) that have gained increasing attention in recent years. circRNAs are formed by exon or intron cyclization that ligates the 5 terminal cap and 3 terminal poly(A) tail to form a circular structure. They are mainly located in the cytoplasm or stored in exosomes, are unaffected by RNA exonucleases, are more stably expressed and less susceptible to degradation, and have been shown to exist in a wide variety of eukaryotic organisms Pradeep et al., 2020). The widespread existence of circRNAs suggests that they have certain biological functions as lncRNAs and microRNAs (miRNAs) play (Jiang et al., 2009(Jiang et al., , 2014(Jiang et al., , 2015Wang et al., 2014;Cheng L. et al., 2019;Liang et al., 2019;Wei and Liu, 2020;Yang et al., 2020). In recent years, studies have shown a diversity of formation mechanisms and biological functions of circRNAs. circRNAs are formed by various mechanisms; for example, spliceosomes (intracellular protein-RNA complexes) catalyze splicing as follows (Salgia et al., 2003): first, the spliceosome recognizes introns, which are flanked by the splice donor (or 5 splice site) and the splice acceptor (or 3 splice site) with specific sequences at the 5 and 3 ends; then, the 2 hydroxyl group of the downstream sequence attacks the splice donor, resulting in a circular intron lariat structure; finally, the 3 hydroxyl group of the upstream exon splice donor attacks the splice acceptor, the upstream and downstream exons are sequentially spliced to form a linear structure, and the intron lariat structure is usually degraded rapidly by debranching enzyme. Variable splicing is the process by which a precursor mRNA (pre-mRNA) can be transcribed from different RNA splicing methods; that is, different combinations of splice sites, to produce mutually exclusive mRNA splice isoforms, which in turn are translated to produce different protein products (Pan et al., 2008). This is the main function of RNA cyclization. Cyclization of circRNAs can be divided into intron and exon cyclization (Sanger et al., 1976), and the current mainstream cyclization mechanisms are categorized as follows: (1) exon skipping, (2) direct back-splicing of intron, (3) circRNA formation by RNAbinding proteins (RBPs;Chen, 2016;Zhang et al., 2018), and (4) circular intron RNA cyclization (Stoddard, 2014); the detailed mechanisms are shown in Figure 1. The diversity of circRNAs, and thus their diverse biological functions, is a direct result of these multiple formation mechanisms. For example, circRNAs can act as miRNA sponges (Hansen et al., 2013;Memczak et al., 2013;Zhao et al., 2020a), be translated into proteins (Yang et al., 2017), bind functional proteins (Li Z. et al., 2015), regulate RNA splicing (Conn et al., 2017), and regulate transcription (Chao et al., 1998;Memczak et al., 2013). Therefore, the identification of circRNAs contributes to our understanding of the formation and biological functions of circRNAs.
In 1976, Kolakofsky (1976) observed, for the first time, defective interfering RNAs in parainfluenza virus particles using electron microscopy. Sanger et al. (1976) discovered that plantinfecting viroids are a class of single-stranded, circular RNA molecules that have characteristics such as high thermal stability and a natural circular structure by self-complementary. In 1979, similar circular transcripts were found in HeLa cells and yeast mitochondria by electron microscopy (Hsu and Coca-Prados, 1979). In 1981, a ribosomal RNA (rRNA) gene was discovered in Tetrahymena that contained an intron sequence that formed a circular RNA after splicing. In 1988, the intron of 23S rRNA in archaea was found to be spliced at a specific site to form a stable circular RNA and to function as a transposon. In 1991, researchers identified several circular transcripts formed by different splicing patterns in the human oncogene DCC (Nigro et al., 1991), and these circular RNAs were then found in human ETS1 gene, mouse Sry (sex-determining region Y) gene, rat cytochrome P450 2C24 gene and human P450 2C18 gene.
Despite their early discovery, research on circRNAs has been slow in recent decades. Although circRNAs were discovered decades ago, they could not be detected by molecular techniques that relied on poly(A) enrichment because they did not have free 3 and 5 ends. Instead, cyclizable exons were spliced by reverse splicing, which was different from regular linear splicing. Moreover, the mapping algorithm of early transcriptome analysis could not directly map the sequenced fragments to the genome, leading to the idea that circRNAs were byproducts of missplicing. With the development of high-throughput sequencing and bioinformatics technologies, it was first proposed in 2012 that circRNAs are circular transcripts generated by reverse splicing of mRNA precursors, which are found to exist in large quantities in different types of human cells. In 2013, it was found that circRNAs can act as a sponge for miRNAs (Hansen et al., 2013;Memczak et al., 2013), which regulate the growth and development of organisms. Since then, circRNAs have rapidly become a research hotspot. To identify circRNAs, in addition to high-throughput techniques (RNA-seq), common analytical and computational methods are used, such as CIRI (Gao et al., 2015), segemehl (Hoffmann et al., 2014), Mapsplice , and CircSeq (Guo et al., 2014). In recent years, researchers have developed machine learning methods to identify circRNAs based on the above methods (Yin et al., 2021). Feature selection is an important part of these machine learning models. Feature selection, aiming to select a subset of features by eliminating redundant and noise features, is an important preprocessing step in bioinformatics. Recently, Su et al. (2018) proposed a binomial distribution based method to perform feature selection in computational genomics. The effectiveness of their method has been proved by predicting lncRNA subcellular localizations . Since both nucleotide and amino acid composition obey binomial distribution, this method is suggested to be used for genomic and proteomic analysis. We provide here an overview of the research progress of circRNAs, including the development of circRNA databases, identification of circRNAs, and the role of circRNAs in human diseases such as colon cancer, atherosclerosis, and gastric cancer.

circRNA-RELATED DATABASES
In recent years, as circRNA research has progressed, an increasing number of circRNAs have been discovered in different species, and circRNA-related databases have been created. Some of the main circRNA databases published so far are listed below.
(1) circBase collects and merges public circRNA datasets and provides evidence of the genomic catalog of their expression, as well as scripts to identify circRNAs in sequencing data 1 (Glazar et al., 2014). (2) Circ2Trait is a comprehensive database that includes potential associations of circRNAs with diseases and traits by studying the interaction network of circRNAs with miRNAs and calculating their internal SNPs and Argonaute (Ago) interaction sites 2 (Ghosal et al., 2013).
(3) deepBase contains about 150,000 circRNA genes from organisms, including human, mouse, Drosophila, and nematode. This database also constructs the most comprehensive expression map of circRNAs 3 (Yang et al., 2010). (4) CirNet mainly includes RNA-seq data of more than 400 samples from 26 tissues collected from the sequence read archive database. This database not only includes basic information on circRNAs but also provides expression profile data of circRNAs in different tissues and the competing endogenous (ce)RNA regulatory network of circRNAs-miRNA-gene 4 . (5) starBase v2.0 integrates published circRNA data and constructs interaction networks of miRNAs with circRNAs and circRNAs with RBPs. In addition, the database looks for potential miRNA-ncRNA, miRNA-mRNA, ncRNA-RNA, RBP-ncRNA, and RBP-mRNA interactions through high-throughput data. starBase also predicts the function of ncRNAs from miRNA-mediated (ceRNA) regulatory networks (miRNAs, lncRNAs, and pseudogenes) and protein-coding genes using the online tools miRFunction and ceRNAFunction 5 .

TOOLS FOR RECOGNITION OF circRNAs
Because of the low expression level of circRNAs and limitations of previous computational methods, these RNA molecules were only found in small numbers in individual genes and therefore initially thought to be products of missplicing, byproducts of RNA splicing, incidental in animals, or precursors of linear RNAs. In recent years, with improved experimental and computational methods for circRNAs and the use of nextgeneration high-throughput sequencing technologies Zeng et al., 2017Zeng et al., , 2019, a large number of stable circRNAs have now been found in a variety of cells, and 85% of circRNAs can be mapped to known genes, of which 84% overlap with coding exons (Memczak et al., 2013). Because of the special structure of circRNAs-they lack a 5 terminal cap and a 3 terminal poly(A) tail and have a closed-loop structure with covalent bonds-and their maturation mechanism, early sequencing methods could not easily detect such molecules. Improvements in sequencing analysis techniques and computational methods have made detection more efficient (Malysiak-Mrozek et al., 2019;Mrozek, 2020). Therefore, studies on the identification of circRNAs are reviewed from two aspects: (1) identification based on sequencing data and (2) identification based on sequence features and machine learning methods.

Identification of circRNAs Based on Sequencing
Many algorithms exist for circRNA identification, including CIRI (Gao et al., 2015), segemehl (Hoffmann et al., 2014), Mapsplice , CircSeq (Guo et al., 2014), and find_circ (Memczak et al., 2013). Using these algorithms, researchers have identified a large number of circRNAs in human, mouse, nematode, archaea, and other organisms (Yang et al., 2011;Jeck and Sharpless, 2014). We describe here several of these commonly used sequencing-based tools for identification of circRNAs. CIRI (Stoddard, 2014) was developed by Gao et al. (2015) to comprehensively identify circRNAs, and it is based on the novel chiastic clipping signal algorithm. CIRI can accurately detect circRNAs from transcriptomic data without bias through multiple filtering strategies. This tool is mainly used to identify and annotate circRNAs from RNA-seq data. Unlike other methods for annotating circRNAs, CIRI eliminates false positives by using a new algorithm based on paired cross-clip signal detection in the BWA-MEM sequence alignment/map and combining it with systematic filtering.
CIRCexplorer, a tool for identifying circRNAs developed by Zhang et al. (2014), was the first to elucidate the regulatory mechanism of complementary sequences on production of exon-derived circRNAs. This tool revealed that regulation of variable cyclization was mediated by competitive pairing of complementary sequences, providing a new theoretical perspective on the complexity and diversity of gene expression at the transcriptional and posttranscriptional levels. Nearly 10,000 circRNAs were identified in human embryonic stem cell line H9 using a special nuclease to enrich circRNAs in combination with computational analysis software, demonstrating exon cyclization mediated by the complementary sequence of intron RNA. Competitive pairing of complementary sequences between different regions can selectively generate either linear RNAs or circRNAs.
CircSeq, a tool developed by Guo et al. (2014) to identify and characterize mammalian circRNAs, is a computational pipeline to identify and quantify the relative abundance of circRNAs from RNA-seq databases. Compared with other identification tools, CircSeq does not require available gene annotation to identify circRNAs. The application of the identification tool to non-polyA-selected RNA sequencing data in the ENCODE project proved its ability to classify and globally characterize more than 7000 human circRNAs.
The above sequencing methods all identify back-splicing sites from high-throughput sequencing data to detect circRNAs. In comparing some of the above identification tools, Hansen et al. (2016) and Sekar et al. (2019) found that only a small percentage of circRNAs could be predicted simultaneously by these tools, indicating significant differences and species variability. Therefore, the above tools developed around high-throughput sequencing technology have poor identification performance and low consistency. Moreover, these tools generally have high false-positive rates and low sensitivity (Hansen et al., 2016). To address these shortcomings, researchers have developed tools to identify circRNAs on the basis of sequence features and machine learning.

Identification of circRNAs Based on Sequence Features and Machine Learning
Identifying circRNAs using sequence features that distinguish circRNAs from linear RNAs (especially mRNAs that encode proteins) is an urgent problem to be solved in bioinformatics. In recent years, the combination of sequence features and machine learning has been successfully used to solve biological problems such as the prediction of gene regulatory sites and splice sites (Wang et al., 2008;Xiong et al., 2015), and protein function (Cao et al., 2017;Gbenro et al., 2020;Hippe, 2020;Zhai et al., 2020), etc (Mrozek et al., 2007(Mrozek et al., , 2009Wei et al., 2017bWei et al., ,c, 2018Jin et al., 2019;Stephenson et al., 2019;Su et al., 2019a,b;Liu B. et al., 2020;Smith et al., 2020;Zhao et al., 2020b,c). Some tools have been developed to identify circRNAs using sequence features and machine learning methods. The basic framework of using machine learning methods to predict circRNAs is shown in Figure 2. One study selected 100 RNA circularization-related sequence features, including length, adenosine-to-inosine (A-to-I) density, and Alu sequences of introns upstream and downstream of the splice site, and established a machine learning model to identify circRNAs in the human genome. The classification abilities of two machine learning methods, random forest (RF; Cheng et al., 2019b;Liu et al., 2019) and support vector machine (SVM; Jiang et al., 2013;Wei et al., 2014Wei et al., , 2017aWei et al., , 2019Zhao et al., 2015;Cheng, 2019;Hong et al., 2020;Li and Liu, 2020;Shao and Liu, 2020), were also compared. The results showed that the selected sequence features could effectively identify RNA circularization and that different sequence features contribute differently to the classification and prediction ability of the model. The RF method showed better classification than the SVM method.
In 2021, Yin et al. (2021) constructed a tool, named PCirc, to identify circRNAs using multiple sequence features and RF classification. This tool specifically targets the identification of circRNAs in plants, mainly from RNA sequence data. The tool encodes the sequence information of rice circRNAs by using three feature-encoding methods: k-mers, open reading frames, and splicing junction sequence coding (SJSC). The accuracy of the encoded information is greater than 80% when using the RF method for identification. The identification model can be used not only for the identification of rice circRNAs, but also for the recognition of circRNAs in plants such as Arabidopsis thaliana.

circRNAs AND HUMAN DISEASES
In terms of disease diagnosis, studies have found that the exosomes released by cancer cells contain abundant circRNAs, suggesting that circRNAs might be used as biological markers for clinical diagnosis. The key when using circRNAs for disease prediction is to identify the interaction site between the circRNA and miRNA or RBP, and then indirectly determine the association between the circRNA and disease by analyzing the relationship between the miRNA or RBP and disease (Jiang et al., 2010;Cheng et al., 2018;Liu, 2020;Zeng et al., 2020;Zuo et al., 2020).
In 2015,  reported that exosomes are enriched with circRNAs, so it is possible that diseases such as colon cancer could be diagnosed by detecting circRNAs in serum. Aberrant expression of circRNAs in colorectal cancer and pancreatic ductal adenocarcinoma has been used as a diagnostic or predictive biomarker. By studying their expression profile, it was found that circRNAs may be associated with the molecular pathogenesis of cutaneous basal cell carcinoma (Sand et al., 2016).
The first validated circRNA, cANRIL, is closely related to a single nucleotide polymorphism (SNP) that is thought to alter the splicing of cANRIL, leading to expression of the INK4A/ARF loci, resulting in an increased incidence of atherosclerosis (Burd et al., 2010). Hypoxia is one of the key factors contributing to the development of atherosclerosis, and is therefore also regulated by circRNA (Boeckel et al., 2015). Xu et al. (2015) showed that mice of a transgenic line overexpressing the miR-7 gene in β-cells developed diabetes mellitus. The same study showed that overexpression of the circRNA ciRS-7 inhibited miR-7 function and thus improved insulin secretion. Potential target genes of miR-7 have been identified by bioinformatics analysis and include Myrip (a gene regulating insulin secretory granules) and Pax6 (a gene enhancing insulin transcription).
A study by Li P. et al. (2015) identified the circRNA hsa-circ002059 as being associated with gastric cancer. In that study, expression of this circRNA was downregulated in gastric tissues of patients compared with healthy controls. In addition, hsa-circ002059 was found at significantly lower levels in plasma of patients with gastric cancer than in healthy controls.
In bladder cancer, circRNAs have been identified using highthroughput microarray technology. Using this approach, Zhong et al. (2016) found two downregulated circRNAs (circFAM169A and circTRIM24) and 4 upregulated circRNAs (circTCF25, circZFR, circPTK2, and circBC048201) in bladder cancer tissue compared with adjacent non-tumor tissues. In addition, in the cancer tissues, circTCF25 could increase expression of the CDK6 gene by modulating miR-103a-3p and miR-107. This is closely related to the development of cancer. Qin et al. (2016) identified hsa-cir0001649 in hepatocellular carcinoma (HCC) and found that its expression was significantly decreased compared with that in adjacent normal liver tissue. In contrast, Shang et al. (2016) found that another circRNA, hsa-cir0005075, was significantly downregulated in HCC compared with adjacent normal tissue.
Exosomes are highly enriched with circRNAs. Exosomes are extracellular vesicles, 40 to 160 nm in diameter, that function as important intercellular signaling pathways Kalluri and LeBleu, 2020). The exosome database exoRBase included 92 sequenced samples of serum exosomes, including samples from healthy volunteers and patients with coronary heart disease and colon cancer. The exosome samples contained 58,330 circRNAs and 18,333 mRNAs (Li et al., 2018). Zhang et al. (2019) demonstrated that circNRIP1, when secreted via exosome, can be taken up by gastric cancer cells and promote their proliferation, migration, and invasion. Therefore, exosomes can be regarded as in vivo carriers of circRNAs that can amplify their biological functions.

CHALLENGES AND PROSPECTS
Compared with long non-coding RNAs and miRNAs, research on circRNAs is still in its infancy and many questions remain to be answered, primarily in four areas: (1) Transport and degradation: because circRNAs can resist RNase digestion and are stable in cells, the process of their degradation is unclear.
(2) Formation: it is unknown whether circRNAs are produced during or after transcription. (3) Expression, translation, and function of circRNAs: circRNAs have stable structures and are highly conserved, underpinning their ability to play important roles in different organisms. Their unconfirmed roles, including acting as miRNA sponges, regulating gene expression, and targeting RBPs, require comprehensive and extensive elucidation. (4) Research methodology: the experimental methodologies and bioinformatics used to identify circRNAs are challenging. For example, in experimental methods, general RNA-seq procedures such as reverse transcription may cause technical mis-ligation and generate a large number of artificial circRNAs. These pseudo circRNAs can account for 34-55% of the sequencing quantity, seriously affecting the accuracy of the data. As for methods that use machine learning and sequence features, only a few identification tools exist and their accuracy needs to be improved. These tools are not stable across different species. Therefore, in the future, stable identification models and deep learning methods are needed to establish identification tools for circRNAs and improve the robustness of the models.
Accurate identification will help determine additional biological functions of circRNAs. The unique features of circRNAs such as ceRNA may provide new ideas for drug discovery and development. The tissue specificity and stability of circRNAs make them potentially useful biomarkers. In the near future, it is likely that circRNAs will play important roles in the prevention, diagnosis, and treatment of various diseases.

AUTHOR CONTRIBUTIONS
ML and BG: conceptualization, writing-review and editing, and supervision. SJ, SH, and SW: investigation and writingoriginal draft preparation. All authors have read and agreed to the published version of the manuscript.

FUNDING
The work was supported by National Natural Science Foundation of China (No. 62002087).