Long Non-coding RNAs in Endothelial Biology

In recent years, the role of RNA has expanded to the extent that protein-coding RNAs are now the minority with a variety of non-coding RNAs (ncRNAs) now comprising the majority of RNAs in higher organisms. A major contributor to this shift in understanding is RNA sequencing (RNA-seq), which allows a largely unconstrained method for monitoring the status of RNA from whole organisms down to a single cell. This observational power presents both challenges and new opportunities, which require specialized bioinformatics tools to extract knowledge from the data and the ability to reuse data for multiple studies. In this review, we summarize the current status of long non-coding RNA (lncRNA) research in endothelial biology. Then, we will cover computational methods for identifying, annotating, and characterizing lncRNAs in the heart, especially endothelial cells.


INTRODUCTION
The development of next generation sequencing (NGS) and RNA sequencing (RNA-seq) has significantly improved the understanding of transcriptomes. For example, we now know that most of the human genome is transcribed (Lander et al., 2001), yet only a small percent of these RNAs code for protein (Weirick et al., 2016a). When the human genome was annotated (i.e., giving the definition to the genome by naming a particular gene and its corresponding exons), it was originally thought that the number of protein-coding genes in humans should be more than those of lower organisms (e.g., yeast, plants, fishes, amphibians) (Mercer et al., 2011;Ezkurdia et al., 2014). However, when the numbers of protein-coding genes are compared among species, the number of human genes is not more than those of lower organisms ( Figure 1A). Given that humans are able to carry out more complex tasks than lower organisms, the question remains in the field: What aspect of our genome allows for the increased complexity? One school of thoughts is that proteins can be modified for various biological processes (e.g., phosphorylation of a protein for its activation). Another school suggests for the increased variety of isoforms resulting from one gene due to the alternative splicing events. In both schools, the ultimate final products are proteins as we know more about proteins than RNAs. Last school postulates that the increased number of ncRNAs (especially, lncRNAs) is at the base of the highest complexity in human, although it is highly subjective as the number of ncRNAs depends on how well the organism is studied as C. elegans has more lncRNAs than any other organisms (Figure 1B). At the moment, it is most likely that the combination of these schools of thoughts may yield important answers to the question.

lncRNAs IN ENDOTHELIAL CELLS
Vessels deliver metabolites and oxygen to the tissue and export waste products to sustain the well-being of an organism (Asahara et al., 2011). After tissue injury [e.g., myocardial infarction (MI)], ECs migrate to the site of injury to re-establish the capillary network through a process called "angiogenesis" (Jakobsson et al., 2010;Oka et al., 2014). Furthermore, ECs contribute to the multicellular communications that maintain the balance between the regeneration and dysfunctional or maladaptive healing (Cines et al., 1998;Libby, 2012;Kluge et al., 2013;Eelen et al., 2018). Although advances have been made to understand angiogenesis, the recent emergence of lncRNAs has added another layer of complexity to the genetic network of angiogenesis. To date, a number of lncRNAs are identified and characterized (Table 1) as ECs can be found throughout the human body. For example, MALAT1 regulates endothelial cell function and vessel growth via cell cycle control (Michalik et al., 2014); the histone demethylase JARID1B controls the lncRNA MANTIS, which regulates EC function and vessel growth by binding to the chromatin modifying enzyme BRG1 (Leisegang et al., 2017); and several lncRNAs bind miRNAs to function as miRNA sponges Huang et al., 2015;Yan et al., 2015;Lu et al., 2016;Ming et al., 2016;Ma Y. et al., 2017;Sun et al., 2017;Zhang B. Y. et al., 2017;Bao et al., 2018).

RNA-seq DATA ANALYSIS USING BIOINFORMATICS
There are two major methods of generating libraries for RNA-seq, which are based on poly-A selection and ribosomal RNA (rRNA)depletion. Both methods are aimed at removing rRNAs, which constitute ∼80% of total RNA followed by 15% transfer RNAs (tRNAs) and only 5% for all other RNAs, including proteincoding genes and lncRNAs (Lodish et al., 2000). The poly-A selection will result in the identification of protein-coding genes and lncRNAs with poly A tails (∼60% of total lncRNAs; Cheng et al., 2005), while the rRNA-depletion can identify the rest of lncRNAs and circRNAs-in addition to those identified in the former method. The presence of circRNAs is detected only with the latter method as circRNAs arise from exons and/or introns that are spliced out, which are devoid of poly A tails. Analysis of RNA-seq data usually involves a number of common computational steps to obtain the expression profiles of the RNA in a set of samples. At the start of a typical analysis pipeline, reads are trimmed to remove primers and low-quality regions of reads. Next, the reads are aligned to a genome in a "guided alignment." In the case of the organism with no reference genome, a "de novo assembly" of the transcriptome is performed. However, de novo assembly is more error-prone and difficult to operate, thus we will simply focus on guided alignments. Traditionally, Tophat (Trapnell et al., 2012) has been the most popular aligner, but it is now being supplanted by newer programs (e.g., STAR, HISAT2), which offer greater speed and alignment accuracy (Engström et al., 2013;Conesa et al., 2016;Costa-Silva et al., 2017;. Similar to protein-coding genes, lncRNAs undergo alternative splicing (AS) to produce isoforms (Deveson et al., 2017;White et al., 2017). The current understanding of AS is mainly based on EST-cDNA sequencing and short-read RNA-seq data. In the second-generation sequencing (e.g., Illumina-based short RNA-seq), long strands of cDNA must be broken into small segments to infer nucleotide sequences by amplification and synthesis (Metzker, 2010), which fall short of detecting intact fulllength transcripts. To address this shortcoming, third-generation sequencing (also known as "long-read sequencing") may be a solution. PacBio RS II (Pacific Biosciences, CA, U.S.A.) is the first commercialized third-generation sequencer, which utilizes a novel single molecule real-time (SMRT) technology (Schadt et al., 2010). Compared to second-generation sequencing, SMRT technology offers long read lengths (up to 92 kb), high consensus accuracy (free of systematic sequencing errors), and low degree of bias (even coverage across G+C content) (Nakano et al., 2017). When this technology is applied to any transcriptome (cDNA) sequencing (e.g., RNA-seq), it is called "Iso-Seq, " which can monitor AS (Abdel-Ghany et al., 2016). With Iso-Seq, the need for transcriptome assembly is eliminated as "one read = one transcript" with each transcript can be read from its 5 ′ -end to poly A tail. Iso-Seq has been applied to various species and tissues (Singh et al., 2016;Cheng et al., 2017;Hoang et al., 2017a,b;Jiang et al., 2017;Jo et al., 2017;Kim et al., 2017;Kuo et al., 2017;Wang et al., 2017aWang et al., ,b, 2018Xue et al., 2017;Zhang S. J. et al., 2017;Zulkapli et al., 2017;Filichkin et al., 2018) but not yet to ECs.
The largely unbiased manner in which RNA-seq captures information is another interesting aspect of the technology, which enables new findings via re-analysis of published data. For example, most of the RNA-seq studies have been focused on analyzing expression of protein-coding genes. As lncRNA are also present in the data sets, these data offer a rich resource for studying lncRNA expression patterns. We have developed a number of bioinformatics tools to exploit these resources (Gellert et al., 2013;Weirick et al., 2015Weirick et al., , 2016bWeirick et al., , 2017, including some specifically designed to identify lncRNAs and to associate their expressions in various tissues and cell types, including ECs (e.g., our database ANGIOGENES; Müller et al., 2016). Although ECs can be found throughout the human body, there are only few databases available that contain the expression profiles for genes expressed in ECs (e.g., Causal Biological Network database Boué et al., 2015, dbANGIO4 Savas, 2012, and PubAngioGen Li et al., 2015. Our ANGIOGENE is one of the few that contain the expression profiles of both protein-coding genes and lncRNAs in various ECs based on RNA-seq data. Furthermore, ANGIOGENES covers humans, mice, and zebrafish to allow for the screening of lncRNAs in the positional conserved regions (not necessary sequence-conserved) (Weirick et al., 2015).
There are many transcripts whose sequencing reads are present in RNA-seq data but are not annotated in the public databases, including NONCODE , which is one of the hallmark databases for lncRNAs. Our previous study (Weirick et al., 2016a) shows that 77,656 novel isoforms of annotated reference transcripts and 102,848 intergenic transcripts are identified with 58,789 (75.70%) and 101,993 (99.17%) being predicted as non-coding, respectively, from 12 human tissues (Nielsen et al., 2014), while there are 181,434 annotated transcripts (87.13% out of 208,244 transcripts in Ensembl version 77) are expressed in at least one of 12 tissues analyzed. Although we could validate the presence of novel lncRNAs by RT-PCR experiments, many novel lncRNAs contain repetitive elements, such as microsatellites (Bidichandani et al., 1998) and short interspersed nuclear elements (SINE), including ALU elements (Häsler and Strub, 2006). Thus, it is highly recommended to consult the available methods to characterize lncRNAs Liu et al., 2017), including CAGE-seq to annotate the 5 ′ -end of lncRNAs (Hon et al., 2017) and riboseq/ribosomal footprinting RNA-seq technology to understand the coding potential (Ruiz-Orera et al., 2014;Ji et al., 2015;Alvarez-Dominguez and Lodish, 2017) before proceeding to more functional experiments.
It is well-known that ECs are heterogeneous populations of cells as their activities and functions differ based on their physiological locations (Aird, 2012;Regan and Aird, 2012;Yuan et al., 2016). In order to understand such heterogeneity of ECs, it is important to perform single-cell RNA-seq (scRNA-seq) instead of bulk RNA-seq by using a piece of tissue or those in a culture dish. As the technique for scRNA-seq matures, the immediate problem is the data analysis, especially positioning each cell to a particular cell type in order to organize their molecular signatures matching to the anatomical location in which each cell was isolated from. For example, hearts contain multiple cell types (e.g., cardiomyocytes, ECs, fibroblasts, pericytes, and smooth muscle cells). In regards to ECs, their expression profiles may differ for those contained in the artery and vein. When such profiles are compared to ECs from other tissues (e.g., kidneys, lungs), there are some genes that are expressed at the similar level in all tissues while others are expressed specifically in ECs isolated from a particular tissue. In order to understand such hierarchical organization of cells, their corresponding cell types, and tissues, it is utmost importance that the ontology of each cell must be organized in relation to its corresponding cell type and tissue. To achieve this hierarchical and ontological organization, we recently introduced the usage of logic programing (Weirick et al., 2016b), which was applied to kidneys. Logic programming is a programming paradigm based on formal logic, using a set of logical sentences consisting of facts, rules, and queries (Eklund and Klawonn, 1992). For example, consider a transcript expressed in the renal cortex. The renal cortex is located within kidneys. When sequencing whole kidney under the same condition, the same transcript should be expressed. One could even descend to the level of cell types (e.g., ECs isolated from interlobular arteries, which are located within the kidney cortex). Similarly, all sequences expressed within these ECs are expressed in the kidney. Furthermore, it is well-known that high abundance sequences can overwhelm lower abundance sequences. Thus, logic programming can be useful for integrating RNA-seq data at different hierarchical levels and beyond. This can be accomplished by: (1) modeling the anatomical and experimental relationships; (2) creating rules to define various types of expression characteristics; and (3) using queries to determine expression characteristics of a given RNA. The analysis of RNA-seq data of ECs in the heart for lncRNAs, coupled with logic programming, should help to facilitate the further usage of the available RNA-seq data (e.g., single cell RNAseq data from the heart) to test various hypotheses that were not originally intended when the data were generated. Such an approach should yield the identification of lncRNAs in a variety of conditions (e.g., expressed in atherosclerotic plaques but not in the healthy artery), which can be further validated in functional studies.

DETECTION OF RNA EDITING PATTERNS FROM RNA-seq DATA
In addition to studying lncRNAs, re-analysis of publiclyavailable RNA-seq data is also useful for studying RNA editing. RNA editing is a post-transcriptional modification to alter the sequence of RNA molecules (Keegan et al., 2001;Hideyama and Kwak, 2011). The full extent and reasons for RNA editing is largely unknown. However, recent studies show that the editing in exons leads to an amino acid substitutions from altered codons (Alon et al., 2015;Liscovitch-Brauer et al., 2017), whereas editing in 3 ′ -untranslated regions (UTRs) may affect binding of RNA binding proteins (RBPs) or microRNAs (miRNAs) thereby modulating RNA stability and/or translation (Keegan et al., 2001). There are two types of RNA editing: adenosine to inosine (A-to-I) and cytidine to uridine (C-to-U). A-to-I is the most common form and occurs through RNA editing enzymes called "adenosine deaminases acting on RNA (ADARs), " which convert adenosine in double-stranded RNA into inosine (Savva et al., 2012). When reverse transcribed to complementary DNA (cDNA), an inosine is converted to guanine ("G"), which can be identified by comparison to the reference genome. A number of studies have been conducted to detect RNA editing events from RNA-seq data (Bahn et al., 2012;Park et al., 2012;Peng et al., 2012;Ramaswami et al., 2012Ramaswami et al., , 2013Solomon et al., 2013), including our recent study in ECs (Stellos et al., 2016). Because of the detection from RNA-seq data, several databases for RNA editing events have been constructed to provide evidence for the frequency of RNA editing in various conditions (Kiran and Baranov, 2010;Picardi et al., 2011Picardi et al., , 2017Laganà et al., 2012;Ramaswami and Li, 2014;Solomon et al., 2016;Gong et al., 2017). We recently reported that cathepsin S (CTSS), which encodes a cysteine protease associated with angiogenesis and atherosclerosis, is highly edited (Stellos et al., 2016). Such RNA editing enables the recruitment of stabilizing RBP human antigen R (HuR) to the 3 ′ -UTR of CTSS transcript, thereby controlling CTSS mRNA stability and expression. The RNA editing enzyme ADAR1 levels and the extent of CTSS RNA editing are associated with changes in CTSS levels in patients with coronary artery diseases. Our study highlights the involvement of RNA editing in cardiovascular diseases, which has not yet been investigated (Uchida and Jones, 2018). Our finding was further supported by the recent large-scale, multi-center study analyzing RNA-seq data from the NIH Common Fund's Genotype-Tissue Expression (GTEx) program, which reported that the aorta, coronary, and tibial arteries were the most highly edited tissue type among 53 body sites from 552 individuals analyzed (Tan et al., 2017).
In humans, RNA editing occurs mostly in repetitive Alu regions (Levanon et al., 2004;Peng et al., 2012), which can be found in lncRNAs as lncRNAs can also be edited (Picardi et al., 2014;Szczesniak and Makalowska, 2016;Gong et al., 2017). Although proposed but not tested extensively, the functions of lncRNAs may depend on their conformation (e.g., 3D structures), which can be affected by their primary sequences. This folding process can be influenced by a variety of factors, including (but not limited to) RNA modifications on lncRNAs, such as RNA editing. Given that RNA editing can be readily detected from RNA-seq data, more systematic analysis of RNA editing patterns is necessary, especially targeting lncRNAs in the heart (Uchida and Jones, 2018). For this purpose, several bioinformatics tools are available to detect editing within RNA-seq data, including GIREMI (Zhang and Xiao, 2015), JACUSA (Piechotta et al., 2017), RED (Sun et al., 2016), RED-ML (Xiong et al., 2017), REDItools (Picardi and Pesole, 2013), RES-Scanner , and our RNAEditor .

HOW COULD WE TRANSLATE THE CONCEPT OF lncRNAs INTO RNA THERAPEUTICS
The one obvious usage of lncRNAs in medicine is using lncRNAs as diagnostic biomarkers as lncRNAs are more celltype specifically expressed than protein-coding genes (Thurman et al., 2012;Gellert et al., 2013;Necsulea et al., 2014;Weirick et al., 2015). Although some progresses have been made, most of RNA-seq data analyzed so far does not consider lncRNAs due to the reasons mentioned above. Thus, without performing further RNA-seq experiments, it should be feasible to discover lncRNAs that capable of differentiating between diseased and healthy individuals by re-analyzing publicly-available RNA-seq data. For this purpose, bioinformatics tools mentioned above should be useful.